API

checks

This file contains the functions doing the actual asserts. You can potentially use this file during interactive sessions, probably via the pipe method.

checks.py

Each function in here should

  • Take a DataFrame as its first argument, maybe optional arguments
  • Makes its assert on the result
  • Return the original DataFrame
engarde.checks.is_monotonic(df, items=None, increasing=None, strict=False)

Asserts that the DataFrame is monotonic.

Parameters:
df : Series or DataFrame
items : dict

mapping columns to conditions (increasing, strict)

increasing : None or bool

None is either increasing or decreasing.

strict : whether the comparison should be strict
Returns:
df : DataFrame
engarde.checks.is_same_as(df, df_to_compare, **kwargs)

Assert that two pandas dataframes are the equal

Parameters:
df : pandas DataFrame
df_to_compare : pandas DataFrame
**kwargs : dict

keyword arguments passed through to panda’s assert_frame_equal

Returns:
df : DataFrame
engarde.checks.is_shape(df, shape)

Asserts that the DataFrame is of a known shape.

Parameters:
df : DataFrame
shape : tuple

(n_rows, n_columns). Use None or -1 if you don’t care about a dimension.

Returns:
df : DataFrame
engarde.checks.none_missing(df, columns=None)

Asserts that there are no missing values (NaNs) in the DataFrame.

Parameters:
df : DataFrame
columns : list

list of columns to restrict the check to

Returns:
df : DataFrame

same as the original

engarde.checks.unique_index(df)

Assert that the index is unique

Parameters:
df : DataFrame
Returns:
df : DataFrame
engarde.checks.within_n_std(df, n=3)

Assert that every value is within n standard deviations of its column’s mean.

Parameters:
df : DataFame
n : int

number of standard deviations from the mean

Returns:
df : DataFrame
engarde.checks.within_range(df, items=None)

Assert that a DataFrame is within a range.

Parameters:
df : DataFame
items : dict

mapping of columns (k) to a (low, high) tuple (v) that df[k] is expected to be between.

Returns:
df : DataFrame
engarde.checks.within_set(df, items=None)

Assert that df is a subset of items

Parameters:
df : DataFrame
items : dict

mapping of columns (k) to array-like of values (v) that df[k] is expected to be a subset of

Returns:
df : DataFrame
engarde.checks.has_dtypes(df, items)

Assert that a DataFrame has dtypes

Parameters:
df: DataFrame
items: dict

mapping of columns to dtype.

Returns:
df : DataFrame
engarde.checks.verify(df, check, *args, **kwargs)

Generic verify. Assert that check(df, *args, **kwargs) is true.

Parameters:
df : DataFrame
check : function

Should take DataFrame and **kwargs. Returns bool

Returns:
df : DataFrame

same as the input.

engarde.checks.verify_all(df, check, *args, **kwargs)

Verify that all the entries in check(df, *args, **kwargs) are true.

engarde.checks.verify_any(df, check, *args, **kwargs)

Verify that any of the entries in check(df, *args, **kwargs) is true

engarde.checks.one_to_many(df, unitcol, manycol)

Assert that a many-to-one relationship is preserved between two columns. For example, a retail store will have have distinct departments, each with several employees. If each employee may only work in a single department, then the relationship of the department to the employees is one to many.

Parameters:
df : DataFrame
unitcol : str

The column that encapulates the groups in manycol.

manycol : str

The column that must remain unique in the distict pairs between manycol and unitcol

Returns:
df : DataFrame
engarde.checks.is_same_as(df, df_to_compare, **kwargs)

Assert that two pandas dataframes are the equal

Parameters:
df : pandas DataFrame
df_to_compare : pandas DataFrame
**kwargs : dict

keyword arguments passed through to panda’s assert_frame_equal

Returns:
df : DataFrame

decorators

engarde.decorators.none_missing(columns=None)

Asserts that no missing values (NaN) are found

engarde.decorators.within_range(items)

Check that a DataFrame’s values are within a range.

Parameters:
items : dict or array-like

dict maps columss to (lower, upper) array-like checks the same (lower, upper) for each column

engarde.decorators.within_set(items)

Check that DataFrame values are within set.

>>> @within_set({'A': {1, 3}})
>>> def f(df):
        return df
engarde.decorators.has_dtypes(items)

Tests that the dtypes are as specified in items.

engarde.decorators.verify(func, *args, **kwargs)

Assert that func(df, *args, **kwargs) is true.

engarde.decorators.verify_all(func, *args, **kwargs)

Assert that all of func(*args, **kwargs) are true.

engarde.decorators.verify_any(func, *args, **kwargs)

Assert that any of func(*args, **kwargs) are true.

engarde.decorators.within_n_std(n=3)

Tests that all values are within 3 standard deviations of their mean.

engarde.decorators.one_to_many(unitcol, manycol)

Tests that each value in manycol only is associated with just a single value in unitcol.

This file provides a nice API for each of the checks, designed to fit seamlessly into an ETL pipeline. Each of the functions defined here can be applied to a function that returns a DataFrame.