Example¶

Engarde really shines when you have a dataset that regularly receives updates. We’ll work with a data set of customer preferences on trains. This is a static dataset and isn’t being updated, but you could imagine that each month the Dutch authorities upload a new month’s worth of data.

We can start by making some very basic assertions, that the dataset is the correct shape, and that a few columns are the correct dtypes. Assertions are made as decorators to functions that return a DataFrame.

In [1]: import pandas as pd

In [2]: import engarde.decorators as ed

In [3]: pd.set_option('display.max_rows', 10)

In [4]: dtypes = dict(
   ...:     price1=int,
   ...:     price2=int,
   ...:     time1=int,
   ...:     time2=int,
   ...:     change1=int,
   ...:     change2=int,
   ...:     comfort1=int,
   ...:     comfort2=int
   ...: )

In [5]: @ed.is_shape((None, 11))
   ...: @ed.has_dtypes(items=dtypes)
   ...: def unload():
   ...:     trains = pd.read_csv("data/trains.csv", index_col=0)
   ...:     return trains

In [6]: unload()
Out[6]:
     id  choiceid   choice  price1  time1  change1  comfort1  price2  time2  \
1     1         1  choice1    2400    150        0         1    4000    150
2     1         2  choice1    2400    150        0         1    3200    130
3     1         3  choice1    2400    115        0         1    4000    115
4     1         4  choice2    4000    130        0         1    3200    150
5     1         5  choice2    2400    150        0         1    3200    150
..   ..       ...      ...     ...    ...      ...       ...     ...    ...
347  30         7  choice1    2100    135        1         1    2800    135
348  30         8  choice1    2100    125        1         1    3500    125
349  30         9  choice1    2100    150        0         0    2800    125
350  30        10  choice1    2800    125        0         1    2800    135
351  30        11  choice2    3500    125        1         0    2800    135

     change2  comfort2
1          0         1
2          0         1
3          0         0
4          0         0
5          0         0
..       ...       ...
347        1         0
348        1         0
349        0         1
350        1         0
351        1         0

[351 rows x 11 columns]

One very important part of the design of Engarde is that your code, the code actually doing the work, shouldn’t have to change. I don’t want a bunch of asserts cluttering up the logic of what’s happening. This is a perfect case for decorators.

The order of execution here is unload returns the DataFrame, trains. Next, ed.has_dtypes asserts that trains has the correct dtypes, as specified with dtypes. Once that assert passes, has_dtypes passes trains along to the next check, and so on, until the original caller gets back trains.

Each row of this dataset contains a passengers preference over two routes. Each route has an associated cost, travel time, comfort level, and number of changes. Like any good economist, we’ll assume people are rational: their first choice is surely going to be better in at least one way than their second choice (faster, more comfortable, …). This is fundamental to our analysis later on, so we’ll explicitly state it in our code, and check it in our data.

 In [7]: def rational(df):
    ...:     """
    ...:     Check that at least one criteria is better.
    ...:     """
    ...:     r = ((df.price1 < df.price2) | (df.time1 < df.time2) |
    ...:          (df.change1 < df.change2) | (df.comfort1 > df.comfort2))
    ...:     return r
    ...:

 In [8]: @ed.is_shape((None, 11))
    ...: @ed.has_dtypes(items=dtypes)
    ...: @ed.verify_all(rational)
    ...: def unload():
    ...:     trains = pd.read_csv("data/trains.csv", index_col=0)
    ...:     return trains
    ...:

 In [9]: df = unload()
 ---------------------------------------------------------------------------
 AssertionError                            Traceback (most recent call last)
 <ipython-input-9-b108f050ce4e> in <module>()
 ----> 1 df = unload()

 /Users/tom.augspurger/sandbox/engarde/engarde/decorators.py in wrapper(*args, **kwargs)
      22         @wraps(func)
      23         def wrapper(*args, **kwargs):
 ---> 24             result = func(*args, **kwargs)
      25             ck.is_shape(result, shape)
      26             return result

 /Users/tom.augspurger/sandbox/engarde/engarde/decorators.py in wrapper(*args, **kwargs)
     115         @wraps(func)
     116         def wrapper(*args, **kwargs):
 --> 117             result = func(*args, **kwargs)
     118             ck.has_dtypes(result, items)
     119             return result

 /Users/tom.augspurger/sandbox/engarde/engarde/decorators.py in wrapper(*operation_args, **operation_kwargs)
     147         def wrapper(*operation_args, **operation_kwargs):
     148             result = operation_func(*operation_args, **operation_kwargs)
 --> 149             vfunc(result, func, *args, **kwargs)
     150             return result
     151         return wrapper

 /Users/tom.augspurger/sandbox/engarde/engarde/generic.py in verify_all(df, check, *args, **kwargs)
      40     result = check(df, *args, **kwargs)
      41     try:
 ---> 42         assert np.all(result)
      43     except AssertionError as e:
      44         msg = "{} not true for all".format(check.__name__)

 AssertionError: ('rational not true for all',      id  choiceid   choice  price1  time1  change1  comfort1  price2  time2  \
 13    2         3  choice2    2450    121        0         0    2450     93
 18    2         8  choice2    2975    108        0         0    2450    108
 27    3         6  choice2    1920    106        0         0    1440     96
 28    3         7  choice1    1920    106        0         0    1920     96
 33    4         1  choice2     545    105        1         1     545     85
 ..   ..       ...      ...     ...    ...      ...       ...     ...    ...
 306  27         9  choice1    3920    140        1         1    3920    125
 319  28         8  choice2    2450    133        1         1    2450    108
 325  28        14  choice2    2450    123        0         1    2450    108
328  28        17  choice2    2815    108        0         1    2450    108
330  29         2  choice2    2800    140        2         0    2800    120

     change2  comfort2
13         0         1
18         0         1
27         0         1
28         0         1
33         1         1
..       ...       ...
306        0         2
319        0         2
325        0         2
328        0         2
330        0         1

[42 rows x 11 columns])

So our check failed, apparently people aren’t rational… Engarde has printed the name of the failed assertion and the rows that are False. We’re simply reusing pandas printing machinery, so set pd.options.display.max_rows to display more or fewer rows.

We’ll fix this problem by ignoring those people (why change your mind when you can change the data?).

In [16]: @ed.verify_all(rational)
   ....: def drop_silly_people(df):
   ....:     r = df.query("price1 < price2 | time1 < time2 |"
   ....:                  "change1 < change2 | comfort1 > comfort2")
   ....:     return r
   ....:

In [17]: @ed.is_shape((None, 11))
   ....: @ed.has_dtypes(items=dtypes)
   ....: def unload():
   ....:     trains = pd.read_csv("data/trains.csv", index_col=0)
   ....:     return trains

In [18]: df = unload().pipe(drop_silly_people)

In [19]: df.head()
Out[19]:
   id  choiceid   choice  price1  time1  change1  comfort1  price2  time2  \
1   1         1  choice1    2400    150        0         1    4000    150
2   1         2  choice1    2400    150        0         1    3200    130
3   1         3  choice1    2400    115        0         1    4000    115
4   1         4  choice2    4000    130        0         1    3200    150
5   1         5  choice2    2400    150        0         1    3200    150

   change2  comfort2
1        0         1
2        0         1
3        0         0
4        0         0
5        0         0

All of our assertions have “passed” now, so we’re happy and our analysis can proceed.