Data Analysis guide

Note: this document is a work in progress. For sections that aren’t completed, I have included links to useful documentation and examples.

See also these resources:

Overview

[2]:
from siuba.data import cars

cars
[2]:
cyl mpg hp
0 6 21.0 110
1 6 21.0 110
... ... ... ...
30 8 15.0 335
31 4 21.4 109

32 rows × 3 columns

Split-apply-combine

🚧 Coming soon. In the meantime, check out these docs with many examples in the User API.

Dates and times

🚧 Coming soon. See this article on timeseries in the pandas docs.

Strings

🚧 Coming soon. See this article on working with text in the pandas docs.

Reshaping

🚧 Coming soon. See the following User API entries for reshaping verbs.

Table joins

🚧 Coming soon. See the siuba’s User API entry for joins.

Nested data

🚧 Coming soon. See the siuba’s User API entry for nest and unnest.

Custom functions

Custom functions are created using symbolic_dispatch.

[3]:
import pandas as pd
from siuba.siu import symbolic_dispatch


@symbolic_dispatch(cls = pd.Series)
def add(x, y):
    return x + y


from siuba import _, mutate

df = pd.DataFrame({
        "x": [1, 2, 3],
        "y": [4, 5, 6],
        })

df >> mutate(res = add(_.x, _.y) + 100)
[3]:
x y res
0 1 4 105
1 2 5 107
2 3 6 109

Note that one important feature of symbolic dispatch is its unique handling of the _. In this case, it returns a Symbolic object, which lets you use it in complex expressions.

[4]:
add(_.x, _.y) + 100
[4]:
█─+
├─█─'__call__'
│ ├─█─'__custom_func__'
│ │ └─<function add at 0x7f1998853cb0>
│ ├─█─.
│ │ ├─_
│ │ └─'x'
│ └─█─.
│   ├─_
│   └─'y'
└─100

Debugging

This section covers the four most common issues people seem to hit.

  1. Referring to a column that doesn’t a exist

  2. A pandas Series method raising an error

  3. Python syntax errors

  4. Any of the above in a pipe

    Note that stack traces shown here are shorter than normal, to help make them clearer. This is something siuba does for SQL by default, and will be implemented for pandas in the future.

[5]:
import pandas as pd
from siuba import mutate, _

df = pd.DataFrame({
    'g': ['a','a','b'],
    'x': [1,2,3]
})

Missing columns

[7]:
mutate(df, y = _.X + 1)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-7-6985da088c9c> in <module>
----> 1 mutate(df, y = _.X + 1)

AttributeError: 'DataFrame' object has no attribute 'X'

In this case, the data doesn’t have a column named “X”.

[8]:
df.columns
[8]:
Index(['g', 'x'], dtype='object')

Series method error

[9]:
mutate(df, y = _.x.mean(bad_arg = True))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-9-7d091be01caf> in <module>
----> 1 mutate(df, y = _.x.mean(bad_arg = True))

TypeError: mean() got an unexpected keyword argument 'bad_arg'

In this case, it’s helpful to try replacing _ with the actual data.

[10]:
# expression to debug
_.x.mean(bad_arg = True)

# replacing _ with the data
df.x.mean(bad_arg = True)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-10-74696ea76235> in <module>
      3
      4 # replacing _ with the data
----> 5 df.x.mean(bad_arg = True)

TypeError: mean() got an unexpected keyword argument 'bad_arg'

Python syntax errors

[11]:
df
    >> mutate(y = _.x + 1)
  File "<ipython-input-11-fded324be8e0>", line 2
    >> mutate(y = _.x + 1)
    ^
IndentationError: unexpected indent

In this case, we either need to use a backslash, or put the code in parentheses.

[12]:
df \
    >> mutate(y = _.x + 1)

(df
    >> mutate(y = _.x + 1)
)
[12]:
g x y
0 a 1 2
1 a 2 3
2 b 3 4

Pipes

When the error occurs in a pipe, it’s helpful to comment out parts of the pipe.

For example, consider the 3 step pipe below.

[13]:
from siuba import select, arrange, mutate

(df
   >> select(_.g, _.x)
   >> mutate(res = _.X + 1)
   >> arrange(-_.res)
)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-13-8a0872ddae08> in <module>
      4    >> select(_.g, _.x)
      5    >> mutate(res = _.X + 1)
----> 6    >> arrange(-_.res)
      7 )

AttributeError: 'DataFrame' object has no attribute 'X'

Notice the arrow pointing to line 6. This is not because that’s where the error is, but because python will always point to the last line of a pipe.

Let’s debug by running only the first line, then only the first two, etc.., until we find the error.

[14]:
(df
   >> select(_.g, _.x)
#    >> mutate(res = _.X + 1)
#    >> arrange(-_.res)
)
[14]:
g x
0 a 1
1 a 2
2 b 3

Select works okay, now let’s uncomment the next line.

[15]:
(df
   >> select(_.g, _.x)
   >> mutate(res = _.X + 1)
#    >> arrange(-_.res)
)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-15-f317e7f1dcb7> in <module>
      1 (df
      2    >> select(_.g, _.x)
----> 3    >> mutate(res = _.X + 1)
      4 #    >> arrange(-_.res)
      5 )

AttributeError: 'DataFrame' object has no attribute 'X'

We found our bug! Note that when working with SQL, siuba prints out the name of the verb where the error occured. This is very useful, and will be added to working with pandas in the future!

Edit page on github here. Interactive version: Binder badge