Data Analysis guide¶

Note: this document is a work in progress. For sections that aren’t completed, I have included links to useful documentation and examples.

Overview¶

[2]:

from siuba.data import cars

cars

[2]:

	cyl	mpg	hp
0	6	21.0	110
1	6	21.0	110
...	...	...	...
30	8	15.0	335
31	4	21.4	109

32 rows × 3 columns

Split-apply-combine¶

🚧 Coming soon. In the meantime, check out these docs with many examples in the User API.

Dates and times¶

🚧 Coming soon. See this article on timeseries in the pandas docs.

Strings¶

🚧 Coming soon. See this article on working with text in the pandas docs.

Reshaping¶

🚧 Coming soon. See the following User API entries for reshaping verbs.

Table joins¶

🚧 Coming soon. See the siuba’s User API entry for joins.

joins

Nested data¶

🚧 Coming soon. See the siuba’s User API entry for nest and unnest.

nest and unnest

Custom functions¶

Custom functions are created using symbolic_dispatch.

[3]:

import pandas as pd
from siuba.siu import symbolic_dispatch


@symbolic_dispatch(cls = pd.Series)
def add(x, y):
    return x + y


from siuba import _, mutate

df = pd.DataFrame({
        "x": [1, 2, 3],
        "y": [4, 5, 6],
        })

df >> mutate(res = add(_.x, _.y) + 100)

[3]:

	x	y	res
0	1	4	105
1	2	5	107
2	3	6	109

Note that one important feature of symbolic dispatch is its unique handling of the _. In this case, it returns a Symbolic object, which lets you use it in complex expressions.

[4]:

add(_.x, _.y) + 100

[4]:

█─+
├─█─'__call__'
│ ├─█─'__custom_func__'
│ │ └─<function add at 0x7f1998853cb0>
│ ├─█─.
│ │ ├─_
│ │ └─'x'
│ └─█─.
│   ├─_
│   └─'y'
└─100

Debugging¶

This section covers the four most common issues people seem to hit.

Referring to a column that doesn’t a exist
A pandas Series method raising an error
Python syntax errors
Any of the above in a pipe

Note that stack traces shown here are shorter than normal, to help make them clearer. This is something siuba does for SQL by default, and will be implemented for pandas in the future.

[5]:

import pandas as pd
from siuba import mutate, _

df = pd.DataFrame({
    'g': ['a','a','b'],
    'x': [1,2,3]
})

Missing columns¶

[7]:

mutate(df, y = _.X + 1)

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-7-6985da088c9c> in <module>
----> 1 mutate(df, y = _.X + 1)

AttributeError: 'DataFrame' object has no attribute 'X'

In this case, the data doesn’t have a column named “X”.

[8]:

df.columns

[8]:

Index(['g', 'x'], dtype='object')

Series method error¶

[9]:

mutate(df, y = _.x.mean(bad_arg = True))

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-9-7d091be01caf> in <module>
----> 1 mutate(df, y = _.x.mean(bad_arg = True))

TypeError: mean() got an unexpected keyword argument 'bad_arg'

In this case, it’s helpful to try replacing _ with the actual data.

[10]:

# expression to debug
_.x.mean(bad_arg = True)

# replacing _ with the data
df.x.mean(bad_arg = True)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-10-74696ea76235> in <module>
      3
      4 # replacing _ with the data
----> 5 df.x.mean(bad_arg = True)

TypeError: mean() got an unexpected keyword argument 'bad_arg'

Python syntax errors¶

[11]:

df
    >> mutate(y = _.x + 1)

  File "<ipython-input-11-fded324be8e0>", line 2
    >> mutate(y = _.x + 1)
    ^
IndentationError: unexpected indent

In this case, we either need to use a backslash, or put the code in parentheses.

[12]:

df \
    >> mutate(y = _.x + 1)

(df
    >> mutate(y = _.x + 1)
)

[12]:

	g	x	y
0	a	1	2
1	a	2	3
2	b	3	4

Pipes¶

When the error occurs in a pipe, it’s helpful to comment out parts of the pipe.

For example, consider the 3 step pipe below.

[13]:

from siuba import select, arrange, mutate

(df
   >> select(_.g, _.x)
   >> mutate(res = _.X + 1)
   >> arrange(-_.res)
)

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-13-8a0872ddae08> in <module>
      4    >> select(_.g, _.x)
      5    >> mutate(res = _.X + 1)
----> 6    >> arrange(-_.res)
      7 )

AttributeError: 'DataFrame' object has no attribute 'X'

Notice the arrow pointing to line 6. This is not because that’s where the error is, but because python will always point to the last line of a pipe.

Let’s debug by running only the first line, then only the first two, etc.., until we find the error.

[14]:

(df
   >> select(_.g, _.x)
#    >> mutate(res = _.X + 1)
#    >> arrange(-_.res)
)

[14]:

	g	x
0	a	1
1	a	2
2	b	3

Select works okay, now let’s uncomment the next line.

[15]:

(df
   >> select(_.g, _.x)
   >> mutate(res = _.X + 1)
#    >> arrange(-_.res)
)

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-15-f317e7f1dcb7> in <module>
      1 (df
      2    >> select(_.g, _.x)
----> 3    >> mutate(res = _.X + 1)
      4 #    >> arrange(-_.res)
      5 )

AttributeError: 'DataFrame' object has no attribute 'X'

We found our bug! Note that when working with SQL, siuba prints out the name of the verb where the error occured. This is very useful, and will be added to working with pandas in the future!

Edit page on github here. Interactive version:

siuba

Navigation

Related Topics

Data Analysis guide¶

Overview¶

Split-apply-combine¶

Dates and times¶

Strings¶

Reshaping¶

Table joins¶

Nested data¶

Custom functions¶

Debugging¶

Missing columns¶

Series method error¶

Python syntax errors¶

Pipes¶