[1]:
import pandas as pd
pd.set_option("display.max_rows", 5)

Group by

This function is used to specify groups in your data for verbs like mutate, filter, and summarize to perform operations over.

For example, in the mtcars dataset, there are 3 possible values for cylinders (cyl). You could use group_by to say that you want to perform operations separately for each of these 3 groups of values.

An important compliment to group_by is ungroup, which removes all current groupings.

[2]:
from siuba import _, group_by, ungroup, filter, mutate, summarize
from siuba.data import mtcars

small_cars = mtcars[["cyl", "gear", "hp"]]

small_cars
[2]:
cyl gear hp
0 6 4 110
1 6 4 110
... ... ... ...
30 8 5 335
31 4 4 109

32 rows × 3 columns

Grouping by column

The simplest way to use group by is to specify your grouping column directly. This is shown below, by grouping mtcars according to its 3 groups of cylinder values (4, 6, or 8 cylinders).

[3]:
g_cyl = small_cars >> group_by(_.cyl)

g_cyl
[3]:

(grouped data frame)

cyl gear hp
0 6 4 110
1 6 4 110
... ... ... ...
30 8 5 335
31 4 4 109

32 rows × 3 columns

Note that the result is simply a pandas GroupedDataFrame, which is what is returned if you use mtcars.groupby('cyl'). Normally, a GroupedDataFrame doesn’t print out a preview of itself, but siuba modifies it to do so, since this is very handy.

The group_by function is most often used with filter, mutate, and summarize.

[4]:
# keep rows where hp is greater than mean hp within cyl group
g_cyl >> filter(_.hp > _.hp.mean())
[4]:

(grouped data frame)

cyl gear hp
2 4 4 93
6 8 3 245
... ... ... ...
30 8 5 335
31 4 4 109

15 rows × 3 columns

[5]:
g_cyl >> mutate(avg_hp = _.hp.mean())
[5]:

(grouped data frame)

cyl gear hp avg_hp
0 6 4 110 122.285714
1 6 4 110 122.285714
... ... ... ... ...
30 8 5 335 209.214286
31 4 4 109 82.636364

32 rows × 4 columns

[6]:
g_cyl >> summarize(avg_hp = _.hp.mean())
[6]:
cyl avg_hp
0 4 82.636364
1 6 122.285714
2 8 209.214286

Grouping by multiple columns

In order to group by multiple columns, simply specify them all as arguments to group_by.

[7]:
small_cars >> group_by(_.cyl, _.gear)
[7]:

(grouped data frame)

cyl gear hp
0 6 4 110
1 6 4 110
... ... ... ...
30 8 5 335
31 4 4 109

32 rows × 3 columns

Defining a new column for grouping

[8]:
small_cars >> group_by(high_hp = _.hp > 300)
[8]:

(grouped data frame)

cyl gear hp high_hp
0 6 4 110 False
1 6 4 110 False
... ... ... ... ...
30 8 5 335 True
31 4 4 109 False

32 rows × 4 columns

Ungrouping

[9]:
small_cars >> group_by(_.cyl) >> ungroup()
[9]:
cyl gear hp
0 6 4 110
1 6 4 110
... ... ... ...
30 8 5 335
31 4 4 109

32 rows × 3 columns

Edit page on github here. Interactive version: Binder badge