[1]:

import pandas as pd
pd.set_option("display.max_rows", 5)

Group by¶

This function is used to specify groups in your data for verbs like mutate, filter, and summarize to perform operations over.

For example, in the mtcars dataset, there are 3 possible values for cylinders (cyl). You could use group_by to say that you want to perform operations separately for each of these 3 groups of values.

An important compliment to group_by is ungroup, which removes all current groupings.

[2]:

from siuba import _, group_by, ungroup, filter, mutate, summarize
from siuba.data import mtcars

small_cars = mtcars[["cyl", "gear", "hp"]]

small_cars

[2]:

	cyl	gear	hp
0	6	4	110
1	6	4	110
...	...	...	...
30	8	5	335
31	4	4	109

32 rows × 3 columns

Grouping by column¶

The simplest way to use group by is to specify your grouping column directly. This is shown below, by grouping mtcars according to its 3 groups of cylinder values (4, 6, or 8 cylinders).

[3]:

g_cyl = small_cars >> group_by(_.cyl)

g_cyl

[3]:

(grouped data frame)

	cyl	gear	hp
0	6	4	110
1	6	4	110
...	...	...	...
30	8	5	335
31	4	4	109

32 rows × 3 columns

Note that the result is simply a pandas GroupedDataFrame, which is what is returned if you use mtcars.groupby('cyl'). Normally, a GroupedDataFrame doesn’t print out a preview of itself, but siuba modifies it to do so, since this is very handy.

The group_by function is most often used with filter, mutate, and summarize.

[4]:

# keep rows where hp is greater than mean hp within cyl group
g_cyl >> filter(_.hp > _.hp.mean())

[4]:

(grouped data frame)

	cyl	gear	hp
2	4	4	93
6	8	3	245
...	...	...	...
30	8	5	335
31	4	4	109

15 rows × 3 columns

[5]:

g_cyl >> mutate(avg_hp = _.hp.mean())

[5]:

(grouped data frame)

	cyl	gear	hp	avg_hp
0	6	4	110	122.285714
1	6	4	110	122.285714
...	...	...	...	...
30	8	5	335	209.214286
31	4	4	109	82.636364

32 rows × 4 columns

[6]:

g_cyl >> summarize(avg_hp = _.hp.mean())

[6]:

	cyl	avg_hp
0	4	82.636364
1	6	122.285714
2	8	209.214286

Grouping by multiple columns¶

In order to group by multiple columns, simply specify them all as arguments to group_by.

[7]:

small_cars >> group_by(_.cyl, _.gear)

[7]:

(grouped data frame)

	cyl	gear	hp
0	6	4	110
1	6	4	110
...	...	...	...
30	8	5	335
31	4	4	109

32 rows × 3 columns

Defining a new column for grouping¶

[8]:

small_cars >> group_by(high_hp = _.hp > 300)

[8]:

(grouped data frame)

	cyl	gear	hp	high_hp
0	6	4	110	False
1	6	4	110	False
...	...	...	...	...
30	8	5	335	True
31	4	4	109	False

32 rows × 4 columns

Ungrouping¶

[9]:

small_cars >> group_by(_.cyl) >> ungroup()

[9]:

	cyl	gear	hp
0	6	4	110
1	6	4	110
...	...	...	...
30	8	5	335
31	4	4	109

32 rows × 3 columns

Edit page on github here. Interactive version:

siuba

Navigation

Related Topics

Group by¶

Grouping by column¶

Grouping by multiple columns¶

Defining a new column for grouping¶

Ungrouping¶