[1]:
import pandas as pd
pd.set_option("display.max_rows", 5)
Group by¶
This function is used to specify groups in your data for verbs like mutate
, filter
, and summarize
to perform operations over.
For example, in the mtcars
dataset, there are 3 possible values for cylinders (cyl
). You could use group_by
to say that you want to perform operations separately for each of these 3 groups of values.
An important compliment to group_by
is ungroup
, which removes all current groupings.
[2]:
from siuba import _, group_by, ungroup, filter, mutate, summarize
from siuba.data import mtcars
small_cars = mtcars[["cyl", "gear", "hp"]]
small_cars
[2]:
cyl | gear | hp | |
---|---|---|---|
0 | 6 | 4 | 110 |
1 | 6 | 4 | 110 |
... | ... | ... | ... |
30 | 8 | 5 | 335 |
31 | 4 | 4 | 109 |
32 rows × 3 columns
Grouping by column¶
The simplest way to use group by is to specify your grouping column directly. This is shown below, by grouping mtcars
according to its 3 groups of cylinder values (4, 6, or 8 cylinders).
[3]:
g_cyl = small_cars >> group_by(_.cyl)
g_cyl
[3]:
(grouped data frame)
cyl | gear | hp | |
---|---|---|---|
0 | 6 | 4 | 110 |
1 | 6 | 4 | 110 |
... | ... | ... | ... |
30 | 8 | 5 | 335 |
31 | 4 | 4 | 109 |
32 rows × 3 columns
Note that the result is simply a pandas GroupedDataFrame, which is what is returned if you use mtcars.groupby('cyl')
. Normally, a GroupedDataFrame doesn’t print out a preview of itself, but siuba
modifies it to do so, since this is very handy.
The group_by
function is most often used with filter
, mutate
, and summarize
.
[4]:
# keep rows where hp is greater than mean hp within cyl group
g_cyl >> filter(_.hp > _.hp.mean())
[4]:
(grouped data frame)
cyl | gear | hp | |
---|---|---|---|
2 | 4 | 4 | 93 |
6 | 8 | 3 | 245 |
... | ... | ... | ... |
30 | 8 | 5 | 335 |
31 | 4 | 4 | 109 |
15 rows × 3 columns
[5]:
g_cyl >> mutate(avg_hp = _.hp.mean())
[5]:
(grouped data frame)
cyl | gear | hp | avg_hp | |
---|---|---|---|---|
0 | 6 | 4 | 110 | 122.285714 |
1 | 6 | 4 | 110 | 122.285714 |
... | ... | ... | ... | ... |
30 | 8 | 5 | 335 | 209.214286 |
31 | 4 | 4 | 109 | 82.636364 |
32 rows × 4 columns
[6]:
g_cyl >> summarize(avg_hp = _.hp.mean())
[6]:
cyl | avg_hp | |
---|---|---|
0 | 4 | 82.636364 |
1 | 6 | 122.285714 |
2 | 8 | 209.214286 |
Grouping by multiple columns¶
In order to group by multiple columns, simply specify them all as arguments to group_by
.
[7]:
small_cars >> group_by(_.cyl, _.gear)
[7]:
(grouped data frame)
cyl | gear | hp | |
---|---|---|---|
0 | 6 | 4 | 110 |
1 | 6 | 4 | 110 |
... | ... | ... | ... |
30 | 8 | 5 | 335 |
31 | 4 | 4 | 109 |
32 rows × 3 columns
Defining a new column for grouping¶
[8]:
small_cars >> group_by(high_hp = _.hp > 300)
[8]:
(grouped data frame)
cyl | gear | hp | high_hp | |
---|---|---|---|---|
0 | 6 | 4 | 110 | False |
1 | 6 | 4 | 110 | False |
... | ... | ... | ... | ... |
30 | 8 | 5 | 335 | True |
31 | 4 | 4 | 109 | False |
32 rows × 4 columns
Ungrouping¶
[9]:
small_cars >> group_by(_.cyl) >> ungroup()
[9]:
cyl | gear | hp | |
---|---|---|---|
0 | 6 | 4 | 110 |
1 | 6 | 4 | 110 |
... | ... | ... | ... |
30 | 8 | 5 | 335 |
31 | 4 | 4 | 109 |
32 rows × 3 columns
Edit page on github here. Interactive version: