[1]:
import pandas as pd
pd.set_option("display.max_rows", 5)
Summarize¶
This function lets you define a new column in your data, which is a single number calculated either across all the data, or within specified groups. It will result in a DataFrame with as many rows as the number of unique groups, or if no groups are defined, one row.
[2]:
from siuba import _, group_by, summarize
from siuba.data import mtcars
Summarize over everything¶
When you use summarize with an ungrouped DataFrame, the result is a single row.
[3]:
mtcars >> summarize(avg_mpg = _.mpg.mean())
[3]:
avg_mpg | |
---|---|
0 | 20.090625 |
Summarizing per group¶
When you use summarize with a grouped DataFrame, the result has the same number of rows as there are groups in the data. For example, there are 3 values of cylinders (cyl
) a row can have (4, 6, or 8), so ther result will be 3 rows.
[4]:
(mtcars
>> group_by(_.cyl)
>> summarize(avg_mpg = _.mpg.mean())
)
[4]:
cyl | avg_mpg | |
---|---|---|
0 | 4 | 26.663636 |
1 | 6 | 19.742857 |
2 | 8 | 15.100000 |
Note that summarize also accepts a single value, like a string or number.
[5]:
(mtcars
>> group_by(_.cyl)
>> summarize(
measure = "mean miles per gallon",
value = _.mpg.mean()
)
)
[5]:
cyl | measure | value | |
---|---|---|---|
0 | 4 | mean miles per gallon | 26.663636 |
1 | 6 | mean miles per gallon | 19.742857 |
2 | 8 | mean miles per gallon | 15.100000 |
Edit page on github here. Interactive version: