[1]:

import pandas as pd
pd.set_option("display.max_rows", 5)

Summarize¶

This function lets you define a new column in your data, which is a single number calculated either across all the data, or within specified groups. It will result in a DataFrame with as many rows as the number of unique groups, or if no groups are defined, one row.

[2]:

from siuba import _, group_by, summarize
from siuba.data import mtcars

Summarize over everything¶

When you use summarize with an ungrouped DataFrame, the result is a single row.

[3]:

mtcars >> summarize(avg_mpg = _.mpg.mean())

[3]:

	avg_mpg
0	20.090625

Summarizing per group¶

When you use summarize with a grouped DataFrame, the result has the same number of rows as there are groups in the data. For example, there are 3 values of cylinders (cyl) a row can have (4, 6, or 8), so ther result will be 3 rows.

[4]:

(mtcars
  >> group_by(_.cyl)
  >> summarize(avg_mpg = _.mpg.mean())
  )

[4]:

	cyl	avg_mpg
0	4	26.663636
1	6	19.742857
2	8	15.100000

Note that summarize also accepts a single value, like a string or number.

[5]:

(mtcars
  >> group_by(_.cyl)
  >> summarize(
       measure = "mean miles per gallon",
       value = _.mpg.mean()
       )
  )

[5]:

	cyl	measure	value
0	4	mean miles per gallon	26.663636
1	6	mean miles per gallon	19.742857
2	8	mean miles per gallon	15.100000

Edit page on github here. Interactive version:

Fork me on GitHub