--- jupyter: jupytext: text_representation: extension: .Rmd format_name: rmarkdown format_version: '1.1' jupytext_version: 1.1.1 kernelspec: display_name: Python 3 language: python name: python3 --- ```{python nbsphinx=hidden} import pandas as pd pd.set_option("display.max_rows", 5) ``` ## Group by This function is used to specify groups in your data for verbs like `mutate`, `filter`, and `summarize` to perform operations over. For example, in the `mtcars` dataset, there are 3 possible values for cylinders (`cyl`). You could use `group_by` to say that you want to perform operations separately for each of these 3 groups of values. An important compliment to `group_by` is `ungroup`, which removes all current groupings. ```{python} from siuba import _, group_by, ungroup, filter, mutate, summarize from siuba.data import mtcars small_cars = mtcars[["cyl", "gear", "hp"]] small_cars ``` ### Grouping by column The simplest way to use group by is to specify your grouping column directly. This is shown below, by grouping `mtcars` according to its 3 groups of cylinder values (4, 6, or 8 cylinders). ```{python} g_cyl = small_cars >> group_by(_.cyl) g_cyl ``` Note that the result is simply a pandas GroupedDataFrame, which is what is returned if you use `mtcars.groupby('cyl')`. Normally, a GroupedDataFrame doesn't print out a preview of itself, but `siuba` modifies it to do so, since this is very handy. The `group_by` function is most often used with `filter`, `mutate`, and `summarize`. ```{python} # keep rows where hp is greater than mean hp within cyl group g_cyl >> filter(_.hp > _.hp.mean()) ``` ```{python} g_cyl >> mutate(avg_hp = _.hp.mean()) ``` ```{python} g_cyl >> summarize(avg_hp = _.hp.mean()) ``` ### Grouping by multiple columns In order to group by multiple columns, simply specify them all as arguments to `group_by`. ```{python} small_cars >> group_by(_.cyl, _.gear) ``` ### Defining a new column for grouping ```{python} small_cars >> group_by(high_hp = _.hp > 300) ``` ### Ungrouping ```{python} small_cars >> group_by(_.cyl) >> ungroup() ```