[1]:
import pandas as pd
pd.set_option("display.max_rows", 5)
Distinct¶
This function keeps only unique values for a specified column. If multiple columns are specified, it keeps only the unique groups of values for those columns.
[2]:
from siuba import _, distinct
from siuba.data import mtcars
Specifying distinct columns¶
[3]:
mtcars >> distinct(_.cyl, _.gear)
[3]:
cyl | gear | |
---|---|---|
0 | 6 | 4 |
1 | 4 | 4 |
... | ... | ... |
6 | 8 | 5 |
7 | 6 | 5 |
8 rows × 2 columns
Note that by default, only the columns that are passed to distinct
are returned.
Keeping all other columns¶
In order to keep all the columns from the data, you can use the _keep_all
argument. In this case, the first row encountered for each set of distinct values is returned.
[4]:
mtcars >> distinct(_.cyl, _.gear, _keep_all = True)
[4]:
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.62 | 16.46 | 0 | 1 | 4 | 4 |
1 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.32 | 18.61 | 1 | 1 | 4 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6 | 15.8 | 8 | 351.0 | 264 | 4.22 | 3.17 | 14.50 | 0 | 1 | 5 | 4 |
7 | 19.7 | 6 | 145.0 | 175 | 3.62 | 2.77 | 15.50 | 0 | 1 | 5 | 6 |
8 rows × 11 columns
Specifying column expressions¶
The distinct
function also accepts column expressions, so long as they are passed as a keyword argument. This is illustrated below, by calculating distinct values of mpg, rounded to the nearest whole number.
[5]:
mtcars >> distinct(round_mpg = _.mpg.round())
[5]:
round_mpg | |
---|---|
0 | 21.0 |
1 | 23.0 |
... | ... |
16 | 26.0 |
17 | 20.0 |
18 rows × 1 columns
Edit page on github here. Interactive version: