[1]:
import pandas as pd
pd.set_option("display.max_rows", 5)

Distinct

This function keeps only unique values for a specified column. If multiple columns are specified, it keeps only the unique groups of values for those columns.

[2]:
from siuba import _, distinct
from siuba.data import mtcars

Specifying distinct columns

[3]:
mtcars >> distinct(_.cyl, _.gear)
[3]:
cyl gear
0 6 4
1 4 4
... ... ...
6 8 5
7 6 5

8 rows × 2 columns

Note that by default, only the columns that are passed to distinct are returned.

Keeping all other columns

In order to keep all the columns from the data, you can use the _keep_all argument. In this case, the first row encountered for each set of distinct values is returned.

[4]:
mtcars >> distinct(_.cyl, _.gear, _keep_all = True)
[4]:
mpg cyl disp hp drat wt qsec vs am gear carb
0 21.0 6 160.0 110 3.90 2.62 16.46 0 1 4 4
1 22.8 4 108.0 93 3.85 2.32 18.61 1 1 4 1
... ... ... ... ... ... ... ... ... ... ... ...
6 15.8 8 351.0 264 4.22 3.17 14.50 0 1 5 4
7 19.7 6 145.0 175 3.62 2.77 15.50 0 1 5 6

8 rows × 11 columns

Specifying column expressions

The distinct function also accepts column expressions, so long as they are passed as a keyword argument. This is illustrated below, by calculating distinct values of mpg, rounded to the nearest whole number.

[5]:
mtcars >> distinct(round_mpg = _.mpg.round())
[5]:
round_mpg
0 21.0
1 23.0
... ...
16 26.0
17 20.0

18 rows × 1 columns

Edit page on github here. Interactive version: Binder badge