[1]:

import pandas as pd
pd.set_option("display.max_rows", 5)

Distinct¶

This function keeps only unique values for a specified column. If multiple columns are specified, it keeps only the unique groups of values for those columns.

[2]:

from siuba import _, distinct
from siuba.data import mtcars

Specifying distinct columns¶

[3]:

mtcars >> distinct(_.cyl, _.gear)

[3]:

	cyl	gear
0	6	4
1	4	4
...	...	...
6	8	5
7	6	5

8 rows × 2 columns

Note that by default, only the columns that are passed to distinct are returned.

Keeping all other columns¶

In order to keep all the columns from the data, you can use the _keep_all argument. In this case, the first row encountered for each set of distinct values is returned.

[4]:

mtcars >> distinct(_.cyl, _.gear, _keep_all = True)

[4]:

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
0	21.0	6	160.0	110	3.90	2.62	16.46	0	1	4	4
1	22.8	4	108.0	93	3.85	2.32	18.61	1	1	4	1
...	...	...	...	...	...	...	...	...	...	...	...
6	15.8	8	351.0	264	4.22	3.17	14.50	0	1	5	4
7	19.7	6	145.0	175	3.62	2.77	15.50	0	1	5	6

8 rows × 11 columns

Specifying column expressions¶

The distinct function also accepts column expressions, so long as they are passed as a keyword argument. This is illustrated below, by calculating distinct values of mpg, rounded to the nearest whole number.

[5]:

mtcars >> distinct(round_mpg = _.mpg.round())

[5]:

	round_mpg
0	21.0
1	23.0
...	...
16	26.0
17	20.0

18 rows × 1 columns

Edit page on github here. Interactive version:

siuba

Navigation

Related Topics

Distinct¶

Specifying distinct columns¶

Keeping all other columns¶

Specifying column expressions¶