[1]:

import pandas as pd
pd.set_option("display.max_rows", 5)

Filter¶

This function keeps rows of data that meet all specified conditions.

At its core, it follows the three following rules:

if each condition is true for a row, then it keeps that row.
it does not keep a row when a condition results in NA values.
it works with normal functions, including lambdas.

[2]:

from siuba import _, filter, group_by
from siuba.data import mtcars

Filter basics¶

This section demonstrates each of the three rules listed above.

For example, in the code below, a row must meet two conditions to be kept in the data. You can verbalize these conditions as, “cyl is equal to four, and gear is equal to five”.

[3]:

mtcars >> filter(_.cyl == 4, _.gear == 5)

[3]:

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
26	26.0	4	120.3	91	4.43	2.140	16.7	0	1	5	2
27	30.4	4	95.1	113	3.77	1.513	16.9	1	1	5	2

As with most subsetting in pandas, when a condition evalutes to an NA value, the row is automatically excluded. This is different from pandas indexing, where NA values produce errors.

[4]:

df = pd.DataFrame({
    "x": [True, False, None],
    })

df

[4]:

	x
0	True
1	False
2	None

[5]:

df >> filter(_.x)

[5]:

	x
0	True

Finally, while the simplest method is to use filter with siu expressions, both these forms work:

_.cyl == 4
lambda _: _.cyl == 4

[6]:

mtcars >> filter(lambda _: _.cyl == 4, lambda _: _.gear == 5)

[6]:

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
26	26.0	4	120.3	91	4.43	2.140	16.7	0	1	5	2
27	30.4	4	95.1	113	3.77	1.513	16.9	1	1	5	2

Under the hood, filter will call the lambda on the underlying data. However, keep in mind that using a lambda function prevents siuba from working with SQL, and from optimizing filter conditions over pandas DataFrames in the future.

Grouped filters¶

In the example below, we keep rows where the horsepower (hp) is above the median horsepower for each cylinder group. This means that the filter performas the following operations.

calculates the median hp for each cyl grouping.
for each row, based on the cyl group it belongs to, tests where the hp is greater than the median hp.
keeps only rows where this test passes.

[7]:

(mtcars
  >> group_by(_.cyl)
  >> filter(_.hp > _.hp.median())
  )

[7]:

(grouped data frame)

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
2	22.8	4	108.0	93	3.85	2.32	18.61	1	1	4	1
6	14.3	8	360.0	245	3.21	3.57	15.84	0	0	3	4
...	...	...	...	...	...	...	...	...	...	...	...
30	15.0	8	301.0	335	3.54	3.57	14.60	0	1	5	8
31	21.4	4	121.0	109	4.11	2.78	18.60	1	1	4	2

15 rows × 11 columns

Filter with siuba vector functions¶

The siuba library includes a number of helper functions for operating on arrays of data (i.e. pandas Series or numpy arrays).

These functions can also do other useful calculations inside a filter, like…

return the row number of an entry
calculate a cumulative sum
shift all values down (lag) or up (lead) one

Keeping the two lowest horsepower rows per cylinder¶

For example, in the code below we filter to…

sort the data by ascending horse power (hp).
get the row number of each entry within a cyl group.
filter it to keep the two lowest hp cars per group.

Note that since there are 3 cyl groups (4, 6, or 8 cylinders), this returns 6 rows.

[8]:

from siuba.dply.vector import row_number, lag
from siuba import arrange

(mtcars
  >> arrange(_.hp)
  >> group_by(_.cyl)
  >> filter(row_number(_) <= 2)
  )

[8]:

(grouped data frame)

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
0	21.0	6	160.0	110	3.90	2.620	16.46	0	1	4	4
5	18.1	6	225.0	105	2.76	3.460	20.22	1	0	3	1
...	...	...	...	...	...	...	...	...	...	...	...
21	15.5	8	318.0	150	2.76	3.520	16.87	0	0	3	2
22	15.2	8	304.0	150	3.15	3.435	17.30	0	0	3	2

6 rows × 11 columns

Comparing shifts in hp across rows¶

Below we sort the data by ascending horsepower (hp), then filter to keep rows where it increases by 50.

[9]:

(mtcars
  >> arrange(_.hp)
  >> filter(_.hp - lag(_.hp) > 50)
  )

[9]:

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
30	15.0	8	301.0	335	3.54	3.57	14.6	0	1	5	8

Edit page on github here. Interactive version:

siuba

Navigation

Related Topics

Filter¶

Filter basics¶

Grouped filters¶

Filter with siuba vector functions¶

Keeping the two lowest horsepower rows per cylinder¶

Comparing shifts in hp across rows¶