[1]:
import pandas as pd
pd.set_option("display.max_rows", 5)
Filter¶
This function keeps rows of data that meet all specified conditions.
At its core, it follows the three following rules:
if each condition is true for a row, then it keeps that row.
it does not keep a row when a condition results in NA values.
it works with normal functions, including lambdas.
[2]:
from siuba import _, filter, group_by
from siuba.data import mtcars
Filter basics¶
This section demonstrates each of the three rules listed above.
For example, in the code below, a row must meet two conditions to be kept in the data. You can verbalize these conditions as, “cyl
is equal to four, and gear
is equal to five”.
[3]:
mtcars >> filter(_.cyl == 4, _.gear == 5)
[3]:
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
26 | 26.0 | 4 | 120.3 | 91 | 4.43 | 2.140 | 16.7 | 0 | 1 | 5 | 2 |
27 | 30.4 | 4 | 95.1 | 113 | 3.77 | 1.513 | 16.9 | 1 | 1 | 5 | 2 |
As with most subsetting in pandas, when a condition evalutes to an NA
value, the row is automatically excluded. This is different from pandas indexing, where NA
values produce errors.
[4]:
df = pd.DataFrame({
"x": [True, False, None],
})
df
[4]:
x | |
---|---|
0 | True |
1 | False |
2 | None |
[5]:
df >> filter(_.x)
[5]:
x | |
---|---|
0 | True |
Finally, while the simplest method is to use filter
with siu expressions, both these forms work:
_.cyl == 4
lambda _: _.cyl == 4
[6]:
mtcars >> filter(lambda _: _.cyl == 4, lambda _: _.gear == 5)
[6]:
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
26 | 26.0 | 4 | 120.3 | 91 | 4.43 | 2.140 | 16.7 | 0 | 1 | 5 | 2 |
27 | 30.4 | 4 | 95.1 | 113 | 3.77 | 1.513 | 16.9 | 1 | 1 | 5 | 2 |
Under the hood, filter
will call the lambda on the underlying data. However, keep in mind that using a lambda function prevents siuba from working with SQL, and from optimizing filter conditions over pandas DataFrames in the future.
Grouped filters¶
In the example below, we keep rows where the horsepower (hp) is above the median horsepower for each cylinder group. This means that the filter performas the following operations.
calculates the median
hp
for eachcyl
grouping.for each row, based on the
cyl
group it belongs to, tests where thehp
is greater than the medianhp
.keeps only rows where this test passes.
[7]:
(mtcars
>> group_by(_.cyl)
>> filter(_.hp > _.hp.median())
)
[7]:
(grouped data frame)
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
2 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.32 | 18.61 | 1 | 1 | 4 | 1 |
6 | 14.3 | 8 | 360.0 | 245 | 3.21 | 3.57 | 15.84 | 0 | 0 | 3 | 4 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
30 | 15.0 | 8 | 301.0 | 335 | 3.54 | 3.57 | 14.60 | 0 | 1 | 5 | 8 |
31 | 21.4 | 4 | 121.0 | 109 | 4.11 | 2.78 | 18.60 | 1 | 1 | 4 | 2 |
15 rows × 11 columns
Filter with siuba vector functions¶
The siuba library includes a number of helper functions for operating on arrays of data (i.e. pandas Series or numpy arrays).
These functions can also do other useful calculations inside a filter, like…
return the row number of an entry
calculate a cumulative sum
shift all values down (lag) or up (lead) one
Keeping the two lowest horsepower rows per cylinder¶
For example, in the code below we filter to…
sort the data by ascending horse power (
hp
).get the row number of each entry within a
cyl
group.filter it to keep the two lowest
hp
cars per group.
Note that since there are 3 cyl
groups (4, 6, or 8 cylinders), this returns 6 rows.
[8]:
from siuba.dply.vector import row_number, lag
from siuba import arrange
(mtcars
>> arrange(_.hp)
>> group_by(_.cyl)
>> filter(row_number(_) <= 2)
)
[8]:
(grouped data frame)
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
5 | 18.1 | 6 | 225.0 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
21 | 15.5 | 8 | 318.0 | 150 | 2.76 | 3.520 | 16.87 | 0 | 0 | 3 | 2 |
22 | 15.2 | 8 | 304.0 | 150 | 3.15 | 3.435 | 17.30 | 0 | 0 | 3 | 2 |
6 rows × 11 columns
Comparing shifts in hp across rows¶
Below we sort the data by ascending horsepower (hp
), then filter to keep rows where it increases by 50.
[9]:
(mtcars
>> arrange(_.hp)
>> filter(_.hp - lag(_.hp) > 50)
)
[9]:
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
30 | 15.0 | 8 | 301.0 | 335 | 3.54 | 3.57 | 14.6 | 0 | 1 | 5 | 8 |
Edit page on github here. Interactive version: