```
[1]:
```

```
import pandas as pd
pd.set_option("display.max_rows", 5)
```

# Filter¶

This function keeps rows of data that meet all specified conditions.

At its core, it follows the three following rules:

if each condition is true for a row, then it keeps that row.

it does not keep a row when a condition results in NA values.

it works with normal functions, including lambdas.

```
[2]:
```

```
from siuba import _, filter, group_by
from siuba.data import mtcars
```

## Filter basics¶

This section demonstrates each of the three rules listed above.

For example, in the code below, a row must meet two conditions to be kept in the data. You can verbalize these conditions as, “`cyl`

is equal to four, and `gear`

is equal to five”.

```
[3]:
```

```
mtcars >> filter(_.cyl == 4, _.gear == 5)
```

```
[3]:
```

mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|

26 | 26.0 | 4 | 120.3 | 91 | 4.43 | 2.140 | 16.7 | 0 | 1 | 5 | 2 |

27 | 30.4 | 4 | 95.1 | 113 | 3.77 | 1.513 | 16.9 | 1 | 1 | 5 | 2 |

## Filters with OR conditions¶

In order to keep a row when one of several conditions is met, use the bar (`|`

) operator.

```
[4]:
```

```
mtcars >> filter((_.cyl == 4) | (_.gear == 5))
```

```
[4]:
```

mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|

2 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.32 | 18.61 | 1 | 1 | 4 | 1 |

7 | 24.4 | 4 | 146.7 | 62 | 3.69 | 3.19 | 20.00 | 1 | 0 | 4 | 2 |

... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |

30 | 15.0 | 8 | 301.0 | 335 | 3.54 | 3.57 | 14.60 | 0 | 1 | 5 | 8 |

31 | 21.4 | 4 | 121.0 | 109 | 4.11 | 2.78 | 18.60 | 1 | 1 | 4 | 2 |

14 rows × 11 columns

The code above keeps rows where `cyl`

is equal to 4 OR `gear`

is equal to 5.

Be sure to explicitly put parentheses around both sides. Otherwise, python will group the operation like `_.cyl == (4 | _.gear) == 5`

.

## Dropping NAs¶

As with most subsetting in pandas, when a condition evaluates to an `NA`

value, the row is automatically excluded. This is different from pandas indexing, where `NA`

values produce errors.

```
[5]:
```

```
df = pd.DataFrame({
"x": [True, False, None],
})
df
```

```
[5]:
```

x | |
---|---|

0 | True |

1 | False |

2 | None |

```
[6]:
```

```
df >> filter(_.x)
```

```
[6]:
```

x | |
---|---|

0 | True |

Finally, while the simplest method is to use `filter`

with siu expressions, both these forms work:

`_.cyl == 4`

`lambda _: _.cyl == 4`

```
[7]:
```

```
mtcars >> filter(lambda _: _.cyl == 4, lambda _: _.gear == 5)
```

```
[7]:
```

mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|

26 | 26.0 | 4 | 120.3 | 91 | 4.43 | 2.140 | 16.7 | 0 | 1 | 5 | 2 |

27 | 30.4 | 4 | 95.1 | 113 | 3.77 | 1.513 | 16.9 | 1 | 1 | 5 | 2 |

Under the hood, `filter`

will call the lambda on the underlying data. However, keep in mind that using a lambda function prevents siuba from working with SQL, and from optimizing filter conditions over pandas DataFrames in the future.

## Grouped filters¶

In the example below, we keep rows where the horsepower (hp) is above the median horsepower for each cylinder group. This means that the filter performas the following operations.

calculates the median

`hp`

for each`cyl`

grouping.for each row, based on the

`cyl`

group it belongs to, tests where the`hp`

is greater than the median`hp`

.keeps only rows where this test passes.

```
[8]:
```

```
(mtcars
>> group_by(_.cyl)
>> filter(_.hp > _.hp.median())
)
```

```
[8]:
```

(grouped data frame)

mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|

2 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.32 | 18.61 | 1 | 1 | 4 | 1 |

6 | 14.3 | 8 | 360.0 | 245 | 3.21 | 3.57 | 15.84 | 0 | 0 | 3 | 4 |

... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |

30 | 15.0 | 8 | 301.0 | 335 | 3.54 | 3.57 | 14.60 | 0 | 1 | 5 | 8 |

31 | 21.4 | 4 | 121.0 | 109 | 4.11 | 2.78 | 18.60 | 1 | 1 | 4 | 2 |

15 rows × 11 columns

## Filter with siuba vector functions¶

The siuba library includes a number of helper functions for operating on arrays of data (i.e. pandas Series or numpy arrays).

These functions can also do other useful calculations inside a filter, like…

return the row number of an entry

calculate a cumulative sum

shift all values down (lag) or up (lead) one

### Keeping the two lowest horsepower rows per cylinder¶

For example, in the code below we filter to…

sort the data by ascending horse power (

`hp`

).get the row number of each entry within a

`cyl`

group.filter it to keep the two lowest

`hp`

cars per group.

Note that since there are 3 `cyl`

groups (4, 6, or 8 cylinders), this returns 6 rows.

```
[9]:
```

```
from siuba.dply.vector import row_number, lag
from siuba import arrange
(mtcars
>> arrange(_.hp)
>> group_by(_.cyl)
>> filter(row_number(_) <= 2)
)
```

```
[9]:
```

(grouped data frame)

mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|

0 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |

5 | 18.1 | 6 | 225.0 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |

... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |

21 | 15.5 | 8 | 318.0 | 150 | 2.76 | 3.520 | 16.87 | 0 | 0 | 3 | 2 |

22 | 15.2 | 8 | 304.0 | 150 | 3.15 | 3.435 | 17.30 | 0 | 0 | 3 | 2 |

6 rows × 11 columns

### Comparing shifts in hp across rows¶

Below we sort the data by ascending horsepower (`hp`

), then filter to keep rows where it increases by 50.

```
[10]:
```

```
(mtcars
>> arrange(_.hp)
>> filter(_.hp - lag(_.hp) > 50)
)
```

```
[10]:
```

mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|

30 | 15.0 | 8 | 301.0 | 335 | 3.54 | 3.57 | 14.6 | 0 | 1 | 5 | 8 |

Edit page on github here. Interactive version: