[1]:
import pandas as pd
pd.set_option("display.max_rows", 5)
Select¶
This function lets you select specific columns of your data to keep. Each selection may include up to three pieces…
specifying column(s) to include or remove
excluding some specified columns
renaming a column
The documentation below will illustrate these pieces when specifying one column at a time, specifying multiple columns, or searching columns using functions like contains
.
[2]:
from siuba import _, select
from siuba.data import mtcars
mtcars
[2]:
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
1 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
30 | 15.0 | 8 | 301.0 | 335 | 3.54 | 3.570 | 14.60 | 0 | 1 | 5 | 8 |
31 | 21.4 | 4 | 121.0 | 109 | 4.11 | 2.780 | 18.60 | 1 | 1 | 4 | 2 |
32 rows × 11 columns
Specifying one column at a time¶
Specify columns by name or position¶
The cleanest way to specify a column is to refer to it by name. By default, referring to a column will keep it.
[3]:
mtcars >> select(_.mpg, _.cyl)
[3]:
mpg | cyl | |
---|---|---|
0 | 21.0 | 6 |
1 | 21.0 | 6 |
... | ... | ... |
30 | 15.0 | 8 |
31 | 21.4 | 4 |
32 rows × 2 columns
This approach ensures that you can easily rename, or exclude it from the data (shown in following sections). However, you can also refer to a column using a string, or its 0-indexed column position.
[4]:
# two other ways to keep the same columns
mtcars >> select(0, 1)
mtcars >> select("mpg", "cyl")
[4]:
mpg | cyl | |
---|---|---|
0 | 21.0 | 6 |
1 | 21.0 | 6 |
... | ... | ... |
30 | 15.0 | 8 |
31 | 21.4 | 4 |
32 rows × 2 columns
Excluding columns¶
You can remove a column from the data by specifying it with a minus sign (-
) in front of it. This action can be performed on multiple columns.
[5]:
# simple select with exclusion
mtcars >> select(-_.mpg, -_.cyl)
[5]:
disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|
0 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
1 | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
30 | 301.0 | 335 | 3.54 | 3.570 | 14.60 | 0 | 1 | 5 | 8 |
31 | 121.0 | 109 | 4.11 | 2.780 | 18.60 | 1 | 1 | 4 | 2 |
32 rows × 9 columns
Renaming columns¶
You can rename a specified column by using the equality operator (==
). This operation takes the following form.
_.new_name == _.old_name
[6]:
# select with rename
mtcars >> select(_.miles_per_gallon == _.mpg, _.cyl)
[6]:
miles_per_gallon | cyl | |
---|---|---|
0 | 21.0 | 6 |
1 | 21.0 | 6 |
... | ... | ... |
30 | 15.0 | 8 |
31 | 21.4 | 4 |
32 rows × 2 columns
Note that expressing the new column name on the left is similar to how creating a python dictionary works. For example…
select(_.a == _.x, _.b == _.y)
dict(a = "x", b = "y")
both create new entries named “a” and “b”.
However, keep in mind that pandas DataFrame.rename
method uses the opposite approach.
Select a slice of columns¶
When the columns you want to select are adjacent to each other, you can select them using a special slicing syntax. This syntax takes the form…
_["start_col":"end_col"]
where “start_col” and “end_col” can be any of the three methods to specify a column: _.some_col
, “some_col”, or its position number.
[7]:
mtcars >> select(_["mpg": "hp"])
[7]:
mpg | cyl | disp | hp | |
---|---|---|---|---|
0 | 21.0 | 6 | 160.0 | 110 |
1 | 21.0 | 6 | 160.0 | 110 |
... | ... | ... | ... | ... |
30 | 15.0 | 8 | 301.0 | 335 |
31 | 21.4 | 4 | 121.0 | 109 |
32 rows × 4 columns
Note that when position number is used to slice columns, the columns you specify are exactly the ones you would be from indexing the DataFrame.columns
attribute.
[8]:
print(mtcars.columns[0:4])
mtcars >> select(_[0:4])
Index(['mpg', 'cyl', 'disp', 'hp'], dtype='object')
[8]:
mpg | cyl | disp | hp | |
---|---|---|---|---|
0 | 21.0 | 6 | 160.0 | 110 |
1 | 21.0 | 6 | 160.0 | 110 |
... | ... | ... | ... | ... |
30 | 15.0 | 8 | 301.0 | 335 |
31 | 21.4 | 4 | 121.0 | 109 |
32 rows × 4 columns
Finally, columns selected through slicing can be excluded using the minus operator (-
).
[9]:
mtcars >> select(-_["mpg": "hp"])
[9]:
drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|
0 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
1 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
... | ... | ... | ... | ... | ... | ... | ... |
30 | 3.54 | 3.570 | 14.60 | 0 | 1 | 5 | 8 |
31 | 4.11 | 2.780 | 18.60 | 1 | 1 | 4 | 2 |
32 rows × 7 columns
Searching with methods like startswith
or contains
¶
The final, most flexible way to specify columns is to use any of the methods on the DataFrame.columns.str
attribute. This is done by calling any of these methods in a siu expression (e.g. _.startswith('a')
).
[10]:
# prints columns that contain the letter d
columns = mtcars.columns
print(columns[columns.str.contains('d')])
# uses the same method to select only these columns
mtcars >> select(_.contains('d'))
Index(['disp', 'drat'], dtype='object')
[10]:
disp | drat | |
---|---|---|
0 | 160.0 | 3.90 |
1 | 160.0 | 3.90 |
... | ... | ... |
30 | 301.0 | 3.54 |
31 | 121.0 | 4.11 |
32 rows × 2 columns
As with the other approaches of specifying columns, you can also choose to exclude them.
[11]:
mtcars >> select(-_.contains('d'))
[11]:
mpg | cyl | hp | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|
0 | 21.0 | 6 | 110 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
1 | 21.0 | 6 | 110 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
30 | 15.0 | 8 | 335 | 3.570 | 14.60 | 0 | 1 | 5 | 8 |
31 | 21.4 | 4 | 109 | 2.780 | 18.60 | 1 | 1 | 4 | 2 |
32 rows × 9 columns
There are many string methods that can be accessed from DataFrame.colname.str
. See their pandas docs, or their docstrings (e.g. help(mtcars.cyl.str.contains)
) for more information.
For convenience, the names of these methods are listed below.
[12]:
str_methods = dir(mtcars.columns.str)
str_useful = [x for x in str_methods if not x.startswith("_")]
print(str_useful)
['capitalize', 'casefold', 'cat', 'center', 'contains', 'count', 'decode', 'encode', 'endswith', 'extract', 'extractall', 'find', 'findall', 'fullmatch', 'get', 'get_dummies', 'index', 'isalnum', 'isalpha', 'isdecimal', 'isdigit', 'islower', 'isnumeric', 'isspace', 'istitle', 'isupper', 'join', 'len', 'ljust', 'lower', 'lstrip', 'match', 'normalize', 'pad', 'partition', 'repeat', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'slice', 'slice_replace', 'split', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'wrap', 'zfill']
Edit page on github here. Interactive version: