[1]:

import pandas as pd
pd.set_option("display.max_rows", 5)

Select¶

This function lets you select specific columns of your data to keep. Each selection may include up to three pieces…

specifying column(s) to include or remove
excluding some specified columns
renaming a column

The documentation below will illustrate these pieces when specifying one column at a time, specifying multiple columns, or searching columns using functions like contains.

[2]:

from siuba import _, select
from siuba.data import mtcars

mtcars

[2]:

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
0	21.0	6	160.0	110	3.90	2.620	16.46	0	1	4	4
1	21.0	6	160.0	110	3.90	2.875	17.02	0	1	4	4
...	...	...	...	...	...	...	...	...	...	...	...
30	15.0	8	301.0	335	3.54	3.570	14.60	0	1	5	8
31	21.4	4	121.0	109	4.11	2.780	18.60	1	1	4	2

32 rows × 11 columns

Specifying one column at a time¶

Specify columns by name or position¶

The cleanest way to specify a column is to refer to it by name. By default, referring to a column will keep it.

[3]:

mtcars >> select(_.mpg, _.cyl)

[3]:

	mpg	cyl
0	21.0	6
1	21.0	6
...	...	...
30	15.0	8
31	21.4	4

32 rows × 2 columns

This approach ensures that you can easily rename, or exclude it from the data (shown in following sections). However, you can also refer to a column using a string, or its 0-indexed column position.

[4]:

# two other ways to keep the same columns
mtcars >> select(0, 1)
mtcars >> select("mpg", "cyl")

[4]:

	mpg	cyl
0	21.0	6
1	21.0	6
...	...	...
30	15.0	8
31	21.4	4

32 rows × 2 columns

Excluding columns¶

You can remove a column from the data by specifying it with a minus sign (-) in front of it. This action can be performed on multiple columns.

[5]:

# simple select with exclusion
mtcars >> select(-_.mpg, -_.cyl)

[5]:

	disp	hp	drat	wt	qsec	vs	am	gear	carb
0	160.0	110	3.90	2.620	16.46	0	1	4	4
1	160.0	110	3.90	2.875	17.02	0	1	4	4
...	...	...	...	...	...	...	...	...	...
30	301.0	335	3.54	3.570	14.60	0	1	5	8
31	121.0	109	4.11	2.780	18.60	1	1	4	2

32 rows × 9 columns

Renaming columns¶

You can rename a specified column by using the equality operator (==). This operation takes the following form.

_.new_name == _.old_name

[6]:

# select with rename
mtcars >> select(_.miles_per_gallon == _.mpg, _.cyl)

[6]:

	miles_per_gallon	cyl
0	21.0	6
1	21.0	6
...	...	...
30	15.0	8
31	21.4	4

32 rows × 2 columns

Note that expressing the new column name on the left is similar to how creating a python dictionary works. For example…

select(_.a == _.x, _.b == _.y)
dict(a = "x", b = "y")

both create new entries named “a” and “b”.

However, keep in mind that pandas DataFrame.rename method uses the opposite approach.

Select a slice of columns¶

When the columns you want to select are adjacent to each other, you can select them using a special slicing syntax. This syntax takes the form…

_["start_col":"end_col"]

where “start_col” and “end_col” can be any of the three methods to specify a column: _.some_col, “some_col”, or its position number.

[7]:

mtcars >> select(_["mpg": "hp"])

[7]:

	mpg	cyl	disp	hp
0	21.0	6	160.0	110
1	21.0	6	160.0	110
...	...	...	...	...
30	15.0	8	301.0	335
31	21.4	4	121.0	109

32 rows × 4 columns

Note that when position number is used to slice columns, the columns you specify are exactly the ones you would be from indexing the DataFrame.columns attribute.

[8]:

print(mtcars.columns[0:4])

mtcars >> select(_[0:4])

Index(['mpg', 'cyl', 'disp', 'hp'], dtype='object')

[8]:

	mpg	cyl	disp	hp
0	21.0	6	160.0	110
1	21.0	6	160.0	110
...	...	...	...	...
30	15.0	8	301.0	335
31	21.4	4	121.0	109

32 rows × 4 columns

Finally, columns selected through slicing can be excluded using the minus operator (-).

[9]:

mtcars >> select(-_["mpg": "hp"])

[9]:

	drat	wt	qsec	vs	am	gear	carb
0	3.90	2.620	16.46	0	1	4	4
1	3.90	2.875	17.02	0	1	4	4
...	...	...	...	...	...	...	...
30	3.54	3.570	14.60	0	1	5	8
31	4.11	2.780	18.60	1	1	4	2

32 rows × 7 columns

Searching with methods like `startswith` or `contains`¶

The final, most flexible way to specify columns is to use any of the methods on the DataFrame.columns.str attribute. This is done by calling any of these methods in a siu expression (e.g. _.startswith('a')).

[10]:

# prints columns that contain the letter d
columns = mtcars.columns
print(columns[columns.str.contains('d')])

# uses the same method to select only these columns
mtcars >> select(_.contains('d'))

Index(['disp', 'drat'], dtype='object')

[10]:

	disp	drat
0	160.0	3.90
1	160.0	3.90
...	...	...
30	301.0	3.54
31	121.0	4.11

32 rows × 2 columns

As with the other approaches of specifying columns, you can also choose to exclude them.

[11]:

mtcars >> select(-_.contains('d'))

[11]:

	mpg	cyl	hp	wt	qsec	vs	am	gear	carb
0	21.0	6	110	2.620	16.46	0	1	4	4
1	21.0	6	110	2.875	17.02	0	1	4	4
...	...	...	...	...	...	...	...	...	...
30	15.0	8	335	3.570	14.60	0	1	5	8
31	21.4	4	109	2.780	18.60	1	1	4	2

32 rows × 9 columns

There are many string methods that can be accessed from DataFrame.colname.str. See their pandas docs, or their docstrings (e.g. help(mtcars.cyl.str.contains)) for more information.

For convenience, the names of these methods are listed below.

[12]:

str_methods = dir(mtcars.columns.str)
str_useful = [x for x in str_methods if not x.startswith("_")]

print(str_useful)

['capitalize', 'casefold', 'cat', 'center', 'contains', 'count', 'decode', 'encode', 'endswith', 'extract', 'extractall', 'find', 'findall', 'fullmatch', 'get', 'get_dummies', 'index', 'isalnum', 'isalpha', 'isdecimal', 'isdigit', 'islower', 'isnumeric', 'isspace', 'istitle', 'isupper', 'join', 'len', 'ljust', 'lower', 'lstrip', 'match', 'normalize', 'pad', 'partition', 'repeat', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'slice', 'slice_replace', 'split', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'wrap', 'zfill']

Edit page on github here. Interactive version:

siuba

Navigation

Related Topics

Select¶

Specifying one column at a time¶

Specify columns by name or position¶

Excluding columns¶

Renaming columns¶

Select a slice of columns¶

Searching with methods like `startswith` or `contains`¶

Select¶

Specifying one column at a time¶

Specify columns by name or position¶

Excluding columns¶

Renaming columns¶

Select a slice of columns¶

Searching with methods like startswith or contains¶

Searching with methods like `startswith` or `contains`¶