---
jupyter:
  jupytext:
    text_representation:
      extension: .Rmd
      format_name: rmarkdown
      format_version: '1.1'
      jupytext_version: 1.1.1
  kernelspec:
    display_name: Python 3
    language: python
    name: python3
---

```{python nbsphinx=hidden}
import pandas as pd
pd.set_option("display.max_rows", 5)
```

## Select

This function lets you select specific columns of your data to keep.
Each selection may include up to three pieces...

* specifying column(s) to include or remove
* excluding some specified columns
* renaming a column

The documentation below will illustrate these pieces when specifying one column at a time, specifying multiple columns, or searching columns using functions like `contains`.

```{python}
from siuba import _, select
from siuba.data import mtcars

mtcars
```

### Specifying one column at a time


#### Specify columns by name or position

The cleanest way to specify a column is to refer to it by name.
By default, referring to a column will keep it.

```{python}
mtcars >> select(_.mpg, _.cyl)
```

This approach ensures that you can easily rename, or exclude it from the data (shown in following sections). However, you can also refer to a column using a string, or its 0-indexed column position.

```{python}
# two other ways to keep the same columns
mtcars >> select(0, 1)
mtcars >> select("mpg", "cyl")
```

#### Excluding columns

You can remove a column from the data by specifying it with a minus sign (`-`) in front of it.
This action can be performed on multiple columns.

```{python}
# simple select with exclusion
mtcars >> select(-_.mpg, -_.cyl)
```

#### Renaming columns

You can rename a specified column by using the equality operator (`==`).
This operation takes the following form.

* `_.new_name == _.old_name`

```{python}
# select with rename
mtcars >> select(_.miles_per_gallon == _.mpg, _.cyl)
```

Note that expressing the new column name on the left is similar to how creating a python dictionary works. For example...

* `select(_.a == _.x, _.b == _.y)`
* `dict(a = "x", b = "y")`

both create new entries named "a" and "b".

However, keep in mind that pandas `DataFrame.rename` method uses the **opposite** approach.


### Select a slice of columns

When the columns you want to select are adjacent to each other, you can select them using a special slicing syntax. This syntax takes the form...

* `_["start_col":"end_col"]`

where "start_col" and "end_col" can be any of the three methods to specify a column: `_.some_col`, "some_col", or its position number.

```{python}
mtcars >> select(_["mpg": "hp"])
```

Note that when position number is used to slice columns, the columns you specify are exactly the ones you would be from indexing the `DataFrame.columns` attribute.

```{python}
print(mtcars.columns[0:4])

mtcars >> select(_[0:4])
```

Finally, columns selected through slicing can be excluded using the minus operator (`-`).

```{python}
mtcars >> select(-_["mpg": "hp"])
```

### Searching with methods like `startswith` or `contains`


The final, most flexible way to specify columns is to use any of the methods on the `DataFrame.columns.str` attribute. This is done by calling any of these methods in a siu expression (e.g. `_.startswith('a')`).

```{python}
# prints columns that contain the letter d
columns = mtcars.columns
print(columns[columns.str.contains('d')])

# uses the same method to select only these columns
mtcars >> select(_.contains('d'))
```

As with the other approaches of specifying columns, you can also choose to exclude them.

```{python}
mtcars >> select(-_.contains('d'))
```

There are many string methods that can be accessed from `DataFrame.colname.str`. 
See [their pandas docs](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html), or their docstrings (e.g. `help(mtcars.cyl.str.contains)`) for more information.

For convenience, the names of these methods are listed below.

```{python}
str_methods = dir(mtcars.columns.str)
str_useful = [x for x in str_methods if not x.startswith("_")]

print(str_useful)
```