[1]:
import pandas as pd
pd.set_option("display.max_rows", 5)
Arrange¶
This function lets you to arrange the rows of your data, through two steps…
choosing columns to arrange by
specifying an order (ascending or descending)
Below, we’ll illustrate this function with a single variable, multiple variables, and more general expressions.
[2]:
from siuba import _, arrange, select
from siuba.data import mtcars
small_mtcars = mtcars >> select(_.cyl, _.mpg, _.hp)
small_mtcars
[2]:
cyl | mpg | hp | |
---|---|---|---|
0 | 6 | 21.0 | 110 |
1 | 6 | 21.0 | 110 |
... | ... | ... | ... |
30 | 8 | 15.0 | 335 |
31 | 4 | 21.4 | 109 |
32 rows × 3 columns
Arranging rows by a single variable¶
The simplest way to use arrange is to specify a column name. The arrange
function uses pandas.sort_values
under the hood, and arranges rows in ascending order.
For example, the code below arranges the rows from least to greatest horsepower (hp
).
[3]:
# simple arrange of 1 var
small_mtcars >> arrange(_.hp)
[3]:
cyl | mpg | hp | |
---|---|---|---|
18 | 4 | 30.4 | 52 |
7 | 4 | 24.4 | 62 |
... | ... | ... | ... |
28 | 8 | 15.8 | 264 |
30 | 8 | 15.0 | 335 |
32 rows × 3 columns
If you add a -
before a column or expression, arrange
will sort the rows in descending order. This applies to all types of columns, including arrays of strings and categories!
[4]:
small_mtcars >> arrange(-_.hp)
[4]:
cyl | mpg | hp | |
---|---|---|---|
30 | 8 | 15.0 | 335 |
28 | 8 | 15.8 | 264 |
... | ... | ... | ... |
7 | 4 | 24.4 | 62 |
18 | 4 | 30.4 | 52 |
32 rows × 3 columns
Arranging rows by multiple variables¶
When arrange receives multiple arguments, it sorts so that the one specified first changes the slowest, followed by the second, and so on.
[5]:
small_mtcars >> arrange(_.cyl, _.mpg)
[5]:
cyl | mpg | hp | |
---|---|---|---|
31 | 4 | 21.4 | 109 |
20 | 4 | 21.5 | 97 |
... | ... | ... | ... |
4 | 8 | 18.7 | 175 |
24 | 8 | 19.2 | 175 |
32 rows × 3 columns
[6]:
small_mtcars >> arrange(_.cyl, -_.mpg)
[6]:
cyl | mpg | hp | |
---|---|---|---|
19 | 4 | 33.9 | 65 |
17 | 4 | 32.4 | 66 |
... | ... | ... | ... |
14 | 8 | 10.4 | 205 |
15 | 8 | 10.4 | 215 |
32 rows × 3 columns
Expressions¶
You can also arrange
the rows of your data using more complex expressions, similar to those you would use in a mutate
.
For example, the code below sorts by horsepower (hp
) per cylindar (cyl
).
[7]:
small_mtcars >> arrange(_.hp / _.cyl)
[7]:
cyl | mpg | hp | |
---|---|---|---|
18 | 4 | 30.4 | 52 |
7 | 4 | 24.4 | 62 |
... | ... | ... | ... |
28 | 8 | 15.8 | 264 |
30 | 8 | 15.0 | 335 |
32 rows × 3 columns
Arranging Categorical series¶
Note that when arranging a categorical series, it will be arranged in the order of its categories. For example, the DataFrame below consists of a category with three entries.
[8]:
df = pd.DataFrame({
"x_cat": pd.Categorical(["c", "b", "a"])
})
df
[8]:
x_cat | |
---|---|
0 | c |
1 | b |
2 | a |
While the values of the category go from “c” to “a”, the default levels of a categorical are already sorted, so go from “a” to “c”. This can be seen in the very last line of output below.
[9]:
df.x_cat
[9]:
0 c
1 b
2 a
Name: x_cat, dtype: category
Categories (3, object): ['a', 'b', 'c']
Since pd.sort_values
would sort the categorical according to the order listed under “Categories”, arrange does this also.
[10]:
df >> arrange(_.x_cat)
[10]:
x_cat | |
---|---|
2 | a |
1 | b |
0 | c |
This means that if reorder the categories, the arrange will follow that reordering!
[11]:
from siuba.dply.forcats import fct_rev
df["rev_x_cat"] = fct_rev(df.x_cat)
df.rev_x_cat
[11]:
0 c
1 b
2 a
Name: rev_x_cat, dtype: category
Categories (3, object): ['c', 'b', 'a']
[12]:
df >> arrange(_.rev_x_cat)
[12]:
x_cat | rev_x_cat | |
---|---|---|
0 | c | c |
1 | b | b |
2 | a | a |
Edit page on github here. Interactive version: