--- jupyter: jupytext: text_representation: extension: .Rmd format_name: rmarkdown format_version: '1.2' jupytext_version: 1.4.1 kernelspec: display_name: Python 3 language: python name: python3 nbsphinx: allow_errors: true --- ```{python nbsphinx="hidden"} import pandas as pd pd.set_option("display.max_rows", 5) ``` # Data Analysis guide **Note: this document is a work in progress. For sections that aren't completed, I have included links to useful documentation and examples.** See also these resources: * [pandas Series methods API](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) * [siuba verb API reference](api_user_index.rst) * [siuba examples](examples.ipynb) ## Overview ```{python} from siuba.data import cars cars ``` ## Split-apply-combine > 🚧 Coming soon. In the meantime, check out these docs with many examples in the User API. * [filter](api_table_core/01_filter.Rmd) * [arrange](api_table_core/02_arrange.Rmd) * [select](api_table_core/03_select.Rmd) * [mutate](api_table_core/05_mutate.Rmd) * [summarize](api_table_core/07_summarize.Rmd) * [group_by](api_table_core/08_group_by.Rmd) ## Dates and times > 🚧 Coming soon. See this [article on timeseries](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html) in the pandas docs. ## Strings > 🚧 Coming soon. See this [article on working with text](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html) in the pandas docs. ## Reshaping > 🚧 Coming soon. See the following User API entries for reshaping verbs. * [gather](api_tidy/02_gather.Rmd) * [spread](api_tidy/03_spread.Rmd) ## Table joins > 🚧 Coming soon. See the siuba's User API entry for joins. * [joins](api_table_two/joins.Rmd) ## Nested data > 🚧 Coming soon. See the siuba's User API entry for nest and unnest. * [nest and unnest](api_tidy/01_nest.Rmd) ## Custom functions Custom functions are created using `symbolic_dispatch`. ```{python} import pandas as pd from siuba.siu import symbolic_dispatch @symbolic_dispatch(cls = pd.Series) def add(x, y): return x + y from siuba import _, mutate df = pd.DataFrame({ "x": [1, 2, 3], "y": [4, 5, 6], }) df >> mutate(res = add(_.x, _.y) + 100) ``` Note that one important feature of symbolic dispatch is its unique handling of the `_`. In this case, it returns a Symbolic object, which lets you use it in complex expressions. ```{python} add(_.x, _.y) + 100 ``` ## Debugging This section covers the four most common issues people seem to hit. 1. Referring to a column that doesn't a exist 2. A pandas Series method raising an error 3. Python syntax errors 4. Any of the above in a pipe > Note that stack traces shown here are shorter than normal, to help make them clearer. This is something siuba does for SQL by default, and will be implemented for pandas in the future. ```{python} import pandas as pd from siuba import mutate, _ df = pd.DataFrame({ 'g': ['a','a','b'], 'x': [1,2,3] }) ``` ```{python jupyter={'source_hidden': True}, nbsphinx="hidden"} test = {} def limit_traceback(f, keep_first = True, limit = 1): """Wraps the ipython shell._showtraceback, to cut out some pieces. Note: ipython allows Exceptions to have a _render_traceback_ method, to do what this wrapper does, but that doesn't help us change the behavior of existing classes. This is a situation where generic function dispatch would help. """ from functools import wraps if getattr(f, '_wrapped_lt', False): # don't wrap multiple times. re-wrap original f = f.__wrapped__ @wraps(f) def wrapper(etype, evalue, stb): test['stb'] = stb header = stb[0:3] if keep_first and len(stb) > 3 else [] body = stb[-limit:] f(etype, evalue, [*header, *body]) # ensure we don't wrap multiple times wrapper._wrapped_lt = True # otherwise, return wrapper return wrapper from IPython.core.magic import (register_line_magic, register_cell_magic, register_line_cell_magic) @register_cell_magic def short_traceback(line, cell): shell = get_ipython() shell._showtraceback = limit_traceback(shell._showtraceback, limit = 1) shell.run_cell(cell) shell._showtraceback = shell._showtraceback.__wrapped__ shell = get_ipython() shell._showtraceback = limit_traceback(shell._showtraceback, limit = 1) ``` ### Missing columns ```{python} mutate(df, y = _.X + 1) ``` In this case, the data doesn't have a column named "X". ```{python} df.columns ``` ### Series method error ```{python} mutate(df, y = _.x.mean(bad_arg = True)) ``` In this case, it's helpful to try replacing `_` with the actual data. ```{python} # expression to debug _.x.mean(bad_arg = True) # replacing _ with the data df.x.mean(bad_arg = True) ``` ### Python syntax errors ```{python} df >> mutate(y = _.x + 1) ``` In this case, we either need to use a backslash, or put the code in parentheses. ```{python} df \ >> mutate(y = _.x + 1) (df >> mutate(y = _.x + 1) ) ``` ### Pipes When the error occurs in a pipe, it's helpful to comment out parts of the pipe. For example, consider the 3 step pipe below. ```{python} from siuba import select, arrange, mutate (df >> select(_.g, _.x) >> mutate(res = _.X + 1) >> arrange(-_.res) ) ``` Notice the arrow pointing to line 6. This is not because that's where the error is, but because python will always point to the last line of a pipe. Let's debug by running only the first line, then only the first two, etc.., until we find the error. ```{python} (df >> select(_.g, _.x) # >> mutate(res = _.X + 1) # >> arrange(-_.res) ) ``` Select works okay, now let's uncomment the next line. ```{python} (df >> select(_.g, _.x) >> mutate(res = _.X + 1) # >> arrange(-_.res) ) ``` We found our bug! Note that when working with SQL, siuba prints out the name of the verb where the error occured. This is very useful, and will be added to working with pandas in the future!