--- jupyter: jupytext: text_representation: extension: .Rmd format_name: rmarkdown format_version: '1.2' jupytext_version: 1.4.2 kernelspec: display_name: Python 3 language: python name: python3 --- ```{python nbsphinx=hidden} import pandas as pd pd.set_option("display.max_rows", 5) ``` # SQL backend ## Step 1: Column Translation Column translation requires three pieces: 1. **Locals:** Functions for creating the sqlalchemy clause corresponding to an operation. 2. **Column Data:** Classes representing columns under normal and aggregate settings. 3. **Translator:** A class that can take a symbolic expression (e.g. `_.x.mean()`) and return the correct sqlachemy clause. ```{python} # Column data ================================================================= from siuba.sql.translate import SqlColumn, SqlColumnAgg # used if you want to add a scalar or window translation # (eg. something that gets used in mutate) class WowSqlColumn(SqlColumn): pass # used if you want to add a aggregate translation # (eg. something that gets used in a query with a GROUP BY clause) class WowSqlColumnAgg(WowSqlColumn, SqlColumn): pass # Locals ====================================================================== from siuba.sql.translate import ( win_over, win_agg, sql_agg, sql_scalar, sql_colmeth, sql_not_impl, ) scalar = { "__add__": sql_colmeth("__add__"), "round": sql_scalar("round"), } window = { "rank": win_over("rank"), "mean": win_agg("mean"), } aggregation = { "rank": sql_not_impl(), "mean": sql_agg("mean"), } # Translator ================================================================== from siuba.sql.translate import SqlTranslator translator = SqlTranslator.from_mappings( scalar, window, aggregation, WowSqlColumn, WowSqlColumnAgg ) ``` ## Column Data There are two kinds of data classes, corresponding to whether the generated outermost query in the generated SQL will use a `GROUP BY` clause. ## Locals The entries of each local dictionary are functions that take a sqlalchemy.sql.ClauseElement--which is the parent class of many sqlalchemy elements--and returns a ClauseElement. ```{python} from sqlalchemy import sql expr_rank = window["rank"](sql.column("a_col")) expr_rank ``` ```{python} print(expr_rank) ``` ## Translator Below, we set up a sqlalchemy select statement in order to demonstrate the translator in action. ```{python} from siuba import _ from sqlalchemy.sql import column, select sel = select([column('x'), column('y')]) ``` Then we feed the columns to the translated call. ```{python} call_add = translator.translate(_.x + _.y) call_add(sel.columns) ``` Note that behind the scenes, the translator goes down the call tree and swaps functions like `"__add__"` with the local translations. ```{python} # the root node is __add__. shown as +. _.x + _.y ``` ```{python} # We can see this in action by calling the translation directly. scalar["__add__"](sel.columns.x, sel.columns.y) ``` By default the translate method assumes the expression is using window functions, so operations like `.mean()` return SqlAlchemy Over clauses. ```{python} f_translate = translator.translate(_.x.mean()) expr = f_translate(sel.columns) expr ``` ```{python} print(expr) ``` Keep in mind--since the translator doesn't know about grouping variables--it returns an empty over clause. This separation of concerns: * translator: handles simple column ops, including returning over clauses to be filled. * backend: handles broader contexts--like if the data has been grouped, arranged, or limited--by visiting the result of each translation. ## User defined functions Remember those column data classes we made and passed to the translator? They allow users to register custom column operation functions! Below we show a custom round function, which calls the sqlalchemy that usually corresponds to `ROUND()`. ```{python} from siuba.siu import _, symbolic_dispatch from sqlalchemy import sql @symbolic_dispatch(cls = WowSqlColumn) def round(col): print("running round function") return sql.function.round(col) # Creates a special symbolic object round(_) ``` Note that a special feature of `symbolic_dispatch`, is that it let's you form complex expressions by passing `_` to your function call. ```{python} # Symbolic objects let you express complex operations round(_) + 9999 ``` ## . ## Step 2: Backend class So far we've only discussed how to translate symbolic expressions like `_.x + _.y` in the mutate call below. ```python mutate(backend, _.x + _.y) ``` The last piece is implementing a backend that works with the mutate function itself. If you are just adding support for a new SQL dialect, you can use the LazyTbl class provided by `siuba.sql`. While the translations above work on columns of data, this class handles tables and queries over data (e.g. select statements). More specifically, LazyTbl has 2 jobs: 1. Representing a specific table of data from the database. 2. Creating SQL queries via table verbs. This includes using translators above, as well as broader SQL constructs (e.g. ordering, grouping / partitions, limiting rows, etc..). ```{python} # Setup example data ---- from sqlalchemy import create_engine from siuba.data import mtcars # # copy pandas DataFrame to sqlite engine = create_engine("sqlite:///:memory:") mtcars.to_sql("mtcars", engine, if_exists = "replace") ``` ## Representing SQL tables ```{python} from siuba.sql import LazyTbl tbl_cars = LazyTbl(engine, "mtcars") tbl_cars ``` Note that you can access a number of useful attributes. ```{python} # the underlying translator f_add = tbl_cars.translator.translate(_.mpg + _.hp) f_add(tbl_cars.last_op.columns) ``` ```{python} # the underlying sqlalchemy source table tbl_cars.tbl ``` ```{python} # grouping info from siuba import group_by, _ tbl2 = group_by(tbl_cars, rounded_mpg = _.mpg.round(10)) tbl2.group_by ``` ## Table verbs Backends register on a dispatcher called singledispatch2. This is shown below with an `assign` verb that is a limited implementation of mutate. ```{python} from siuba.dply.verbs import singledispatch2 from sqlalchemy.sql import select @singledispatch2(LazyTbl) def assign(__data, **kwargs): new_cols = [] for k, expr in kwargs.items(): # .shape_call mostly data.translator under the hood new_call = __data.shape_call(expr) sql_expr = new_call(__data.last_op.columns) new_cols.append(sql_expr.label(k)) # copy of data, where .last_op is the new select statement new_data = __data.append_op(select(new_cols)) return new_data # test out tbl_assigned = assign(tbl_cars, res = _.mpg + _.hp) tbl_assigned ``` Note that the `backend.last_op` property holds the current select statement, so the verb can adapt it, or wrap it in another select statement. ```{python} print(tbl_assigned.last_op) ``` ## Unit tests Siuba generates a table of all supported operations per backend, and tests against a simple example for each, to ensure they produce the same result as the pandas backend. However, some translations may deviate in the following ways: * returning a float rather than an int (or vice-versa). * an aggregation that works in `summarize`, but not in `mutate` verbs. In order to mark translations as deviating, you can use the annotation functions. ```{python} from sqlalchemy import sql from siuba.sql.translate import annotate, wrap_annotate # puts an `operation` attribute on function, describing limitations f_and = annotate(lambda col: col & False, input_type = "bool") f_and(sql.column('a')) ``` ```{python} f_and.operation ``` ```{python} # creates a new function (wrapping the old one), and annotates that f_and2 = wrap_annotate(f_and, something_else = False) f_and2.operation ```