[1]:
import pandas as pd
pd.set_option("display.max_rows", 5)
from siuba.siu.format import Formatter
show_tree = lambda x: print(Formatter().format(x))
SQL backend¶
⚠️: This document is being revised (though the code runs correctly!).
Step 1: Column Translation¶
Column translation requires three pieces:
Locals: Functions for creating the sqlalchemy clause corresponding to an operation.
Column Data: Classes representing columns under normal and aggregate settings.
Translator: A class that can take a symbolic expression (e.g.
_.x.mean()
) and return it in call form:mean(_.x)
.Codata visitor: A class that takes the above call, and swaps in the sql dialect version of each call.
[2]:
# Column data =================================================================
from siuba.sql.translate import SqlColumn, SqlColumnAgg
# used if you want to add a scalar or window translation
# (eg. something that gets used in mutate)
class WowSqlColumn(SqlColumn): pass
# used if you want to add a aggregate translation
# (eg. something that gets used in a query with a GROUP BY clause)
class WowSqlColumnAgg(WowSqlColumn, SqlColumn): pass
# Locals ======================================================================
from siuba.sql.translate import (
win_over,
win_agg,
sql_agg,
sql_scalar,
sql_colmeth,
sql_not_impl,
)
scalar = {
"__add__": sql_colmeth("__add__"),
"round": sql_scalar("round"),
}
window = {
"rank": win_over("rank"),
"mean": win_agg("mean"),
}
aggregation = {
"rank": sql_not_impl(),
"mean": sql_agg("mean"),
}
# Translator ==================================================================
from siuba.sql.translate import SqlTranslator
translator = SqlTranslator.from_mappings(
WowSqlColumn, WowSqlColumnAgg
)
# TODO: how to work in codata visitor?
Column Data¶
There are two kinds of data classes, corresponding to whether the generated outermost query in the generated SQL will use a GROUP BY
clause.
Locals¶
The entries of each local dictionary are functions that take a sqlalchemy.sql.ClauseElement–which is the parent class of many sqlalchemy elements–and returns a ClauseElement.
[3]:
from sqlalchemy import sql
expr_rank = window["rank"](WowSqlColumn(), sql.column("a_col"))
expr_rank
[3]:
<siuba.sql.translate.RankOver object at 0x7fdd04854b20>
[4]:
print(expr_rank)
rank() OVER (ORDER BY a_col)
Translator¶
Below, we set up a sqlalchemy select statement in order to demonstrate the translator in action.
[5]:
from siuba import _
from sqlalchemy.sql import column, select
sel = select([column('x'), column('y')])
Then we feed the columns to the translated call.
[6]:
call_add = translator.translate(_.x + _.y)
show_tree(call_add)
█─'__call__'
├─█─'__custom_func__'
│ └─<function singledispatch.<locals>.wrapper at 0x7fdcfdd189d0>
├─█─.
│ ├─_
│ └─'x'
└─█─.
├─_
└─'y'
Note that behind the scenes, the translator goes down the call tree and swaps functions like "__add__"
with the local translations.
[7]:
from siuba.siu.visitors import CodataVisitor
codata = CodataVisitor(WowSqlColumn, object)
call_add_final = codata.visit(call_add)
show_tree(call_add_final)
█─'__call__'
├─█─'__custom_func__'
│ └─<function sql_colmeth.<locals>.f at 0x7fdcfdb7c040>
├─<__main__.WowSqlColumn object at 0x7fdd04854d30>
├─█─.
│ ├─_
│ └─'x'
└─█─.
├─_
└─'y'
[8]:
# the root node is __add__. shown as +.
_.x + _.y
[8]:
█─+
├─█─.
│ ├─_
│ └─'x'
└─█─.
├─_
└─'y'
[9]:
# We can see this in action by calling the translation directly.
scalar["__add__"](WowSqlColumn(), sel.columns.x, sel.columns.y)
/tmp/ipykernel_2338/1405687756.py:2: SADeprecationWarning: The SelectBase.c and SelectBase.columns attributes are deprecated and will be removed in a future release; these attributes implicitly create a subquery that should be explicit. Please call SelectBase.subquery() first in order to create a subquery, which then contains this attribute. To access the columns that this SELECT object SELECTs from, use the SelectBase.selected_columns attribute. (deprecated since: 1.4)
scalar["__add__"](WowSqlColumn(), sel.columns.x, sel.columns.y)
[9]:
<sqlalchemy.sql.elements.BinaryExpression object at 0x7fdd048540d0>
By default the translate method assumes the expression is using window functions, so operations like .mean()
return SqlAlchemy Over clauses.
[10]:
f_translate = translator.translate(_.x.mean())
f_translate_co = codata.visit(f_translate)
expr = f_translate_co(sel.columns)
expr
/tmp/ipykernel_2338/2826175941.py:4: SADeprecationWarning: The SelectBase.c and SelectBase.columns attributes are deprecated and will be removed in a future release; these attributes implicitly create a subquery that should be explicit. Please call SelectBase.subquery() first in order to create a subquery, which then contains this attribute. To access the columns that this SELECT object SELECTs from, use the SelectBase.selected_columns attribute. (deprecated since: 1.4)
expr = f_translate_co(sel.columns)
[10]:
<siuba.sql.translate.AggOver object at 0x7fdd2836de20>
[11]:
print(expr)
avg(anon_1.x) OVER ()
Keep in mind–since the translator doesn’t know about grouping variables–it returns an empty over clause. This separation of concerns:
translator: handles simple column ops, including returning over clauses to be filled.
backend: handles broader contexts–like if the data has been grouped, arranged, or limited–by visiting the result of each translation.
User defined functions¶
Remember those column data classes we made and passed to the translator? They allow users to register custom column operation functions!
Below we show a custom round function, which calls the sqlalchemy that usually corresponds to ROUND(<col>)
.
[12]:
from siuba.siu import _, symbolic_dispatch
from sqlalchemy import sql
@symbolic_dispatch(cls = WowSqlColumn)
def round(self, col):
print("running round function")
return sql.function.round(col)
# Creates a special symbolic object
round(_)
[12]:
█─'__call__'
├─█─'__custom_func__'
│ └─<function round at 0x7fdcfdafbca0>
└─_
Note that a special feature of symbolic_dispatch
, is that it let’s you form complex expressions by passing _
to your function call.
[13]:
# Symbolic objects let you express complex operations
round(_) + 9999
[13]:
█─+
├─█─'__call__'
│ ├─█─'__custom_func__'
│ │ └─<function round at 0x7fdcfdafbca0>
│ └─_
└─9999
.¶
Step 2: Backend class¶
So far we’ve only discussed how to translate symbolic expressions like _.x + _.y
in the mutate call below.
mutate(backend, _.x + _.y)
The last piece is implementing a backend that works with the mutate function itself. If you are just adding support for a new SQL dialect, you can use the LazyTbl class provided by siuba.sql
. While the translations above work on columns of data, this class handles tables and queries over data (e.g. select statements).
More specifically, LazyTbl has 2 jobs:
Representing a specific table of data from the database.
Creating SQL queries via table verbs. This includes using translators above, as well as broader SQL constructs (e.g. ordering, grouping / partitions, limiting rows, etc..).
[14]:
# Setup example data ----
from sqlalchemy import create_engine
from siuba.data import mtcars
# copy pandas DataFrame to sqlite
engine = create_engine("sqlite:///:memory:")
mtcars.to_sql("mtcars", engine, if_exists = "replace")
[14]:
32
Representing SQL tables¶
[15]:
from siuba.sql import LazyTbl
tbl_cars = LazyTbl(engine, "mtcars")
tbl_cars
[15]:
# Source: lazy query # DB Conn: Engine(sqlite:///:memory:) # Preview:
index | mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
1 | 1 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
2 | 2 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
3 | 3 | 21.4 | 6 | 258.0 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
4 | 4 | 18.7 | 8 | 360.0 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
# .. may have more rows
Note that you can access a number of useful attributes.
[16]:
# calls the underlying translator and codata
f_add = tbl_cars.shape_call(_.mpg + _.hp)
f_add(tbl_cars.last_op.columns)
/tmp/ipykernel_2338/710220926.py:3: SADeprecationWarning: The SelectBase.c and SelectBase.columns attributes are deprecated and will be removed in a future release; these attributes implicitly create a subquery that should be explicit. Please call SelectBase.subquery() first in order to create a subquery, which then contains this attribute. To access the columns that this SELECT object SELECTs from, use the SelectBase.selected_columns attribute. (deprecated since: 1.4)
f_add(tbl_cars.last_op.columns)
[16]:
<sqlalchemy.sql.elements.BinaryExpression object at 0x7fdcfd9aafd0>
[17]:
# the underlying sqlalchemy source table
tbl_cars.tbl
[17]:
Table('mtcars', MetaData(bind=Engine(sqlite:///:memory:)), Column('index', BIGINT(), table=<mtcars>), Column('mpg', FLOAT(), table=<mtcars>), Column('cyl', BIGINT(), table=<mtcars>), Column('disp', FLOAT(), table=<mtcars>), Column('hp', BIGINT(), table=<mtcars>), Column('drat', FLOAT(), table=<mtcars>), Column('wt', FLOAT(), table=<mtcars>), Column('qsec', FLOAT(), table=<mtcars>), Column('vs', BIGINT(), table=<mtcars>), Column('am', BIGINT(), table=<mtcars>), Column('gear', BIGINT(), table=<mtcars>), Column('carb', BIGINT(), table=<mtcars>), schema=None)
[18]:
# grouping info
from siuba import group_by, _
tbl2 = group_by(tbl_cars, rounded_mpg = _.mpg.round(10))
tbl2.group_by
[18]:
('rounded_mpg',)
Table verbs¶
Backends register on a dispatcher called singledispatch2. This is shown below with an assign
verb that is a limited implementation of mutate.
[19]:
from siuba.dply.verbs import singledispatch2
from sqlalchemy.sql import select
@singledispatch2(LazyTbl)
def assign(__data, **kwargs):
new_cols = []
for k, expr in kwargs.items():
# .shape_call mostly data.translator under the hood
new_call = __data.shape_call(expr)
sql_expr = new_call(__data.last_op.columns)
new_cols.append(sql_expr.label(k))
# copy of data, where .last_op is the new select statement
new_data = __data.append_op(select(new_cols))
return new_data
# test out
tbl_assigned = assign(tbl_cars, res = _.mpg + _.hp)
tbl_assigned
/tmp/ipykernel_2338/1042114963.py:12: SADeprecationWarning: The SelectBase.c and SelectBase.columns attributes are deprecated and will be removed in a future release; these attributes implicitly create a subquery that should be explicit. Please call SelectBase.subquery() first in order to create a subquery, which then contains this attribute. To access the columns that this SELECT object SELECTs from, use the SelectBase.selected_columns attribute. (deprecated since: 1.4)
sql_expr = new_call(__data.last_op.columns)
[19]:
# Source: lazy query # DB Conn: Engine(sqlite:///:memory:) # Preview:
res | |
---|---|
0 | 131.0 |
1 | 131.0 |
2 | 115.8 |
3 | 131.4 |
4 | 193.7 |
# .. may have more rows
Note that the backend.last_op
property holds the current select statement, so the verb can adapt it, or wrap it in another select statement.
[20]:
print(tbl_assigned.last_op)
SELECT anon_1.mpg + anon_1.hp AS res
FROM (SELECT mtcars."index" AS "index", mtcars.mpg AS mpg, mtcars.cyl AS cyl, mtcars.disp AS disp, mtcars.hp AS hp, mtcars.drat AS drat, mtcars.wt AS wt, mtcars.qsec AS qsec, mtcars.vs AS vs, mtcars.am AS am, mtcars.gear AS gear, mtcars.carb AS carb
FROM mtcars) AS anon_1
Unit tests¶
Siuba generates a table of all supported operations per backend, and tests against a simple example for each, to ensure they produce the same result as the pandas backend.
However, some translations may deviate in the following ways:
returning a float rather than an int (or vice-versa).
an aggregation that works in
summarize
, but not inmutate
verbs.
In order to mark translations as deviating, you can use the annotation functions.
[21]:
from sqlalchemy import sql
from siuba.sql.translate import annotate, wrap_annotate
# puts an `operation` attribute on function, describing limitations
f_and = annotate(lambda col: col & False, input_type = "bool")
f_and(sql.column('a'))
[21]:
<sqlalchemy.sql.elements.AsBoolean object at 0x7fdcfd9efcd0>
[22]:
f_and.operation
[22]:
{'input_type': 'bool'}
[23]:
# creates a new function (wrapping the old one), and annotates that
f_and2 = wrap_annotate(f_and, something_else = False)
f_and2.operation
[23]:
{'something_else': False}
Edit page on github here. Interactive version: