[1]:

import pandas as pd
pd.set_option("display.max_rows", 5)

SQL backend¶

Step 1: Column Translation¶

Column translation requires three pieces:

Locals: Functions for creating the sqlalchemy clause corresponding to an operation.
Column Data: Classes representing columns under normal and aggregate settings.
Translator: A class that can take a symbolic expression (e.g. _.x.mean()) and return the correct sqlachemy clause.

[2]:

# Column data =================================================================

from siuba.sql.translate import SqlColumn, SqlColumnAgg

# used if you want to add a scalar or window translation
# (eg. something that gets used in mutate)
class WowSqlColumn(SqlColumn): pass

# used if you want to add a aggregate translation
# (eg. something that gets used in a query with a GROUP BY clause)
class WowSqlColumnAgg(WowSqlColumn, SqlColumn): pass


# Locals ======================================================================

from siuba.sql.translate import (
        win_over,
        win_agg,
        sql_agg,
        sql_scalar,
        sql_colmeth,
        sql_not_impl,
        )

scalar = {
    "__add__": sql_colmeth("__add__"),
    "round": sql_scalar("round"),
}

window = {
    "rank": win_over("rank"),
    "mean": win_agg("mean"),
}

aggregation = {
    "rank": sql_not_impl(),
    "mean": sql_agg("mean"),
}


# Translator ==================================================================

from siuba.sql.translate import SqlTranslator

translator = SqlTranslator.from_mappings(
    scalar, window, aggregation,
    WowSqlColumn, WowSqlColumnAgg
)

Column Data¶

There are two kinds of data classes, corresponding to whether the generated outermost query in the generated SQL will use a GROUP BY clause.

Locals¶

The entries of each local dictionary are functions that take a sqlalchemy.sql.ClauseElement–which is the parent class of many sqlalchemy elements–and returns a ClauseElement.

[3]:

from sqlalchemy import sql

expr_rank = window["rank"](sql.column("a_col"))
expr_rank

[3]:

<siuba.sql.translate.RankOver object at 0x7f1e921903d0>

[4]:

print(expr_rank)

rank() OVER (ORDER BY a_col)

Translator¶

Below, we set up a sqlalchemy select statement in order to demonstrate the translator in action.

[5]:

from siuba import _
from sqlalchemy.sql import column, select

sel = select([column('x'), column('y')])

Then we feed the columns to the translated call.

[6]:

call_add = translator.translate(_.x + _.y)
call_add(sel.columns)

[6]:

<sqlalchemy.sql.elements.BinaryExpression object at 0x7f1eb4c54fd0>

Note that behind the scenes, the translator goes down the call tree and swaps functions like "__add__" with the local translations.

[7]:

# the root node is __add__. shown as +.
_.x + _.y

[7]:

█─+
├─█─.
│ ├─_
│ └─'x'
└─█─.
  ├─_
  └─'y'

[8]:

# We can see this in action by calling the translation directly.
scalar["__add__"](sel.columns.x, sel.columns.y)

[8]:

<sqlalchemy.sql.elements.BinaryExpression object at 0x7f1e91101ad0>

By default the translate method assumes the expression is using window functions, so operations like .mean() return SqlAlchemy Over clauses.

[9]:

f_translate = translator.translate(_.x.mean())
expr = f_translate(sel.columns)

expr

[9]:

<siuba.sql.translate.AggOver object at 0x7f1e91101e90>

[10]:

print(expr)

mean(x) OVER ()

Keep in mind–since the translator doesn’t know about grouping variables–it returns an empty over clause. This separation of concerns:

translator: handles simple column ops, including returning over clauses to be filled.
backend: handles broader contexts–like if the data has been grouped, arranged, or limited–by visiting the result of each translation.

User defined functions¶

Remember those column data classes we made and passed to the translator? They allow users to register custom column operation functions!

Below we show a custom round function, which calls the sqlalchemy that usually corresponds to ROUND(<col>).

[11]:

from siuba.siu import _, symbolic_dispatch
from sqlalchemy import sql

@symbolic_dispatch(cls = WowSqlColumn)
def round(col):
    print("running round function")

    return sql.function.round(col)

# Creates a special symbolic object
round(_)

[11]:

█─'__call__'
├─█─'__custom_func__'
│ └─<function round at 0x7f1e91167680>
└─_

Note that a special feature of symbolic_dispatch, is that it let’s you form complex expressions by passing _ to your function call.

[12]:

# Symbolic objects let you express complex operations
round(_) + 9999

[12]:

█─+
├─█─'__call__'
│ ├─█─'__custom_func__'
│ │ └─<function round at 0x7f1e91167680>
│ └─_
└─9999

.¶

Step 2: Backend class¶

So far we’ve only discussed how to translate symbolic expressions like _.x + _.y in the mutate call below.

mutate(backend, _.x + _.y)

The last piece is implementing a backend that works with the mutate function itself. If you are just adding support for a new SQL dialect, you can use the LazyTbl class provided by siuba.sql. While the translations above work on columns of data, this class handles tables and queries over data (e.g. select statements).

More specifically, LazyTbl has 2 jobs:

Representing a specific table of data from the database.
Creating SQL queries via table verbs. This includes using translators above, as well as broader SQL constructs (e.g. ordering, grouping / partitions, limiting rows, etc..).

[13]:

# Setup example data ----
from sqlalchemy import create_engine
from siuba.data import mtcars

# copy pandas DataFrame to sqlite
engine = create_engine("sqlite:///:memory:")
mtcars.to_sql("mtcars", engine, if_exists = "replace")

Representing SQL tables¶

[14]:

from siuba.sql import LazyTbl

tbl_cars = LazyTbl(engine, "mtcars")
tbl_cars

[14]:

# Source: lazy query
# DB Conn: Engine(sqlite:///:memory:)
# Preview:

	index	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
0	0	21.0	6	160.0	110	3.90	2.620	16.46	0	1	4	4
1	1	21.0	6	160.0	110	3.90	2.875	17.02	0	1	4	4
2	2	22.8	4	108.0	93	3.85	2.320	18.61	1	1	4	1
3	3	21.4	6	258.0	110	3.08	3.215	19.44	1	0	3	1
4	4	18.7	8	360.0	175	3.15	3.440	17.02	0	0	3	2

# .. may have more rows

Note that you can access a number of useful attributes.

[15]:

# the underlying translator
f_add = tbl_cars.translator.translate(_.mpg + _.hp)
f_add(tbl_cars.last_op.columns)

[15]:

<sqlalchemy.sql.elements.BinaryExpression object at 0x7f1e90b696d0>

[16]:

# the underlying sqlalchemy source table
tbl_cars.tbl

[16]:

Table('mtcars', MetaData(bind=Engine(sqlite:///:memory:)), Column('index', BIGINT(), table=<mtcars>), Column('mpg', FLOAT(), table=<mtcars>), Column('cyl', BIGINT(), table=<mtcars>), Column('disp', FLOAT(), table=<mtcars>), Column('hp', BIGINT(), table=<mtcars>), Column('drat', FLOAT(), table=<mtcars>), Column('wt', FLOAT(), table=<mtcars>), Column('qsec', FLOAT(), table=<mtcars>), Column('vs', BIGINT(), table=<mtcars>), Column('am', BIGINT(), table=<mtcars>), Column('gear', BIGINT(), table=<mtcars>), Column('carb', BIGINT(), table=<mtcars>), schema=None)

[17]:

# grouping info
from siuba import group_by, _

tbl2 = group_by(tbl_cars, rounded_mpg = _.mpg.round(10))
tbl2.group_by

[17]:

('rounded_mpg',)

Table verbs¶

Backends register on a dispatcher called singledispatch2. This is shown below with an assign verb that is a limited implementation of mutate.

[18]:

from siuba.dply.verbs import singledispatch2
from sqlalchemy.sql import select

@singledispatch2(LazyTbl)
def assign(__data, **kwargs):

    new_cols = []
    for k, expr in kwargs.items():

        # .shape_call mostly data.translator under the hood
        new_call = __data.shape_call(expr)
        sql_expr = new_call(__data.last_op.columns)

        new_cols.append(sql_expr.label(k))

    # copy of data, where .last_op is the new select statement
    new_data = __data.append_op(select(new_cols))

    return new_data


# test out
tbl_assigned = assign(tbl_cars, res = _.mpg + _.hp)

tbl_assigned

[18]:

# Source: lazy query
# DB Conn: Engine(sqlite:///:memory:)
# Preview:

	res
0	131.0
1	131.0
2	115.8
3	131.4
4	193.7

# .. may have more rows

Note that the backend.last_op property holds the current select statement, so the verb can adapt it, or wrap it in another select statement.

[19]:

print(tbl_assigned.last_op)

SELECT mpg + hp AS res
FROM (SELECT mtcars."index" AS "index", mtcars.mpg AS mpg, mtcars.cyl AS cyl, mtcars.disp AS disp, mtcars.hp AS hp, mtcars.drat AS drat, mtcars.wt AS wt, mtcars.qsec AS qsec, mtcars.vs AS vs, mtcars.am AS am, mtcars.gear AS gear, mtcars.carb AS carb
FROM mtcars)

Unit tests¶

Siuba generates a table of all supported operations per backend, and tests against a simple example for each, to ensure they produce the same result as the pandas backend.

However, some translations may deviate in the following ways:

returning a float rather than an int (or vice-versa).
an aggregation that works in summarize, but not in mutate verbs.

In order to mark translations as deviating, you can use the annotation functions.

[20]:

from sqlalchemy import sql
from siuba.sql.translate import annotate, wrap_annotate

# puts an `operation` attribute on function, describing limitations
f_and = annotate(lambda col: col & False, input_type = "bool")
f_and(sql.column('a'))

[20]:

<sqlalchemy.sql.elements.AsBoolean object at 0x7f1e90b69c10>

[21]:

f_and.operation

[21]:

{'input_type': 'bool'}

[22]:

# creates a new function (wrapping the old one), and annotates that
f_and2 = wrap_annotate(f_and, something_else = False)

f_and2.operation

[22]:

{'something_else': False}

Edit page on github here. Interactive version:

siuba

Navigation

Related Topics

SQL backend¶

Step 1: Column Translation¶

Column Data¶

Locals¶

Translator¶

User defined functions¶

.¶

Step 2: Backend class¶

Representing SQL tables¶

Table verbs¶

Unit tests¶