[1]:
import pandas as pd
pd.set_option("display.max_rows", 5)

SQL backend

Step 1: Column Translation

Column translation requires three pieces:

  1. Locals: Functions for creating the sqlalchemy clause corresponding to an operation.

  2. Column Data: Classes representing columns under normal and aggregate settings.

  3. Translator: A class that can take a symbolic expression (e.g. _.x.mean()) and return the correct sqlachemy clause.

[2]:
# Column data =================================================================

from siuba.sql.translate import SqlColumn, SqlColumnAgg

# used if you want to add a scalar or window translation
# (eg. something that gets used in mutate)
class WowSqlColumn(SqlColumn): pass

# used if you want to add a aggregate translation
# (eg. something that gets used in a query with a GROUP BY clause)
class WowSqlColumnAgg(WowSqlColumn, SqlColumn): pass


# Locals ======================================================================

from siuba.sql.translate import (
        win_over,
        win_agg,
        sql_agg,
        sql_scalar,
        sql_colmeth,
        sql_not_impl,
        )

scalar = {
    "__add__": sql_colmeth("__add__"),
    "round": sql_scalar("round"),
}

window = {
    "rank": win_over("rank"),
    "mean": win_agg("mean"),
}

aggregation = {
    "rank": sql_not_impl(),
    "mean": sql_agg("mean"),
}


# Translator ==================================================================

from siuba.sql.translate import SqlTranslator

translator = SqlTranslator.from_mappings(
    scalar, window, aggregation,
    WowSqlColumn, WowSqlColumnAgg
)

Column Data

There are two kinds of data classes, corresponding to whether the generated outermost query in the generated SQL will use a GROUP BY clause.

Locals

The entries of each local dictionary are functions that take a sqlalchemy.sql.ClauseElement–which is the parent class of many sqlalchemy elements–and returns a ClauseElement.

[3]:
from sqlalchemy import sql

expr_rank = window["rank"](sql.column("a_col"))
expr_rank
[3]:
<siuba.sql.translate.RankOver object at 0x7f1e921903d0>
[4]:
print(expr_rank)
rank() OVER (ORDER BY a_col)

Translator

Below, we set up a sqlalchemy select statement in order to demonstrate the translator in action.

[5]:
from siuba import _
from sqlalchemy.sql import column, select

sel = select([column('x'), column('y')])

Then we feed the columns to the translated call.

[6]:
call_add = translator.translate(_.x + _.y)
call_add(sel.columns)
[6]:
<sqlalchemy.sql.elements.BinaryExpression object at 0x7f1eb4c54fd0>

Note that behind the scenes, the translator goes down the call tree and swaps functions like "__add__" with the local translations.

[7]:
# the root node is __add__. shown as +.
_.x + _.y
[7]:
█─+
├─█─.
│ ├─_
│ └─'x'
└─█─.
  ├─_
  └─'y'
[8]:
# We can see this in action by calling the translation directly.
scalar["__add__"](sel.columns.x, sel.columns.y)
[8]:
<sqlalchemy.sql.elements.BinaryExpression object at 0x7f1e91101ad0>

By default the translate method assumes the expression is using window functions, so operations like .mean() return SqlAlchemy Over clauses.

[9]:
f_translate = translator.translate(_.x.mean())
expr = f_translate(sel.columns)

expr
[9]:
<siuba.sql.translate.AggOver object at 0x7f1e91101e90>
[10]:
print(expr)
mean(x) OVER ()

Keep in mind–since the translator doesn’t know about grouping variables–it returns an empty over clause. This separation of concerns:

  • translator: handles simple column ops, including returning over clauses to be filled.

  • backend: handles broader contexts–like if the data has been grouped, arranged, or limited–by visiting the result of each translation.

User defined functions

Remember those column data classes we made and passed to the translator? They allow users to register custom column operation functions!

Below we show a custom round function, which calls the sqlalchemy that usually corresponds to ROUND(<col>).

[11]:
from siuba.siu import _, symbolic_dispatch
from sqlalchemy import sql

@symbolic_dispatch(cls = WowSqlColumn)
def round(col):
    print("running round function")

    return sql.function.round(col)

# Creates a special symbolic object
round(_)
[11]:
█─'__call__'
├─█─'__custom_func__'
│ └─<function round at 0x7f1e91167680>
└─_

Note that a special feature of symbolic_dispatch, is that it let’s you form complex expressions by passing _ to your function call.

[12]:
# Symbolic objects let you express complex operations
round(_) + 9999
[12]:
█─+
├─█─'__call__'
│ ├─█─'__custom_func__'
│ │ └─<function round at 0x7f1e91167680>
│ └─_
└─9999

.

Step 2: Backend class

So far we’ve only discussed how to translate symbolic expressions like _.x + _.y in the mutate call below.

mutate(backend, _.x + _.y)

The last piece is implementing a backend that works with the mutate function itself. If you are just adding support for a new SQL dialect, you can use the LazyTbl class provided by siuba.sql. While the translations above work on columns of data, this class handles tables and queries over data (e.g. select statements).

More specifically, LazyTbl has 2 jobs:

  1. Representing a specific table of data from the database.

  2. Creating SQL queries via table verbs. This includes using translators above, as well as broader SQL constructs (e.g. ordering, grouping / partitions, limiting rows, etc..).

[13]:
# Setup example data ----
from sqlalchemy import create_engine
from siuba.data import mtcars

# copy pandas DataFrame to sqlite
engine = create_engine("sqlite:///:memory:")
mtcars.to_sql("mtcars", engine, if_exists = "replace")

Representing SQL tables

[14]:
from siuba.sql import LazyTbl

tbl_cars = LazyTbl(engine, "mtcars")
tbl_cars
[14]:
# Source: lazy query
# DB Conn: Engine(sqlite:///:memory:)
# Preview:
index mpg cyl disp hp drat wt qsec vs am gear carb
0 0 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 1 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
2 2 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
3 3 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 4 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2

# .. may have more rows

Note that you can access a number of useful attributes.

[15]:
# the underlying translator
f_add = tbl_cars.translator.translate(_.mpg + _.hp)
f_add(tbl_cars.last_op.columns)
[15]:
<sqlalchemy.sql.elements.BinaryExpression object at 0x7f1e90b696d0>
[16]:
# the underlying sqlalchemy source table
tbl_cars.tbl
[16]:
Table('mtcars', MetaData(bind=Engine(sqlite:///:memory:)), Column('index', BIGINT(), table=<mtcars>), Column('mpg', FLOAT(), table=<mtcars>), Column('cyl', BIGINT(), table=<mtcars>), Column('disp', FLOAT(), table=<mtcars>), Column('hp', BIGINT(), table=<mtcars>), Column('drat', FLOAT(), table=<mtcars>), Column('wt', FLOAT(), table=<mtcars>), Column('qsec', FLOAT(), table=<mtcars>), Column('vs', BIGINT(), table=<mtcars>), Column('am', BIGINT(), table=<mtcars>), Column('gear', BIGINT(), table=<mtcars>), Column('carb', BIGINT(), table=<mtcars>), schema=None)
[17]:
# grouping info
from siuba import group_by, _

tbl2 = group_by(tbl_cars, rounded_mpg = _.mpg.round(10))
tbl2.group_by
[17]:
('rounded_mpg',)

Table verbs

Backends register on a dispatcher called singledispatch2. This is shown below with an assign verb that is a limited implementation of mutate.

[18]:
from siuba.dply.verbs import singledispatch2
from sqlalchemy.sql import select

@singledispatch2(LazyTbl)
def assign(__data, **kwargs):

    new_cols = []
    for k, expr in kwargs.items():

        # .shape_call mostly data.translator under the hood
        new_call = __data.shape_call(expr)
        sql_expr = new_call(__data.last_op.columns)

        new_cols.append(sql_expr.label(k))

    # copy of data, where .last_op is the new select statement
    new_data = __data.append_op(select(new_cols))

    return new_data


# test out
tbl_assigned = assign(tbl_cars, res = _.mpg + _.hp)

tbl_assigned
[18]:
# Source: lazy query
# DB Conn: Engine(sqlite:///:memory:)
# Preview:
res
0 131.0
1 131.0
2 115.8
3 131.4
4 193.7

# .. may have more rows

Note that the backend.last_op property holds the current select statement, so the verb can adapt it, or wrap it in another select statement.

[19]:
print(tbl_assigned.last_op)
SELECT mpg + hp AS res
FROM (SELECT mtcars."index" AS "index", mtcars.mpg AS mpg, mtcars.cyl AS cyl, mtcars.disp AS disp, mtcars.hp AS hp, mtcars.drat AS drat, mtcars.wt AS wt, mtcars.qsec AS qsec, mtcars.vs AS vs, mtcars.am AS am, mtcars.gear AS gear, mtcars.carb AS carb
FROM mtcars)

Unit tests

Siuba generates a table of all supported operations per backend, and tests against a simple example for each, to ensure they produce the same result as the pandas backend.

However, some translations may deviate in the following ways:

  • returning a float rather than an int (or vice-versa).

  • an aggregation that works in summarize, but not in mutate verbs.

In order to mark translations as deviating, you can use the annotation functions.

[20]:
from sqlalchemy import sql
from siuba.sql.translate import annotate, wrap_annotate

# puts an `operation` attribute on function, describing limitations
f_and = annotate(lambda col: col & False, input_type = "bool")
f_and(sql.column('a'))
[20]:
<sqlalchemy.sql.elements.AsBoolean object at 0x7f1e90b69c10>
[21]:
f_and.operation
[21]:
{'input_type': 'bool'}
[22]:
# creates a new function (wrapping the old one), and annotates that
f_and2 = wrap_annotate(f_and, something_else = False)

f_and2.operation
[22]:
{'something_else': False}

Edit page on github here. Interactive version: Binder badge