This document describes how to add a new SQL backend to dbplyr. To begin:
Ensure that you have a DBI compliant database backend. If not,
you’ll need to first create it by following the instructions in
vignette("backend", package = "DBI").
You’ll need a working knowledge of S3. Make sure that you’re familiar with the basics before you start.
This document is still a work in progress, but it will hopefully get you started. I’d also strongly recommend reading the bundled source code for SQLite, MySQL, and PostgreSQL.
For interactive exploitation, attach dplyr and DBI. If you’re creating a package, you’ll need to import dplyr and DBI.
Check that you can create a tbl from a connection, like:
con <- DBI::dbConnect(RSQLite::SQLite(), path = ":memory:")
DBI::dbWriteTable(con, "mtcars", mtcars)
tbl(con, "mtcars")
#> # A query: ?? x 11
#> # Database: sqlite 3.52.0 []
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> # ℹ more rowsIf you can’t, this likely indicates some problem with the DBI methods. Use DBItest to narrow down the problem.
The first method of your dbplyr backend should always be for the
dbplyr_edition() generic:
#' @importFrom dbplyr dbplyr_edition
#' @export
dbplyr_edition.myConnectionClass <- function(con) 2LThis declares that your package uses version 2 of the API, which is the version that this vignette documents.
Next, define a dialect for your backend using
sql_dialect() and new_sql_dialect(). A dialect
encapsulates the SQL syntax rules for your database:
#' @export
sql_dialect.myConnectionClass <- function(con) {
new_sql_dialect(
"mybackend",
quote_identifier = function(x) sql_quote(x, '"')
)
}The new_sql_dialect() function requires two
arguments:
dialect: A unique name for your backend (used to create
the class name).quote_identifier: A function that quotes identifiers.
This will typically be either a call to sql_quote() or to
DBI::dbQuoteIdentifier()new_sql_dialect() has a number of additional arguments
that describe the capabilities of the SQL dialect; see
?new_sql_dialect for details.
The dialect system allows you to write methods that dispatch on the
dialect class (e.g., sql_translation.sql_dialect_mybackend)
rather than the connection class. This makes it easier to share SQL
generation code between different connection types that use the same
database (e.g. direct, ODBC, JDBC, ADBC).
In general, generics that start with sql_ should have a
sql_dialect method, and generics that start with db_ should
use a DBI connection method. It should be relatively rare to need a
method for db_ generic, but occasionally dbplyr’s SQL
generation system is not quite flexible enough. If you find this, it’s
worth filing an issue so I can look into it.
Next, check that copy_to(), collapse(),
compute(), and collect() work:
If copy_to() fails, you probably need a method for
sql_table_analyze() or sql_table_index(). If
copy_to() fails during creation of the tbl, you may need a
method for sql_query_fields().
If collapse() fails, your database has a
non-standard way of constructing subqueries. Add a method for
sql_subquery().
If compute() fails, your database has a non-standard
way of saving queries in temporary tables. Add a method for
db_save_query().
Make sure you’ve read vignette("translation-verb") so
you have the lay of the land. First check that SQL translation for the
key verbs work:
summarise(), mutate(),
filter() etc: powered by
sql_query_select()left_join(), inner_join(): powered by
sql_query_join()semi_join(), anti_join(): powered by
sql_query_semi_join()union(), intersect(),
setdiff(): powered by sql_query_set_op()If you need to generate your own SQL, we recommend using
sql_glue2(). It uses glue syntax with type markers for safe
SQL generation:
con <- simulate_dbi()
# Create an index
index_name <- "index"
table <- I("schema.table")
columns <- c("column1", "column2")
sql_glue2(con, "CREATE INDEX {.id index_name} ON {.tbl table} {.id columns*}")
#> <SQL> CREATE INDEX "index" ON schema.table ("column1", "column2")
# Insert values safely
name <- "O'Brien"
sql_glue2(con, "INSERT INTO students (name) VALUES {name*}")
#> <SQL> INSERT INTO students (name) VALUES ('O''Brien')
# Build a query
table <- "my_table"
cols <- c("id", "name", "value")
sql_glue2(con, "SELECT {.id cols} FROM {.tbl table}")
#> <SQL> SELECT "id", "name", "value" FROM "my_table"The high-level translation from R code to SQL takes place in a few steps, using two intermediate representations: a “lazy query” and a “query”. A lazy query is closer to R and a query is closer to SQL. This continuum of representation allows us to work with expressions when trying to reason about the transformation pipeline, then later switch to strings of SQL when we get closer to generating a query.
All dbplyr pipelines are built on top of the tbl_lazy
data structure, which is the only data structure that the user will ever
see. It has two fields:
connection which is used to specialise SQL
translation.lazy_query which represents the lazy computation
described by the pipeline. It always starts off as a
lazy_base_query which represents a remote table or SQL
query.Each verb takes a tbl_lazy as input and returns a
modified tbl_lazy, modifying the lazy_query
with the new requirements. At the simplest level, this involves wrapping
the existing lazy query within a new lazy query, such as
lazy_query_select() or
lazy_query_multi_join(). These represent what later will
become a subquery so that (e.g.) filter(),
mutate(), select(), arrange(),
and summarise() all use lazy_select_query(),
whereas left_join() and inner_join() use
lazy_query_multi_join().
These lazy queries are typically created by a helper function like
add_select(), add_mutate(), and
add_filter(). These helpers have two roles:
They abstract away repeated code for substantially similar dplyr
verbs. For example, select(), rename(),
rename_with() and relocate() all use
add_select().
They are responsible for figuring out if we can re-use the
existing query or we need to wrap in a subquery. Typically this work is
done by a function called can_inline_select() which will
use what we know about SQL and the verb to determine whether or not we
have to introduce a subquery.
The lazy query gives us a representation of the data manipulation
pipeline with enough detail that we can reason about when we need a
subquery. Then when the user calls collect(), we need to
generate an SQL query. This happens in two steps: building and
rendering.
First, sql_build() recurses over the lazy query,
building up query objects like select_query() and
multi_join_query() that represent the different subtypes of
SELECT queries. Compared to the previous “lazy” query
representation, these are much closer to SQL than R. They can be much
simpler because we no longer need to reason about subqueries, but we
still need some structure to facilitate backend-specific
translations.
Next, sql_render() is called on the result of
sql_build() and dispatches to generics like
sql_query_select(), sql_query_join(),
sql_query_semi_join(), and sql_query_set_op().
These generics have methods for each backend that are responsible for
creating the needed SQL.
Finally, you may have to provide custom R -> SQL translation for
functions that work with vectors within verbs. You can do this by
providing a method for sql_translation(), which returns an
object created by sql_variant().
You can define translation methods either on your connection class or on your dialect class. Using the dialect class is recommended as it allows sharing translations between different connection types:
# Method on dialect class (recommended)
sql_translation.sql_dialect_mybackend <- function(con) {
sql_variant(
scalar = sql_translator(base_scalar, ...), # Functions in SELECT (non-aggregated)
aggregate = sql_translator(base_aggregate, ...), # Aggregation functions (mean, sum, etc.)
window = sql_translator(base_win, ...) # Window functions (lead, lag, rank, etc.)
)
}Each translator will inherit from the base (ANSI SQL) translator and override only what’s different for your backend:
sql_translator(
base_scalar, # Inherit most translations
# Override specific functions for your backend
`+` = sql_infix("+"),
mean = sql_aggregate("AVG", "mean")
)dbplyr provides several helper functions to make it easier to translate R functions to SQL:
sql_prefix(f, n = NULL): For standard
SQL functions. The n argument optionally specifies the
number of arguments.sql_infix(f): For infix operators like
+, *, or ==.sql_cast(type): For type casting
functions.sql_not_supported(f): For functions
with no SQL translation.Here’s an example showing all of these helpers in use:
sql_translation.sql_dialect_mybackend <- function(con) {
sql_variant(
scalar = sql_translator(
base_scalar,
# Standard SQL functions
cos = sql_prefix("COS", 1),
round = sql_prefix("ROUND", 2),
# Infix operators
`+` = sql_infix("+"),
`*` = sql_infix("*"),
`==` = sql_infix("="),
# Type casting
as.numeric = sql_cast("NUMERIC"),
as.character = sql_cast("VARCHAR")
),
aggregate = sql_translator(base_agg),
window = sql_translator(base_win)
)
}sql_aggregate(f, f_r = f): For
single-argument SQL aggregate functions. The f_r argument
gives the name of the R function.sql_aggregate_2(f): For two-argument
SQL aggregate functions.sql_aggregate_n(f, f_r = f): For
variadic SQL aggregate functions.sql_translation.sql_dialect_mybackend <- function(con) {
sql_variant(
scalar = sql_translator(base_scalar),
aggregate = sql_translator(
.parent = base_agg,
# Single-argument aggregates
mean = sql_aggregate("AVG", "mean"),
var = sql_aggregate("VAR_SAMP", "var"),
# Two-argument aggregates
cov = sql_aggregate_2("COVAR_SAMP"),
# Variadic aggregates
pmin = sql_aggregate_n("LEAST", "pmin"),
pmax = sql_aggregate_n("GREATEST", "pmax"),
# Unsupported functions
median = sql_not_supported("median")
),
window = sql_translator(base_win)
)
}Window functions have their own set of helpers:
win_rank(f): For ranking
functions.win_aggregate(f): For aggregate
functions used as window functions.win_cumulative(f): For cumulative
functions.win_absent(f): For backends that don’t
support certain window functions.Here’s an example showing all of these helpers in use:
window = sql_translator(
base_win,
# Ranking functions
row_number = win_rank("ROW_NUMBER"),
rank = win_rank("RANK"),
dense_rank = win_rank("DENSE_RANK"),
# Aggregate functions as window functions
mean = win_aggregate("AVG"),
sum = win_aggregate("SUM"),
# Cumulative functions
cumsum = win_cumulative("SUM"),
# Absent functions
cume_dist = win_absent("cume_dist")
)For more complex translations, you can write custom functions that
return SQL expressions using sql_glue(). This uses glue
syntax for string interpolation with automatic escaping.
scalar = sql_translator(
base_scalar,
# Custom log function with change of base
log = function(x, base = exp(1)) {
if (isTRUE(all.equal(base, exp(1)))) {
sql_glue("LN({x})")
} else {
sql_glue("LOG({x}) / LOG({base})")
}
},
# Custom paste function using CONCAT
paste = function(..., sep = " ") {
sql_glue("CONCAT_WS({sep}, {...})")
}
)