% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/tbl_dbi.R
\name{correlate.tbl_dbi}
\alias{correlate.tbl_dbi}
\title{Compute the correlation coefficient between two numerical data}
\usage{
\method{correlate}{tbl_dbi}(.data, ..., in_database = FALSE, collect_size = Inf)
}
\arguments{
\item{.data}{a tbl_dbi.}

\item{...}{one or more unquoted expressions separated by commas.
You can treat variable names like they are positions.
Positive values select variables; negative values to drop variables.
If the first expression is negative, correlate() will automatically start with all variables.
These arguments are automatically quoted and evaluated in a context where column names
represent column positions.
They support unquoting and splicing.}

\item{in_database}{Specifies whether to perform in-database operations. 
If TRUE, most operations are performed in the DBMS. if FALSE, 
table data is taken in R and operated in-memory. Not yet supported in_database = TRUE.}

\item{collect_size}{a integer. The number of data samples from the DBMS to R. 
Applies only if in_database = FALSE.

See vignette("EDA") for an introduction to these concepts.}
}
\description{
The correlate() compute pearson's the correlation
coefficient of the numerical(INTEGER, NUMBER, etc.) column of 
the DBMS table through tbl_dbi.
}
\details{
This function is useful when used with the group_by() function of the dplyr package.
If you want to compute by level of the categorical data you are interested in,
rather than the whole observation, you can use \code{\link{grouped_df}} as the group_by() function.
This function is computed stats::cor() function by use = "pairwise.complete.obs" option.
}
\section{Correlation coefficient information}{

The information derived from the numerical data compute is as follows.

\itemize{
\item var1 : names of numerical variable
\item var2 : name of the corresponding numeric variable
\item coef_corr : pearson's correlation coefficient
}
}

\examples{
library(dplyr)

# Generate data for the example
carseats <- ISLR::Carseats
carseats[sample(seq(NROW(carseats)), 20), "Income"] <- NA
carseats[sample(seq(NROW(carseats)), 5), "Urban"] <- NA

# connect DBMS
con_sqlite <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

# copy carseats to the DBMS with a table named TB_CARSEATS
copy_to(con_sqlite, carseats, name = "TB_CARSEATS", overwrite = TRUE)

# Using pipes ---------------------------------
# Correlation coefficients of all numerical variables
con_sqlite \%>\% 
  tbl("TB_CARSEATS") \%>\% 
  correlate()
 
# Positive values select variables
con_sqlite \%>\% 
  tbl("TB_CARSEATS") \%>\% 
  correlate(Sales, Price)
 
# Negative values to drop variables, and In-memory mode and collect size is 200
con_sqlite \%>\% 
  tbl("TB_CARSEATS") \%>\% 
  correlate(-Sales, -Price, collect_size = 200)
 
# Positions values select variables
con_sqlite \%>\% 
  tbl("TB_CARSEATS") \%>\% 
  correlate(1)
 
# Positions values select variables
con_sqlite \%>\% 
  tbl("TB_CARSEATS") \%>\% 
  correlate(-1, -2, -3, -5, -6)
 
# ---------------------------------------------
# Correlation coefficient
# that eliminates redundant combination of variables
con_sqlite \%>\% 
  tbl("TB_CARSEATS") \%>\% 
  correlate() \%>\%
  filter(as.integer(var1) > as.integer(var2))

con_sqlite \%>\% 
  tbl("TB_CARSEATS") \%>\% 
  correlate(Sales, Price) \%>\%
  filter(as.integer(var1) > as.integer(var2))

# Using pipes & dplyr -------------------------
# Compute the correlation coefficient of Sales variable by 'ShelveLoc'
# and 'US' variables. And extract only those with absolute
# value of correlation coefficient is greater than 0.5
con_sqlite \%>\% 
  tbl("TB_CARSEATS") \%>\% 
  group_by(ShelveLoc, US) \%>\%
  correlate(Sales) \%>\%
  filter(abs(coef_corr) >= 0.5)

# extract only those with 'ShelveLoc' variable level is "Good",
# and compute the correlation coefficient of 'Sales' variable
# by 'Urban' and 'US' variables.
# And the correlation coefficient is negative and smaller than 0.5
con_sqlite \%>\% 
  tbl("TB_CARSEATS") \%>\% 
  filter(ShelveLoc == "Good") \%>\%
  group_by(Urban, US) \%>\%
  correlate(Sales) \%>\%
  filter(coef_corr < 0) \%>\%
  filter(abs(coef_corr) > 0.5)
 
}
\seealso{
\code{\link{correlate.data.frame}}, \code{\link{cor}}.
}
