---
title: "sport an R package for online update algorithms"
author: "Dawid Kałędkowski"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
 %\VignetteIndexEntry{sport an R package for online update algorithms}
 %\VignetteEngine{knitr::rmarkdown}
 %\usepackage[utf8]{inputenc}
---

# About
Name `sport` is an abbreviation for Sequential Pairwise Online Rating Techniques.
Package contains functions calculating ratings for two-player or multi-player
matchups. Methods included in package are able to estimate ratings 
(players strengths) and their evolution in time, also able to predict output of
challenge. Algorithms are based on Bayesian Approximation Method, and they don't
involve any matrix inversions nor likelihood estimation. `sport` incorporates
glicko algorithm, glicko2, bayesian Bradley-Terry and dynamic logistic 
regression. Parameters are updated sequentially, and computation doesn't require
any additional RAM to make estimation feasible. Additionally, package is written
in `c++` what makes computations even faster.  

Before start, it's recommended to read theoretical foundations of algorithms in 
other `sport` vignette "The theory of the online update algorithms". 

Package can be installed from CRAN or from github.

```{r eval=FALSE, warning=FALSE , message=FALSE}
install.packages("sport")
devtools::install_github("gogonzo/sport")
```


# Package Usage

## Available Data
Package contains actual data from Speedway Grand-Prix. There are two data.frames: 

1. `gpheats` - results SGP heats. Column `rank` is a numeric version of column 
`position` - rider position in race.

2. `gpsquads` - summarized results of the events, with sum of point and final 
position.

```{r echo=TRUE, warning=FALSE , message=FALSE}
library(sport)
data <- gpheats[1:1002, ]
str(data)
```

Data used in `sport` package must be in so called long format. Typically 
`data.frame` contains at least `id`, `name` of the player and `rank`, 
with one row for one player within specific match. Package allows for any number
of players within event and allows ties also.  

In all methods, output variable needs to be expressed as a rank/position in event. 
Don't mix up rank output with typical 1-win, 0-lost. In `sport` package output 
for two player game should be coded as 1=winner 2=looser. Below example of two
matches with 4 players each.

```{r echo=FALSE, warning=FALSE , message=FALSE}
data[1:8, c("id", "rider", "rank")]
```

## Estimate dynamic ratings

To compute ratings using each algorithms one has to specify formula. 
- RHS of the formula have to be specified with `player(player)` term or 
`player(player | team)` when players competes in team match. `player(...)` is a 
term function which helps identify column with `player` names and/or `team` names.
- LHS of the formula should contain `rank` term which points to column where
results (ranks) are stored and `id` (optional). RHS should rather be specified 
by `rank | id` to split matches - if `id` is missing all data will be computed 
under same event id.

```{r warning=FALSE , message=FALSE}
glicko <- glicko_run(formula = rank | id ~ player(rider), data = data)
glicko2 <- glicko2_run(formula = rank | id ~ player(rider), data = data)
bbt <- bbt_run(formula = rank | id ~ player(rider), data = data)
dbl <- dbl_run(formula = rank | id ~ player(rider), data = data)

print(glicko)
```

## Output

Objects returned by `<method>_run` are of class `rating` and have their own 
`print` and `summary` which provides simple overview. `print.sport` shows  
condensed informations about model performance like accuracy and consistency of 
model predictions with observed probabilities. More precise overview are  
given by `summary` by showing ratings, ratings deviations and comparing model 
win probabilities with observed.

```{r warning=FALSE , message=FALSE}
summary(dbl)
```

To visualize top n ratings with their 95% confidence interval one can use 
dedicated `plot.rating` function. For `dbl` method top coefficients are 
presented which doesn't have to be player specific (ratings). It's also possible 
to examine ratings evolution in time, by specifying `players` argument.

```{r message=FALSE, fig.show='hold', out.width = "50%", warnings=FALSE}
plot(glicko, n = 15)
plot(glicko, players = c("Greg Hancock", "Tomasz Gollob", "Tony Rickardsson"))
```

Except dedicated `print`,`summary` and `plot` there is possibility to extract 
more detailed information for analyses. `rating` object contains following 
elements:

```{r warning=FALSE , message=FALSE}
names(glicko)
```

* `rating$final_r` and `rating$final_rd` contains the last estimate of the 
ratings and ratings deviations. For `glicko2` there is also `rating$final_sigma`.

* `r` contains `data.table` with prior ratings estimations from first event 
to the last. Number of rows in `r` equals number of rows in input data.

* `pairs` pairwise combinations of players in analyzed events with prior 
probability and result of a challenge. 

```{r warning=FALSE , message=FALSE}
tail(glicko$r)
tail(glicko$pairs)
```

# Advanced sport
Examples presented in package overview might be sufficient in most cases, but 
sometimes it is necessary to adjust algorithms to fit data better. One 
characteristic of the online update algorithms is that variance of the parameters 
drops quickly to zero. Especially, when the number of events for the player is
big ($\small n_i>100 $), after hundreds iterations rating parameters are very
difficult to change, and output probabilities use to be extreme. To avoid these 
mistakes some additional controls should be applied, which is explained in this 
section with easy to learn examples.

## Formula

In all methods formula must contain `rank | id ~ player(player)` elements, to 
correctly specify the model.

- `rank` denotes column with output (order). 

- `id` denotes event id, within which update is computed.  

- `player(...)` function helps to identify column in which names of the players 
are stored. `player(...)` can be specified in two ways:  
  
  1. `player(player)` if results of the event are observed per player.
  
      ```{r}
glicko2 <- glicko2_run(
  formula = rank_player | id ~ player(player),
  data = data.frame(
    id = c(1, 1, 1, 1),
    player = c("a", "b", "c", "d"),
    rank_player = c(3, 4, 1, 2)
  )
)
      ```
  
  1. `player(player | team)` when players competes within teams, and results are 
  observed per team. This option is not available in `dbl_run` which requires only
  formula for player matchups.
  
      ```{r}
glicko2 <- glicko2_run(
  formula = rank_team | id ~ player(player | team),
  data = data.frame(
    id = c(1, 1, 1, 1),
    team = c("A", "A", "B", "B"),
    player = c("a", "b", "c", "d"),
    rank_team = c(1, 1, 2, 2)
  )
)
      ```
    

- other variables - available only in `dbl_run`, which allows to specify other 
factors in model. 

  ```{r}
dbl <- dbl_run(
  formula = rank | id ~ player(name) + gate * factor1,
  data = data.frame(
    id = c(1, 1, 1, 1),
    name = c("A", "B", "C", "D"),
    rank = c(3, 4, 1, 2),
    gate = c(1, 2, 3, 4),
    factor1 = c("a", "a", "b", "b"),
    factor2 = c("a", "b", "a", "b")
  )
)
  ```


## Prior beliefs about `r` and `rd`
Main functionality which is common between all algorithms is to specify prior `r`
and `rd`. Both parameters can be set by creating named vectors. Let's suppose we
have 4 players `c("A","B","C","D")` competing in an event, and we have players 
prior `r` and `rd` estimates. It's important to have `r` and `rd` names 
corresponding with levels of `name` variable. One can run algorithm, to obtain 
new estimates.

```{r warning=FALSE , message=FALSE}
model <- glicko_run(
  formula = rank | id ~ player(rider),
  data = gpheats[1:16, ]
)
```

We can also run models re-using previously estimated parameters from `model$final_r` 
and `model$final_rd` in the future when new data appear.

```{r warning=FALSE , message=FALSE}
glicko_run(
  formula = rank | id ~ player(rider),
  data = gpheats[17:20, ],
  r = model$final_r,
  rd = model$final_rd
)$final_r
```

## Controlling update size by `weight`

All algorithms have a weight argument which increases or decreases update size. 
Higher weight increasing impact of corresponding event. Effect of the weight on 
update size can be expressed directly by following formula - 
$\small R_i^{'} \leftarrow R_i \pm \omega_i * \Omega_i$. 
To specify weight $\omega_i$  one needs to create additional column in input 
data, and pass the name of the column to `weight` argument. For example weight 
could depend on importance of competition. In speedway Grand-Prix last three 
heats determine event winner, thus they weight more.

```{r warning=FALSE , message=FALSE, echo=FALSE}
library(dplyr)
data <- mutate(data,
  weight = ifelse(heat >= (max(heat) - 3), 2, 1)
)

glicko <- glicko_run(
  formula = rank | id ~ player(rider),
  data = data,
  weight = "weight"
)
```

## Avoiding excessive RD shrinkage with `kappa`

In situation when player plays games very frequently, `rd` can quickly decrease 
to zero, making further changes limited. Setting `kappa` (single value) avoids 
rating deviation decrease to be lower than specified fraction of `rd`. In  other 
words final `rd` can't be lower than initial `RD` times `kappa` 

$$\small RD' \geq RD * kappa$$

```{r warning=FALSE , message=FALSE, echo=FALSE}
bbt1 <- bbt_run(
  formula = rank | id ~ player(rider),
  data = data,
  kappa = 0.99
) # RD decreases at most 1%
```

## Control output uncertainty by `lambda`

In some cases player ratings tend to be more uncertain. If scientist have prior
knowledge about higher risk of event or uncertainty of specific player 
performance, then one might create another column with relevant values and pass 
the column name to `lambda` argument.

```{r warning=FALSE , message=FALSE}
# bbt example
data <- data %>%
  group_by(rider) %>%
  mutate(idle_30d = if_else(as.integer(date - lag(date)) > 30, 1.0, 2.0)) %>%
  filter(!is.na(idle_30d))

bbt <- bbt_run(
  formula = rank | id ~ player(rider),
  data = data,
  lambda = "idle_30d"
)
```

## Players nested within teams

In above examples players competes as individuals, and each is ranked at the 
finish line. There are sports where players, competes in teams, and results
are reported per team. `sport` is able to compute player ratings, and requires 
only changing formula from `player(player)` to `player(player | team)`. `data.frame`
should always be a long format, with one player for each row. Ratings are updated
according to their contribution in team efforts. `share` argument can be added
optionally if scientist have some knowledge about players contribution in match 
(eg. minutes spent on the field from all possible minutes).

```{r warning=FALSE , message=FALSE}
glicko2 <- glicko2_run(
  data = data.frame(
    id = c(1, 1, 1, 1),
    team = c("A", "A", "B", "B"),
    player = c("a", "b", "c", "d"),
    rank_team = c(1, 1, 2, 2),
    share = c(0.4, 0.6, 0.5, 0.5)
  ),
  formula = rank_team | id ~ player(player | team),
  share = "share"
)

glicko2$final_r
```


Output object contains the same elements as normal, with one difference -
`pairs` contains probability and output per team, and `r` contains prior ratings 
per individuals.

```{r warning=FALSE , message=FALSE}
glicko2$pairs
glicko2$r
```
