--- title: "simDAG Cookbook" output: rmarkdown::html_vignette author: "Robin Denz" vignette: > %\VignetteIndexEntry{simDAG Cookbook} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include=FALSE} knitr::opts_chunk$set( collapse=TRUE, comment="#>", fig.align="center" ) ``` # Introduction The `simDAG` package can be used to generate all kinds of data, as shown in the many examples and other vignettes of this package. In this vignette we will further illustrate the capabilities of the package, by giving very short and simplified examples on how to use `simDAG` to generate multiple different kinds of data. The vignette can be seen as a sort of "cookbook", in the sense that it includes the building blocks for many different possible data generation processes (DGP), which users can expand on for their specific needs. Note that the examples shown are not meant to be realistic, they are only meant to show the general structure of the required code. This vignette assumes that the reader is already somewhat familiar with the `simDAG` syntax. More detailed explanations of how to simulate data using a time-fixed DAG and DAGs including time-dependent variables is given in other vignettes and the documentation of the functions. ```{r, message=FALSE, warning=FALSE} library(simDAG) library(data.table) ``` # Randomized Controlled Trials First, we will give some examples on how data for randomized controlled trials (RCT) could be generated. In a classic RCT, the treatment of interest (below named `Treatment`) is randomly assigned to the individuals of the study, making it a "root node" in the terminology of DAGs. ## Two Treatment Groups Below is an example for an RCT with two treatment groups, a binary outcome and two baseline covariates (`Age` and `Sex`): ```{r} dag <- empty_dag() + node("Age", type="rnorm", mean=55, sd=5) + node("Sex", type="rbernoulli", p=0.5) + node("Treatment", type="rbernoulli", p=0.5) + node("Outcome", type="binomial", formula= ~ -12 + Age*0.2 + Sex*1.1 + Treatment*-0.5) data <- sim_from_dag(dag, n_sim=1000) head(data) ``` ## Three or More Treatment Groups The example from earlier can easily be made a little more complex, by including three treatment groups instead of just two. This can be achieved by using the `rcategorical()` function as node type instead of the `rbernoulli()` function. ```{r} dag <- empty_dag() + node("Age", type="rnorm", mean=55, sd=5) + node("Sex", type="rbernoulli", p=0.5, output="numeric") + node("Treatment", type="rcategorical", probs=c(0.33333, 0.33333, 0.33333), labels=c("Placebo", "Med1", "Med2"), output="factor") + node("Outcome", type="binomial", formula= ~ -12 + Age*0.2 + Sex*1.1 + TreatmentMed1*-0.5 + TreatmentMed2*-1) data <- sim_from_dag(dag, n_sim=1000) head(data) ``` By setting `probs=c(0.33333, 0.33333, 0.33333)`, each treatment group is choosen with a probability of 0.33333, meaning that the resulting groups should be roughly of equal size in expectation. ## Multiple Outcome Measurements The previous two examples assumed that there was a single fixed time at which the binary outcome was measured. Below we switch things up a bit by using a continuous outcome that is measured at 5 different points in time after baseline. The syntax is nearly the same as before, with the major difference being that we now use a `node_td()` call to define the outcome, because it is time-dependent. Additionally, because of this inclusion of a time-dependent node, we have to use the `sim_discrete_time()` function instead of the `sim_from_dag()` function. Finally, the `sim2data()` has to be called to obtain the desired output in the long-format. ```{r} dag <- empty_dag() + node("Age", type="rnorm", mean=55, sd=5) + node("Sex", type="rbernoulli", p=0.5, output="numeric") + node("Treatment", type="rbernoulli", p=0.5) + node_td("Outcome", type="gaussian", formula= ~ -12 + Age*0.2 + Sex*1.1 + Treatment*-0.5, error=1) sim <- sim_discrete_time(dag, n_sim=1000, max_t=5, save_states="all") data <- sim2data(sim, to="long") head(data) ``` ## Non-Compliance to Treatment Assignment Previously we assumed that the treatment was assigned at baseline and never changed throughout the study. In real RCTs, individuals are often assigned to *treatment strategies* instead (for example: one pill per week). Some individuals might not adhere to their assigned treatment strategy, by for example "switching back" to not taking the pill. The following code shows one possible way to simulate such data. In the shown DGP, individuals are assigned to a treatment strategy at baseline. Here, `Treatment = FALSE` refers to the control condition in which the individual should not take any pills. In the intervention group (`Treatment = TRUE`), however, the individual is assigned to take a pill every week. On average, 5% of these individuals stop taking their pills with each passing week. Once they stop, they never start taking them again. The continuous outcome at each observation time depends on how many pills each individual took in total. None of the individuals in the control group start taking pills. For simplicity, the simulation is run without any further covariates for 5 weeks. ```{r} # function to calculate the probability of taking the pill at t, # given the current treatment status of the person prob_treat <- function(data) { fifelse(!data$Treatment_event, 0, 0.95) } dag <- empty_dag() + node("Treatment_event", type="rbernoulli", p=0.5) + node_td("Treatment", type="time_to_event", parents=c("Treatment_event"), prob_fun=prob_treat, event_count=TRUE, event_duration=1) + node_td("Outcome", type="gaussian", formula= ~ -2 + Treatment_event_count*-0.3, error=2) sim <- sim_discrete_time(dag, n_sim=1000, max_t=5, save_states="all") data <- sim2data(sim, to="long") head(data) ``` What this DAG essentially means is that first, the `Treatment_event` column is generated, which includes the assigned treatment at baseline. We called it `Treatment_event` here instead of just `Treatment`, because the value in this column should be updated with each iteration and nodes of type `"time_to_event"` always split the node into two columns: status and time. By setting `event_count=TRUE` in the `node_td()` call for the `Treatment` node, a count of the amount of pills taken up to $t$ is directly calculated at each point in time (column `Treatment_event_count`), which can then be used directly to generate the `Outcome` node. This simulation could be made more realistic (or just more complex), by for example adding either of the following things: * making the probability of switching dependent on the outcome at $t - 1$ * making the probability of switching dependent on other variables * allowing patients to switch back to taking the pill after discontinuation * allowing patients in the control group to switch to the treatment These additions are left as an exercise to the user. ## With Cluster Randomization Instead of randomly assigning *individuals* to the treatment, many trials actually randomly assign the treatment at the level of clinics or other forms of *clusters*, which is fittingly called cluster randomization. This may be implemented in `simDAG` using the following syntax: ```{r} dag <- empty_dag() + node("Clinic", type="rcategorical", probs=rep(0.02, 50)) + node("Treatment", type="identity", formula= ~ Clinic >= 25) + node("Outcome", type="poisson", formula= ~ -1 + Treatment*4 + (1|Clinic), var_corr=0.5) data <- sim_from_dag(dag, n_sim=1000) head(data) ``` In this DGP, each individual is randomly assigned to one of 50 Clinics with equal probability. All patients in clinics numbered 0-24 are assigned to the control group, while the other patients are assigned to the treatment group. The outcome is then generated using a Poisson regression with a random intercept based on the clinic. # Observational Studies In all previous examples, the variable of interest was a treatment that was randomly assigned and did not depend on other variables (excluding the `Clinic` for the cluster randomization example). In observational studies, the variable of interest is usually not randomly assigned. Below are some examples for generating such observational study data. ## Crossectional Data ## Panel Data # Miscellaneous ## Including Multi-Level Data ## Including "Outliers" ## Including Missing Values Real data usually includes at least some missing values in key variables. Below is a very simple example on how users could include missing values in their data: ```{r} dag <- empty_dag() + node("A_real", type="rnorm", mean=10, sd=3) + node("A_missing", type="rbernoulli", p=0.5) + node("A_observed", type="identity", formula= ~ fifelse(A_missing, NA, A_real)) data <- sim_from_dag(dag, n_sim=10) head(data) ``` In this DAG, the real values of node `A` are generated first. Afterwards, an indicator of whether the corresponding values is observed is drawn randomly for each individual in node `A_missing`. The actually observed value, denoted `A_observed`, is then generated by simply using either the value of `A_real` or a simple `NA` if `A_missing==TRUE`. The missingness shown here would correspond to **missing completely at random** (MCAR) in the categorisation by XX. Although this might seem a little cumbersome at first, it does allow quite a lot of flexibility in the specification of different missingness mechanisms. For example, to simulate **missing at random** (MAR) patterns, we could use the following code instead: ```{r} dag <- empty_dag() + node("A_real", type="rnorm", mean=0, sd=1) + node("B_real", type="rbernoulli", p=0.5) + node("A_missing", type="rbernoulli", p=0.1) + node("B_missing", type="binomial", formula= ~ -5 + A_real*0.1) + node("A_observed", type="identity", formula= ~ fifelse(A_missing, NA, A_real)) + node("B_observed", type="identity", formula= ~ fifelse(B_missing, NA, B_real)) data <- sim_from_dag(dag, n_sim=10) head(data) ``` Here, the missingness in `A_observed` is again MCAR, because it is independent of everything. The missingness in `B_observed`, however, is MAR because the probability of missingness is dependent on the actually observed value of `A_real`. ## Including Measurement Error Measurement error refers to situations in which variables of interest are not measured perfectly. For example, the disease of interest may only be detected in 90% of all patients with the disease and may falsely be detected in 1% of all patients without the disease. The same strategy shown for missing values could be used to simulate such data using `simDAG`: ```{r} probs <- list(`TRUE`=0.9, `FALSE`=0.01) dag <- empty_dag() + node("Disease_real", type="rbernoulli", p=0.5) + node("Disease_observed", type="conditional_prob", parents="Disease_real", probs=probs) data <- sim_from_dag(dag, n_sim=10) head(data) ``` In this example, the disease is present in 50% of all individuals. By using a node with `type="conditional_prob"` we can easily draw new values for the observed disease status by specifying the `probs` argument correctly. We could similarly extend this example to make the probability of misclassification dependent on another variable: ```{r} # first TRUE / FALSE refers to Sex = TRUE / FALSE # second TRUE / FALSE refers to Disease = TRUE / FALSE probs <- list(TRUE.TRUE=0.9, TRUE.FALSE=0.01, FALSE.TRUE=0.8, FALSE.FALSE=0.05) dag <- empty_dag() + node("Sex", type="rbernoulli", p=0.5) + node("Disease_real", type="rbernoulli", p=0.5) + node("Disease_observed", type="conditional_prob", parents=c("Sex", "Disease_real"), probs=probs) data <- sim_from_dag(dag, n_sim=1000) head(data) ``` In this extended example, `Sex` is equally distributed among the population (with `TRUE` = "female" and `FALSE` = "male"). In this example, the probability of being diagnosed with the disease (`Disease_observed`) if the disease is actually present is 0.9 for females and only 0.8 for males. Similarly, the probability of being diagnosed if the disease is not present is 0.01 for females and 0.05 for males. ## Very Heterogeneous Data ## Ensuring Event-Order in Discrete-Time Simulations ## Correlated Error Terms