Title: | Simulate Data from a DAG and Associated Node Information |
---|---|
Description: | Simulate complex data from a given directed acyclic graph and information about each individual node. Root nodes are simply sampled from the specified distribution. Child Nodes are simulated according to one of many implemented regressions, such as logistic regression, linear regression, poisson regression and more. Also includes a comprehensive framework for discrete-time simulation, which can generate even more complex longitudinal data. |
Authors: | Robin Denz [aut, cre], Katharina Meiszl [aut] |
Maintainer: | Robin Denz <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.2.2 |
Built: | 2025-02-23 15:24:04 UTC |
Source: | https://github.com/robindenz1/simdag |
What is this package about?
This package aims to give a comprehensive framework to simulate static and longitudinal data given a directed acyclic graph and some information about each node. Our goal is to make this package as user-friendly and intuitive as possible, while allowing extreme flexibility and while keeping the underlying code as fast and RAM efficient as possible.
What features are included in this package?
This package includes two main simulation functions: the sim_from_dag
function, which can be used to simulate data from a previously defined causal DAG and node information and the sim_discrete_time
function, which implements a framework to conduct discrete-time simulations. The former is very easy to use, but cannot deal with time-varying variable easily. The latter is a little more difficult to use (usually requiring the user to write some functions himself), but allows the simulation of arbitrarily complex longitudinal data.
Through a collection of implemented node types, this package allows the user to generate data with a mix of binary, categorical, count and time-to-event data. The sim_discrete_time
function additionally enables the user to generate time-to-event data with, if desired, a mix of competing events, recurrent events, time-varying variables that influence each other and any types of censoring.
The package also includes a few functions to transform resulting data into multiple formats, to augment existing DAGs, to plot DAGs and to plot a flow-chart of the data generation process.
What does a typical workflow using this package look like?
Users should start by defining a DAG
object using the empty_dag
and node
functions. This DAG
can then be passed to one of the two simulation functions included in this package. More information on how to do this can be found in the respective documentation pages and the three vignettes of this package.
When should I use sim_from_dag
and when sim_discrete_time
?
If you want to simulate data that is easily described using a standard DAG without time-varying variables, you should use the sim_from_dag
function. If the DAG includes time-varying variables, but you only want to consider a few points in time and can easily describe the relations between those manually, you can still use the sim_from_dag
function. If you want more complex data with time-varying variables, particularly with time-to-event outcomes, you should consider using the sim_discrete_time
function.
What features are missing from this package?
The package currently only implements some possible child nodes. In the future we would like to implement more child node types, such as nodes with generalized mixed linear models or more complex survival time models.
Why should I use this package instead of the simCausal package?
The simCausal package was a big inspiration for this package. In contrast to it, however, it allows quite a bit more flexibility. A big difference is that this package includes a comprehensive framework for discrete-time simulations and the simCausal package does not.
Where can I get more information?
The documentation pages contain a lot of information, relevant examples and some literature references. Additional examples can be found in the vignettes of this package, which can be accessed using:
vignette(topic="v_sim_from_dag", package="simDAG")
vignette(topic="v_sim_discrete_time", package="simDAG")
vignette(topic="v_covid_example", package="simDAG")
vignette(topic="v_using_formulas", package="simDAG")
We are also working on a separate article on this package that is going to be published in a peer-reviewed journal.
I have a problem using the sim_discrete_time
function
The sim_discrete_time
function can become difficult to use depending on what kind of data the user wants to generate. For this reason we put in extra effort to make the documentation and examples as clear and helpful as possible. Please consult the relevant documentation pages and the vignettes before contacting the authors directly with programming related questions that are not clearly bugs in the code.
I want to suggest a new feature / I want to report a bug. Where can I do this?
Bug reports, suggestions and feature requests are highly welcome. Please file an issue on the official github page or contact the author directly using the supplied e-mail address.
Robin Denz, <[email protected]>
Banks, Jerry, John S. Carson II, Barry L. Nelson, and David M. Nicol (2014). Discrete-Event System Simulation. Vol. 5. Edinburgh Gate: Pearson Education Limited.
DAG.node
object to a DAG
object
This function allows users to add DAG.node
objects created using the node
or node_td
function to DAG
objects created using the empty_dag
function, which makes it easy to fully specify a DAG to use in the sim_from_dag
function and sim_discrete_time
.
add_node(dag, node) ## S3 method for class 'DAG' object_1 + object_2
add_node(dag, node) ## S3 method for class 'DAG' object_1 + object_2
dag |
A |
node |
A |
object_1 |
Either a |
object_2 |
See argument |
The two ways of adding a node to a DAG
object are: dag <- add_node(dag, node(...))
and dag <- dag + node(...)
, which give identical results (note that the ...
should be replaced with actual arguments and that the initial dag
should be created with a call to empty_dag
). See node
for more information on how to specify a DAG
for use in the sim_from_dag
and node_td
functions.
Returns an DAG
object with the DAG.node
object added to it.
Robin Denz
library(simDAG) ## add nodes to DAG using + dag <- empty_dag() + node("age", type="rnorm", mean=50, sd=5) + node("sex", type="rbernoulli", p=0.5) + node("income", type="gaussian", parents=c("age", "sex"), betas=c(1.1, 0.2), intercept=-5, error=4) ## add nodes to DAG using add_node() dag <- empty_dag() dag <- add_node(dag, node("age", type="rnorm", mean=50, sd=5))
library(simDAG) ## add nodes to DAG using + dag <- empty_dag() + node("age", type="rnorm", mean=50, sd=5) + node("sex", type="rbernoulli", p=0.5) + node("income", type="gaussian", parents=c("age", "sex"), betas=c(1.1, 0.2), intercept=-5, error=4) ## add nodes to DAG using add_node() dag <- empty_dag() dag <- add_node(dag, node("age", type="rnorm", mean=50, sd=5))
igraph
object
This function extends the as.igraph
function from the igraph
package to allow the input of a DAG
object. The result is an igraph
object that includes only the structure of the DAG, not any specifications. May be useful for plotting purposes.
## S3 method for class 'DAG' as.igraph(x, ...)
## S3 method for class 'DAG' as.igraph(x, ...)
x |
A |
... |
Currently not used. |
Returns a igraph
object.
Robin Denz
library(simDAG) # some example DAG dag <- empty_dag() + node("death", type="binomial", parents=c("age", "sex"), betas=c(1, 2), intercept=-10) + node("age", type="rnorm", mean=10, sd=2) + node("sex", parents="", type="rbernoulli", p=0.5) + node("smoking", parents=c("sex", "age"), type="binomial", betas=c(0.6, 0.2), intercept=-2) if (requireNamespace("igraph")) { g <- igraph::as.igraph(dag) plot(g) }
library(simDAG) # some example DAG dag <- empty_dag() + node("death", type="binomial", parents=c("age", "sex"), betas=c(1, 2), intercept=-10) + node("age", type="rnorm", mean=10, sd=2) + node("sex", parents="", type="rbernoulli", p=0.5) + node("smoking", parents=c("sex", "age"), type="binomial", betas=c(0.6, 0.2), intercept=-2) if (requireNamespace("igraph")) { g <- igraph::as.igraph(dag) plot(g) }
DAG
object with parameters estimated from reference data
Given a partially specified DAG
object, where only the name
, type
and the parents
are specified plus a data.frame
containing realizations of these nodes, return a fully specified DAG
(with beta-coefficients, intercepts, errors, ...). The returned DAG
can be used directly to simulate data with the sim_from_dag
function.
dag_from_data(dag, data, return_models=FALSE, na.rm=FALSE)
dag_from_data(dag, data, return_models=FALSE, na.rm=FALSE)
dag |
A partially specified |
data |
A |
return_models |
Whether to return a list of all models that were fit to estimate the information for all child nodes (elements in |
na.rm |
Whether to remove missing values or not. |
How it works:
It can be cumbersome to specify all the node information needed for the simulation, especially when there are a lot of nodes to consider. Additionally, if data is available, it is natural to fit appropriate models to the data to get an empirical estimate of the node information for the simulation. This function automates this process. If the user has a reasonable DAG and knows the node types, this is a very fast way to generate synthetic data that corresponds well to the empirical data.
All the user has to do is create a minimal DAG
object including only information on the parents
, the name
and the node type
. For root nodes, the required distribution parameters are extracted from the data. For child nodes, regression models corresponding to the specified type
are fit to the data using the parents
as independent covariates and the name
as dependent variable. All required information is extracted from these models and added to the respective node. The output contains a fully specified DAG
object which can then be used directly in the sim_from_dag
function. It may also include a list containing the fitted models for further inspection, if return_models=TRUE
.
Supported root node types:
Currently, the following root node types are supported:
"rnorm"
: Estimates parameters of a normal distribution.
"rbernoulli"
: Estimates the p
parameter of a Bernoulli distribution.
"rcategorical"
: Estimates the class probabilities in a categorical distribution.
Other types need to be implemented by the user.
Supported child node types:
Currently, the following child node types are supported:
"gaussian"
: Estimates parameters for a node of type "gaussian"
.
"binomial"
: Estimates parameters for a node of type "binomial"
.
"poisson"
: Estimates parameters for a node of type "poisson"
.
"negative_binomial"
: Estimates parameters for a node of type "negative_binomial"
.
"conditional_prob"
: Estimates parameters for a node of type "conditional_prob"
.
Other types need to be implemented by the user.
Support for custom nodes:
The sim_from_dag
function supports custom node functions, as described in node_custom
. It is impossible for us to directly support these custom types in this function directly. However, the user can extend this function easily to accommodate any of his/her custom types. Similar to defining a custom node type, the user simply has to write a function that returns a correctly specified node.DAG
object, given the named arguments name
, parents
, type
, data
and return_model
. The first three arguments should simply be added directly to the output. The data
should be used inside your function to fit a model or obtain the required parameters in some other way. The return_model
argument should control whether the model should be added to the output (in a named argument called model
). The function name should be paste0("gen_node_", YOURTYPE)
. An examples is given below.
Interactions & cubic terms:
This function currently does not support the usage of interaction effects or non-linear terms (such as using A ~ B + I(B^2)
as a formula). Instead, it will be assumed that all values in parents
have a linear effect on the respective node. For example, using parents=c("A", "B")
for a node named "C"
will use the formula C ~ A + B
. If other behavior is desired, users need to integrate this into their own custom function as described above.
A list of length two containing the new fully specified DAG
object named dag
and a list of the fitted models (if return_models=TRUE
) in the object named models
.
Robin Denz
library(simDAG) set.seed(457456) # get some example data from a known DAG dag <- empty_dag() + node("death", type="binomial", parents=c("age", "sex"), betas=c(1, 2), intercept=-10) + node("age", type="rnorm", mean=10, sd=2) + node("sex", parents="", type="rbernoulli", p=0.5) + node("smoking", parents=c("sex", "age"), type="binomial", betas=c(0.6, 0.2), intercept=-2) data <- sim_from_dag(dag=dag, n_sim=1000) # suppose we only know the causal structure and the node type: dag <- empty_dag() + node("death", type="binomial", parents=c("age", "sex")) + node("age", type="rnorm") + node("sex", type="rbernoulli") + node("smoking", type="binomial", parents=c("sex", "age")) # get parameter estimates from data dag_full <- dag_from_data(dag=dag, data=data) # can now be used to simulate data data2 <- sim_from_dag(dag=dag_full$dag, n_sim=100)
library(simDAG) set.seed(457456) # get some example data from a known DAG dag <- empty_dag() + node("death", type="binomial", parents=c("age", "sex"), betas=c(1, 2), intercept=-10) + node("age", type="rnorm", mean=10, sd=2) + node("sex", parents="", type="rbernoulli", p=0.5) + node("smoking", parents=c("sex", "age"), type="binomial", betas=c(0.6, 0.2), intercept=-2) data <- sim_from_dag(dag=dag, n_sim=1000) # suppose we only know the causal structure and the node type: dag <- empty_dag() + node("death", type="binomial", parents=c("age", "sex")) + node("age", type="rnorm") + node("sex", type="rbernoulli") + node("smoking", type="binomial", parents=c("sex", "age")) # get parameter estimates from data dag_full <- dag_from_data(dag=dag, data=data) # can now be used to simulate data data2 <- sim_from_dag(dag=dag_full$dag, n_sim=100)
DAG
object
The sim_from_dag
function requires the user to specify the causal relationships inside a DAG
object containing node information. This function takes this object as input and outputs the underlying adjacency matrix. This can be useful to plot the theoretical DAG or to check if the nodes have been specified correctly.
dag2matrix(dag, include_root_nodes=TRUE, include_td_nodes=FALSE)
dag2matrix(dag, include_root_nodes=TRUE, include_td_nodes=FALSE)
dag |
A |
include_root_nodes |
Whether to include root nodes in the output matrix. Should usually be kept at |
include_td_nodes |
Whether to include time-dependent nodes added to the |
An adjacency matrix is simply a square matrix in which each node has one column and one row associated with it. For example, if the node A has a causal effect on node B, the matrix will contain 1
in the spot matrix["A", "B"]
.
If a time-varying node is also defined as a time-fixed node, the parents of both parts will be pooled when creating the output matrix.
Returns a numeric square matrix with one row and one column per used node in dag
.
Robin Denz
library(simDAG) # some example DAG dag <- empty_dag() + node("death", type="binomial", parents=c("age", "sex"), betas=c(1, 2), intercept=-10) + node("age", type="rnorm", mean=10, sd=2) + node("sex", parents="", type="rbernoulli", p=0.5) + node("smoking", parents=c("sex", "age"), type="binomial", betas=c(0.6, 0.2), intercept=-2) # get adjacency matrix dag2matrix(dag) # get adjacency matrix using only the child nodes dag2matrix(dag, include_root_nodes=FALSE) ## adding time-varying nodes dag <- dag + node_td("disease", type="time_to_event", parents=c("age", "smoking"), prob_fun=0.01) + node_td("death", type="time_to_event", parents=c("age", "sex", "smoking", "disease"), prob_fun=0.001, event_duration=Inf) # get adjacency matrix including all nodes dag2matrix(dag, include_td_nodes=TRUE) # get adjacency matrix including only time-constant nodes dag2matrix(dag, include_td_nodes=FALSE) # get adjacency matrix using only the child nodes dag2matrix(dag, include_root_nodes=FALSE)
library(simDAG) # some example DAG dag <- empty_dag() + node("death", type="binomial", parents=c("age", "sex"), betas=c(1, 2), intercept=-10) + node("age", type="rnorm", mean=10, sd=2) + node("sex", parents="", type="rbernoulli", p=0.5) + node("smoking", parents=c("sex", "age"), type="binomial", betas=c(0.6, 0.2), intercept=-2) # get adjacency matrix dag2matrix(dag) # get adjacency matrix using only the child nodes dag2matrix(dag, include_root_nodes=FALSE) ## adding time-varying nodes dag <- dag + node_td("disease", type="time_to_event", parents=c("age", "smoking"), prob_fun=0.01) + node_td("death", type="time_to_event", parents=c("age", "sex", "smoking", "disease"), prob_fun=0.001, event_duration=Inf) # get adjacency matrix including all nodes dag2matrix(dag, include_td_nodes=TRUE) # get adjacency matrix including only time-constant nodes dag2matrix(dag, include_td_nodes=FALSE) # get adjacency matrix using only the child nodes dag2matrix(dag, include_root_nodes=FALSE)
DAG
objects
This function can be used to set one or more nodes in a given DAG
object to a specific value, which corresponds to an intervention on a DAG as defined by the do-operator introduced by Judea Pearl.
do(dag, names, values)
do(dag, names, values)
dag |
A |
names |
A character string specifying names of nodes in the |
values |
A vector or list of any values. These nodes defined with the |
Internally this function simply removes the old node definition of all nodes in names
and replaces it with a new node definition that defines the node as a constant value, irrespective of the original definition. The same effect can be created by directly specifying the DAG
in this way from the start (see examples).
This function does not alter the original DAG
in place. Instead, it returns a modified version of the DAG
. In other words, using only do(dag, names="A", values=3)
will not change the dag
object.
Returns a DAG
object with updated node definitions.
Robin Denz
Judea Pearl (2009). Causality: Models, Reasoning and Inference. 2nd ed. Cambridge: Cambridge University Press
library(simDAG) # define some initial DAG dag <- empty_dag() + node("death", "binomial", c("age", "sex"), betas=c(1, 2), intercept=-10) + node("age", type="rnorm", mean=10, sd=2) + node("sex", parents="", type="rbernoulli", p=0.5) + node("smoking", parents=c("sex", "age"), type="binomial", betas=c(0.6, 0.2), intercept=-2) # return new DAG with do(smoking = TRUE) dag2 <- do(dag, names="smoking", values=TRUE) # which is equivalent to dag2 <- empty_dag() + node("death", "binomial", c("age", "sex"), betas=c(1, 2), intercept=-10) + node("age", type="rnorm", mean=10, sd=2) + node("sex", parents="", type="rbernoulli", p=0.5) + node("smoking", type="rconstant", constant=TRUE) # use do() on multiple variables: do(smoking = TRUE, sex = FALSE) dag2 <- do(dag, names=c("smoking", "sex"), values=list(TRUE, FALSE))
library(simDAG) # define some initial DAG dag <- empty_dag() + node("death", "binomial", c("age", "sex"), betas=c(1, 2), intercept=-10) + node("age", type="rnorm", mean=10, sd=2) + node("sex", parents="", type="rbernoulli", p=0.5) + node("smoking", parents=c("sex", "age"), type="binomial", betas=c(0.6, 0.2), intercept=-2) # return new DAG with do(smoking = TRUE) dag2 <- do(dag, names="smoking", values=TRUE) # which is equivalent to dag2 <- empty_dag() + node("death", "binomial", c("age", "sex"), betas=c(1, 2), intercept=-10) + node("age", type="rnorm", mean=10, sd=2) + node("sex", parents="", type="rbernoulli", p=0.5) + node("smoking", type="rconstant", constant=TRUE) # use do() on multiple variables: do(smoking = TRUE, sex = FALSE) dag2 <- do(dag, names=c("smoking", "sex"), values=list(TRUE, FALSE))
DAG
object
This function should be used in conjunction with multiple calls to node
or node_td
to create a DAG
object, which can then be used to simulate data using the sim_from_dag
and sim_discrete_time
functions.
empty_dag()
empty_dag()
Note that this function is only used to initialize an empty DAG
object. Actual information about the respective nodes have to be added using the node
function or the node_td
function. The documentation page of that function contains more information on how to correctly do this.
Returns an empty DAG
object.
Robin Denz
library(simDAG) # just an empty DAG empty_dag() # adding a node to it empty_dag() + node("age", type="rnorm", mean=20, sd=5)
library(simDAG) # just an empty DAG empty_dag() # adding a node to it empty_dag() + node("age", type="rnorm", mean=20, sd=5)
data.table
in the long-format to a data.table
in the start-stop format
This function transforms a data.table
in the long-format (one row per person per time point) to a data.table
in the start-stop format (one row per person-specific period in which no variables changed).
long2start_stop(data, id, time, varying, overlap=FALSE, check_inputs=TRUE)
long2start_stop(data, id, time, varying, overlap=FALSE, check_inputs=TRUE)
data |
A |
id |
A single character string specifying a unique person identifier included in in |
time |
A single character string specifying a time variable included in in |
varying |
A character vector specifying names of variables included in in |
overlap |
Specifies whether the intervals should overlap or not. If |
check_inputs |
Whether to check if the user input is correct or not. Can be turned off by setting it to |
This function relies on data.table
syntax to make the data transformation as RAM efficient and fast as possible.
Returns a data.table
containing the columns .id
(the unique person identifier), .time
(an integer variable encoding the time) and all other variables included in the input data
in the long format.
Robin Denz
library(simDAG) library(data.table) # generate example data in long format long <- data.table(.id=rep(seq_len(10), each=5), .time=rep(seq_len(5), 10), A=c(rep(FALSE, 43), TRUE, TRUE, rep(FALSE, 3), TRUE, TRUE), B=FALSE) setkey(long, .id, .time) # transform to start-stop format long2start_stop(data=long, id=".id", time=".time", varying=c("A", "B"))
library(simDAG) library(data.table) # generate example data in long format long <- data.table(.id=rep(seq_len(10), each=5), .time=rep(seq_len(5), 10), A=c(rep(FALSE, 43), TRUE, TRUE, rep(FALSE, 3), TRUE, TRUE), B=FALSE) setkey(long, .id, .time) # transform to start-stop format long2start_stop(data=long, id=".id", time=".time", varying=c("A", "B"))
DAG
object from a Adjacency Matrix and a List of Node Types
The sim_from_dag
function requires the user to specify the causal relationships inside a DAG
object containing node information. This function creates such an object using a adjacency matrix and a list of node types. The resulting DAG
will be only partially specified, which may be useful for the dag_from_data
function.
matrix2dag(mat, type)
matrix2dag(mat, type)
mat |
A p x p adjacency matrix where p is the number of variables. The matrix should be filled with zeros. Only places where the variable specified by the row has a direct causal effect on the variable specified by the column should be 1. Both the columns and the rows should be named with the corresponding variable names. |
type |
A named list with one entry for each variable in |
An adjacency matrix is simply a square matrix in which each node has one column and one row associated with it. For example, if the node A has a causal effect on node B, the matrix will contain 1
in the spot matrix["A", "B"]
. This function uses this kind of matrix and additional information about the node type to create a DAG
object. The resulting DAG
cannot be used in the sim_from_dag
function directly, because it will not contain the necessary parameters such as beta-coefficients or intercepts etc. It may, however, be passed directly to the dag_from_data
function. This is pretty much it's only valid use-case. If the goal is to to specify a full DAG
manually, the user should use the empty_dag
function in conjunction with node
calls instead, as described in the respective documentation pages and the vignettes.
The output will never contain time-dependent nodes. If this is necessary, the user needs to manually define the DAG.
Returns a partially specified DAG
object.
Robin Denz
empty_dag
, node
, node_td
, dag_from_data
library(simDAG) # simple example adjacency matrix mat <- matrix(c(0, 0, 1, 0, 0, 1, 0, 0, 0), ncol=3, byrow=TRUE) colnames(mat) <- c("age", "sex", "death") rownames(mat) <- c("age", "sex", "death") type <- list(age="rnorm", sex="rbernoulli", death="binomial") matrix2dag(mat=mat, type=type)
library(simDAG) # simple example adjacency matrix mat <- matrix(c(0, 0, 1, 0, 0, 1, 0, 0, 0), ncol=3, byrow=TRUE) colnames(mat) <- c("age", "sex", "death") rownames(mat) <- c("age", "sex", "death") type <- list(age="rnorm", sex="rbernoulli", death="binomial") matrix2dag(mat=mat, type=type)
These functions should be used in conjunction with the empty_dag
function to create DAG
objects, which can then be used to simulate data using the sim_from_dag
function or the sim_discrete_time
function.
node(name, type, parents=NULL, formula=NULL, ...) node_td(name, type, parents=NULL, formula=NULL, ...)
node(name, type, parents=NULL, formula=NULL, ...) node_td(name, type, parents=NULL, formula=NULL, ...)
name |
A character vector with at least one entry specifying the name of the node. If a character vector containing multiple different names is supplied, one separate node will be created for each name. These nodes are completely independent, but have the exact same node definition as supplied by the user. If only a single character string is provided, only one node is generated. |
type |
A single character string specifying the type of the node. Depending on whether the node is a root node, a child node or a time-dependent node different node types are allowed. See details. Alternatively, a suitable function may be passed directly to this argument. |
parents |
A character vector of names, specifying the parents of the node or |
formula |
An optional |
... |
Further named arguments needed to specify the node. Those can be parameters of distribution functions such as the |
To generate data using the sim_from_dag
function or the sim_discrete_time
function, it is required to create a DAG
object first. This object needs to contain information about the causal structure of the data (e.g. which variable causes which variable) and the specific structural equations for each variable (information about causal coefficients, type of distribution etc.). In this package, the node
and/or node_td
function is used in conjunction with the empty_dag
function to create this object.
This works by first initializing an empty DAG
using the empty_dag
function and then adding multiple calls to the node
and/or node_td
functions to it using a simple +
, where each call to node
and/or node_td
adds information about a single node that should be generated. Multiple examples are given below.
In each call to node
or node_td
the user needs to indicate what the node should be called (name
), which function should be used to generate the node (type
), whether the node has any parents and if so which (parents
) and any additional arguments needed to actually call the data-generating function of this node later passed to the three-dot syntax (...
).
node
vs. node_td
:
By calling node
you are indicating that this node is a time-fixed variable which should only be generated once. By using node_td
you are indicating that it is a time-dependent node, which will be updated at each step in time when using a discrete-time simulation.
node_td
should only be used if you are planning to perform a discrete-time simulation with the sim_discrete_time
function. DAG
objects including time-dependent nodes may not be used in the sim_from_dag
function.
Implemented Root Node Types:
Any function can be used to generate root nodes. The only requirement is that the function has at least one named argument called n
which controls the length of the resulting vector. For example, the user could specify a node of type "rnorm"
to create a normally distributed node with no parents. The argument n
will be set internally, but any additional arguments can be specified using the ...
syntax. In the type="rnorm"
example, the user could set the mean and standard deviation using node(name="example", type="rnorm", mean=10, sd=5)
.
For convenience, this package additionally includes three custom root-node functions:
"rbernoulli": Draws randomly from a bernoulli distribution.
"rcategorical": Draws randomly from any discrete probability density function.
"rconstant": Used to set a variable to a constant value.
Implemented Child Node Types:
Currently, the following node types are implemented directly for convenience:
"gaussian": A node based on linear regression.
"binomial": A node based on logistic regression.
"conditional_prob": A node based on conditional probabilities.
"conditional_distr": A node based on conditional draws from different distributions.
"multinomial": A node based on multinomial regression.
"poisson": A node based on poisson regression.
"negative_binomial": A node based on negative binomial regression.
"cox": A node based on cox-regression.
"identity": A node that is just some R expression of other nodes.
For custom child node types, see below.
Implemented Time-Dependent Node Types:
Currently, the following node types are implemented directly for convenience to use in node_td
calls:
"time_to_event": A node based on repeatedly checking whether an event occurs at each point in time.
"competing_events": A node based on repeatedly checking whether one of multiple mutually exclusive events occurs at each point in time.
However, the user may also use any of the child node types in a node_td
call directly. For custom time-dependent node types, see below.
Custom Node Types
It is very simple to write a new custom node_function
to be used instead, allowing the user to use any type
of data-generation mechanism for any type of node (root / child / time-dependent). All that is required of this function is, that it has the named arguments data
(the sample as generated so far) and, if it's a child node, parents
(a character vector specifying the parents) and outputs either a vector containing n_sim
entries, or a data.frame
with n_sim
rows and an arbitrary amount of columns. More information about this can be found on the node_custom
documentation page.
Using child nodes as parents for other nodes:
If the data generated by a child node is categorical (such as when using node_multinomial
) they can still be used as parents of other nodes for most standard node types without issues. All the user has to do is to use formula
argument to supply an enhanced formula, instead of defining the parents
and betas
argument directly. This works well for all node types that directly support formula
input. For other node types, users may need to write custom functions to make this work. See the associated vignette: vignette(topic="v_using_formulas", package="simDAG")
for more information on how to correctly use formulas.
Cyclic causal structures:
The name DAG (directed acyclic graph) implies that cycles are not allowed. This means that if you start from any node and only follow the arrows in the direction they are pointing, there should be no way to get back to your original node. This is necessary both theoretically and for practical reasons if we are dealing with static DAGs created using the node
function. If the user attempts to generate data from a static cyclic graph using the sim_from_dag
function, an error will be produced.
However, in the realm of discrete-time simulations, cyclic causal structures are perfectly reasonable. A variable at
may influence a variable
at
, which in turn may influence variable
at
again. Therefore, when using the
node_td
function to simulate time-dependent data using the sim_discrete_time
function, cyclic structures are allowed to be present and no error will be produced.
Returns a DAG.node
object which can be added to a DAG
object directly.
Contrary to the R standard, this function does NOT support partial matching of argument names. This means that supplying nam="age"
will not be recognized as name="age"
and instead will be added as additional node argument used in the respective data-generating function call when using sim_from_dag
.
Robin Denz
library(simDAG) # creating a DAG with a single root node dag <- empty_dag() + node("age", type="rnorm", mean=30, sd=4) # creating a DAG with multiple root nodes # (passing the functions directly to 'type' works too) dag <- empty_dag() + node("sex", type=rbernoulli, p=0.5) + node("income", type=rnorm, mean=2700, sd=500) # creating a DAG with multiple root nodes + multiple names in one node dag <- empty_dag() + node("sex", type="rbernoulli", p=0.5) + node(c("income_1", "income_2"), type="rnorm", mean=2700, sd=500) # also using child nodes dag <- empty_dag() + node("sex", type="rbernoulli", p=0.5) + node("income", type="rnorm", mean=2700, sd=500) + node("sickness", type="binomial", parents=c("sex", "income"), betas=c(1.2, -0.3), intercept=-15) + node("death", type="binomial", parents=c("sex", "income", "sickness"), betas=c(0.1, -0.4, 0.8), intercept=-20) # creating the same DAG as above, but using the enhanced formula interface dag <- empty_dag() + node("sex", type="rbernoulli", p=0.5) + node("income", type="rnorm", mean=2700, sd=500) + node("sickness", type="binomial", formula= ~ -15 + sexTRUE*1.2 + income*-0.3) + node("death", type="binomial", formula= ~ -20 + sexTRUE*0.1 + income*-0.4 + sickness*0.8) # using time-dependent nodes # NOTE: to simulate data from this DAG, the sim_discrete_time() function needs # to be used due to "sickness" being a time-dependent node dag <- empty_dag() + node("sex", type="rbernoulli", p=0.5) + node("income", type="rnorm", mean=2700, sd=500) + node_td("sickness", type="binomial", parents=c("sex", "income"), betas=c(0.1, -0.4), intercept=-50) # we could also use a DAG with only time-varying variables dag <- empty_dag() + node_td("vaccine", type="time_to_event", prob_fun=0.001, event_duration=21) + node_td("covid", type="time_to_event", prob_fun=0.01, event_duration=15, immunity_duration=100)
library(simDAG) # creating a DAG with a single root node dag <- empty_dag() + node("age", type="rnorm", mean=30, sd=4) # creating a DAG with multiple root nodes # (passing the functions directly to 'type' works too) dag <- empty_dag() + node("sex", type=rbernoulli, p=0.5) + node("income", type=rnorm, mean=2700, sd=500) # creating a DAG with multiple root nodes + multiple names in one node dag <- empty_dag() + node("sex", type="rbernoulli", p=0.5) + node(c("income_1", "income_2"), type="rnorm", mean=2700, sd=500) # also using child nodes dag <- empty_dag() + node("sex", type="rbernoulli", p=0.5) + node("income", type="rnorm", mean=2700, sd=500) + node("sickness", type="binomial", parents=c("sex", "income"), betas=c(1.2, -0.3), intercept=-15) + node("death", type="binomial", parents=c("sex", "income", "sickness"), betas=c(0.1, -0.4, 0.8), intercept=-20) # creating the same DAG as above, but using the enhanced formula interface dag <- empty_dag() + node("sex", type="rbernoulli", p=0.5) + node("income", type="rnorm", mean=2700, sd=500) + node("sickness", type="binomial", formula= ~ -15 + sexTRUE*1.2 + income*-0.3) + node("death", type="binomial", formula= ~ -20 + sexTRUE*0.1 + income*-0.4 + sickness*0.8) # using time-dependent nodes # NOTE: to simulate data from this DAG, the sim_discrete_time() function needs # to be used due to "sickness" being a time-dependent node dag <- empty_dag() + node("sex", type="rbernoulli", p=0.5) + node("income", type="rnorm", mean=2700, sd=500) + node_td("sickness", type="binomial", parents=c("sex", "income"), betas=c(0.1, -0.4), intercept=-50) # we could also use a DAG with only time-varying variables dag <- empty_dag() + node_td("vaccine", type="time_to_event", prob_fun=0.001, event_duration=21) + node_td("covid", type="time_to_event", prob_fun=0.01, event_duration=15, immunity_duration=100)
Data from the parents is used to generate the node using logistic regression by predicting the covariate specific probability of 1 and sampling from a Bernoulli distribution accordingly.
node_binomial(data, parents, formula=NULL, betas, intercept, return_prob=FALSE, output="logical", labels=NULL)
node_binomial(data, parents, formula=NULL, betas, intercept, return_prob=FALSE, output="logical", labels=NULL)
data |
A |
parents |
A character vector specifying the names of the parents that this particular child node has. If non-linear combinations or interaction effects should be included, the user may specify the |
formula |
An optional |
betas |
A numeric vector with length equal to |
intercept |
A single number specifying the intercept that should be used when generating the node. |
return_prob |
Either |
output |
A single character string, must be either |
labels |
A character vector of length 2 or |
Using the normal form a logistic regression model, the observation specific event probability is generated for every observation in the dataset. Using the rbernoulli
function, this probability is then used to take one bernoulli sample for each observation in the dataset. If only the probability should be returned return_prob
should be set to TRUE
.
Formal Description:
Formally, the data generation can be described as:
where denotes one Bernoulli trial with success probability
,
is the number of parents (
length(parents)
) and the function is defined as:
For example, given intercept=-15
, parents=c("A", "B")
and betas=c(0.2, 1.3)
the data generation process is defined as:
Output Format:
By default this function returns a logical vector containing only TRUE
and FALSE
entries, where TRUE
corresponds to an event and FALSE
to no event. This may be changed by using the output
and labels
arguments. The last three arguments of this function are ignored if return_prob
is set to TRUE
.
Returns a logical vector (or numeric vector if return_prob=TRUE
) of length nrow(data)
.
Robin Denz
empty_dag
, node
, node_td
, sim_from_dag
, sim_discrete_time
library(simDAG) set.seed(5425) # define needed DAG dag <- empty_dag() + node("age", type="rnorm", mean=50, sd=4) + node("sex", type="rbernoulli", p=0.5) + node("smoking", type="binomial", parents=c("age", "sex"), betas=c(1.1, 0.4), intercept=-2) # define the same DAG, but using a pretty formula dag <- empty_dag() + node("age", type="rnorm", mean=50, sd=4) + node("sex", type="rbernoulli", p=0.5) + node("smoking", type="binomial", formula= ~ -2 + age*1.1 + sexTRUE*0.4) # simulate data from it sim_dat <- sim_from_dag(dag=dag, n_sim=100) # returning only the estimated probability instead dag <- empty_dag() + node("age", type="rnorm", mean=50, sd=4) + node("sex", type="rbernoulli", p=0.5) + node("smoking", type="binomial", parents=c("age", "sex"), betas=c(1.1, 0.4), intercept=-2, return_prob=TRUE) sim_dat <- sim_from_dag(dag=dag, n_sim=100)
library(simDAG) set.seed(5425) # define needed DAG dag <- empty_dag() + node("age", type="rnorm", mean=50, sd=4) + node("sex", type="rbernoulli", p=0.5) + node("smoking", type="binomial", parents=c("age", "sex"), betas=c(1.1, 0.4), intercept=-2) # define the same DAG, but using a pretty formula dag <- empty_dag() + node("age", type="rnorm", mean=50, sd=4) + node("sex", type="rbernoulli", p=0.5) + node("smoking", type="binomial", formula= ~ -2 + age*1.1 + sexTRUE*0.4) # simulate data from it sim_dat <- sim_from_dag(dag=dag, n_sim=100) # returning only the estimated probability instead dag <- empty_dag() + node("age", type="rnorm", mean=50, sd=4) + node("sex", type="rbernoulli", p=0.5) + node("smoking", type="binomial", parents=c("age", "sex"), betas=c(1.1, 0.4), intercept=-2, return_prob=TRUE) sim_dat <- sim_from_dag(dag=dag, n_sim=100)
This node essentially models a categorical time-dependent variable for which the time and the type of the event will be important for later usage. It adds two columns to data
: name_event
(which type of event the person is currently experiencing) and name_time
(the time at which the current event started). Can only be used inside of the sim_discrete_time
function, not outside of it. Past events and their kind are stored in two lists. See details.
node_competing_events(data, parents, sim_time, name, prob_fun, ..., event_duration=c(1, 1), immunity_duration=max(event_duration), save_past_events=TRUE, check_inputs=TRUE, envir)
node_competing_events(data, parents, sim_time, name, prob_fun, ..., event_duration=c(1, 1), immunity_duration=max(event_duration), save_past_events=TRUE, check_inputs=TRUE, envir)
data |
A |
parents |
A character vector specifying the names of the parents that this particular child node has. |
sim_time |
The current time of the simulation. |
name |
The name of the node. This will be used as prefix before the |
prob_fun |
A function that returns a numeric matrix with |
... |
An arbitrary number of additional named arguments passed to |
event_duration |
A numeric vector containing one positive integer for each type of event of interest, specifying how long that event should last. For example, if we are interested in modelling the time to a cardiovascular event with death as competing event, this argument would need 2 entries. One would specify the duration of the cardiovascular event and the other would be |
immunity_duration |
A single number >= |
save_past_events |
When the event modeled using this node is recurrent ( |
check_inputs |
Whether to perform plausibility checks for the user input or not. Is set to |
envir |
Only used internally to efficiently store the past event times. Cannot be used by the user. |
When performing discrete-time simulation using the sim_discrete_time
function, the standard node functions implemented in this package are usually not sufficient because they don't capture the time-dependent nature of some very interesting variables. Often, the variable that should be modelled has some probability of occurring at each point in time. Once it does occur, it has some kind of influence on other variables for a period of time until it goes back to normal (or doesn't). This could be a car crash, a surgery, a vaccination etc. The node_time_to_event
node function can be used to model these kinds of nodes in a fairly straightforward fashion.
This function is an extended version of the node_time_to_event
function. Instead of simulating a binary event, it can generate multiple competing events, where the occurrence of one event at time is mutually exclusive with the occurrence of an other event at that time. In other words, multiple events are possible, but only one can occur at a time.
How it Works:
At , this node will be initialized for the first time. It adds two columns to the data:
name_event
(whether the person currently has an event) and name_time
(the time at which the current event started) where name
is the name of the node. Additionally, it adds a list with max_t
entries to the ce_past_events
list returned by the sim_discrete_time
function, which records which individuals experienced a new event at each point in time. The ce_past_causes
list additionally records which kind of event happened at that time.
In a nutshell, it simply models the occurrence of some event by calculating the probability of occurrence at and drawing a single multinomial trial from this probability. If the trial is a "success", the corresponding event column will be set to the drawn event type (described using integers, where 0 is no event and all other events are numbered consecutively), the time column will be set to the current simulation time
and the columns storing the past event times and types will receive an entry.
The event column will stay at its new integer value until the event is over. The duration for that is controlled by the event_duration
parameter. When modeling terminal events such as death, one can simply set this parameter to Inf
, making the event eternal. In many cases it will also be necessary to implement some kind of immunity after the event, which can be done using the immunity_duration
argument. This effectively sets the probability of another occurrence of the event to 0 in the next immunity_duration
time steps. During the immunity duration, the event may be > 0
(if the event is still ongoing) or 0
(if the event_duration
for that event type has already passed).
The probability of occurrence is calculated using the function provided by the user using the prob_fun
argument. This can be an arbitrary complex function. The only requirement is that it takes data
as a first argument. The columns defined by the parents
argument will be passed to this argument automatically. If it has an argument called sim_time
, the current time of the simulation will automatically be passed to it as well. Any further arguments can be passed using the prob_fun_args
argument. A simple example could be a multinomial logistic regression node, in which the probabilities are calculated as an additive linear combination of the columns defined by parents
. A more complex function could include simulation-time dependent effects, further effects dependent on past event times etc. Examples can be found below and in the vignettes.
What can be done with it:
This type of node naturally support the implementation of competing events, where some may be terminal or recurrent in nature and may be influenced by pretty much anything. By specifying the parents
and prob_fun
arguments correctly, it is possible to create an event type that is dependent on past events of itself or other time-to-event variables and other variables in general. The user can include any amount of these nodes in their simulation. It may also be used to simulate any kind of binary time-dependent variable that one would usually not associate with the name "event" as well. It is very flexible, but it does require the user to do some coding by themselves.
What can't be done with it:
This function may only be used to generate competing events, meaning that the occurrence of event 1 at makes it impossible for event 2 at
to occur. If the user wants to generate multiple events that are not mutually exclusive, he or she may add multiple
node_time_to_event
based nodes to the dag
argument of the sim_discrete_time
function.
In fact, a competing events node may be simulated using multiple calls to the node_time_to_event
based nodes as well, by defining the prob_fun
argument of these nodes in such a way that the occurrence of event A makes the occurrence of event B impossible. This might actually be easier to implement in some situations, because it doesn't require the user to manually define a probability function that outputs a matrix of subject-specific probabilities.
Returns a data.table
containing the updated columns of the node.
This function cannot be called outside of the sim_discrete_time
function. It only makes sense to use it as a type in a node_td
function call, as described in the documentation and vignettes.
Robin Denz
empty_dag
, node
, node_td
, sim_from_dag
, sim_discrete_time
library(simDAG) ## a competing_events node with only terminal events, all with a constant ## probability of occurrence, independent of any other variable prob_death_illness <- function(data) { # simply repeat the same probabilities for everyone n <- nrow(data) p_mat <- matrix(c(rep(0.9, n), rep(0.005, n), rep(0.005, n)), byrow = FALSE, ncol=3) return(p_mat) } dag <- empty_dag() + node_td("death_illness", type="competing_events", prob_fun=prob_death_illness, event_duration=c(Inf, Inf)) ## making one of the event-types terminal and the other recurrent dag <- empty_dag() + node_td("death_illness", type="competing_events", prob_fun=prob_death_illness, event_duration=c(15, Inf)) ## call the sim_discrete_time function to generate data from it sim <- sim_discrete_time(dag, n_sim=100, max_t=500) ## more examples on how to use the sim_discrete_time function can be found ## in the documentation page of the node_time_to_event function and ## in the package vignettes
library(simDAG) ## a competing_events node with only terminal events, all with a constant ## probability of occurrence, independent of any other variable prob_death_illness <- function(data) { # simply repeat the same probabilities for everyone n <- nrow(data) p_mat <- matrix(c(rep(0.9, n), rep(0.005, n), rep(0.005, n)), byrow = FALSE, ncol=3) return(p_mat) } dag <- empty_dag() + node_td("death_illness", type="competing_events", prob_fun=prob_death_illness, event_duration=c(Inf, Inf)) ## making one of the event-types terminal and the other recurrent dag <- empty_dag() + node_td("death_illness", type="competing_events", prob_fun=prob_death_illness, event_duration=c(15, Inf)) ## call the sim_discrete_time function to generate data from it sim <- sim_discrete_time(dag, n_sim=100, max_t=500) ## more examples on how to use the sim_discrete_time function can be found ## in the documentation page of the node_time_to_event function and ## in the package vignettes
This function can be used to generate any kind of dichotomous, categorical or numeric variables dependent on one or more categorical variables by randomly sampling from user-defined distributions in each strata defined by the nodes parents.
node_conditional_distr(data, parents, distr, default_distr=NULL, default_distr_args=list(), default_val=NA_real_, coerce2numeric=TRUE, check_inputs=TRUE)
node_conditional_distr(data, parents, distr, default_distr=NULL, default_distr_args=list(), default_val=NA_real_, coerce2numeric=TRUE, check_inputs=TRUE)
data |
A |
parents |
A character vector specifying the names of the parents that this particular child node has. |
distr |
A named list where each element corresponds to one stratum defined by parents. If only one name is given in |
default_distr |
A function that should be used to generate values for all strata that are not explicitly mentioned in the |
default_distr_args |
A named list of arguments which are passed to the function defined by the |
default_val |
A single value which is used as an output for strata that are not mentioned in |
coerce2numeric |
A single logical value specifying whether to try to coerce the resulting variable to numeric or not. |
check_inputs |
A single logical value specifying whether to perform input checks or not. May be set to |
Utilizing the user-defined distribution in each stratum of parents
(supplied using the distr
argument), this function simply calls the user-defined function with the arguments given by the user to generate a new variable. This allows the new variable to consist of a mix of different distributions, based on categorical parents
.
Formal Description:
Formally, the data generation process can be described as a series of conditional equations. For example, suppose that there is just one parent node sex
with the levels male
and female
with the goal of creating a continuous outcome that has a normal distribution of for males and
for females. The conditional equation is then:
If there are more than two variables, the conditional distribution would be stratified by the intersection of all subgroups defined by the variables.
Returns a numeric vector of length nrow(data)
.
Robin Denz
empty_dag
, node
, node_td
, sim_from_dag
, sim_discrete_time
library(simDAG) set.seed(42) #### with one parent node #### # define conditional distributions distr <- list(male=list("rnorm", mean=100, sd=5), female=list("rcategorical", probs=c(0.1, 0.2, 0.7))) # define DAG dag <- empty_dag() + node("sex", type="rcategorical", labels=c("male", "female"), output="factor", probs=c(0.4, 0.6)) + node("chemo", type="rbernoulli", p=0.5) + node("A", type="conditional_distr", parents="sex", distr=distr) # generate data data <- sim_from_dag(dag=dag, n_sim=1000) #### with two parent nodes #### # define conditional distributions with interaction between parents distr <- list(male.FALSE=list("rnorm", mean=100, sd=5), male.TRUE=list("rnorm", mean=100, sd=20), female.FALSE=list("rbernoulli", p=0.5), female.TRUE=list("rcategorical", probs=c(0.1, 0.2, 0.7))) # define DAG dag <- empty_dag() + node("sex", type="rcategorical", labels=c("male", "female"), output="factor", probs=c(0.4, 0.6)) + node("chemo", type="rbernoulli", p=0.5) + node("A", type="conditional_distr", parents=c("sex", "chemo"), distr=distr) # generate data data <- sim_from_dag(dag=dag, n_sim=1000)
library(simDAG) set.seed(42) #### with one parent node #### # define conditional distributions distr <- list(male=list("rnorm", mean=100, sd=5), female=list("rcategorical", probs=c(0.1, 0.2, 0.7))) # define DAG dag <- empty_dag() + node("sex", type="rcategorical", labels=c("male", "female"), output="factor", probs=c(0.4, 0.6)) + node("chemo", type="rbernoulli", p=0.5) + node("A", type="conditional_distr", parents="sex", distr=distr) # generate data data <- sim_from_dag(dag=dag, n_sim=1000) #### with two parent nodes #### # define conditional distributions with interaction between parents distr <- list(male.FALSE=list("rnorm", mean=100, sd=5), male.TRUE=list("rnorm", mean=100, sd=20), female.FALSE=list("rbernoulli", p=0.5), female.TRUE=list("rcategorical", probs=c(0.1, 0.2, 0.7))) # define DAG dag <- empty_dag() + node("sex", type="rcategorical", labels=c("male", "female"), output="factor", probs=c(0.4, 0.6)) + node("chemo", type="rbernoulli", p=0.5) + node("A", type="conditional_distr", parents=c("sex", "chemo"), distr=distr) # generate data data <- sim_from_dag(dag=dag, n_sim=1000)
This function can be used to generate dichotomous or categorical variables dependent on one or more categorical variables where the probabilities of occurrence in each strata defined by those variables is known.
node_conditional_prob(data, parents, probs, default_probs=NULL, default_val=NA, labels=NULL, coerce2factor=FALSE, check_inputs=TRUE)
node_conditional_prob(data, parents, probs, default_probs=NULL, default_val=NA, labels=NULL, coerce2factor=FALSE, check_inputs=TRUE)
data |
A |
parents |
A character vector specifying the names of the parents that this particular child node has. |
probs |
A named list where each element corresponds to one stratum defined by parents. If only one name is given in |
default_probs |
If not all possible strata of |
default_val |
Value of the produced variable in strata that are not included in the |
labels |
A vector of labels for the generated output. If |
coerce2factor |
A single logical value specifying whether to return the drawn events as a factor or not. |
check_inputs |
A single logical value specifying whether input checks should be performed or not. Set to |
Utilizing the user-defined discrete probability distribution in each stratum of parents
(supplied using the probs
argument), this function simply calls either the rbernoulli
or the rcategorical
function.
Formal Description:
Formally, the data generation process can be described as a series of conditional equations. For example, suppose that there is just one parent node sex
with the levels male
and female
with the goal of creating a binary outcome that has a probability of occurrence of 0.5 for males and 0.7 for females. The conditional equation is then:
where:
and is the Bernoulli distribution with success probability
. If the outcome has more than two categories, the Bernoulli distribution would be replaced by
with
being replaced by a matrix of class probabilities. If there are more than two variables, the conditional distribution would be stratified by the intersection of all subgroups defined by the variables.
Returns a numeric vector of length nrow(data)
.
Robin Denz
empty_dag
, node
, node_td
, sim_from_dag
, sim_discrete_time
library(simDAG) set.seed(42) #### two classes, one parent node #### # define conditional probs probs <- list(male=0.5, female=0.8) # define DAG dag <- empty_dag() + node("sex", type="rcategorical", labels=c("male", "female"), output="factor", probs=c(0.5, 0.5)) + node("chemo", type="rbernoulli", p=0.5) + node("A", type="conditional_prob", parents="sex", probs=probs) # generate data data <- sim_from_dag(dag=dag, n_sim=1000) #### three classes, one parent node #### # define conditional probs probs <- list(male=c(0.5, 0.2, 0.3), female=c(0.8, 0.1, 0.1)) # define DAG dag <- empty_dag() + node("sex", type="rcategorical", labels=c("male", "female"), output="factor", probs=c(0.5, 0.5)) + node("chemo", type="rbernoulli", p=0.5) + node("A", type="conditional_prob", parents="sex", probs=probs) # generate data data <- sim_from_dag(dag=dag, n_sim=1000) #### two classes, two parent nodes #### # define conditional probs probs <- list(male.FALSE=0.5, male.TRUE=0.8, female.FALSE=0.1, female.TRUE=0.3) # define DAG dag <- empty_dag() + node("sex", type="rcategorical", labels=c("male", "female"), output="factor", probs=c(0.5, 0.5)) + node("chemo", type="rbernoulli", p=0.5) + node("A", type="conditional_prob", parents=c("sex", "chemo"), probs=probs) # generate data data <- sim_from_dag(dag=dag, n_sim=1000) #### three classes, two parent nodes #### # define conditional probs probs <- list(male.FALSE=c(0.5, 0.1, 0.4), male.TRUE=c(0.8, 0.1, 0.1), female.FALSE=c(0.1, 0.7, 0.2), female.TRUE=c(0.3, 0.4, 0.3)) # define dag dag <- empty_dag() + node("sex", type="rcategorical", labels=c("male", "female"), output="factor", probs=c(0.5, 0.5)) + node("chemo", type="rbernoulli", p=0.5) + node("A", type="conditional_prob", parents=c("sex", "chemo"), probs=probs) # generate data data <- sim_from_dag(dag=dag, n_sim=1000)
library(simDAG) set.seed(42) #### two classes, one parent node #### # define conditional probs probs <- list(male=0.5, female=0.8) # define DAG dag <- empty_dag() + node("sex", type="rcategorical", labels=c("male", "female"), output="factor", probs=c(0.5, 0.5)) + node("chemo", type="rbernoulli", p=0.5) + node("A", type="conditional_prob", parents="sex", probs=probs) # generate data data <- sim_from_dag(dag=dag, n_sim=1000) #### three classes, one parent node #### # define conditional probs probs <- list(male=c(0.5, 0.2, 0.3), female=c(0.8, 0.1, 0.1)) # define DAG dag <- empty_dag() + node("sex", type="rcategorical", labels=c("male", "female"), output="factor", probs=c(0.5, 0.5)) + node("chemo", type="rbernoulli", p=0.5) + node("A", type="conditional_prob", parents="sex", probs=probs) # generate data data <- sim_from_dag(dag=dag, n_sim=1000) #### two classes, two parent nodes #### # define conditional probs probs <- list(male.FALSE=0.5, male.TRUE=0.8, female.FALSE=0.1, female.TRUE=0.3) # define DAG dag <- empty_dag() + node("sex", type="rcategorical", labels=c("male", "female"), output="factor", probs=c(0.5, 0.5)) + node("chemo", type="rbernoulli", p=0.5) + node("A", type="conditional_prob", parents=c("sex", "chemo"), probs=probs) # generate data data <- sim_from_dag(dag=dag, n_sim=1000) #### three classes, two parent nodes #### # define conditional probs probs <- list(male.FALSE=c(0.5, 0.1, 0.4), male.TRUE=c(0.8, 0.1, 0.1), female.FALSE=c(0.1, 0.7, 0.2), female.TRUE=c(0.3, 0.4, 0.3)) # define dag dag <- empty_dag() + node("sex", type="rcategorical", labels=c("male", "female"), output="factor", probs=c(0.5, 0.5)) + node("chemo", type="rbernoulli", p=0.5) + node("A", type="conditional_prob", parents=c("sex", "chemo"), probs=probs) # generate data data <- sim_from_dag(dag=dag, n_sim=1000)
Data from the parents is used to generate the node using cox-regression using the method of Bender et al. (2005).
node_cox(data, parents, formula=NULL, betas, surv_dist, lambda, gamma, cens_dist, cens_args, name)
node_cox(data, parents, formula=NULL, betas, surv_dist, lambda, gamma, cens_dist, cens_args, name)
data |
A |
parents |
A character vector specifying the names of the parents that this particular child node has. If non-linear combinations or interaction effects should be included, the user may specify the |
formula |
An optional |
betas |
A numeric vector with length equal to |
surv_dist |
A single character specifying the distribution that should be used when generating the survival times. Can be either |
lambda |
A single number used as parameter defined by |
gamma |
A single number used as parameter defined by |
cens_dist |
A single character naming the distribution function that should be used to generate the censoring times. For example, |
cens_args |
A list of named arguments which will be passed to the function specified by the |
name |
A single character string specifying the name of the node. |
The survival times are generated according to the cox proportional-hazards regression model as defined by the user. How exactly the data-generation works is described in detail in Bender et al. (2005). To also include censoring, this function allows the user to supply a function that generates random censoring times. If the censoring time is smaller than the generated survival time, the individual is considered censored.
Unlike the other node
type functions, this function adds two columns to the resulting dataset instead of one. The first column is called paste0(name, "_event")
and is a logical variable, where TRUE
indicates that the event has happened and FALSE
indicates right-censoring. The second column is named paste0(name, "_time")
and includes the survival or censoring time corresponding to the previously mentioned event indicator. This is the standard format for right-censored time-to-event data without time-varying covariates.
To simulate more complex time-to-event data, the user may need to use the sim_discrete_time
function instead.
Returns a data.table
of length nrow(data)
containing two columns. Both starting with the nodes name
and ending with _event
and _time
. The first is a logical vector, the second a numeric one.
Robin Denz
Bender R, Augustin T, Blettner M. Generating survival times to simulate Cox proportional hazards models. Statistics in Medicine. 2005; 24 (11): 1713-1723.
library(simDAG) set.seed(3454) # define DAG dag <- empty_dag() + node("age", type="rnorm", mean=50, sd=4) + node("sex", type="rbernoulli", p=0.5) + node("death", type="cox", parents=c("sex", "age"), betas=c(1.1, 0.4), surv_dist="weibull", lambda=1.1, gamma=0.7, cens_dist="runif", cens_args=list(min=0, max=1)) sim_dat <- sim_from_dag(dag=dag, n_sim=1000)
library(simDAG) set.seed(3454) # define DAG dag <- empty_dag() + node("age", type="rnorm", mean=50, sd=4) + node("sex", type="rbernoulli", p=0.5) + node("death", type="cox", parents=c("sex", "age"), betas=c(1.1, 0.4), surv_dist="weibull", lambda=1.1, gamma=0.7, cens_dist="runif", cens_args=list(min=0, max=1)) sim_dat <- sim_from_dag(dag=dag, n_sim=1000)
This page describes in detail how to define custom functions to allow the usage of root nodes, child nodes or time-dependent nodes that are not directly implemented in this package. By doing so, users may create data with any functional dependence they can think of.
The number of available types of nodes is limited, but this package allows the user to easily implement their own node types by writing a single custom function. Users may create their own root nodes, child nodes and time-dependent nodes. The requirements for each node type are listed below. Some simple examples for each node type are given further below.
If you think that your custom node type might be useful to others, please contact the maintainer of this package via the supplied e-mail address or github and we might add it to this package.
Root Nodes:
Any function that generates some vector of size with
n==nrow(data)
, or a data.frame
with as many rows as the current data can be used as a child node. The only requirement is:
1.) The function should have an argument called n
which controls how many samples to generate.
Some examples that are already implemented in R outside of this package are rnorm()
, rgamma()
and rbeta()
. The function may take any amount of further arguments, which will be passed through the three-dot syntax.
Child Nodes:
Again, almost any function may be used to generate a child node. Only four things are required for this to work properly:
1.) Its' name should start with node_
(if you want to use a string to define it in type
).
2.) It should contain an argument called data
(contains the already generated data).
3.) It should contain an argument called parents
(contains a vector of the child nodes parents).
4.) It should return either a vector of length n_sim
or a data.frame
with any number of columns and n_sim
rows.
The function may include any amount of additional arguments specified by the user.
Time-Dependent Nodes:
By time-dependent nodes we mean nodes that are created using the node_td
function. In general, this works in essentially the same way as for simple root nodes or child nodes. The requirements are:
1.) Its' name should start with node_
(if you want to use a string to define it in type
).
2.) It should contain an argument called data
(contains the already generated data).
3.) If it is a child node, it should contain an argument called parents
(contains a vector of the child nodes parents). This is not necessary for nodes that are independently generated.
4.) It should return either a vector of length n_sim
or a data.frame
with any number of columns and n_sim
rows.
Again, any number of additional arguments is allowed and will be passed through the three-dot syntax. Additionally, users may add an argument to this function called sim_time
. If included in the function definition, the current time of the simulation will be passed to the function on every call made to it. Similarly, the argument past_states
may be added. If done so, a list containing all previous states of the simulation (as saved using the save_states
argument of the sim_discrete_time
) function) will be passed to it internally, giving the user access to the data generated at previous points in time.
Should return either a vector of length nrow(data)
or a data.table
or data.frame
with nrow(data)
rows.
Robin Denz
library(simDAG) set.seed(3545) ################ Custom Root Nodes ################### # using external functions without defining them yourself can be done this way dag <- empty_dag() + node("A", type="rgamma", shape=0.1, rate=2) + node("B", type="rbeta", shape1=2, shape2=0.3) ## define your own root node instead # this function takes the sum of a normally distributed random number and an # uniformly distributed random number custom_root <- function(n, min=0, max=1, mean=0, sd=1) { out <- runif(n, min=min, max=max) + rnorm(n, mean=mean, sd=sd) return(out) } dag <- empty_dag() + node("A", type="custom_root", min=0, max=10, mean=5, sd=2) # equivalently, the function can be supplied directly dag <- empty_dag() + node("A", type=custom_root, min=0, max=10, mean=5, sd=2) ############### Custom Child Nodes ################### # create a custom node function, which is just a gaussian node that # includes (bad) truncation node_gaussian_trunc <- function(data, parents, betas, intercept, error, left, right) { out <- node_gaussian(data=data, parents=parents, betas=betas, intercept=intercept, error=error) out <- ifelse(out <= left, left, ifelse(out >= right, right, out)) return(out) } # another custom node function, which simply returns a sum of the parents parents_sum <- function(data, parents, betas=NULL) { out <- rowSums(data[, parents, with=FALSE]) return(out) } # an example of using these new node types in a simulation dag <- empty_dag() + node("age", type="rnorm", mean=50, sd=4) + node("sex", type="rbernoulli", p=0.5) + node("custom_1", type="gaussian_trunc", parents=c("sex", "age"), betas=c(1.1, 0.4), intercept=-2, error=2, left=10, right=25) + node("custom_2", type=parents_sum, parents=c("age", "custom_1")) sim_dat <- sim_from_dag(dag=dag, n_sim=100) ########## Custom Time-Dependent Nodes ############### ## example for a custom time-dependent node with no parents # this node simply draws a new value from a normal distribution at # each point in time node_custom_root_td <- function(data, n, mean=0, sd=1) { return(rnorm(n=n, mean=mean, sd=sd)) } n_sim <- 100 dag <- empty_dag() + node_td(name="Something", type=node_custom_root_td, n=n_sim, mean=10, sd=5) sim <- sim_discrete_time(dag, n_sim=n_sim, max_t=10) ## example for a custom time-dependent child node # draw from a normal distribution with different specifications based on # whether a previously updated time-dependent node is currently TRUE node_custom_child <- function(data, parents) { out <- numeric(nrow(data)) out[data$other_event] <- rnorm(n=sum(data$other_event), mean=10, sd=3) out[!data$other_event] <- rnorm(n=sum(!data$other_event), mean=5, sd=10) return(out) } dag <- empty_dag() + node_td("other", type="time_to_event", prob_fun=0.1) + node_td("whatever", type="custom_child", parents="other_event") sim <- sim_discrete_time(dag, n_sim=50, max_t=10) ## using the sim_time argument in a custom node function # this function returns a continuous variable that is simply the # current simulation time squared node_square_sim_time <- function(data, sim_time, n_sim) { return(rep(sim_time^2, n=n_sim)) } # note that we should not actually define the sim_time argument in the # node_td() call below, because it will be passed internally, just like data dag <- empty_dag() + node_td("unclear", type=node_square_sim_time, n_sim=100) sim <- sim_discrete_time(dag, n_sim=100, max_t=10) ## a node using previous states of the simulation # this function simply returns the value used two simulation time steps ago + # a normally distributed random value node_prev_state <- function(data, past_states, sim_time) { if (sim_time < 3) { return(rnorm(n=nrow(data))) } else { return(past_states[[sim_time-2]]$A + rnorm(n=nrow(data))) } } # note that we again do not specify the sim_time and past_states argument # directly here, because they are set internally dag <- empty_dag() + node_td("A", type=node_prev_state, parents="A") # save_states="all" is needed, because we use them internally sim <- sim_discrete_time(dag, n_sim=100, max_t=10, save_states="all")
library(simDAG) set.seed(3545) ################ Custom Root Nodes ################### # using external functions without defining them yourself can be done this way dag <- empty_dag() + node("A", type="rgamma", shape=0.1, rate=2) + node("B", type="rbeta", shape1=2, shape2=0.3) ## define your own root node instead # this function takes the sum of a normally distributed random number and an # uniformly distributed random number custom_root <- function(n, min=0, max=1, mean=0, sd=1) { out <- runif(n, min=min, max=max) + rnorm(n, mean=mean, sd=sd) return(out) } dag <- empty_dag() + node("A", type="custom_root", min=0, max=10, mean=5, sd=2) # equivalently, the function can be supplied directly dag <- empty_dag() + node("A", type=custom_root, min=0, max=10, mean=5, sd=2) ############### Custom Child Nodes ################### # create a custom node function, which is just a gaussian node that # includes (bad) truncation node_gaussian_trunc <- function(data, parents, betas, intercept, error, left, right) { out <- node_gaussian(data=data, parents=parents, betas=betas, intercept=intercept, error=error) out <- ifelse(out <= left, left, ifelse(out >= right, right, out)) return(out) } # another custom node function, which simply returns a sum of the parents parents_sum <- function(data, parents, betas=NULL) { out <- rowSums(data[, parents, with=FALSE]) return(out) } # an example of using these new node types in a simulation dag <- empty_dag() + node("age", type="rnorm", mean=50, sd=4) + node("sex", type="rbernoulli", p=0.5) + node("custom_1", type="gaussian_trunc", parents=c("sex", "age"), betas=c(1.1, 0.4), intercept=-2, error=2, left=10, right=25) + node("custom_2", type=parents_sum, parents=c("age", "custom_1")) sim_dat <- sim_from_dag(dag=dag, n_sim=100) ########## Custom Time-Dependent Nodes ############### ## example for a custom time-dependent node with no parents # this node simply draws a new value from a normal distribution at # each point in time node_custom_root_td <- function(data, n, mean=0, sd=1) { return(rnorm(n=n, mean=mean, sd=sd)) } n_sim <- 100 dag <- empty_dag() + node_td(name="Something", type=node_custom_root_td, n=n_sim, mean=10, sd=5) sim <- sim_discrete_time(dag, n_sim=n_sim, max_t=10) ## example for a custom time-dependent child node # draw from a normal distribution with different specifications based on # whether a previously updated time-dependent node is currently TRUE node_custom_child <- function(data, parents) { out <- numeric(nrow(data)) out[data$other_event] <- rnorm(n=sum(data$other_event), mean=10, sd=3) out[!data$other_event] <- rnorm(n=sum(!data$other_event), mean=5, sd=10) return(out) } dag <- empty_dag() + node_td("other", type="time_to_event", prob_fun=0.1) + node_td("whatever", type="custom_child", parents="other_event") sim <- sim_discrete_time(dag, n_sim=50, max_t=10) ## using the sim_time argument in a custom node function # this function returns a continuous variable that is simply the # current simulation time squared node_square_sim_time <- function(data, sim_time, n_sim) { return(rep(sim_time^2, n=n_sim)) } # note that we should not actually define the sim_time argument in the # node_td() call below, because it will be passed internally, just like data dag <- empty_dag() + node_td("unclear", type=node_square_sim_time, n_sim=100) sim <- sim_discrete_time(dag, n_sim=100, max_t=10) ## a node using previous states of the simulation # this function simply returns the value used two simulation time steps ago + # a normally distributed random value node_prev_state <- function(data, past_states, sim_time) { if (sim_time < 3) { return(rnorm(n=nrow(data))) } else { return(past_states[[sim_time-2]]$A + rnorm(n=nrow(data))) } } # note that we again do not specify the sim_time and past_states argument # directly here, because they are set internally dag <- empty_dag() + node_td("A", type=node_prev_state, parents="A") # save_states="all" is needed, because we use them internally sim <- sim_discrete_time(dag, n_sim=100, max_t=10, save_states="all")
Data from the parents is used to generate the node using linear regression by predicting the covariate specific mean and sampling from a normal distribution with that mean and a specified standard deviation.
node_gaussian(data, parents, formula=NULL, betas, intercept, error)
node_gaussian(data, parents, formula=NULL, betas, intercept, error)
data |
A |
parents |
A character vector specifying the names of the parents that this particular child node has. If non-linear combinations or interaction effects should be included, the user may specify the |
formula |
An optional |
betas |
A numeric vector with length equal to |
intercept |
A single number specifying the intercept that should be used when generating the node. |
error |
A single number specifying the sigma error that should be used when generating the node. |
Using the general linear regression equation, the observation-specific value that would be expected given the model is generated for every observation in the dataset generated thus far. We could stop here, but this would create a perfect fit for the node, which is unrealistic. Instead, we add an error term by taking one sample of a normal distribution for each observation with mean zero and standard deviation error
. This error term is then added to the predicted mean.
Formal Description:
Formally, the data generation can be described as:
where denotes the normal distribution with mean 0 and a standard deviation of
error
and is the number of parents (
length(parents)
).
For example, given intercept=-15
, parents=c("A", "B")
, betas=c(0.2, 1.3)
and error=2
the data generation process is defined as:
Returns a numeric vector of length nrow(data)
.
Robin Denz
empty_dag
, node
, node_td
, sim_from_dag
, sim_discrete_time
library(simDAG) set.seed(12455432) # define a DAG dag <- empty_dag() + node("age", type="rnorm", mean=50, sd=4) + node("sex", type="rbernoulli", p=0.5) + node("bmi", type="gaussian", parents=c("sex", "age"), betas=c(1.1, 0.4), intercept=12, error=2) # define the same DAG, but with a pretty formula for the child node dag <- empty_dag() + node("age", type="rnorm", mean=50, sd=4) + node("sex", type="rbernoulli", p=0.5) + node("bmi", type="gaussian", error=2, formula= ~ 12 + sexTRUE*1.1 + age*0.4) sim_dat <- sim_from_dag(dag=dag, n_sim=100)
library(simDAG) set.seed(12455432) # define a DAG dag <- empty_dag() + node("age", type="rnorm", mean=50, sd=4) + node("sex", type="rbernoulli", p=0.5) + node("bmi", type="gaussian", parents=c("sex", "age"), betas=c(1.1, 0.4), intercept=12, error=2) # define the same DAG, but with a pretty formula for the child node dag <- empty_dag() + node("age", type="rnorm", mean=50, sd=4) + node("sex", type="rbernoulli", p=0.5) + node("bmi", type="gaussian", error=2, formula= ~ 12 + sexTRUE*1.1 + age*0.4) sim_dat <- sim_from_dag(dag=dag, n_sim=100)
This node type may be used to generate a new node given a regular R expression that may include function calls or any other valid R syntax. This may be useful to combine components of a node which need to be simulated with separate node
calls, or just as a convenient shorthand for some variable transformations.
node_identity(data, parents, formula)
node_identity(data, parents, formula)
data |
A |
parents |
A character vector specifying the names of the parents that this particular child node has. When using this function as a node type in |
formula |
A |
Custom functions and objects can be used without issues in the formula
, but they need to be present in the global environment, otherwise the underlying eval()
function call will fail. Using this function outside of node
or node_td
is essentially equal to using with(data, eval(formula))
(without the ~
in the formula
).
Returns a numeric vector of length nrow(data)
.
Robin Denz
empty_dag
, node
, node_td
, sim_from_dag
, sim_discrete_time
library(simDAG) set.seed(12455432) # define a DAG dag <- empty_dag() + node("age", type="rnorm", mean=50, sd=4) + node("sex", type="rbernoulli", p=0.5) + node("bmi", type="identity", formula= ~ age + sex + 2) sim_dat <- sim_from_dag(dag=dag, n_sim=100) head(sim_dat) # more complex alternative dag <- empty_dag() + node("age", type="rnorm", mean=50, sd=4) + node("sex", type="rbernoulli", p=0.5) + node("bmi", type="identity", formula= ~ age / 2 + age^2 - ifelse(sex, 2, 3) + 2) sim_dat <- sim_from_dag(dag=dag, n_sim=100) head(sim_dat)
library(simDAG) set.seed(12455432) # define a DAG dag <- empty_dag() + node("age", type="rnorm", mean=50, sd=4) + node("sex", type="rbernoulli", p=0.5) + node("bmi", type="identity", formula= ~ age + sex + 2) sim_dat <- sim_from_dag(dag=dag, n_sim=100) head(sim_dat) # more complex alternative dag <- empty_dag() + node("age", type="rnorm", mean=50, sd=4) + node("sex", type="rbernoulli", p=0.5) + node("bmi", type="identity", formula= ~ age / 2 + age^2 - ifelse(sex, 2, 3) + 2) sim_dat <- sim_from_dag(dag=dag, n_sim=100) head(sim_dat)
Data from the parents is used to generate the node using multinomial regression by predicting the covariate specific probability of each class and sampling from a multinomial distribution accordingly.
node_multinomial(data, parents, betas, intercepts, labels=NULL, output="factor", return_prob=FALSE)
node_multinomial(data, parents, betas, intercepts, labels=NULL, output="factor", return_prob=FALSE)
data |
A |
parents |
A character vector specifying the names of the parents that this particular child node has. |
betas |
A numeric matrix with |
intercepts |
A numeric vector with one entry for each class that should be simulated, specifying the intercepts used to generate the node. |
labels |
An optional character vector giving the factor levels of the generated classes. If |
output |
A single character string specifying the output format. Must be one of |
return_prob |
Either |
This function works essentially like the node_binomial
function. First, the matrix of betas
coefficients is used in conjunction with the values defined in the parents
nodes and the intercepts
to calculate the expected subject-specific probabilities of occurrence for each possible category. This is done using the standard multinomial regression equations. Using those probabilities in conjunction with the rcategorical
function, a single one of the possible categories is drawn for each individual.
When actually fitting a multinomial regression model (with functions such as multinom
from the nnet package), the coefficients will usually not be equal to the ones supplied in betas
. The reason is that these functions usually standardize the coefficients to the coefficient of the reference category.
Returns a vector of length nrow(data)
. Depending on the used arguments, this vector may be of type character, numeric of factor. If return_prob
was used it instead returns a numeric matrix containing one column per possible event and nrow(data)
rows.
Robin Denz
empty_dag
, node
, node_td
, sim_from_dag
, sim_discrete_time
library(simDAG) set.seed(3345235) dag <- empty_dag() + node("age", type="rnorm", mean=50, sd=4) + node("sex", type="rbernoulli", p=0.5) + node("UICC", type="multinomial", parents=c("sex", "age"), betas=matrix(c(0.2, 0.4, 0.1, 0.5, 1.1, 1.2), ncol=2), intercepts=1) sim_dat <- sim_from_dag(dag=dag, n_sim=100)
library(simDAG) set.seed(3345235) dag <- empty_dag() + node("age", type="rnorm", mean=50, sd=4) + node("sex", type="rbernoulli", p=0.5) + node("UICC", type="multinomial", parents=c("sex", "age"), betas=matrix(c(0.2, 0.4, 0.1, 0.5, 1.1, 1.2), ncol=2), intercepts=1) sim_dat <- sim_from_dag(dag=dag, n_sim=100)
Data from the parents is used to generate the node using negative binomial regression by applying the betas to the design matrix and sampling from the rnbinom
function.
node_negative_binomial(data, parents, formula=NULL, betas, intercept, theta)
node_negative_binomial(data, parents, formula=NULL, betas, intercept, theta)
data |
A |
parents |
A character vector specifying the names of the parents that this particular child node has. If non-linear combinations or interaction effects should be included, the user may specify the |
formula |
An optional |
betas |
A numeric vector with length equal to |
intercept |
A single number specifying the intercept that should be used when generating the node. |
theta |
A single number specifying the theta parameter ( |
This function uses the linear predictor defined by the betas
and the input design matrix to sample from a subject-specific negative binomial distribution. It does to by calculating the linear predictor using the data
, betas
and intercept
, exponentiating it and passing it to the mu
argument of the rnbinom
function of the stats package.
Returns a numeric vector of length nrow(data)
.
Robin Denz
empty_dag
, node
, node_td
, sim_from_dag
, sim_discrete_time
library(simDAG) set.seed(124554) dag <- empty_dag() + node("age", type="rnorm", mean=50, sd=4) + node("sex", type="rbernoulli", p=0.5) + node("smoking", type="negative_binomial", theta=0.05, formula= ~ -2 + sexTRUE*1.1 + age*0.4) sim_dat <- sim_from_dag(dag=dag, n_sim=100, sort_dag=FALSE)
library(simDAG) set.seed(124554) dag <- empty_dag() + node("age", type="rnorm", mean=50, sd=4) + node("sex", type="rbernoulli", p=0.5) + node("smoking", type="negative_binomial", theta=0.05, formula= ~ -2 + sexTRUE*1.1 + age*0.4) sim_dat <- sim_from_dag(dag=dag, n_sim=100, sort_dag=FALSE)
Data from the parents is used to generate the node using poisson regression by predicting the covariate specific lambda and sampling from a poisson distribution accordingly.
node_poisson(data, parents, formula=NULL, betas, intercept)
node_poisson(data, parents, formula=NULL, betas, intercept)
data |
A |
parents |
A character vector specifying the names of the parents that this particular child node has. If non-linear combinations or interaction effects should be included, the user may specify the |
formula |
An optional |
betas |
A numeric vector with length equal to |
intercept |
A single number specifying the intercept that should be used when generating the node. |
Essentially, this function simply calculates the linear predictor defined by the betas
-coefficients, the intercept
and the values of the parents
. The exponential function is then applied to this predictor and the result is passed to the rpois
function. The result is a draw from a subject-specific poisson distribution, resembling the user-defined poisson regression model.
Formal Description:
Formally, the data generation can be described as:
where means that the variable is Poisson distributed with:
Here, is the count and
is eulers number. The parameter
is determined as:
where is the number of parents (
length(parents)
).
For example, given intercept=-15
, parents=c("A", "B")
, betas=c(0.2, 1.3)
the data generation process is defined as:
Returns a numeric vector of length nrow(data)
.
Robin Denz
empty_dag
, node
, node_td
, sim_from_dag
, sim_discrete_time
library(simDAG) set.seed(345345) dag <- empty_dag() + node("age", type="rnorm", mean=50, sd=4) + node("sex", type="rbernoulli", p=0.5) + node("smoking", type="poisson", formula= ~ -2 + sexTRUE*1.1 + age*0.4) sim_dat <- sim_from_dag(dag=dag, n_sim=100)
library(simDAG) set.seed(345345) dag <- empty_dag() + node("age", type="rnorm", mean=50, sd=4) + node("sex", type="rbernoulli", p=0.5) + node("smoking", type="poisson", formula= ~ -2 + sexTRUE*1.1 + age*0.4) sim_dat <- sim_from_dag(dag=dag, n_sim=100)
This node essentially models a dichotomous time-dependent variable for which the time of the event will be important for later usage. It adds two columns to data
: name_event
(whether the person currently has an event) and name_time
(the time at which the current event started). Past events are stored in a list. Can only be used inside of the sim_discrete_time
function, not outside of it. See details.
node_time_to_event(data, parents, sim_time, past_states, name, prob_fun, ..., event_duration=1, immunity_duration=event_duration, time_since_last=FALSE, event_count=FALSE, save_past_events=TRUE, check_inputs=TRUE, envir)
node_time_to_event(data, parents, sim_time, past_states, name, prob_fun, ..., event_duration=1, immunity_duration=event_duration, time_since_last=FALSE, event_count=FALSE, save_past_events=TRUE, check_inputs=TRUE, envir)
data |
A |
parents |
A character vector specifying the names of the parents that this particular child node has. Those child nodes should be valid column names in |
sim_time |
The current time of the simulation. If |
past_states |
A list of |
name |
The name of the node. This will be used as prefix before the |
prob_fun |
A function that returns a numeric vector of size |
... |
An arbitrary amount of additional named arguments passed to |
event_duration |
A single number > 0 specifying how long the event should last. The point in time at which an event occurs also counts into this duration. For example, if an event occurs at |
immunity_duration |
A single number >= |
time_since_last |
Either |
event_count |
Either |
save_past_events |
When the event modeled using this node is recurrent ( |
check_inputs |
Whether to perform plausibility checks for the user input or not. Is set to |
envir |
Only used internally to efficiently store the past event times. Cannot be used by the user. |
When performing discrete-time simulation using the sim_discrete_time
function, the standard node functions implemented in this package are usually not sufficient because they don't capture the time-dependent nature of some very interesting variables. Often, the variable that should be modelled has some probability of occurring at each point in time. Once it does occur, it has some kind of influence on other variables for a period of time until it goes back to normal (or doesn't). This could be a car crash, a surgery, a vaccination etc. The time_to_event
node function can be used to model these kinds of nodes in a fairly straightforward fashion.
How it Works:
At , this node will be initialized for the first time. It adds two columns to the data:
name_event
(whether the person currently has an event) and name_time
(the time at which the current event started) where name
is the name of the node. Additionally, it adds a list with max_t
entries to the tte_past_events
list returned by the sim_discrete_time
function, which records which individuals experienced a new event at each point in time.
In a nutshell, it simply models the occurrence of some event by calculating the probability of occurrence at and drawing a single bernoulli trial from this probability. If the trial is a "success", the corresponding event column will be set to
TRUE
, the time column will be set to the current simulation time and the column storing the past event times will receive an entry.
The _event
column will stay TRUE
until the event is over. The duration for that is controlled by the event_duration
parameter. When modeling terminal events such as death, one can simply set this parameter to Inf
, making the event eternal. In many cases it will also be necessary to implement some kind of immunity after the event, which can be done using the immunity_duration
argument. This effectively sets the probability of another occurrence of the event to 0 in the next immunity_duration
time steps. During the immunity duration, the event may be TRUE
(if the event is still ongoing) or FALSE
(if the event_duration
has already passed). The _time
column is similarly set to the time of occurrence of the event and reset to NA
when the immunity_duration
is over.
The probability of occurrence is calculated using the function provided by the user using the prob_fun
argument. This can be an arbitrary complex function. The only requirement is that it takes data
as a first argument. The columns defined by the parents
argument will be passed to this argument automatically. If it has an argument called sim_time
, the current time of the simulation will automatically be passed to it as well. Any further arguments can be passed using the ...
syntax. A simple example could be a logistic regression node, in which the probability is calculated as an additive linear combination of the columns defined by parents
. A more complex function could include simulation-time dependent effects, further effects dependent on past event times etc. Examples can be found below and in the vignettes.
How it is Used:
This function should never be called directly by the user. Instead, the user should define a DAG object using the empty_dag
and node_td
functions and set the type
argument inside of a node_td
call to "time_to_event"
. This DAG can be passed to the sim_discrete_time
function to generate the desired data. Many examples and more explanations are given below and in the vignettes of this package.
What can be done with it:
This type of node naturally supports the implementation of terminal and recurrent events that may be influenced by pretty much anything. By specifying the parents
and prob_fun
arguments correctly, it is possible to create an event type that is dependent on past events of itself or other time-to-event variables and other variables in general. The user can include any amount of these nodes in their simulation. It may also be used to simulate any kind of binary time-dependent variable that one would usually not associate with the name "event" as well. It is very flexible, but it does require the user to do some coding by themselves (e.g. creating a suitable function for the prob_fun
argument).
What can't be done with it:
Currently this function only allows binary events. Categorical event types may be implemented using the node_competing_events
function, which works in a very similar fashion.
Returns a data.table
containing at least two columns with updated values of the node.
This function cannot be called outside of the sim_discrete_time
function. It only makes sense to use it as a type in a node_td
function call, as described in the documentation and vignettes.
Robin Denz, Katharina Meiszl
empty_dag
, node_td
, sim_discrete_time
library(simDAG) ## a simple terminal time-to-event node, with a constant probability of ## occurrence, independent of any other variable dag <- empty_dag() + node_td("death", type="time_to_event", prob_fun=0.0001, event_duration=Inf) ## a simple recurrent time-to-event node with a constant probability of ## occurrence, independent of any other variable dag <- empty_dag() + node_td("car_crash", type="time_to_event", prob_fun=0.001, event_duration=1) ## a time-to-event node with a time-dependent probability function that ## has an additional argument prob_car_crash <- function(data, sim_time, base_p) { return(base_p + sim_time * 0.0001) } dag <- empty_dag() + node_td("car_crash", type="time_to_event", prob_fun=prob_car_crash, event_duration=1, base_p=0.0001) ## a time-to-event node with a probability function dependent on a ## time-fixed variable prob_car_crash <- function(data) { ifelse(data$sex==1, 0.001, 0.01) } dag <- empty_dag() + node("sex", type="rbernoulli", p=0.5) + node_td("car_crash", type="time_to_event", prob_fun=prob_car_crash, parents="sex") ## a little more complex car crash simulation, where the probability for ## a car crash is dependent on the sex, and the probability of death is ## highly increased for 3 days after a car crash happened prob_car_crash <- function(data) { ifelse(data$sex==1, 0.001, 0.01) } prob_death <- function(data) { ifelse(data$car_crash_event, 0.1, 0.0001) } dag <- empty_dag() + node("sex", type="rbernoulli", p=0.5) + node_td("car_crash", type="time_to_event", prob_fun=prob_car_crash, parents="sex") + node_td("death", type="time_to_event", prob_fun=prob_death, parents="car_crash_event") # use the sim_discrete_time function to simulate data from one of these DAGs: sim <- sim_discrete_time(dag, n_sim=20, max_t=500) ## more examples can be found in the vignettes of this package
library(simDAG) ## a simple terminal time-to-event node, with a constant probability of ## occurrence, independent of any other variable dag <- empty_dag() + node_td("death", type="time_to_event", prob_fun=0.0001, event_duration=Inf) ## a simple recurrent time-to-event node with a constant probability of ## occurrence, independent of any other variable dag <- empty_dag() + node_td("car_crash", type="time_to_event", prob_fun=0.001, event_duration=1) ## a time-to-event node with a time-dependent probability function that ## has an additional argument prob_car_crash <- function(data, sim_time, base_p) { return(base_p + sim_time * 0.0001) } dag <- empty_dag() + node_td("car_crash", type="time_to_event", prob_fun=prob_car_crash, event_duration=1, base_p=0.0001) ## a time-to-event node with a probability function dependent on a ## time-fixed variable prob_car_crash <- function(data) { ifelse(data$sex==1, 0.001, 0.01) } dag <- empty_dag() + node("sex", type="rbernoulli", p=0.5) + node_td("car_crash", type="time_to_event", prob_fun=prob_car_crash, parents="sex") ## a little more complex car crash simulation, where the probability for ## a car crash is dependent on the sex, and the probability of death is ## highly increased for 3 days after a car crash happened prob_car_crash <- function(data) { ifelse(data$sex==1, 0.001, 0.01) } prob_death <- function(data) { ifelse(data$car_crash_event, 0.1, 0.0001) } dag <- empty_dag() + node("sex", type="rbernoulli", p=0.5) + node_td("car_crash", type="time_to_event", prob_fun=prob_car_crash, parents="sex") + node_td("death", type="time_to_event", prob_fun=prob_death, parents="car_crash_event") # use the sim_discrete_time function to simulate data from one of these DAGs: sim <- sim_discrete_time(dag, n_sim=20, max_t=500) ## more examples can be found in the vignettes of this package
DAG
object
Using the node information contained in the DAG
object this function plots the corresponding DAG in a quick and convenient way. Some options to customize the plot are available, but it may be advisable to use other packages made explicitly to visualize DAGs instead if those do not meet the users needs.
## S3 method for class 'DAG' plot(x, layout="nicely", node_size=0.2, node_names=NULL, node_color="black", node_fill="red", node_linewidth=0.5, node_linetype="solid", node_alpha=1, node_text_color="black", node_text_alpha=1, node_text_size=8, node_text_family="sans", node_text_fontface="bold", arrow_color="black", arrow_linetype="solid", arrow_linewidth=1, arrow_alpha=1, arrow_head_size=0.3, arrow_head_unit="cm", arrow_type="closed", arrow_node_dist=0.03, gg_theme=ggplot2::theme_void(), include_td_nodes=TRUE, mark_td_nodes=TRUE, ...)
## S3 method for class 'DAG' plot(x, layout="nicely", node_size=0.2, node_names=NULL, node_color="black", node_fill="red", node_linewidth=0.5, node_linetype="solid", node_alpha=1, node_text_color="black", node_text_alpha=1, node_text_size=8, node_text_family="sans", node_text_fontface="bold", arrow_color="black", arrow_linetype="solid", arrow_linewidth=1, arrow_alpha=1, arrow_head_size=0.3, arrow_head_unit="cm", arrow_type="closed", arrow_node_dist=0.03, gg_theme=ggplot2::theme_void(), include_td_nodes=TRUE, mark_td_nodes=TRUE, ...)
x |
A |
layout |
A single character string specifying the layout of the plot. This internally calls the |
node_size |
Either a single positive number or a numeric vector with one entry per node in the DAG, specifying the radius of the circles used to draw the nodes. If a single number is supplied, all nodes will be the same size (default). |
node_names |
A character vector with one entry for each node in the DAG specifying names that should be used for in the nodes or |
node_color |
A single character string specifying the color of the outline of the node circles. |
node_fill |
A single character string specifying the color with which the nodes are filled. Ignored if time-varying nodes are present and both |
node_linewidth |
A single number specifying the width of the outline of the node circles. |
node_linetype |
A single character string specifying the linetype of the outline of the node circles. |
node_alpha |
A single number between 0 and 1 specifying the transparency level of the nodes. |
node_text_color |
A single character string specifying the color of the text inside the node circles. |
node_text_alpha |
A single number between 0 and 1 specifying the transparency level of the text inside the node circles. |
node_text_size |
A single number specifying the size of the text inside of the node circles. |
node_text_family |
A single character string specifying the family of the text inside the node circles. |
node_text_fontface |
A single character string specifying the fontface of the text inside the node circles. |
arrow_color |
A single character string specifying the color of the arrows between the nodes. |
arrow_linetype |
A single character string specifying the linetype of the arrows. |
arrow_linewidth |
A single number specifying the width of the arrows. |
arrow_alpha |
A single number between 0 and 1 specifying the transparency level of the arrows. |
arrow_head_size |
A single number specifying the size of the arrow heads. The unit for this size parameter can be changed using the |
arrow_head_unit |
A single character string specifying the unit of the |
arrow_type |
Either |
arrow_node_dist |
A single positive number specifying the distance between nodes and the arrows. By setting this to values greater than 0 the arrows will not touch the node circles, leaving a bit of space instead. |
gg_theme |
A |
include_td_nodes |
Whether to include time-varying nodes added to the |
mark_td_nodes |
Whether to distinguish time-varying and time-fixed nodes by |
... |
Further arguments passed to the |
This function uses the igraph package to find a suitable layout for the plot and then uses the ggplot2 package in conjunction with the geom_circle
function of the ggforce package to plot the directed acyclic graph defined by a DAG
object. Since it returns a ggplot
object, the user may use any standard ggplot2
syntax to augment the plot or to save it using the ggsave
function.
Note that there are multiple great packages specifically designed to plot directed acyclic graphs, such as the igraph package. This function is not meant to be a competitor to those packages. The functionality offered here is rather limited. It is designed to produce decent plots for small DAGs which are easy to create. If this function is not enough to create an adequate plot, users can use the dag2matrix
function to obtain an adjacency matrix from the DAG
object and directly use this matrix and the igraph package (or similar ones) to get much better plots.
If the DAG
supplied to this function contains time-varying variables, the resulting plot may contain cycles or even bi-directional arrows, depending on the DAG
. The reason for that is, that the time-dimension is not shown in the plot. Note also that even though, technically, every time-varying node has itself as a parent, no arrows showing this dependence will be added to the plot.
Returns a standard ggplot2
object.
Robin Denz
library(simDAG) # 2 root nodes, 1 child node dag <- empty_dag() + node("age", type="rnorm", mean=50, sd=4) + node("sex", type="rbernoulli", p=0.5) + node("smoking", type="binomial", parents=c("sex", "age"), betas=c(1.1, 0.4), intercept=-2) if (requireNamespace("ggplot2") & requireNamespace("ggforce")) { library(ggplot2) library(igraph) library(ggforce) plot(dag) # get plot using the igraph package instead g1 <- as.igraph(dag) plot(g1) # plot with a time-varying node dag <- dag + node_td("lottery", type="time_to_event", parents=c("age", "smoking")) plot(dag) }
library(simDAG) # 2 root nodes, 1 child node dag <- empty_dag() + node("age", type="rnorm", mean=50, sd=4) + node("sex", type="rbernoulli", p=0.5) + node("smoking", type="binomial", parents=c("sex", "age"), betas=c(1.1, 0.4), intercept=-2) if (requireNamespace("ggplot2") & requireNamespace("ggforce")) { library(ggplot2) library(igraph) library(ggforce) plot(dag) # get plot using the igraph package instead g1 <- as.igraph(dag) plot(g1) # plot with a time-varying node dag <- dag + node_td("lottery", type="time_to_event", parents=c("age", "smoking")) plot(dag) }
Given a simDT
object obtained with the sim_discrete_time
function, plots a relatively simple flowchart of how the simulation was performed. Shows only some general information extracted from the dag
.
## S3 method for class 'simDT' plot(x, right_boxes=TRUE, box_hdist=1, box_vdist=1, box_l_width=0.35, box_l_height=0.23, box_r_width=box_l_width, box_r_height=box_l_height + 0.1, box_alpha=0.5, box_linetype="solid", box_linewidth=0.5, box_border_colors=NULL, box_fill_colors=NULL, box_text_color="black", box_text_alpha=1, box_text_angle=0, box_text_family="sans", box_text_fontface="plain", box_text_size=5, box_text_lineheight=1, box_1_text_left="Create initial data", box_1_text_right=NULL, box_2_text="Increase t by 1", box_l_node_labels=NULL, box_r_node_labels=NULL, box_last_text=paste0("t <= ", x$max_t, "?"), arrow_line_type="solid", arrow_line_width=0.5, arrow_line_color="black", arrow_line_alpha=1, arrow_head_angle=30, arrow_head_size=0.3, arrow_head_unit="cm", arrow_head_type="closed", arrow_left_pad=0.3, hline_width=0.5, hline_type="dashed", hline_color="black", hline_alpha=1, ...)
## S3 method for class 'simDT' plot(x, right_boxes=TRUE, box_hdist=1, box_vdist=1, box_l_width=0.35, box_l_height=0.23, box_r_width=box_l_width, box_r_height=box_l_height + 0.1, box_alpha=0.5, box_linetype="solid", box_linewidth=0.5, box_border_colors=NULL, box_fill_colors=NULL, box_text_color="black", box_text_alpha=1, box_text_angle=0, box_text_family="sans", box_text_fontface="plain", box_text_size=5, box_text_lineheight=1, box_1_text_left="Create initial data", box_1_text_right=NULL, box_2_text="Increase t by 1", box_l_node_labels=NULL, box_r_node_labels=NULL, box_last_text=paste0("t <= ", x$max_t, "?"), arrow_line_type="solid", arrow_line_width=0.5, arrow_line_color="black", arrow_line_alpha=1, arrow_head_angle=30, arrow_head_size=0.3, arrow_head_unit="cm", arrow_head_type="closed", arrow_left_pad=0.3, hline_width=0.5, hline_type="dashed", hline_color="black", hline_alpha=1, ...)
x |
A |
right_boxes |
Either |
box_hdist |
A single positive number specifying the horizontal distance of the left and the right boxes. |
box_vdist |
A single positive number specifying the vertical distance of the boxes. |
box_l_width |
A single positive number specifying the width of the boxes on the left side. |
box_l_height |
A single positive number specifying the height of the boxes on the left side. |
box_r_width |
A single positive number specifying the width of the boxes on the right side. Ignored if |
box_r_height |
A single positive number specifying the height of the boxes on the right side. Ignored if |
box_alpha |
A single number between 0 and 1 specifying the transparency level of the boxes. |
box_linetype |
A single positive number specifying the linetype of the box outlines. |
box_linewidth |
A single positive number specifying the width of the box outlines. |
box_border_colors |
A character vector of length two specifying the colors of the box outlines. Set to |
box_fill_colors |
A character vector of length two specifying the colors of the inside of the boxes. Set to |
box_text_color |
A single character string specifying the color of the text inside the boxes. |
box_text_alpha |
A single number between 0 and 1 specifying the transparency level of the text inside the boxes. |
box_text_angle |
A single positive number specifying the angle of the text inside the boxes. |
box_text_family |
A single character string specifying the family of the text inside the boxes. May be one of |
box_text_fontface |
A single character string specifying the fontface of the text inside the boxes. May be one of |
box_text_size |
A single number specifying the size of the text inside the boxes. |
box_text_lineheight |
A single number specifying the lineheight of the text inside the boxes. |
box_1_text_left |
A single character string specifying the text inside the first box from the top on the left side. |
box_1_text_right |
A single character string specifying the text inside the first box from the top on the right side or |
box_2_text |
A single character string specifying the text inside the second box from the top. |
box_l_node_labels |
A character vector with one entry for each time-varying node used in the simulation. These will be used to fill the boxes on the left side of the plot. Set to |
box_r_node_labels |
A character vector with one entry for each time-varying node used in the simulation. These will be used to fill the boxes on the right side of the plot. Set to |
box_last_text |
A single character string specifying the text inside the last box on the left side. By default it uses the |
arrow_line_type |
A single character string specifying the linetype of the arrows. |
arrow_line_width |
A single positive number specifying the line width of the arrows. |
arrow_line_color |
A single character string specifying the color of the arrows. |
arrow_line_alpha |
A single number between 0 and 1 specifying the transparency level of the arrows. |
arrow_head_angle |
A single number specifying the angle of the arrow heads. |
arrow_head_size |
A single number specifying the size of the arrow heads. The unit is defined by the |
arrow_head_unit |
A single character string specifying which unit to use when specifying the |
arrow_head_type |
A single character string specifying which type of arrow head to use. See |
arrow_left_pad |
A single positive number specifying the distance between the left boxes and the arrow line to the left of it. |
hline_width |
A single number specifying the width of the horizontal lines between the left and right boxes. |
hline_type |
A single character string specifying the linetype of the horizontal lines between the left and right boxes. |
hline_color |
A single character string specifying the color of the horizontal lines between the left and right boxes. |
hline_alpha |
A single number between 0 and 1 specifying the transparency level of the horizontal lines between the left and right boxes. |
... |
Currently not used. |
The resulting flowchart includes two columns of boxes next to each other. On the left side it always starts with the same two boxes: a box about the creation of the initial data and a box about increasing the simulation time by 1. Next, there will be a box for each time-varying variable in the simDT
object. Afterwards there is another box which asks if the maximum simulation time was reached. An arrow to the left that points back to the second box from the top indicates the iterative nature of the simulation process. The right column of boxes includes additional information about the boxes on the left.
The text in all boxes may be changed to custom text by using the box_1_text_left
, box_1_text_right
, box_2_text
, box_l_node_labels
, box_r_node_labels
and box_last_text
arguments. It is also possible to completely remove the left line of boxes and to change various sizes and appearances. Although these are quite some options, it is still a rather fixed function in nature. One cannot add further boxes or arrows in a simple way. The general structure may also not be changed. It may be useful to visualize a general idea of the simulation flow, but it may be too limited for usage in scientific publications if the simulation is more complex.
The graphic is created using the ggplot2
package and the output is a standard ggplot
object. This means that the user can change the result using standard ggplot
syntax (adding more stuff, changing geoms, ...).
Returns a standard ggplot
object.
Robin Denz
empty_dag
, node
, node_td
, sim_discrete_time
library(simDAG) set.seed(435345) ## exemplary car crash simulation, where the probability for ## a car crash is dependent on the sex, and the probability of death is ## highly increased for 3 days after a car crash happened prob_car_crash <- function(data) { ifelse(data$sex==1, 0.001, 0.01) } prob_death <- function(data) { ifelse(data$car_crash_event, 0.1, 0.0001) } dag <- empty_dag() + node("sex", type="rbernoulli", p=0.5) + node_td("car_crash", type="time_to_event", prob_fun=prob_car_crash, parents="sex") + node_td("death", type="time_to_event", prob_fun=prob_death, parents="car_crash_event") # generate some data sim <- sim_discrete_time(dag, n_sim=20, max_t=500, save_states="last") if (requireNamespace("ggplot2")) { # default plot plot(sim) # removing boxes on the right plot(sim, right_boxes=FALSE) }
library(simDAG) set.seed(435345) ## exemplary car crash simulation, where the probability for ## a car crash is dependent on the sex, and the probability of death is ## highly increased for 3 days after a car crash happened prob_car_crash <- function(data) { ifelse(data$sex==1, 0.001, 0.01) } prob_death <- function(data) { ifelse(data$car_crash_event, 0.1, 0.0001) } dag <- empty_dag() + node("sex", type="rbernoulli", p=0.5) + node_td("car_crash", type="time_to_event", prob_fun=prob_car_crash, parents="sex") + node_td("death", type="time_to_event", prob_fun=prob_death, parents="car_crash_event") # generate some data sim <- sim_discrete_time(dag, n_sim=20, max_t=500, save_states="last") if (requireNamespace("ggplot2")) { # default plot plot(sim) # removing boxes on the right plot(sim, right_boxes=FALSE) }
A very fast implementation for generating bernoulli trials. Can take a vector of probabilities which makes it very useful for simulation studies.
rbernoulli(n, p=0.5, output="logical")
rbernoulli(n, p=0.5, output="logical")
n |
How many draws to make. |
p |
A numeric vector of probabilities, used when drawing the trials. |
output |
A single character string, specifying which format the output should be returned as. Must be one of |
Internally, it uses only a single call to runif
, making it much faster and more memory efficient than using rbinomial
.
Note that this function accepts values of p
that are smaller then 0 and greater than 1. For p < 0
it will always return FALSE
, for p > 1
it will always return TRUE
.
Returns a vector of length n
in the desired output format.
Robin Denz
library(simDAG) # generating 5 bernoulli random draws from an unbiased coin rbernoulli(n=5, p=0.5) # using different probabilities for each coin throw rbernoulli(n=5, p=c(0.1, 0.2, 0.3, 0.2, 0.7)) # return as numeric instead rbernoulli(n=5, p=0.5, output="numeric")
library(simDAG) # generating 5 bernoulli random draws from an unbiased coin rbernoulli(n=5, p=0.5) # using different probabilities for each coin throw rbernoulli(n=5, p=c(0.1, 0.2, 0.3, 0.2, 0.7)) # return as numeric instead rbernoulli(n=5, p=0.5, output="numeric")
Allows different class probabilities for each person by supplying a matrix with one column for each class and one row for each person.
rcategorical(n, probs, labels=NULL, output="numeric")
rcategorical(n, probs, labels=NULL, output="numeric")
n |
How many draws to make. Passed to the |
probs |
Either a numeric vector of probabilities which sums to one or a matrix with one column for each desired class and |
labels |
A vector of labels to draw from. If |
output |
A single character string specifying the output format of the results. Must be either |
In case of a simple numeric vector (class probabilities should be the same for all draws), this function is only a wrapper for the sample
function, to make the code more consistent. It uses weighted sampling with replacement. Otherwise, custom code is used which is faster than the standard rmultinom
function.
Returns a numeric vector (or factor vector if coerce2factor=TRUE
) of length n
.
Robin Denz
library(simDAG) rcategorical(n=5, labels=c("A", "B", "C"), probs=c(0.1, 0.2, 0.7)) rcategorical(n=2, probs=matrix(c(0.1, 0.2, 0.5, 0.7, 0.4, 0.1), nrow=2))
library(simDAG) rcategorical(n=5, labels=c("A", "B", "C"), probs=c(0.1, 0.2, 0.7)) rcategorical(n=2, probs=matrix(c(0.1, 0.2, 0.5, 0.7, 0.4, 0.1), nrow=2))
This is a small convenience function that simply returns the value passed to it, in order to allow the use of a constant node as root node in the sim_from_dag
function.
rconstant(n, constant)
rconstant(n, constant)
n |
The number of times the constant should be repeated. |
constant |
A single value of any kind which is used as the only value of the resulting variable. |
Returns a vector of length n
with the same type as constant
.
Robin Denz
library(simDAG) rconstant(n=10, constant=7) rconstant(n=4, constant="Male")
library(simDAG) rconstant(n=10, constant=7) rconstant(n=4, constant="Male")
Similar to the sim_from_dag
function, this function can be used to generate data from a given DAG. In contrast to the sim_from_dag
function, this function utilizes a discrete-time simulation approach. This is not an "off-the-shelves" simulation function, it should rather be seen as a "framework-function", making it easier to create discrete-time-simulations. It usually requires custom functions written by the user. See details.
sim_discrete_time(dag, n_sim=NULL, t0_sort_dag=FALSE, t0_data=NULL, t0_transform_fun=NULL, t0_transform_args=list(), max_t, tx_nodes_order=NULL, tx_transform_fun=NULL, tx_transform_args=list(), save_states="last", save_states_at=NULL, verbose=FALSE, check_inputs=TRUE)
sim_discrete_time(dag, n_sim=NULL, t0_sort_dag=FALSE, t0_data=NULL, t0_transform_fun=NULL, t0_transform_args=list(), max_t, tx_nodes_order=NULL, tx_transform_fun=NULL, tx_transform_args=list(), save_states="last", save_states_at=NULL, verbose=FALSE, check_inputs=TRUE)
dag |
A |
n_sim |
A single number specifying how many observations should be generated. If a |
t0_sort_dag |
Corresponds to the |
t0_data |
An optional |
t0_transform_fun |
An optional function that takes the data created at |
t0_transform_args |
A named list of additional arguments passed to the |
max_t |
A single integer specifying the final point in time to which the simulation should be carried out. The simulation will start at |
tx_nodes_order |
A numeric vector specifying the order in which the time-dependent nodes added to the |
tx_transform_fun |
An optional function that takes the data created after every point in time |
tx_transform_args |
A named list of additional arguments passed to the |
save_states |
Specifies the amount of simulation states that should be saved in the output object. Has to be one of |
save_states_at |
The specific points in time at which the simulated |
verbose |
If |
check_inputs |
Whether to perform plausibility checks for the user input or not. Is set to |
Sometimes it is necessary to simulate complex data that cannot be described easily with a single DAG and node information. This may be the case if the desired data should contain multiple time-dependent variables or time-to-event variables in which the event has time-dependent effects on other events. An example for this is data on vaccinations and their effects on the occurrence of adverse events (see vignette). Discrete-Time Simulation can be an effective tool to generate these kinds of datasets.
What is Discrete-Time Simulation?:
In a discrete-time simulation, there are entities who have certain states associated with them that only change at discrete points in time. For example, the entities could be people and the state could be alive or dead. In this example we could generate 100 people with some covariates such as age, sex etc.. We then start by increasing the simulation time by one day. For each person we now check if the person has died using a bernoulli trial, where the probability of dying is generated at each point in time based on some of the covariates. The simulation time is then increased again and the process is repeated until we reach max_t
.
Due to the iterative process it is very easy to simulate arbitrarily complex data. The covariates may change over time in arbitrary ways, the event probability can have any functional relationship with the covariates and so on. If we want to model an event type that is not terminal, such as occurrence of cardiovascular disease, events can easily be simulated to be dependent on the timing and number of previous events. Since Discrete-Time Simulation is a special case of Discrete-Event Simulation, introductory textbooks on the latter can be of great help in getting a better understanding of the former.
How it Works:
Internally, this function works by first simulating data using the sim_from_dag
function. Alternatively, the user can supply a custom data.table
using the t0_data
argument. This data defines the state of all entities at . Afterwards, the simulation time is increased by one unit and the data is transformed in place by calling each node function defined by the time-dependent nodes which were added to the
dag
using the node_td
function (either in the order in which they were added to the dag
object or by the order defined by the tx_nodes_order
argument). Usually, each transformation changes the state of the entities in some way. For example if there is an age
variable, we would probably increase the age of each person by one time unit at every step. Once max_t
is reached, the resulting data.table
will be returned. It contains the state of all entities at the last step with additional information of when they experienced some events (if node_time_to_event
was used as time-dependent node). Multiple in-depth examples can be found in the vignettes of this package.
Specifying the dag
argument:
The dag
argument should be specified as described in the node
documentation page. More examples specific to discrete-time simulations can be found in the vignettes and the examples. The only difference to specifying a dag
for the sim_from_dag
function is that the dag
here should contain at least one time-dependent node added using the node_td
function. Usage of the formula
argument with non-linear or interaction terms is discouraged for performance reasons.
Speed Considerations:
All functions in this package rely on the data.table
backend in order to make them more memory efficient and faster. It is however important to note that the time to simulate a dataset increases non-linearly with an increasing max_t
value and additional time-dependent nodes. This is usually not a concern for smaller datasets, but if n_sim
is very large (say > 1 million) this function will get rather slow. Note also that using the formula
argument is a lot more computationally expensive than using the parents
, betas
approach to specify certain nodes.
What do I do with the output?:
This function outputs a simDT
object, not a data.table
. To obtain an actual dataset from the output of this function, users should use the sim2data
function to transform it into the desired format. Currently, the long-format, the wide-format and the start-stop format are supported. See sim2data
for more information.
A Few Words of Caution:
In most cases it will be necessary for the user to write their own functions in order to actually use the sim_discrete_time
function. Unlike the sim_from_dag
function, in which many popular node types can be implemented in a re-usable way, discrete-time simulation will always require some custom input by the user. This is the price users have to pay for the almost unlimited flexibility offered by this simulation methodology.
Returns a simDT
object, containing some general information about the simulated data as well as the final state of the simulated dataset (and more states, depending on the specification of the save_states
argument). In particular, it includes the following objects:
past_states
: A list containing the generated data at the specified points in time.
save_states
: The value of the save_states
argument supplied by the user.
data
: The data at time max_t
.
tte_past_events
: A list storing the times at which events happened in variables of type "time_to_event"
, if specified.
ce_past_events
: A list storing the times at which events happened in variables of type "competing_events"
, if specified.
ce_past_causes
: A list storing the types of events which happened at in variables of type "competing_events"
, if specified.
tx_nodes
: A list of all time-varying nodes, as specified in the supplied dag
object.
max_t
: The value of max_t
, as supplied by the user.
t0_var_names
: A character vector containing the names of all variable names that do not vary over time.
To obtain a single dataset from this function that can be processed further, please use the sim2data
function.
Robin Denz, Katharina Meiszl
Tang, Jiangjun, George Leu, und Hussein A. Abbass. 2020. Simulation and Computational Red Teaming for Problem Solving. Hoboken: IEEE Press.
Banks, Jerry, John S. Carson II, Barry L. Nelson, and David M. Nicol (2014). Discrete-Event System Simulation. Vol. 5. Edinburgh Gate: Pearson Education Limited.
empty_dag
, node
, node_td
, sim2data
, plot.simDT
library(simDAG) set.seed(454236) ## simulating death dependent on age, sex, bmi ## NOTE: this example is explained in detail in one of the vignettes # initializing a DAG with nodes for generating data at t0 dag <- empty_dag() + node("age", type="rnorm", mean=50, sd=4) + node("sex", type="rbernoulli", p=0.5) + node("bmi", type="gaussian", parents=c("sex", "age"), betas=c(1.1, 0.4), intercept=12, error=2) # a function that increases age as time goes on node_advance_age <- function(data) { return(data$age + 1/365) } # a function to calculate the probability of death as a # linear combination of age, sex and bmi on the log scale prob_death <- function(data, beta_age, beta_sex, beta_bmi, intercept) { prob <- intercept + data$age*beta_age + data$sex*beta_sex + data$bmi*beta_bmi prob <- 1/(1 + exp(-prob)) return(prob) } # adding time-dependent nodes to the dag dag <- dag + node_td("age", type="advance_age", parents="age") + node_td("death", type="time_to_event", parents=c("age", "sex", "bmi"), prob_fun=prob_death, beta_age=0.1, beta_bmi=0.3, beta_sex=-0.2, intercept=-20, event_duration=Inf, save_past_events=FALSE) # run simulation for 100 people, 50 days long sim_dt <- sim_discrete_time(n_sim=100, dag=dag, max_t=50, verbose=FALSE)
library(simDAG) set.seed(454236) ## simulating death dependent on age, sex, bmi ## NOTE: this example is explained in detail in one of the vignettes # initializing a DAG with nodes for generating data at t0 dag <- empty_dag() + node("age", type="rnorm", mean=50, sd=4) + node("sex", type="rbernoulli", p=0.5) + node("bmi", type="gaussian", parents=c("sex", "age"), betas=c(1.1, 0.4), intercept=12, error=2) # a function that increases age as time goes on node_advance_age <- function(data) { return(data$age + 1/365) } # a function to calculate the probability of death as a # linear combination of age, sex and bmi on the log scale prob_death <- function(data, beta_age, beta_sex, beta_bmi, intercept) { prob <- intercept + data$age*beta_age + data$sex*beta_sex + data$bmi*beta_bmi prob <- 1/(1 + exp(-prob)) return(prob) } # adding time-dependent nodes to the dag dag <- dag + node_td("age", type="advance_age", parents="age") + node_td("death", type="time_to_event", parents=c("age", "sex", "bmi"), prob_fun=prob_death, beta_age=0.1, beta_bmi=0.3, beta_sex=-0.2, intercept=-20, event_duration=Inf, save_past_events=FALSE) # run simulation for 100 people, 50 days long sim_dt <- sim_discrete_time(n_sim=100, dag=dag, max_t=50, verbose=FALSE)
This function can be used to generate data from a given DAG. It additionally requires information on node distributions, beta coefficients and, depending on the node type, more parameters such as intercepts.
sim_from_dag(dag, n_sim, sort_dag=FALSE, check_inputs=TRUE)
sim_from_dag(dag, n_sim, sort_dag=FALSE, check_inputs=TRUE)
dag |
A |
n_sim |
A single number specifying how many observations should be generated. |
sort_dag |
Whether to topologically sort the DAG before starting the simulation or not. If the nodes in |
check_inputs |
Whether to perform plausibility checks for the user input or not. Is set to |
How it Works:
First, n_sim
i.i.d. samples from the root nodes are drawn. Children of these nodes are then generated one by one according to specified relationships and causal coefficients. For example, lets suppose there are two root nodes, age
and sex
. Those are generated from a normal distribution and a bernoulli distribution respectively. Afterward, the child node height
is generated using both of these variables as parents according to a linear regression with defined coefficients, intercept and sigma (random error). This works because every DAG has at least one topological ordering, which is a linear ordering of vertices such that for every directed edge
, vertex
comes before
in the ordering. By using
sort_dag=TRUE
it is ensured that the nodes are processed in such an ordering.
This procedure is simple in theory, but can get very complex when manually coded. This function offers a simplified workflow by only requiring the user to define the dag
object with appropriate information (see documentation of node
function). A sample of size n_sim
is then generated from the DAG specified by those two arguments.
Specifying the DAG:
Concrete details on how to specify the needed dag
object are given in the documentation page of the node
function and in the vignettes of this package.
Can this function create longitudinal data?
Yes and no. It theoretically can, but only if the user-specified dag
directly specifies a node for each desired point in time. Using the sim_discrete_time
is better in some cases. A brief discussion about this topic can be found in the vignettes of this package.
If time-dependent nodes were added to the dag
using node_td
calls, this function may not be used. Only the sim_discrete_time
function will work in that case.
Returns a single data.table
including the simulated data with (at least) one column per node specified in dag
and n_sim
rows.
Robin Denz
empty_dag
, node
, plot.DAG
, sim_discrete_time
library(simDAG) set.seed(345345) dag <- empty_dag() + node("age", type="rnorm", mean=50, sd=4) + node("sex", type="rbernoulli", p=0.5) + node("bmi", type="gaussian", parents=c("sex", "age"), betas=c(1.1, 0.4), intercept=12, error=2) sim_dat <- sim_from_dag(dag=dag, n_sim=1000) # More examples for each directly supported node type as well as for custom # nodes can be found in the documentation page of the respective node function
library(simDAG) set.seed(345345) dag <- empty_dag() + node("age", type="rnorm", mean=50, sd=4) + node("sex", type="rbernoulli", p=0.5) + node("bmi", type="gaussian", parents=c("sex", "age"), betas=c(1.1, 0.4), intercept=12, error=2) sim_dat <- sim_from_dag(dag=dag, n_sim=1000) # More examples for each directly supported node type as well as for custom # nodes can be found in the documentation page of the respective node function
DAG
object
This function takes a single DAG
object and generates a list of multiple datasets, possible using parallel processing
sim_n_datasets(dag, n_sim, n_repeats, n_cores=parallel::detectCores(), data_format="raw", data_format_args=list(), seed=stats::runif(1), progressbar=TRUE, ...)
sim_n_datasets(dag, n_sim, n_repeats, n_cores=parallel::detectCores(), data_format="raw", data_format_args=list(), seed=stats::runif(1), progressbar=TRUE, ...)
dag |
A |
n_sim |
A single number specifying how many observations per dataset should be generated. |
n_repeats |
A single number specifying how many datasets should be generated. |
n_cores |
A single number specifying the amount of cores that should be used. If |
data_format |
An optional character string specifying the output format of the generated datasets. If |
data_format_args |
An optional list of named arguments passed to the function specified by |
seed |
A seed for the random number generator. By supplying a value to this argument, the results will be replicable, even if parallel processing is used to generate the datasets (using |
progressbar |
Either |
... |
Further arguments passed to the |
Generating a number of datasets from a single defined dag
object is usually the first step when conducting monte-carlo simulation studies. This is simply a convenience function which automates this process using parallel processing (if specified).
Note that for more complex monte-carlo simulations this function may not be ideal, because it does not allow the user to vary aspects of the data-generation mechanism inside the main for loop, because it can only handle a single dag
. For example, if the user wants to simulate n_repeats
datasets with confounding and n_repeats
datasets without confounding, he/she has to call this function twice. This is not optimal, because setting up the clusters for parallel processing takes some processing time. If many different dag
s should be used, it would make more sense to write a single function that generates the dag
itself for each of the desired settings. This can sadly not be automated by us though.
Returns a list of length n_repeats
containing datasets generated according to the supplied dag
object.
Robin Denz
empty_dag
, node
, node_td
, sim_from_dag
, sim_discrete_time
, sim2data
library(simDAG) # some example DAG dag <- empty_dag() + node("death", type="binomial", parents=c("age", "sex"), betas=c(1, 2), intercept=-10) + node("age", type="rnorm", mean=10, sd=2) + node("sex", parents="", type="rbernoulli", p=0.5) + node("smoking", parents=c("sex", "age"), type="binomial", betas=c(0.6, 0.2), intercept=-2) # generate 10 datasets without parallel processing out <- sim_n_datasets(dag, n_repeats=10, n_cores=1, n_sim=100) if (requireNamespace("doSNOW") & requireNamespace("doRNG") & requireNamespace("foreach")) { # generate 10 datasets with parallel processing out <- sim_n_datasets(dag, n_repeats=10, n_cores=2, n_sim=100) } # generate 10 datasets and transforming the output # (using the sim2data function internally) dag <- dag + node_td("CV", type="time_to_event", prob_fun=0.01) out <- sim_n_datasets(dag, n_repeats=10, n_cores=1, n_sim=100, max_t=20, data_format="start_stop")
library(simDAG) # some example DAG dag <- empty_dag() + node("death", type="binomial", parents=c("age", "sex"), betas=c(1, 2), intercept=-10) + node("age", type="rnorm", mean=10, sd=2) + node("sex", parents="", type="rbernoulli", p=0.5) + node("smoking", parents=c("sex", "age"), type="binomial", betas=c(0.6, 0.2), intercept=-2) # generate 10 datasets without parallel processing out <- sim_n_datasets(dag, n_repeats=10, n_cores=1, n_sim=100) if (requireNamespace("doSNOW") & requireNamespace("doRNG") & requireNamespace("foreach")) { # generate 10 datasets with parallel processing out <- sim_n_datasets(dag, n_repeats=10, n_cores=2, n_sim=100) } # generate 10 datasets and transforming the output # (using the sim2data function internally) dag <- dag + node_td("CV", type="time_to_event", prob_fun=0.01) out <- sim_n_datasets(dag, n_repeats=10, n_cores=1, n_sim=100, max_t=20, data_format="start_stop")
sim_discrete_time
output into the start-stop, long- or wide-format
This function transforms the output of the sim_discrete_time
function into a single data.table
structured in the start-stop format (also known as counting process format), the long format (one row per person per point in time) or the wide format (one row per person, one column per point in time for time-varying variables). See details.
sim2data(sim, to, use_saved_states=sim$save_states=="all", overlap=FALSE, target_event=NULL, keep_only_first=FALSE, remove_not_at_risk=FALSE, as_data_frame=FALSE, check_inputs=TRUE, ...) ## S3 method for class 'simDT' as.data.table(x, keep.rownames=FALSE, to, overlap=FALSE, target_event=NULL, keep_only_first=FALSE, remove_not_at_risk=FALSE, use_saved_states=x$save_states=="all", check_inputs=TRUE, ...) ## S3 method for class 'simDT' as.data.frame(x, row.names=NULL, optional=FALSE, to, overlap=FALSE, target_event=NULL, keep_only_first=FALSE, remove_not_at_risk=FALSE, use_saved_states=x$save_states=="all", check_inputs=TRUE, ...)
sim2data(sim, to, use_saved_states=sim$save_states=="all", overlap=FALSE, target_event=NULL, keep_only_first=FALSE, remove_not_at_risk=FALSE, as_data_frame=FALSE, check_inputs=TRUE, ...) ## S3 method for class 'simDT' as.data.table(x, keep.rownames=FALSE, to, overlap=FALSE, target_event=NULL, keep_only_first=FALSE, remove_not_at_risk=FALSE, use_saved_states=x$save_states=="all", check_inputs=TRUE, ...) ## S3 method for class 'simDT' as.data.frame(x, row.names=NULL, optional=FALSE, to, overlap=FALSE, target_event=NULL, keep_only_first=FALSE, remove_not_at_risk=FALSE, use_saved_states=x$save_states=="all", check_inputs=TRUE, ...)
sim , x
|
An object created with the |
to |
Specifies the format of the output data. Must be one of: |
use_saved_states |
Whether the saved simulation states (argument |
overlap |
Only used when |
target_event |
Only used when |
keep_only_first |
Only used when |
remove_not_at_risk |
Only used when |
as_data_frame |
Set this argument to |
check_inputs |
Whether to perform input checks ( |
keep.rownames |
Currently not used. |
row.names |
Passed to the |
optional |
Passed to the |
... |
Further arguments passed to |
The raw output of the sim_discrete_time
function may be difficult to use for further analysis. Using one of these functions, it is straightforward to transform that output into three different formats, which are described below. Note that some caution needs to be applied when using this function, which is also described below. Both as.data.table
and as.data.frame
internally call sim2data
and only exist for user convenience.
The start-stop format:
The start-stop format (to="start_stop"
), also known as counting process or period format corresponds to a data.table
containing multiple rows per person, where each row corresponds to a period of time in which no variables changed. These intervals are defined by the start
and stop
columns. The start
column gives the time at which the period started, the stop
column denotes the time when the period ended. By default these intervals are coded to be non-overlapping, meaning that the edges of the periods are included in the period itself. For example, if the respective period is exactly 1 point in time long, start
will be equal to stop
. If non-overlapping periods are desired, the user can specify overlap=TRUE
instead.
By default, all time-to-event nodes are treated equally. This is not optimal when the goal is to fit survival regression models. In this case, we usually want the target event to be treated in a special way (see for example Chiou et al. 2023). In general, instead of creating new intervals for it we want existing intervals to end at event times with the corresponding event indicator. This can be achieved by naming the target outcome in the target_event
variable. The previously specified duration of this target event is then ignored. To additionally remove all time periods in which individuals are not at-risk due to the event still going on or them being immune to it (as specified using the event_duration
and immunity_duration
parameters of node_time_to_event
), users may set remove_not_at_risk=TRUE
. If only the first occurrence of the event is of interest, users may also set keep_only_first=TRUE
to keep only information up until the first event per person.
The long format:
The long format (to="long"
) corresponds to a data.table
in which there is one row per person per point in time. The unique person identifier is stored in the .id
column and the unique points in time are given in the .time
column.
The wide format:
The wide format (to="wide"
) corresponds to a data.table
with exactly one row per person and multiple columns per points in time for each time-varying variable. All time-varying variables are coded as their original variable name with an underscore and the time-point appended to the end. For example, the variable sickness
at time-point 3 is named "sickness_3"
.
Output with use_saved_states=TRUE
:
If use_saved_states=TRUE
, this function will use only the data that is stored in the past_states
list of the sim
object to construct the resulting data.table
. This results in the following behavior, depending on which save_states
option was used in the original sim_discrete_time
function call:
save_states="all"
: A complete data.table
in the desired format with information for all observations at all points in time for all variables will be created. This is the safest option, but also uses the most RAM and computational time.
save_states="at_t"
: A data.table
in the desired format with correct information for all observations at the user specified times (save_states_at
argument) for all variables will be created. The state of the simulation at all other times will be ignored, because it wasn't stored. This may be useful in some scenarios, but is generally discouraged unless you have good reasons to use it. A warning message about this is printed if check_inputs=TRUE
.
save_states="last"
: Since only the last state of the simulation was saved, an error message is returned. No data.table
is produced.
Output with use_saved_states=FALSE
:
If use_saved_states=FALSE
, this function will use only the data that is stored in the final state of the simulation (data
object in sim
) and information about node_time_to_event
objects. If all tx_nodes
are time_to_event
nodes or if all the user cares about are the time_to_event
nodes and time-fixed variables, this is the best option.
A data.table
in the desired format with correct information about all observations
at all times
is produced, but only with correct entries for some time-varying variables, namely time_to_event
nodes. Note that this information will also only be correct if the user used save_past_events=TRUE
in all time_to_event
nodes. Support for competing_events
nodes will be implemented in the future as well.
The other time-varying variables specified in the tx_nodes
argument will still appear in the output, but it will only be the value that was observed at the last state of the simulation.
Optional columns created using a time_to_event
node:
When using a time-dependent node of type "time_to_event"
with event_count=TRUE
or time_since_last=TRUE
, the columns created using either argument are not included in the output if to="start_stop"
, but will be included if to
is set to either "long"
or "wide"
. The reason for this behavior is that including these columns would lead to nonsense intervals in the start-stop format, but makes sense in the other formats.
What about tx_nodes
that are not time_to_event
nodes?:
If you want the correct output for all tx_nodes
and one or more of those are not time_to_event
nodes, you will have to use save_states="all"
in the original sim_discrete_time
call. We plan to add support for competing_events
with other save_states
arguments in the near future. Support for arbitrary tx_nodes
will probably take longer.
Returns a single data.table
(or data.frame
) containing all simulated variables in the desired format.
Using the node names "start"
, "stop"
, ".id"
, ".time"
or names that are automatically generated by time-dependent nodes of type "time_to_event"
may break this function.
Robin Denz
Sy Han Chiou, Gongjun Xu, Jun Yan, and Chiung-Yu Huang (2023). "Regression Modeling for Recurrent Events Possibly with an Informative Terminal Event Using R Package reReg". In: Journal of Statistical Software. 105.5, pp. 1-34.
library(simDAG) set.seed(435345) ## exemplary car crash simulation, where the probability for ## a car crash is dependent on the sex, and the probability of death is ## highly increased for 3 days after a car crash happened prob_car_crash <- function(data) { ifelse(data$sex==1, 0.001, 0.01) } prob_death <- function(data) { ifelse(data$car_crash_event, 0.1, 0.001) } dag <- empty_dag() + node("sex", type="rbernoulli", p=0.5) + node_td("car_crash", type="time_to_event", prob_fun=prob_car_crash, parents="sex", event_duration=3) + node_td("death", type="time_to_event", prob_fun=prob_death, parents="car_crash_event", event_duration=Inf) # generate some data, only saving the last state # not a problem here, because the only time-varying nodes are # time-to-event nodes where the event times are saved sim <- sim_discrete_time(dag, n_sim=20, max_t=500, save_states="last") # transform to standard start-stop format d_start_stop <- sim2data(sim, to="start_stop") head(d_start_stop) # transform to "death" centric start-stop format # and keep only information until death, cause it's a terminal event # (this could be used in a Cox model) d_start_stop <- sim2data(sim, to="start_stop", target_event="death", keep_only_first=TRUE, overlap=TRUE) head(d_start_stop) # transform to long-format d_long <- sim2data(sim, to="long") head(d_long) # transform to wide-format d_wide <- sim2data(sim, to="wide") #head(d_wide)
library(simDAG) set.seed(435345) ## exemplary car crash simulation, where the probability for ## a car crash is dependent on the sex, and the probability of death is ## highly increased for 3 days after a car crash happened prob_car_crash <- function(data) { ifelse(data$sex==1, 0.001, 0.01) } prob_death <- function(data) { ifelse(data$car_crash_event, 0.1, 0.001) } dag <- empty_dag() + node("sex", type="rbernoulli", p=0.5) + node_td("car_crash", type="time_to_event", prob_fun=prob_car_crash, parents="sex", event_duration=3) + node_td("death", type="time_to_event", prob_fun=prob_death, parents="car_crash_event", event_duration=Inf) # generate some data, only saving the last state # not a problem here, because the only time-varying nodes are # time-to-event nodes where the event times are saved sim <- sim_discrete_time(dag, n_sim=20, max_t=500, save_states="last") # transform to standard start-stop format d_start_stop <- sim2data(sim, to="start_stop") head(d_start_stop) # transform to "death" centric start-stop format # and keep only information until death, cause it's a terminal event # (this could be used in a Cox model) d_start_stop <- sim2data(sim, to="start_stop", target_event="death", keep_only_first=TRUE, overlap=TRUE) head(d_start_stop) # transform to long-format d_long <- sim2data(sim, to="long") head(d_long) # transform to wide-format d_wide <- sim2data(sim, to="wide") #head(d_wide)