Synthesising Data from Marginal Distributions

Data is synthesised by sampling from a multivariate cumulative distribution (Copula), using the simstudy package.

Without Correlations

Data can be synthesised from marginal distributions using the synthesise_data() function:

library(RESIDE)
marginals <- import_marginal_distributions()
simulated_data <- synthesise_data(marginals)

With correlations

User specified correlations can be added to the synthesised data by supplying a correlation matrix. An empty correlations matrix can be generated using the export_empty_cor_matrix() function, supplying the marginals imported using ‘import_marginal_distributions’ and a folder path respectively:

library(RESIDE)
marginals <- import_marginal_distributions()
export_empty_cor_matrix(marginals, folder_path = tempdir())
  • By default the file wil be names correlation_matrix.csv but can be changed with the ‘file_name’ parameter *

The exported CSV file will be a symmetric table which looks like:

Correlations should then be added to the CSV file, without modifying the column / row names. Correlations should use rank order correlations. Categorical variables are represented as dummy variables named using the format variable name underscore category name e.g. SEX_F. Note the correlation matrix should be symmetrical and positive semi definite.

Once the correlations have been added to the CSV file, the correlations can be imported using the `import_cor_matrix’ function:

library(RESIDE)
correlation_matrix <- import_cor_matrix()

By default the filename for the correlation matrix is that of the exported filename (correlation_matrix.csv) and is imported from the current working directory. This can be changed by specifying a file_path using the corresponding parameter of the import_cor_matrix() function, this file path should be a relative or absolute file path.

The import_cor_matrix() function will produce and error if the matrix is not symmetrical and positive semi definite, or the file does not exist.

With a correlation matrix data can now be synthesised with the user specified correlations using the synthesise_data() function, specifying the correlation matrix imported by the import_cor_matrix() function:

library(RESIDE)
marginals <- import_marginal_distributions()
export_empty_cor_matrix(marginals)
correlation_matrix <- import_cor_matrix()
simulated_data <- synthesise_data(
  marginals,
  correlation_matrix
)

NB It is not possible to entirely maintain all the marginal distributions when specifying correlations, this is a known limitation and is not likely to change.