Title: | Rapid Easy Synthesis to Inform Data Extraction |
---|---|
Description: | Developed to assist researchers with planning analysis, prior to obtaining data from Trusted Research Environments (TREs) also known as safe havens. With functionality to export and import marginal distributions as well as synthesise data, both with and without correlations from these marginal distributions. Using a multivariate cumulative distribution (COPULA). Additionally the International Stroke Trial (IST) is included as an example dataset under ODC-By licence Sandercock et al. (2011) <doi:10.7488/ds/104>, Sandercock et al. (2011) <doi:10.1186/1745-6215-12-101>. |
Authors: | Ryan Field [aut, cre] , David McAllister [aut] , Claudia Geue [ctb] |
Maintainer: | Ryan Field <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.3.3 |
Built: | 2024-11-20 06:07:34 UTC |
Source: | https://github.com/hehta/reside |
Developed to assist researchers with planning analysis, prior to obtaining data from Trusted Research Environments (TREs) also known as safe havens. With functionality to export and import marginal distributions as well as synthesise data, both with and without correlations from these marginal distributions. Using a multivariate cumulative distribution (COPULA). Additionally the International Stroke Trial (IST) is included as an example dataset under ODC-By licence Sandercock et al. (2011) doi:10.7488/ds/104, Sandercock et al. (2011) doi:10.1186/1745-6215-12-101.
The RESIDE Package
This work was supported by the UKRI Strength in Places Fund (SIPF) Competition, #' project number 107140. The project title is SIPF The Living Laboratory driving economic growth in Glasgow through real world implementation of precision medicine.
Maintainer: Ryan Field [email protected] (ORCID)
Authors:
David McAllister [email protected] (ORCID)
Other contributors:
Claudia Geue [email protected] (ORCID) [contributor]
Useful links:
A function to export a correlation matrix with the required variables as a csv file.
export_empty_cor_matrix( marginals, folder_path, file_name = "correlation_matrix.csv", create_folder = TRUE )
export_empty_cor_matrix( marginals, folder_path, file_name = "correlation_matrix.csv", create_folder = TRUE )
marginals |
The marginal distributions |
folder_path |
Folder to export to. |
file_name |
(optional) file name, Default: 'correlation_matrix.csv' |
create_folder |
Whether the folder should be created, Default: TRUE |
This function will export an empty correlation matrix
as a csv file, it will contain all the necessary variables including
dummy variables for factors. Dummy variables for factors may contain
a missing category to represent missing data. Correlations should be
added to the empty CSV and the imported using the
import_marginal_distributions
function.
Correlations should be supplied using rank order correlations.
The correlation matrix should be symmetric and positive semi definite.
No return value, called for exportation of files.
import_marginal_distributions
import_cor_matrix
## Not run: marginals <- import_marginal_distributions() export_empty_cor_matrix( marginals, folder_path = tempdir() ) ## End(Not run)
## Not run: marginals <- import_marginal_distributions() export_empty_cor_matrix( marginals, folder_path = tempdir() ) ## End(Not run)
Export the marginal distributions to CSV files
export_marginal_distributions( marginals, folder_path, create_folder = FALSE, force = FALSE )
export_marginal_distributions( marginals, folder_path, create_folder = FALSE, force = FALSE )
marginals |
an Object of type RESIDE from
|
folder_path |
path to folder where to save files. |
create_folder |
if the folder does not exist should it be created, Default: FALSE |
force |
if the folder already contains marginal distribution files should they be removed, Default: FALSE |
Exports each of the marginal distributions to CSV files within a given folder, along with the continuous quantiles.
No return value, called for exportation of files.
marginal_distributions <- get_marginal_distributions(IST) export_marginal_distributions( marginal_distributions, folder_path = tempdir() )
marginal_distributions <- get_marginal_distributions(IST) export_marginal_distributions( marginal_distributions, folder_path = tempdir() )
Generate Marginal Distributions from a given data frame with options to specify which variables to use.
get_marginal_distributions(df, variables = c(), print = FALSE)
get_marginal_distributions(df, variables = c(), print = FALSE)
df |
Data frame to get the marginal distributions from |
variables |
(Optional) variable (columns) to select, Default: c() |
print |
Whether to print the marginal distributions to the console, Default: FALSE |
A function to generate marginal distributions from a given data frame, depending on the variable type the marginals will differ, for binary variables a mean and number of missing is generated for continuous variables, they are first transformed and both mean and sd of the transformed variables are stored along with the quantile mapping for back transformation. For categorical variables, the number of each category is stored, missing values are categorise as "missing".
A list of marginal distributions of an S3 RESIDE Class
marginal_distributions <- get_marginal_distributions( IST, variables <- c( "SEX", "AGE", "ID14", "RSBP", "RATRIAL" ) )
marginal_distributions <- get_marginal_distributions( IST, variables <- c( "SEX", "AGE", "ID14", "RSBP", "RATRIAL" ) )
Imports a correlation matrix from a csv file generated by
export_empty_cor_matrix
import_cor_matrix(file_path = "./correlation_matrix.csv")
import_cor_matrix(file_path = "./correlation_matrix.csv")
file_path |
A path to the csv file, Default: './correlation_matrix.csv' |
A function to import the user specified correlations
generated from the csv file exported by the
export_empty_cor_matrix
function.
Correlations should be entered into the CSV file,
using rank order correlations. The correlation matrix
should be symmetric and be positive semi definite.
a matrix of correlations that can be used with
synthesise_data
export_empty_cor_matrix
is.positive.semi.definite
## Not run: import_cor_matrix("correlation_matrix.csv") ## End(Not run)
## Not run: import_cor_matrix("correlation_matrix.csv") ## End(Not run)
Import the marginal distribution as exported from a Trusted Research Environment (TRE)
import_marginal_distributions( folder_path = ".", binary_variables_file = "", categorical_variables_file = "", continuous_variables_file = "", continuous_quantiles_file = "", summary_file = "summary.csv" )
import_marginal_distributions( folder_path = ".", binary_variables_file = "", categorical_variables_file = "", continuous_variables_file = "", continuous_quantiles_file = "", summary_file = "summary.csv" )
folder_path |
Where the marginal distribution files are located, Default: '.' see details. |
binary_variables_file |
filename for the binary_variables file, Default: ” see details. |
categorical_variables_file |
filename for the categorical variables file , Default: ” see details. |
continuous_variables_file |
filename for the continuous variables file, Default: ” see details. |
continuous_quantiles_file |
filename for the continuous quantiles file, Default: ” see details. |
summary_file |
filename for the summary file, Default: 'summary.csv' see details. |
This function will import marginal distributions as generated
within a Trusted Research Environment (TRE) using the function
export_marginal_distributions
.
The folder_path allows the path of the files
provided by the TRE to be imported,
this will default to the current working directory.
The file parameters will provide the default file names
if no filenames are specified.
Returns an object of a RESIDE class
## Not run: marginals <- import_marginal_distributions() ## End(Not run)
## Not run: marginals <- import_marginal_distributions() ## End(Not run)
The International Stroke Trial Dataset
IST
IST
A data frame with 19435 rows and 112 columns:
Randomisation data: Age in years
Other data and derived variables: Compliant for aspirin
Other data and derived variables: Compliant for heparin
Other data and derived variables: Country code
Other data and derived variables: Abbreviated country code
Recurrent stroke within 14 days: Discharged alive from hospital
Recurrent stroke within 14 days: Date Discharged alive from hospital
Data collected on 14 day/discharge form about treatments given in hospital: Non trial antiplatelet drug (Y/N)
Data collected on 14 day/discharge form about treatments given in hospital: Aspirin given for 14 days or till death or discharge (Y/N)
Data collected on 14 day/discharge form about treatments given in hospital: Discharged on long term aspirin (Y/N)
Randomisation data: Estimate of local day of week (assuming RDATE is Oxford)
Data collected on 14 day/discharge form about treatments given in hospital: Calcium antagonists (Y/N)
Data collected on 14 day/discharge form about treatments given in hospital: Carotid surgery (Y/N)
Other events within 14 days: Dead on discharge form
Other events within 14 days: Cause of death (1-Initial stroke/2-Recurrent stroke (ischaemic or unknown /3-Recurrent stroke (haemorrhagic)/4-Pneumonia /5-Coronary heart disease/6-Pulmonary embolism /7-Other vascular or unknown/8-Non-vascular/0-unknown)
Date of dead on discharge form (yyyy/mm/dd); NOTE: this death is not necessarily within 14 days of randomisation
Other events within 14 days: Comment on death
Final diagnosis of initial event: Haemorrhagic stroke
Final diagnosis of initial event: Ischaemic stroke
Final diagnosis of initial event: Indeterminate stroke
Indicator variables for specific causes of death: Initial stroke
Indicator variables for specific causes of death: Reccurent ischaemic/unknown stroke
Indicator variables for specific causes of death: Reccurent haemorrhagic stroke
Indicator variables for specific causes of death: Pneumonia
Indicator variables for specific causes of death: Coronary heart disease
Indicator variables for specific causes of death: Pulmonary embolism
Indicator variables for specific causes of death: Other vascular or unknown
Indicator variables for specific causes of death: Non vascular
Data collected on 14 day/discharge form about treatments given in hospital: Glycerol or manitol (Y/N)
Data collected on 14 day/discharge form about treatments given in hospital: Haemodilution (Y/N)
Data collected on 14 day/discharge form about treatments given in hospital: Medium dose heparin given for 14 days etc in pilot (combine with above)
Other data and derived variables: Indicator variable for death (1=died; 0=did not die)
Data collected on 14 day/discharge form about treatments given in hospital: Non trial intravenous heparin (Y/N)
Data collected on 14 day/discharge form about treatments given in hospital: Low dose heparin given for 14 days or till death/discharge (Y/N)
Data collected on 14 day/discharge form about treatments given in hospital: Major non-cerebral haemorrhage (Y/N)
Data collected on 14 day/discharge form about treatments given in hospital: Date of Major non-cerebral haemorrhage (yyyy/mm/dd)
Data collected on 14 day/discharge form about treatments given in hospital: Comment of Major non-cerebral haemorrhage
Data collected on 14 day/discharge form about treatments given in hospital: Date of Major non-cerebral haemorrhage (yyyy/mm/dd)
Final diagnosis of initial event: Not a stroke
Final diagnosis of initial event: Comment on Not a stroke
Data collected on 14 day/discharge form about treatments given in hospital: Other anticoagulants (Y/N)
Other events within 14 days: Pulmonary embolism
Other events within 14 days: Date of Pulmonary embolism (yyyy/mm/dd)
Other events within 14 days: Discharge destination (A-Home /B-Relatives home /C-Residential care /D-Nursing home /E-Other hospital departments /U-Unknown)
Recurrent stroke within 14 days: Haemorrhagic stroke
Recurrent stroke within 14 days: Date of Haemorrhagic stroke (yyyy/mm/dd)
Recurrent stroke within 14 days: Ischaemic recurrent stroke
Recurrent stroke within 14 days: Date of Ischaemic recurrent stroke (yyyy/mm/dd)
Recurrent stroke within 14 days: Unknown type
Recurrent stroke within 14 days: Date of Unknown type (yyyy/mm/dd)
Data collected on 14 day/discharge form about treatments given in hospital: Non trial subcutaneous heparin (Y/N)
Data collected on 14 day/discharge form about treatments given in hospital: Other side effect (Y/N)
Data collected on 14 day/discharge form about treatments given in hospital: Date of Other side effect
Data collected on 14 day/discharge form about treatments given in hospital: Comment of Other side effect
Data collected on 14 day/discharge form about treatments given in hospital: Steroids (Y/N)
Data collected on 14 day/discharge form about treatments given in hospital: Thrombolysis (Y/N)
Indicator variables for specific causes of death: Indicator of deep vein thrombosis on discharge form
Other data and derived variables: Predicted probability of death at 14 days
Other data and derived variables: Predicted probability of death at 6 month
Other data and derived variables: Predicted probability of death/dependence at 6 month
Data collected at 6 months: On antiplatelet drugs
Data collected at 6 months: Dead at six month follow-up (Y/N)
Data collected at 6 months: Cause of death (1-Initial stroke /2-Recurrent stroke (ischaemic or unknown) /3-Recurrent stroke (haemorrhagic) /4-Pneumonia /5-Coronary heart disease /6-Pulmonary embolism /7-Other vascular or unknown /8-Non-vascular /0-unknown)
Data collected at 6 months: Date of death; NOTE: this death is not necessarily within 6 months of randomisation
Data collected at 6 months: Comment on death
Data collected at 6 months: Dependent at 6 month follow-up (Y/N)
Data collected at 6 months: Date of last contact
Data collected at 6 months: On anticoagulants
Data collected at 6 months: Place of residance at 6 month follow-up ( A-Home /B-Relatives home /C-Residential care /D-Nursing home /E-Other hospital departments /U-Unknown)
Data collected at 6 months: Fully recovered at 6 month follow-up (Y/N)
Other data and derived variables: Date discharge form completed
Other data and derived variables: Date discharge form received
Other data and derived variables: Date 6 month follow-up done
Indicator variables for specific causes of death: Cerebral bleed/heamorrhagic stroke within 14 days; this is slightly wider definition than DRSH an is used for analysis of cerebral bleeds
Randomisation data: Hospital number
Randomisation data: Local time – hours
Indicator variables for specific causes of death: Indicator of haemorrhagic transformation within 14 days
Other data and derived variables: Indicator of death at 14 days
Indicator variables for specific causes of death: Indicator of ischaemic stroke within 14 days
Randomisation data: Local time – minutes
Indicator variables for specific causes of death: Indicator of any non-cerebral bleed within 14 days
Other data and derived variables: Coding of compliance (see Table 3) doi:10.1186/1745-6215-13-24
Indicator variables for specific causes of death: Indicator of indeterminate stroke within 14 days
Other data and derived variables: Six month outcome ( 1-dead /2-dependent /3-not recovered /4-recovered /8 or 9 – missing status
Data collected on 14 day/discharge form about treatments given in hospital: Estimate of time in days on trial treatment
Indicator variables for specific causes of death: Indicator of pulmonary embolism within 14 days
Randomisation data: Aspirin within 3 days prior to randomisation (Y/N)
Randomisation data: Atrial fibrillation (Y/N); not coded for pilot phase - 984 patients
Randomisation data: Conscious state at randomisation (F - fully alert, D - drowsy, U - unconscious)
Randomisation data: CT before randomisation (Y/N)
Randomisation data: Date of randomisation
Randomisation data: Face deficit (Y/N/C=can't assess)
Randomisation data: Arm/hand deficit (Y/N/C=can't assess)
Randomisation data: Leg/foot deficit (Y/N/C=can't assess)
Randomisation data: Dysphasia (Y/N/C=can't assess)
Randomisation data: Hemianopia (Y/N/C=can't assess)
Randomisation data: Visuospatial disorder (Y/N/C=can't assess)
Randomisation data: Brainstem/cerebellar signs (Y/N/C=can't assess)
Randomisation data: Other deficit (Y/N/C=can't assess)
Randomisation data: Delay between stroke and randomisation in hours
Randomisation data: Heparin within 24 hours prior to randomisation (Y/N)
Randomisation data: Systolic blood pressure at randomisation (mmHg)
Randomisation data: Symptoms noted on waking (Y/N)
Randomisation data: Infarct visible on CT (Y/N)
Randomisation data: Trial aspirin allocated (Y/N)
Randomisation data: Trial heparin allocated (M/L/N) \[M is coded as H=high in pilot\]
Other data and derived variables: Know to be dead or alive at 14 days (1=Yes, 0=No); this does not necessarily mean that we know outcome at 6 monts – see OCCODE for this
Randomisation data: M=male; F=female
Indicator variables for specific causes of death: Indicator of any stroke within 14 days
Randomisation data: Stroke subtype (TACS/PACS/POCS/LACS/other)
Other data and derived variables: Time of death or censoring in days
Indicator variables for specific causes of death: Indicator of major non-cerebral bleed within 14 days
...
Obtained from Sandercock, Peter; Niewada, Maciej; Czlonkowska, Anna. (2011). International Stroke Trial database (version 2), [dataset]. University of Edinburgh. Department of Clinical Neurosciences. doi:10.7488/ds/104 Under ODC-by licence
Sandercock P et al. [email protected]
S3 override for print RESIDE
## S3 method for class 'RESIDE' print(x, ...)
## S3 method for class 'RESIDE' print(x, ...)
x |
an object of class RESIDE |
... |
Other parameters currently none are used |
S3 Override for RESIDE Class
No return value, called to print to the terminal.
print( marginal_distributions <- get_marginal_distributions( IST, variables <- c( "SEX", "AGE", "ID14", "RSBP", "RATRIAL" ) ) )
print( marginal_distributions <- get_marginal_distributions( IST, variables <- c( "SEX", "AGE", "ID14", "RSBP", "RATRIAL" ) ) )
Allows the synthesis of data from marginal distributions obtained from a Trusted Research Environment (TRE)
synthesise_data(marginals, correlation_matrix = NULL, ...) synthesize_data(marginals, correlation_matrix = NULL, ...)
synthesise_data(marginals, correlation_matrix = NULL, ...) synthesize_data(marginals, correlation_matrix = NULL, ...)
marginals |
an object of class RESIDE |
correlation_matrix |
Correlation Matrix
see |
... |
Additional parameters currently none are used. |
This function will synthesise a dataset from marginals imported
using import_marginal_distributions
.
By default the dataset will not contain correlations,
however user specified correlations can be added using
the correlation_matrix
parameter,
see export_empty_cor_matrix
and
import_cor_matrix
for more details.
a data frame of simulated data
export_empty_cor_matrix
import_cor_matrix
## Not run: marginals <- import_marginal_distributions() df <- synthesise_data(marginals) ## End(Not run)
## Not run: marginals <- import_marginal_distributions() df <- synthesise_data(marginals) ## End(Not run)