devMSMs/ExampleDataPrep.Rmd at main · istallworthy/devMSMs · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
---
title: "Data Preparation for using devMSMs with Longitudinal Data"
output: html_document
date: "2024-05-17"
---
The following notebook provides a guide to assist the user in preparingn their data for use with *devMSMs*.

Please review the accompanying manuscript for a full conceptual and practical introduction to MSMs in the context of developmental data. Please also see the *Data Requirements & Preparation* vignette on the *devMSMs* website for step-by-step guidance on the use of this code: https://istallworthy.github.io/devMSMs/index.html.

Headings denote accompanying website sections and steps. We suggest using the interactive outline tool (located above the Console) for ease of navigation.

We advise users implement the appropriate data prepeartion steps in accordance with your starting data, with the goal of assigning to `data` one of the following wide data formats (see Figure 1) for use in the package:

* a single data frame of data in wide format with no missing data

* a mids object (output from mice::mice()) of data imputed in wide format

* a list of data imputed in wide format as data frames.


Users have several options for reading in data. They can begin this workflow with the following options:

* long data with missingness can can be formatted and converted to wide data (P1a) for imputation (P2)

* wide with missingness can be formatted (P1b) before imputing (P2)

* data already imputed in wide format can be read in as a list (P2).

* data with no missingness (P3)
<br>

## Installation
You will always need to install the *devMSMs helper* functions from Github (*devMSMsHelpers*).

```{r}
install.packages("devtools", quiet = TRUE)
library(devtools)

install_github("istallworthy/devMSMsHelpers", quiet = TRUE)
library(devMSMsHelpers)

install_github("istallworthy/devMSMs", quiet = TRUE)
library(devMSMs, include.only = "initMSM")
```

Preliminary steps P1 (formulate hypotheses) and P2 (create a DAG) are detailed further in the accompanying manuscript. These following recommended preliminary steps are designed to assist the user in preparing and inspecting their data to ensure appropriate use of the package.
<br>

# P3. Specify Core Inputs
Please see the *Specify Core Inputs* vignette for further details.

```{r}
exposure = c("ESETA1.6", "ESETA1.15", "ESETA1.24", "ESETA1.35", "ESETA1.58")

outcome <- "StrDif_Tot.58"

ti_conf =  c("state", "BioDadInHH2", "PmAge2", "PmBlac2", "TcBlac2", "PmMrSt2", "PmEd2", "KFASTScr",
             "RMomAgeU", "RHealth", "HomeOwnd", "SWghtLB", "SurpPreg", "SmokTotl", "DrnkFreq",
             "peri_health", "caregiv_health", "gov_assist")

tv_conf = c("SAAmylase.6","SAAmylase.15", "SAAmylase.24",
            "MDI.6", "MDI.15",
            "RHasSO.6", "RHasSO.15", "RHasSO.24","RHasSO.35", "RHasSO.58",
            "WndNbrhood.6","WndNbrhood.24", "WndNbrhood.35", "WndNbrhood.58",
            "IBRAttn.6", "IBRAttn.15", "IBRAttn.24",
            "B18Raw.6", "B18Raw.15", "B18Raw.24", "B18Raw.58",
            "HOMEETA1.6", "HOMEETA1.15", "HOMEETA1.24", "HOMEETA1.35", "HOMEETA1.58",
            "InRatioCor.6", "InRatioCor.15", "InRatioCor.24", "InRatioCor.35", "InRatioCor.58",
            "CORTB.6", "CORTB.15", "CORTB.24",
            "EARS_TJo.24", "EARS_TJo.35",
            "LESMnPos.24", "LESMnPos.35",
            "LESMnNeg.24", "LESMnNeg.35",
            "StrDif_Tot.35", "StrDif_Tot.58",
            "fscore.35", "fscore.58")

home_dir = '/Users/isabella/Library/CloudStorage/Box-Box/BSL General/MSMs/testing/isa/'

```
<br>


# P.4 Data Preparation & Inspection
## P4.1. Single Long Data Frame
Users beginning with a single data frame in long format (with or without missingness).

### P4.1a. Format Long Data
Users have the option of formatting their long data if their data columns are not yet labeled correctly using the `formatLongData()` helper function.

Read in long data and specify any factors and integers
```{r}
data_long <- read.csv("/Users/isabella/Library/CloudStorage/Box-Box/BSL General/MSMs/testing/testing data/continuous outcome/continuous exposure/FLP_long_missing_unformatted.csv",
                      header = TRUE)

factor_confounders <- c("state", "TcBlac2","BioDadInHH2","HomeOwnd", "PmBlac2",
                        "PmMrSt2", "SurpPreg", "RHealth", "SmokTotl", "DrnkFreq",
                        "RHasSO")

integer_confounders <- c("KFASTScr", "PmEd2", "RMomAgeU", "SWghtLB",
                         "peri_health", "caregiv_health" ,
                         "gov_assist", "B18Raw", "EARS_TJo", "MDI")
```

```{r}
data_long_f <- formatLongData(data = data_long,
                              exposure = exposure,
                              outcome = outcome,
                              sep = "\\.",
                              time_var = "Tage",
                              id_var = "S_ID",
                              missing = -9999,
                              factor_confounders = factor_confounders,
                              integer_confounders = integer_confounders,
                              home_dir = home_dir,
                              save.out = TRUE)
head(data_long_f)
```

Make sure to inspect your data at each step of the way to ensure formatting looks correct.


### P4.1b. Tranform Formatted Long Data to Wide
Users with correctly formatted data in long format have the option of using the following code to transform their data into wide format, to proceed to using the package (if there is no missing data) or imputing (with < 20% missing data MAR). They may see some warnings about *NA* values.

```{r}

sep <- "\\." # time delimiter
v <- sapply(strsplit(tv_conf[!grepl("\\:", tv_conf)], sep), head, 1)
v <- c(v[!duplicated(v)], sapply(strsplit(exposure[1], sep), head, 1))

library(stats)
data_wide_f <- stats::reshape(data = data_long_f,
                            idvar = "ID",
                            v.names = v,
                            timevar = "WAVE",
                            times = c(6, 15, 24, 35, 58),
                            direction = "wide")

data_wide_f <- data_wide_f[, colSums(is.na(data_wide_f)) < nrow(data_wide_f)]

head(data_wide_f)
```


## P4.2. Single Wide Data Frame
Alternatively, we could start with a single data frame of wide data (with or without missingness).

### P4.2a. Format Wide Data
Users can draw on the `formatWideData()` helper function to format their wide data.

Read in long data and specify any factors and integers
```{r}
data_wide <- read.csv("/Users/isabella/Library/CloudStorage/Box-Box/BSL General/MSMs/Tutorial paper/merged_tutorial_filtered.csv",
                      header = TRUE)

factor_confounders <- c("state", "TcBlac2","BioDadInHH2","HomeOwnd", "PmBlac2",
                        "PmMrSt2", "SurpPreg", "RHealth", "SmokTotl", "DrnkFreq",
                        "RHasSO.6", "RHasSO.15", "RHasSO.24", "RHasSO.35", "RHasSO.58")

integer_confounders = c("KFASTScr", "PmEd2", "RMomAgeU", "SWghtLB", "peri_health", "caregiv_health" ,
                        "gov_assist", "B18Raw.6", "B18Raw.15", "B18Raw.24", "B18Raw.58",
                        "EARS_TJo.24", "EARS_TJo.35", "MDI.6", "MDI.15")
```

Format wide data
```{r}
data_wide_f <- formatWideData(data = data_wide,
                              exposure = exposure,
                              outcome = outcome,
                              sep = "\\.",
                              id_var = "ID",
                              missing = NA,
                              factor_confounders = factor_confounders,
                              integer_confounders = integer_confounders,
                              home_dir = home_dir, save.out = TRUE)

head(data_wide_f)
```


## P4.3. Formatted Wide Data with Missingness
Most data collected from humans will have some degree of missing data.

### P4.3a. Multiply Impute Formatted, Wide Data Frame using MICE
Users have the option of imputing their wide, formatted data using the *mice* package, for use with *devMSMs*.

Specify imputation parameters
```{r}
s <- 1234

m <- 5

method <- "rf"

maxit <- 0 # 5
```

Impute data
```{r}
set.seed(s)

imputed_data <- imputeData(data = data_wide_f,
                           exposure = exposure,
                           outcome = outcome,
                           sep = "\\.",
                           m = m,
                           method = method,
                           maxit = maxit,
                           para_proc = TRUE,
                           seed = s,
                           read_imps_from_file = FALSE,
                           home_dir = home_dir,
                           save.out = TRUE)
```

To use the package with imputed data
```{r}

data <- imputed_data

library(mice)
summary(mice::complete(data, 1))
```


Alternatively, users could read in a saved mids object for use with *devMSMS*.
```{r}
imputed_data <- readRDS("/Users/isabella/Library/CloudStorage/Box-Box/BSL General/MSMs/testing/testing data/continuous outcome/continuous exposure/FLP_wide_imputed_mids.rds")
```


### P4.3b. Read in as a List of Wide Imputed Data Saved Locally
Alternatively, if a user has imputed datasets already created using a program other than *mice*, they can read in, as a list, files saved locally as .csv files (labeled “1”:m) in a single folder.

```{r}
folder <- "/Users/isabella/Library/CloudStorage/Box-Box/BSL General/MSMs/testing/testing data/continuous outcome/continuous exposure/imputations/"

files <- list.files(folder, full.names = TRUE, pattern = "\\.csv")
```

To use the package with a list of imputed data from above
```{r}
data <- lapply(files, function(file) {
  imp_data <- read.csv(file)
  imp_data
})


# check for character variables

any(as.logical(unlist(lapply(data, function(x) { # if this is TRUE, run next lines
  any(sapply(x, class) == "character") }))))
names(data[[1]])[sapply(data[[1]], class) == "character"] # find names of any character variables

# run this for each variable that needs char -> factor

data <- lapply(data, function(x){
  x[, "state"] <- factor(x[, "state"], labels = c(1, 0))
})


# make factors; list factor variables here

factor_covars <- c("state", "TcBlac2","BioDadInHH2","HomeOwnd", "PmBlac2",
                   "PmMrSt2", "SurpPreg", "RHealth", "SmokTotl", "DrnkFreq",
                   "RHasSO.6", "RHasSO.15", "RHasSO.24", "RHasSO.35", "RHasSO.58")

data <- lapply(data, function(x) {
  x[, factor_covars] <- as.data.frame(lapply(x[, factor_covars], as.factor))
  x })
```

Having read in their data, users should now have have assigned to `data` wide, complete data as: a data frame, mice object, or list of imputed data frames for use with the *devMSMs* functions.