DNAmArray_workflow/04_Normalization.Rmd at master · molepi/DNAmArray_workflow · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
```{r, child="_setup.Rmd"}
```

***

# Normalization #

## Motivation ##

Our workflow outline the use of functional normalization<sup>16</sup>, which exploits internal control probes designed to detect technical variations without assaying biological differences, and dasen as implemented by [**wateRmelon**](https://www.bioconductor.org/packages/devel/bioc/html/wateRmelon.html)<sup>30</sup>. Both are adjusted and updated to use the interpolatedXY method<sup>31</sup>.

Functional normalization has been shown to perform favourably when compared to other approaches<sup>17</sup>. Using the internal control probes avoids the problems associated with global normalization methods, where biological variation can be mistaken for a technical effect and removed. This is especially important in studies where groups are expected to have differential methylation signatures, such as multiple tissue studies<sup>18</sup>.

Conversations on the best approaches for normalization in DNAm data pipelines are ongoing<sup>19</sup>.

***

# Principal Components #

The default of selecting only two principal components is often too low for this type of data. Often you will see a drop-off in proportion of variance explained after a certain number of principal components, and this can indicate an efficient selection.

```{r 401scree}
var_explained %>% ggplot(aes(x=PC, y=var_explained)) +
  geom_line() +
  geom_point(color='grey5', fill='#6DACBC', shape=21, size=3) +
  scale_x_continuous(breaks=1:ncol(pca$x)) +
  xlab("Principal Component") +
  ylab("Proportion of variance explained") +
  theme_bw()
```

***

# Running Normalization #

In order to run normalization the annotation of the `RGset` must be updated for EPIC arrays.

```{r 402anno}
RGset@annotation <- c(array = "IlluminaHumanMethylationEPIC", annotation = "ilm10b4.hg19")
```

We use the `adjustedFunnorm` function from [**wateRmelon**](https://www.bioconductor.org/packages/devel/bioc/html/wateRmelon.html), which uses the interpolated XY method<sup>31</sup>. By default, functional normalization returns normalized copy number data making the returned `GenomicRatioSet` twice the size necessary when only beta-values or M-values are required. Therefore, we set `keepCN` to FALSE.

```{r 403funnorm}
GRset <- adjustedFunnorm(
  rgSet = RGset,
  nPCs = 4,
  sex = ifelse(targets$sex == "Female", 0, 1),
  keepCN = F,
  verbose = T
)

GRset
```

It is also possible to use `adjustedDasen` to apply dasen normalization to normalize autosomal CpGs and infer the sex chromosome linked CpGs by linear interpolation on corrected autosomal CpGs. Instead of outputting a `GRset`, this function outputs the normalized beta values. Therefore, using this normalization removes the need for the DNAmArray `reduce` function in the next steps.

```{r eval=F}
betas <- adjustedDasen(mns = methylated(RGset),
                       uns = unmethylated(RGset),
                       onetwo = fData(RGset)[,fot(RGset)],
                       chr = fData(RGset)$CHR)
```

***