Environment-Wide Association Studies (EWAS) are the study of the association between a health event and several exposures one after the other. With this package, it is possible to carry out an EWAS analysis in the simplest way and to display easily interpretable results in the output.
To do this, you must first define several points:
Make your dataset prepared
Determine which health event you want to study
Launch the program
The Elja package works step by step to perform an EWAS analysis:
It structures the dataset you have provided
It runs multiple models on all the exposures according to the type of model chosen for the type of outcome (continuous, binary categorical etc.)
It results in a data frame including for each tested exposure (and all associated modalities): the value of the estimator (odd ratio or coefficients) as well as its 95% confidence interval and the associated p-value and the number of values taken into account in the model and the AIC of the model.
It can also display two types of Manhattan plot both with visual indicator of the alpha threshold at 0.05 and of the alpha threshold corrected according to the Bonferroni method and the False Discovery Rate (FDR) of Benjamini-Hochberg. The first one, representing all the variables of the EWAS analysis. The second one, only for the significant values.
This document introduces the basic use of this package in an EWAS analysis.
In order to show in a simple way the use of the Elja package, we will use the PIMA dataset. This dataset is present in the package mlbench (https://mlbench.github.io/).
library(mlbench)
data(PimaIndiansDiabetes)
head(PimaIndiansDiabetes)
#> pregnant glucose pressure triceps insulin mass pedigree age diabetes
#> 1 6 148 72 35 0 33.6 0.627 50 pos
#> 2 1 85 66 29 0 26.6 0.351 31 neg
#> 3 8 183 64 0 0 23.3 0.672 32 pos
#> 4 1 89 66 23 94 28.1 0.167 21 neg
#> 5 0 137 40 35 168 43.1 2.288 33 pos
#> 6 5 116 74 0 0 25.6 0.201 30 neg
This dataset containing a health event (diabetes) will allow us to to illustrate the functioning of the Elja package.
Before performing the function, we have to make sure that the dataset is well structured.
To do so, we have to check 2 elements:
The health event (outcome) must be in the same dataframe as all the exposures: this will avoid making a model where we will include other outcomes.
The variables must be classified in the right way: for this, use ‘str()’.
str(PimaIndiansDiabetes)
#> 'data.frame': 768 obs. of 9 variables:
#> $ pregnant: num 6 1 8 1 0 5 3 10 2 8 ...
#> $ glucose : num 148 85 183 89 137 116 78 115 197 125 ...
#> $ pressure: num 72 66 64 66 40 74 50 0 70 96 ...
#> $ triceps : num 35 29 0 23 35 0 32 0 45 0 ...
#> $ insulin : num 0 0 0 94 168 0 88 0 543 0 ...
#> $ mass : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
#> $ pedigree: num 0.627 0.351 0.672 0.167 2.288 ...
#> $ age : num 50 31 32 21 33 30 26 29 53 54 ...
#> $ diabetes: Factor w/ 2 levels "neg","pos": 2 1 2 1 2 1 2 1 2 2 ...
Diabetes, which is our target health event, stands alone with exposures. In addition, the variables all have the correct class associated.
According to the class of the outcome, one model will be preferred to another. It is therefore necessary to choose the right model for the type of variable chosen as the health event.
We have seen previously that our health event is binary categorical: Diabetes (Yes/No).
We can therefore use a logistic regression model.
The approach for the logistic regression is similar for the models linear models with ELJAlinear function and for Generalized Linear Models with ELJAglm function.
The dataset being prepared and the type of model chosen, we can proceed to the analysis.
To do so, the following information are needed:
var: Outcome / health event; it must be categorical for a logistic regression model
data: Dataframe that contains the outcome and all the exposures to be tested
Other information can be added to the output of the function:
manplot: Indicates if it is desired to display a Manhattan plot of the results in the output of the function
nbvalmanplot: Indicates the number of values to display in the Manhattan plot (in order not to overload the graphs)
Bonferroni: Indicates if we want to display the Bonferroni threshold on the Manhattan plot
FDR : Indicates if you want to display the False Discovery Rate threshold according to the Benjamini-Hochberg method on the Manhattan plot
manplotsign : Indicates if you want to display a Manhattan plot containing only significant only the significant values with a p-value > 0.05.
ELJAlogistic(var = 'diabetes',data = PimaIndiansDiabetes,manplot = TRUE,
Bonferroni = TRUE,FDR = TRUE, nbvalmanplot = 30, manplotsign = FALSE)
results
#> level odd_ratio ci_low ci_high p_value n
#> pregnant_pregnant pregnant 1.147008 1.0970869 1.200315 2.147445e-09 768
#> glucose_glucose glucose 1.038599 1.0321816 1.045439 2.378098e-31 768
#> pressure_pressure pressure 1.007452 0.9994922 1.015902 7.299362e-02 768
#> triceps_triceps triceps 1.009911 1.0005344 1.019455 3.881576e-02 768
#> insulin_insulin insulin 1.002301 1.0010311 1.003607 4.353455e-04 768
#> mass_mass mass 1.098044 1.0730012 1.124942 8.449577e-15 768
#> pedigree_pedigree pedigree 2.953073 1.8799770 4.713627 3.702926e-06 768
#> age_age age 1.042922 1.0296867 1.056659 1.773155e-10 768
#> AIC
#> pregnant_pregnant 960.2099
#> glucose_glucose 812.7196
#> pressure_pressure 994.1276
#> triceps_triceps 993.1890
#> insulin_insulin 984.8104
#> mass_mass 924.7142
#> pedigree_pedigree 974.8609
#> age_age 954.7203
We observe a Manhattan plot showing the results of the EWAS analysis and a dataframe showing the more detailed results.