1 Introduction

Genes are pleiotropic and getting a better knowledge of their function requires a comprehensive characterization of their mutants. Within the ProMetIS project, a multi-level dataset has been generated, which combines phenomic, proteomic and metabolomic acquisitions from liver tissues of a mouse knock-out model lacking the Lat (linker for activation of T cells) gene (Imbert et al. 2021). The data and processing code are publicly available in the ProMetIS R package to ensure accessibility, interoperability, and reusability.

Here, we propose to perform a statistical analysis of this dataset and characterize the impact of the deletion of the Lat gene.

2 Hands-on

2.1 Options

Sys.setlocale("LC_CTYPE", "French")

## Warning in Sys.setlocale("LC_CTYPE", "French"): using locale code page other
## than 65001 ("UTF-8") may cause problems

## [1] "French_France.1252"

2.2 Loading the dataset

The multi-omics dataset is stored in the ProMetIS package as three tabulated files (dataMatrix.tsv, sampleMetadata.tsv, and variableMetadata.tsv) for each of the 9 datasets (see the Appendix section below):

lat_mset_dir.c <- ProMetIS::aggregated_dir.c("LAT")

The whole multi-omics dataset can be loaded as a MultiDataSet object (Hernandez-Ferrer et al. 2016), by using the phenomis::reading method from the phenomis package (Imbert et al. 2021):

lat.mset <- phenomis::reading(lat_mset_dir.c, report.c = "none", output.c = "set")

Let us view the content of this multi-dataset:

lat.mset

## Object of class 'MultiDataSet'
##  . assayData: 9 elements
##     . metabolomics_liver_c18hypersil_pos: 5665 features, 28 samples 
##     . metabolomics_liver_hilic_neg: 2866 features, 28 samples 
##     . metabolomics_plasma_c18acquity_neg: 1584 features, 28 samples 
##     . metabolomics_plasma_c18acquity_pos: 6104 features, 28 samples 
##     . metabolomics_plasma_c18hypersil_pos: 4788 features, 28 samples 
##     . metabolomics_plasma_hilic_neg: 3131 features, 28 samples 
##     . preclinical: 167 features, 28 samples 
##     . proteomics_liver: 2098 features, 28 samples 
##     . proteomics_plasma: 419 features, 24 samples 
##  . featureData:
##     . metabolomics_liver_c18hypersil_pos: 5665 rows, 34 cols (set, ..., biosigner)
##     . metabolomics_liver_hilic_neg: 2866 rows, 34 cols (set, ..., biosigner)
##     . metabolomics_plasma_c18acquity_neg: 1584 rows, 24 cols (set, ..., biosigner)
##     . metabolomics_plasma_c18acquity_pos: 6104 rows, 24 cols (set, ..., biosigner)
##     . metabolomics_plasma_c18hypersil_pos: 4788 rows, 34 cols (set, ..., biosigner)
##     . metabolomics_plasma_hilic_neg: 3131 rows, 34 cols (set, ..., biosigner)
##     . preclinical: 167 rows, 15 cols (set, ..., biosigner)
##     . proteomics_liver: 2098 rows, 14 cols (set, ..., biosigner)
##     . proteomics_plasma: 419 rows, 14 cols (set, ..., biosigner)
##  . rowRanges:
##     . metabolomics_liver_c18hypersil_pos: NO
##     . metabolomics_liver_hilic_neg: NO
##     . metabolomics_plasma_c18acquity_neg: NO
##     . metabolomics_plasma_c18acquity_pos: NO
##     . metabolomics_plasma_c18hypersil_pos: NO
##     . metabolomics_plasma_hilic_neg: NO
##     . preclinical: NO
##     . proteomics_liver: NO
##     . proteomics_plasma: NO
##  . phenoData:
##     . metabolomics_liver_c18hypersil_pos: 28 samples, 4 cols (gene, ..., sex)
##     . metabolomics_liver_hilic_neg: 28 samples, 4 cols (gene, ..., sex)
##     . metabolomics_plasma_c18acquity_neg: 28 samples, 4 cols (gene, ..., sex)
##     . metabolomics_plasma_c18acquity_pos: 28 samples, 4 cols (gene, ..., sex)
##     . metabolomics_plasma_c18hypersil_pos: 28 samples, 4 cols (gene, ..., sex)
##     . metabolomics_plasma_hilic_neg: 28 samples, 4 cols (gene, ..., sex)
##     . preclinical: 28 samples, 10 cols (gene, ..., impc_phenotype)
##     . proteomics_liver: 28 samples, 4 cols (gene, ..., sex)
##     . proteomics_plasma: 24 samples, 4 cols (gene, ..., sex)

Questions:

How many datasets are included?

How many technologies?

How many tissues?

How many samples?

How many features?

We now restrict the dataset to the liver samples:

# subsetting the MultiDataSet to the datasets containing 'liver' in their name
latliv.mset <- lat.mset[, grep("liver", names(lat.mset), value = TRUE)]

and to the annotated metabolomics features only:

# for each dataset
for (set.c in grep("metabolomics", names(latliv.mset), value = TRUE)) {

  # extracting the ExpressionSet
  eset <- latliv.mset[[set.c]]

  # restricting to the features with a name
  eset <- eset[Biobase::fData(eset)[, "name"] != "", ]

  # updating the MultiDataSet
  stopifnot(methods::validObject(eset))

  latliv.mset <- MultiDataSet::add_eset(latliv.mset,
                                        eset,
                                        dataset.type = set.c,
                                        GRanges = NA,
                                        overwrite = TRUE,
                                        warnings = FALSE)

}

We finally rename the datasets ‘proteo’, ‘metabo_c18’ and ‘metabo_hilic’:

latliv_oldname.mset <- latliv.mset

# creating a new MultiDataSet to store the renamed ExpressionSets
latliv.mset <- MultiDataSet::createMultiDataSet()

for (set.c in names(latliv_oldname.mset)) {

  # extracting the ExpressionSet
  eset <- latliv_oldname.mset[[set.c]]

  # including the renamed ExpressionSet
  latliv.mset <- MultiDataSet::add_eset(latliv.mset,
                                        eset,
                                        dataset.type = switch(set.c,
                                                              proteomics_liver = "proteo",
                                                              metabolomics_liver_c18hypersil_pos = "metabo_c18",
                                                              metabolomics_liver_hilic_neg = "metabo_hilic"),
                                        GRanges = NA,
                                        overwrite = TRUE,
                                        warnings = FALSE)

}
# re-ordering the datasets
latliv.mset <- latliv.mset[, c("proteo", "metabo_c18", "metabo_hilic")]

We have now the final multi-dataset which we will study in the rest of the session. Let us first have a look at the summary:

latliv.mset

## Object of class 'MultiDataSet'
##  . assayData: 3 elements
##     . proteo: 2098 features, 28 samples 
##     . metabo_c18: 138 features, 28 samples 
##     . metabo_hilic: 199 features, 28 samples 
##  . featureData:
##     . proteo: 2098 rows, 14 cols (set, ..., biosigner)
##     . metabo_c18: 138 rows, 34 cols (set, ..., biosigner)
##     . metabo_hilic: 199 rows, 34 cols (set, ..., biosigner)
##  . rowRanges:
##     . proteo: NO
##     . metabo_c18: NO
##     . metabo_hilic: NO
##  . phenoData:
##     . proteo: 28 samples, 4 cols (gene, ..., sex)
##     . metabo_c18: 28 samples, 4 cols (gene, ..., sex)
##     . metabo_hilic: 28 samples, 4 cols (gene, ..., sex)

Questions:

How many datasets are included?

How many technologies?

How many tissues?

How many samples?

How many features?

Note that the phenomis::inspecting method may be used to restrict the datasets to common samples only:

latliv.mset <- MultiDataSet::commonSamples(latliv.mset)

Finally, we extract the vectors of the our two factors of interest: the ‘gene’ (‘LAT’ or ‘WT’) and ‘sex’ (‘M’ or ‘F’)

# getting one of the sampleMetadata (e.g. the one from the 'proteo' dataset)
pdata.df <- Biobase::pData(latliv.mset[["proteo"]])
# getting the two vectors
gene.vc <- pdata.df[, "gene"]
print(table(gene.vc))

## gene.vc
## LAT  WT 
##  13  15

sex.vc <- pdata.df[, "sex"]
print(table(sex.vc))

## sex.vc
##  F  M 
## 14 14

2.3 Single-omics data exploration

2.3.1 Overview

The phenomis::inspecting method from the phenomis package applied to a MultiDataSet (or ExpressionSet) provides a numerical and graphical summary of each dataset:

latliv.mset <- phenomis::inspecting(latliv.mset, report.c = "none")

Questions:

What shows each graphic?

How many missing values are present in the datasets?

How many zero values are present in the datasets?

Have the intensities been log transformed?

Are there outliers?

2.3.2 Principal Component Analysis (PCA)

We continue our exploration of the individual datasets by focusing on the Principal Component Analysis.

The ropls::opls method from the ropls package enables to build the PCA models for each dataset from a MultiDataSet object (Thévenot et al. 2015):

latliv.pca <- ropls::opls(latliv.mset, info.txtC = "none")

Questions:

For each dataset, how many components contribute to more than 10% of the total variance?

Let us study if a separation between the genotypes is visible on the score-plot:

ropls::plot(latliv.pca, typeVc = "x-score",
            parAsColFcVn = gene.vc)

Questions:

Are the samples from the two genotypes clustered on the score-plot?

Which omic approach gives the best separation?

What happens with the protein data?

Is there any separation according to ‘sex’?

The MultiDataSet object with new columns for the PCA scores (respectively, loadings) included in the sampleMetadata (respectively, variableMetadata) can be extracted from the PCA model object with the ropls::getMset method:

latliv.mset <- ropls::getMset(latliv.pca)

2.3.3 Univariate Hypothesis Testing

We now would like to assess how much information about the gene is contained in each block. We therefore perform a univariate analysis. Since we have observed that there is also a variation according to sex in the datasets, we want to test the effects of gene, sex and their interaction simultaneously.

Questions:

What are the tests available with the phenomis::hypotesting method?

Which one would you suggest?

What kind of correction for multiple testing are available? What are the default method and threshold?

We now perform a 2-way ANOVA with interaction, with the default Benjamini and Hochberg control of the False Discovery Rate at the 5% threshold (Benjamini and Hochberg 1995).

latliv.mset <- phenomis::hypotesting(latliv.mset,
                                     factor_names.vc = c("gene", "sex"),
                                     test.c = "anova2waysInter",
                                     report.c = "none")

Questions:

Are there features with significant difference between means according to the genotype in all datasets?

Are there features with significant difference between means according to the sex in all datasets?

Are there interactions between genotype and sex?

What would be the result of a single Student’s t-test of a significant difference between means according to the genotype?

2.3.4 Partial Least Squares - Discriminant Analysis (PLS-DA)

Partial Least Squares - Discriminant Analysis is a machine learning approach which is popular in the field of multivariate analysis of mass spectrometry data (Trygg, Holmes, and Lundstedt 2007). It is a latent variable approach based on the maximisation of the covariance between the components and the response (Wold, Sjöström, and Eriksson 2001).

The ropls::opls method from the ropls package enables to build PLS models by specifying the response as the y parameter:

latliv.plsda <- ropls::opls(latliv.mset,
                            y = "gene",
                            info.txtC = "none")

Questions:

How many components were computed?

What does the top right graphic tell you? Are the models significant?

What is the predictive performance of the models?

Are the results in agreement with the univariate analysis?

The MultiDataSet object with new columns for the PLS-DA scores (respectively, Variable Importance in Projection values) included in the sampleMetadata (respectively, variableMetadata) can be extracted from the PLS-DA model object with the ropls::getMset method:

latliv.mset <- ropls::getMset(latliv.plsda)

Note: An Orthogonal Partial Least Squares - Discriminant Analysis (OPLS-DA) modeling can be performed with the method by setting the orthI argument to NA (instead of the 0 default value)

latliv.oplsda <- ropls::opls(latliv.mset,
                             y = "gene",
                             orthoI = NA,
                             info.txtC = "none")

## Warning: OPLS: number of predictive components ('predI' argument) set to 1

## Warning: OPLS: number of predictive components ('predI' argument) set to 1

## Warning: OPLS: number of predictive components ('predI' argument) set to 1

2.4 Multi-omics data integration

Many statistical approaches are available for integrative analysis of multi-omics datasets (Pierre-Jean et al. 2019)(Huang, Chaudhary, and Garmire 2017)(Cantini et al. 2021). Here we provide two examples of (unsupervised) multiple co-inertia analysis (Meng et al. 2014) and supervised multi-block PLS-DA (Rohart et al. 2017).

2.4.1 Multiple co-intertia analysis (MCIA)

Multiple Co-Inertia Analysis (MCIA) is a multi-omics exploratory data analysis technique (Meng et al. 2016). The datasets are projected into the same dimensional space by defining both ‘global’ and ‘block-specific’ scores (and loadings), and maximizing the sum squared covariance between them (Meng et al. 2014).

2.4.1.1 Building the model

Exporting the data matrices:

latliv_mn.ls <- MultiDataSet::as.list(latliv.mset)

Performing MCIA:

latliv.mcia <- omicade4::mcia(latliv_mn.ls)

2.4.1.2 Interpreting the results

Plotting the results (samples are colored by genotype):

require(ade4)

## Le chargement a nécessité le package : ade4

omicade4::plot.mcia(latliv.mcia, axes = 1:2,
                    phenovec = gene.vc,
                    sample.lab = FALSE)

Questions:

What is shown by each graphic?

Do you see any clustering of the two genotypes?

What about performing the same analysis with the full metabolomics datasets (i.e., including also the non-annotated variables?)

# restricting the full LAT MultiDataSet to the liver datasets
latliv_full.mset <- lat.mset[, grep("liver", names(lat.mset), value = TRUE)]

# renaming the 3 datasets
latliv_full_oldname.mset <- latliv_full.mset
latliv_full.mset <- MultiDataSet::createMultiDataSet()
for (set.c in names(latliv_full_oldname.mset)) {
  eset <- latliv_full_oldname.mset[[set.c]]
  latliv_full.mset <- MultiDataSet::add_eset(latliv_full.mset,
                                        eset,
                                        dataset.type = switch(set.c,
                                                              proteomics_liver = "proteo",
                                                              metabolomics_liver_c18hypersil_pos = "metabo_c18",
                                                              metabolomics_liver_hilic_neg = "metabo_hilic"),
                                        GRanges = NA,
                                        overwrite = TRUE,
                                        warnings = FALSE)

}

# re-ordering the datasets
latliv_full.mset <- latliv_full.mset[, c("proteo", "metabo_c18", "metabo_hilic")]

# MCIA modeling
latliv_full_mn.ls <- MultiDataSet::as.list(latliv_full.mset)

latliv_full.mcia <- omicade4::mcia(latliv_full_mn.ls)

# MCIA results
require(ade4)
omicade4::plot.mcia(latliv_full.mcia, axes = 1:2,
                    phenovec = gene.vc,
                    sample.lab = FALSE)

# Note: the variables most contributing to the (second) axis may be viewed with:
# omicade4::selectVar(latliv_full.mcia,
#                     a1.lim = c(-Inf, Inf), a2.lim = c(1, Inf))

Questions:

Conclusion?

What about coloring the sample plot according to ‘sex’?

omicade4:::splot.sample.mcia(latliv_full.mcia,
                    phenovec = sex.vc,
                    sample.lab = FALSE)

2.4.2 Multi-block sparse PLS-DA

The Data Integration Analysis for Biomarker discovery using a Latent component method for Omics studies (DIABLO) method extends sparse PLS-Discriminant Analysis to multi-omics analyses (Singh et al. 2019). It is based on sparse generalized canonical correlation analysis (sGCCA) (Tenenhaus et al. 2014). sGCCA is a multivariate dimension reduction technique that uses singular value decomposition and selects co-expressed (correlated) variables from several omics datasets. sGCCA maximizes the covariance between linear combinations of variables (latent component scores). The selection of the correlated molecules across omics levels is performed internally in sGCCA with L1–penalization on the variable coefficient vector defining the linear combinations. To extend sGCCA for a classification framework, one omics dataset is substituted with a dummy indicator matrix Y. In addition, the L1 penalty parameter was replaced by the number of variables to select in each dataset and each component (Singh et al. 2019). The mixOmics::block.splsda is available in the mixOmics package Rohart et al. (2017).

2.4.2.1 Building the model

We first extract the list of (transposed) data matrices:

exprs.ls <- MultiDataSet::as.list(latliv.mset)
texprs.ls <- lapply(exprs.ls, t)

We build the design matrix defining the link between the blocks. The weights were set to 0.1 between the blocks of omics and to 1 between the blocks of omics and the response block.

design.mn <- matrix(0.1,
                    ncol = length(exprs.ls),
                    nrow = length(exprs.ls), 
                    dimnames = list(names(exprs.ls),
                                    names(exprs.ls)))
diag(design.mn) <- 0

We next determine the optimal number of components:

latliv_5comp.blocksplsda <- mixOmics::block.splsda(X = texprs.ls,
                                                   Y = factor(gene.vc),
                                                   design = design.mn,
                                                   ncomp = 5)

## Design matrix has changed to include Y; each block will be
##             linked to Y.

latliv_perf.blocksplsda <- mixOmics::perf(latliv_5comp.blocksplsda,
                                          validation = 'Mfold', folds = 5, nrepeat = 10)
plot(latliv_perf.blocksplsda)

blocksplsda_comp.i <- latliv_perf.blocksplsda$choice.ncomp$WeightedVote["Overall.BER", "centroids.dist"]

message("Number of selected components: ", blocksplsda_comp.i)

## Number of selected components: 2

Questions:

How many components would you select?

We now specify how many features are to be selected from each dataset for the 2 components by using cross-validation on a grid of values. Note: the code below will not be run in this session for the sake of time.

# grid for variable selection
varsel_grid.ls <- list(proteo = c(seq(20, 100, 10)),
                       metabo_c18 = c(seq(10, 60, 10)),
                       metabo_hilic = c(seq(10, 60, 10)))

tune.blocksplsda = mixOmics::tune.block.splsda(X = texprs.ls,
                                  Y = factor(gene.vc),
                                  design = design.mn,
                                  ncomp = blocksplsda_comp.i, 
                                  test.keepX = varsel_grid.ls,
                                  validation = 'Mfold',folds=5, nrepeat=50,
                                  cpus = cpu, dist = "centroids.dist")

# optimum
varsel.ls = tune.blocksplsda$choice.keepX

# number of selected for each block and components: 
varsel.ls

varsel.ls <- list(proteo = c(20, 20),
                  metabo_c18 = c(10, 20),
                  metabo_hilic = c(60, 60))

We can now build the model:

latliv.blocksplsda <- mixOmics::block.splsda(X = texprs.ls,
                                             Y = factor(gene.vc),
                                             design = design.mn,
                                             scheme = 'horst',
                                             keepX = varsel.ls)

2.4.2.2 Interpreting the results

2.4.2.2.1 Within blocks

We first look at the individual blocks.

2.4.2.2.1.1 Samples

mixOmics::plotIndiv(latliv.blocksplsda,
                    ind.names = TRUE, 
                    legend = TRUE,
                    ellipse = TRUE,
                    col.per.group = ProMetIS::palette.vc()[c("LAT", "WT")])

Questions:

How do you interpret the results?

2.4.2.2.1.2 Variables

mixOmics::plotLoadings(latliv.blocksplsda,
                       comp = 1,
                       legend.color = ProMetIS::palette.vc()[c("LAT", "WT")],
                       contrib = 'max',
                       method = 'median')

# lapply(within(mixOmics::selectVar(latliv.blocksplsda), rm(comp)),
#        function(block_sel.ls) block_sel.ls[["name"]])

Questions:

How do you interpret the results?

2.4.2.2.2 Between blocks

Let us look at the relationships between the block components.

2.4.2.2.2.1 Samples

mixOmics::plotDiablo(latliv.blocksplsda)

Questions:

How do you interpret the results?

2.4.2.2.2.2 Variables

mixOmics::circosPlot(latliv.blocksplsda,
                     cutoff = 0.7,
                     line = TRUE,
                     comp = 1,
                     color.Y = ProMetIS::palette.vc()[c("LAT", "WT")])

Questions:

How do you interpret the results?

latliv.network <- mixOmics::network(latliv.blocksplsda,
                                    comp = list(proteo = 1,
                                                metabo_c18 = 1,
                                                metabo_hilic = 1),
                                    shape.node = rep("circle", 3),
                                    lty.edge = "solid",
                                    lwd.edge = 3,
                                    color.edge = mixOmics::color.spectral(50),
                                    blocks = c(1, 2, 3), cutoff = 0.75)

latliv.graph <- latliv.network$gR

latliv.graph$layout <- igraph::layout.fruchterman.reingold(latliv.graph,
                                                weights = (1 - abs(igraph::E(latliv.graph)$weight)))
igraph::V(latliv.graph)$size <- 5

plot(latliv.graph)

# knitr::kable(names(igraph::V(latliv.graph)))

Questions:

What biological interpretation(s) would you suggest?

2.4.2.2.2.3 Samples and variables

heatmap.ls <- mixOmics::cimDiablo(latliv.blocksplsda, margins = c(10, 20),
                    color.Y = ProMetIS::palette.vc()[c("LAT", "WT")],
                    transpose = TRUE, comp = 1,
                    size.legend = 0.6,  legend.position = "right",
                    dist.method = c("correlation", "correlation"))

## 
## trimming values to [-3, 3] range for cim visualisation. See 'trim' arg in ?cimDiablo

Questions:

What kind of clustering is used?

What additional information is provided by this graphic?

2.4.2.2.3 Model Performance

perf.blocksplsda <- mixOmics::perf(latliv.blocksplsda, validation = 'Mfold',
                                   folds = 5, nrepeat = 10, 
                                   dist = 'centroids.dist', auc = TRUE)
plot(perf.blocksplsda)

message("Error rate:")

## Error rate:

print(perf.blocksplsda$error.rate)

## $proteo
##       centroids.dist
## comp1      0.1428571
## comp2      0.1607143
## 
## $metabo_c18
##       centroids.dist
## comp1     0.08214286
## comp2     0.06428571
## 
## $metabo_hilic
##       centroids.dist
## comp1    0.017857143
## comp2    0.003571429

message("AUC:")

## AUC:

print(perf.blocksplsda$auc)

## $comp1
##                 AUC     p-value
## LAT vs WT 0.9993493 6.69088e-05
## 
## $comp2
##                 AUC      p-value
## LAT vs WT 0.9999447 6.529187e-05

2.4.2.3 Including the variable selection results into the MultiDataSet

for (set.c in names(latliv.mset)) {
  
  eset <- latliv.mset[[set.c]]
  
  fdata.df <- Biobase::fData(eset)
    
  loadings.ls <- latliv.blocksplsda[["loadings"]]
  
  loadings.vn <- loadings.ls[[set.c]][, "comp1"]
  
  loadings.vn <- loadings.vn[abs(loadings.vn) > 0]
  
  mixomics.vn <- numeric(nrow(fdata.df))
  names(mixomics.vn) <- rownames(fdata.df)
  
  stopifnot(all(names(loadings.vn) %in% names(mixomics.vn)))
  
  mixomics.vn[names(loadings.vn)] <- loadings.vn
  
  fdata.df[, "mixOmics"] <- mixomics.vn
  
  Biobase::fData(eset) <- fdata.df
  
  stopifnot(methods::validObject(eset))
  
  latliv.mset <- MultiDataSet::add_eset(latliv.mset,
                                        eset,
                                        dataset.type = set.c,
                                        GRanges = NA,
                                        overwrite = TRUE,
                                        warnings = FALSE)
  
}

2.5 Saving the results

phenomis::writing(latliv.mset, dir.c = "my_preferred_directory")

3 Appendix

3.1 3 tabular file format used for import/export

Each dataset consists of:

a numeric matrix of intensities (dataMatrix)
a data frame of sample metadata (sampleMetadata)
a data frame of variable metadata (variableMetadata)

Theses 3 tables can be conveniently imported to/exported from R as tabular files:

3.1.1 ExpressionSet class for a single dataset

The following table describes useful Biobase methods for the handling of ExpressionSet objects (see the ‘An introduction to Biobase and ExpressionSets’ documentation from the Biobase package):

Biobase methods	Description
dims(eset)	2-row numeric matrix of ‘Features’ and ‘Samples’ dimensions
exprs(eset)	‘variable times samples’ numeric matrix - dataMatrix
pData(eset)	sample metadata data frame - sampleMetadata
fData(eset)	variable metadata data frame - variableMetadata
sampleNames(eset)	sample names
featureNames(eset)	variable names
varLabels(eset)	Column names of the sampleMetadata, pData(eset)
fvarLabels(eset)	Column names of the variableMetadata, fData(eset)

Creating an ExpressionSet with Biobase::ExpressionSet:

lat_dir.c <- ProMetIS::aggregated_dir.c("LAT")
latlivprot_dir.c <- file.path(lat_dir.c, "proteomics_liver")
dataMatrix.mn <- as.matrix(read.table(file.path(latlivprot_dir.c, "dataMatrix.tsv"),
                                      header = TRUE,
                                      row.names = 1,
                                      sep = "\t",
                                      comment.char = "",
                                      quote = "\""))
sampleMetadata.df <- read.table(file.path(latlivprot_dir.c, "sampleMetadata.tsv"),
                                header = TRUE,
                                row.names = 1,
                                sep = "\t",
                                comment.char = "",
                                quote = "\"")
variableMetadata.df <- read.table(file.path(latlivprot_dir.c, "variableMetadata.tsv"),
                                  header = TRUE,
                                  row.names = 1,
                                  sep = "\t",
                                  comment.char = "",
                                  quote = "\"")
latlivprot.eset <- Biobase::ExpressionSet(assayData = dataMatrix.mn,
                                          phenoData = Biobase::AnnotatedDataFrame(data = sampleMetadata.df),
                                          featureData = Biobase::AnnotatedDataFrame(data = variableMetadata.df),
                                          experimentData = Biobase::MIAME(title = "lat_proteomics_liver"))

which is equivalent to:

latlivprot.eset <- phenomis::reading(latlivprot_dir.c, output.c = "set")

## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 2098 features, 28 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: W616f W617f ... L862f (28 total)
##   varLabels: gene mouse_id sex id
##   varMetadata: labelDescription
## featureData
##   featureNames: Q8R1I1_Cytochrome b-c1 complex . P62852_40S ribosomal
##     protein S25 ... Q61553_Fascin (2098 total)
##   fvarLabels: set gene ... mixomics (14 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:

Brackets may be used to restrict the ExpressionSet to specific sample or features:

wt.eset <- latlivprot.eset[, grep("W", Biobase::sampleNames(latlivprot.eset))]
Biobase::dims(wt.eset)

##          exprs
## Features  2098
## Samples     15

signif.eset <- latlivprot.eset[Biobase::fData(latlivprot.eset)[, "WT.KO_signif"] > 0, ]
Biobase::dims(signif.eset)

##          exprs
## Features   275
## Samples     28

Exporting the ExpressionSet in the 3 table format:

phenomis::writing(latlivprot.eset, dir.c = "your_prefered_directory")

3.1.2 MultiDataSet class for multiple datasets

The following table describes useful MultiDataSet (and Biobase) methods for the handling of MultiDataSet objects (Hernandez-Ferrer et al. 2016):

MultiDataSet and Biobase methods	Description
MultiDataSet::length(mset)	number of datasets
MultiDataSet::names(eset)	names of the datasets
Biobase::dims(mset)	list of 2-element numeric named (‘Features’ and ‘Samples’) vectors of data set dimensions
MultiDataSet::commonSamples(mset)	multi-dataset restricted to the common samples
MultiDataSet::as.list(mset)	list of data matrices
Biobase::pData(mset)	list of sample metadata data frames
Biobase::fData(mset)	list of variable metadata data frames
BiocGenerics::subset(mset)	subsetting on common columns from the sampleMetadata and/or variableMetadata

Creating a MultiDataSet with MultiDataSet::createMultiDataSet:

lat_dir.c <- ProMetIS::aggregated_dir.c("LAT")
proteo.eset <- phenomis::reading(file.path(lat_dir.c, "proteomics_liver"),
                                 report.c = "none", output.c = "set")
metabo_c18.eset <- phenomis::reading(file.path(lat_dir.c, "metabolomics_liver_c18hypersil_pos"),
                                     report.c = "none", output.c = "set")
metabo_hilic.eset <- phenomis::reading(file.path(lat_dir.c, "metabolomics_liver_hilic_neg"),
                                       report.c = "none", output.c = "set")

latliv.mset <- MultiDataSet::createMultiDataSet()
latliv.mset <- MultiDataSet::add_eset(latliv.mset,
                                      proteo.eset,
                                      dataset.type = "proteomics_liver",
                                      GRanges = NA)
latliv.mset <- MultiDataSet::add_eset(latliv.mset,
                                      metabo_c18.eset,
                                      dataset.type = "metabolomics_liver_c18hypersil_pos",
                                      GRanges = NA)
latliv.mset <- MultiDataSet::add_eset(latliv.mset,
                                      metabo_hilic.eset,
                                      dataset.type = "metabolomics_liver_hilic_neg",
                                      GRanges = NA)

which is equivalent to:

latliv.mset <- phenomis::reading(lat_dir.c,
                                 subsets.vc = c("proteomics_liver",
                                                "metabolomics_liver_c18hypersil_pos",
                                                "metabolomics_liver_hilic_neg"),
                                 report.c = "none",
                                 output.c = "set")

Brackets may be used to restrict the MultiDataSet to specific samples or datasets:

wt.mset <- latliv.mset[grep("W", Biobase::sampleNames(latliv.mset)[["proteomics_liver"]], value = TRUE), ]
Biobase::dims(wt.mset)

## $proteomics_liver
## Features  Samples 
##     2098       15 
## 
## $metabolomics_liver_c18hypersil_pos
## Features  Samples 
##     5665       15 
## 
## $metabolomics_liver_hilic_neg
## Features  Samples 
##     2866       15

metabo.mset <- latliv.mset[, c("metabolomics_liver_c18hypersil_pos", "metabolomics_liver_hilic_neg")]
names(metabo.mset)

## [1] "metabolomics_liver_c18hypersil_pos" "metabolomics_liver_hilic_neg"

Note: In case of the restriction to a single dataset, the single bracket (as shown above) will outpout a MultiDataSet object. To get an ExpressionSet, either add the drop = TRUE argument or use the double brackets:

proteo.eset <- latliv.mset[["proteomics_liver"]]

Subsetting based on (common) columns from the sampleMetadata and/or variableMetadata may be performed with BiocGenerics::subset:

female.mset <- BiocGenerics::subset(latliv.mset, , sex == "F")
# or
signif.mset <- BiocGenerics::subset(latliv.mset, WT.KO_signif > 0, )

Exporting the MultiDataSet into the 3 table format (each dataset will be written as 3 tables in a specific subfolder):

phenomis::writing(latliv.mset, dir.c = "your_prefered_directory")

References

Benjamini, Y., and Y. Hochberg. 1995. “Controlling the False Discovery Rate: A Practical and Powerful Approach for Multiple Testing.” Journal of the Royal Statistical Society. Series B (Methodological) 57: 289–300. http://www.jstor.org/stable/2346101.

Cantini, Laura, Pooya Zakeri, Celine Hernandez, Aurelien Naldi, Denis Thieffry, Elisabeth Remy, and Anaïs Baudot. 2021. “Benchmarking Joint Multi-Omics Dimensionality Reduction Approaches for the Study of Cancer.” Nature Communications 12 (1). https://doi.org/10.1038/s41467-020-20430-7.

Hernandez-Ferrer, Carles, Carlos Ruiz-Arenas, Alba Beltran-Gomila, and Juan R. Gonzalez. 2016. “MultiDataSet: An R Package for Encapsulating Multiple Data Sets with Application to Omic Data Integration.” BMC Bioinformatics 18 (PMC5240259): 36. https://doi.org/10.1186/s12859-016-1455-1.

Huang, Sijia, Kumardeep Chaudhary, and Lana X. Garmire. 2017. “More Is Better: Recent Progress in Multi-Omics Data Integration Methods.” Frontiers in Genetics 8 (June). https://doi.org/10.3389/fgene.2017.00084.

Imbert, Alyssa, Magali Rompais, Mohammed Selloum, Florence Castelli, Emmanuelle Mouton-Barbosa, Marion Brandolini-Bunlon, Emeline Chu-Van, et al. 2021. “ProMetIS, Deep Phenotyping of Mouse Models by Combined Proteomics and Metabolomics Analysis.” Scientific Data 8 (1): 311. https://doi.org/10.1038/s41597-021-01095-3.

Meng, Chen, Bernhard Kuster, Aedı́n C Culhane, and Amin Gholami. 2014. “A Multivariate Approach to the Integration of Multi-Omics Datasets.” BMC Bioinformatics 15 (1): 162. https://doi.org/10.1186/1471-2105-15-162.

Meng, Chen, Oana A. Zeleznik, Gerhard G. Thallinger, Bernhard Kuster, Amin M. Gholami, and Aedı́n C. Culhane. 2016. “Dimension Reduction Techniques for the Integrative Analysis of Multi-Omics Data.” Briefings in Bioinformatics 17 (4): 628–41. https://doi.org/10.1093/bib/bbv108.

Pierre-Jean, Morgane, Jean-François Deleuze, Edith Le Floch, and Florence Mauger. 2019. “Clustering and Variable Selection Evaluation of 13 Unsupervised Methods for Multi-Omics Data Integration.” Briefings in Bioinformatics, December. https://doi.org/10.1093/bib/bbz138.

Rohart, Florian, Benoit Gautier, Amrit Singh, and Kim-Anh Le Cao. 2017. “mixOmics: An r Package for Omics Feature Selection and Multiple Data Integration.” Edited by Dina Schneidman. PLOS Computational Biology 13 (11): e1005752. https://doi.org/10.1371/journal.pcbi.1005752.

Singh, Amrit, Casey P Shannon, Benoit Gautier, Florian Rohart, Michaël Vacher, Scott J Tebbutt, and Kim-Anh Le Cao. 2019. “DIABLO: An Integrative Approach for Identifying Key Molecular Drivers from Multi-Omics Assays.” Edited by Inanc Birol. Bioinformatics 35 (17): 3055–62. https://doi.org/10.1093/bioinformatics/bty1054.

Tenenhaus, A., C. Philippe, V. Guillemot, K.-A. Le Cao, J. Grill, and V. Frouin. 2014. “Variable Selection for Generalized Canonical Correlation Analysis.” Biostatistics 15 (3): 569–83. https://doi.org/10.1093/biostatistics/kxu001.

Thévenot, Etienne A., Aurélie Roux, Ying Xu, Eric Ezan, and Christophe Junot. 2015. “Analysis of the Human Adult Urinary Metabolome Variations with Age, Body Mass Index and Gender by Implementing a Comprehensive Workflow for Univariate and OPLS Statistical Analyses.” Journal of Proteome Research 14 (8): 3322–35. https://doi.org/10.1021/acs.jproteome.5b00354.

Trygg, Johan, Elaine Holmes, and Torbjörn Lundstedt. 2007. “Chemometrics in Metabonomics.” Journal of Proteome Research 6 (2): 469–79. https://doi.org/10.1021/pr060594q.

Wold, Svante, Michael Sjöström, and Lennart Eriksson. 2001. “PLS-Regression: A Basic Tool of Chemometrics.” Chemometrics and Intelligent Laboratory Systems 58 (2): 109–30. https://doi.org/https://doi.org/10.1016/S0169-7439(01)00155-1.

ProMetIS: proteomics and metabolomics data integration

21 janvier 2023

1 Introduction

2 Hands-on

2.1 Options

2.2 Loading the dataset

2.3 Single-omics data exploration

2.3.1 Overview

2.3.2 Principal Component Analysis (PCA)

2.3.3 Univariate Hypothesis Testing

2.3.4 Partial Least Squares - Discriminant Analysis (PLS-DA)

2.4 Multi-omics data integration

2.4.1 Multiple co-intertia analysis (MCIA)

2.4.1.1 Building the model

2.4.1.2 Interpreting the results

2.4.2 Multi-block sparse PLS-DA

2.4.2.1 Building the model

2.4.2.2 Interpreting the results

2.4.2.2.1 Within blocks

2.4.2.2.1.1 Samples

2.4.2.2.1.2 Variables

2.4.2.2.2 Between blocks

2.4.2.2.2.1 Samples

2.4.2.2.2.2 Variables

2.4.2.2.2.3 Samples and variables

2.4.2.2.3 Model Performance

2.4.2.3 Including the variable selection results into the MultiDataSet

2.5 Saving the results

3 Appendix

3.1 3 tabular file format used for import/export

3.1.1 ExpressionSet class for a single dataset

3.1.2 MultiDataSet class for multiple datasets

References