Practical - exploring omics data

Introduction

The goal of this practical is to combine different statistical approaches to explore an omics dataset (transcriptome, proteome).

graphical representations of the distribution
impact of the normalisation and standardisation
multidimensional scaling with Principal Component Analysis (PCA)
class discovery by clustering

Normalisation and standardisation

We prepared a notebook (in R markdown) detailing all the dirty cooking details to download the datasets from the Zenodo repository, and to run some preprocessing steps

log2 transformation
filtering out of the undetected features (genes, proteins)
median-based sample-wise centering
IQR-based sample-wise scaling

Data loading

We prepared memory images that enable you to reload the preprocessed datasets.

R script to reload data

Instructions

The following tasks will be covered in 2 sessions + a bit of personal work.

Write the solution of each exercise in a separate R script file.
Take care to properly document the code
Once the code is satisfying, write a report in R markdown that will incoroporate the R files, and comment the results.

Alternatively you can immediately write the code inside a R markdown document, as far as it combines a properly documented code and relevant interpretation comments.

Exercises

1. Data loading

Choose one of the datasets and load the memory image

2. Descriptive statistics

Compute sample-wise and feature-wise descriptive statistics

mean,
median,
sd,
var,
IQR,
some relevant percentiles (0, 05, 25, 50 , 75, 95, 100)

Use different graphical representations to compare the values and get familiar with your data. For example :

histogram of all the values
boxplot of the values per sample
feature means versus medians
feature standard deviation versus IQR
mean versus variance plot

Solution by the teachers: [02_descriptive_stats.R]

3. Summary per condition

Compute the mean value per condition (mean between the replicates)
Draw a dot plot to compare the values between each time point and the control.

4. Feature selection

Select the 500 top features according to two different criteria

highest variance
differential analysis (will be provided by the teachers)

5. Comparison metrics

Compute different metrics indicating the relations between samples (columns) of the log2-transformed values

covariance
Pearson correlation + derived distance matrix
Spearman correlation + derived distance matrix
Euclidian distance

Compute these on the commplete dataset and on the 500 matrix selections

Draw graphical representations of the result with corrplot().

Comment the results

differences between metrics
differences between results with all the features and selected features

6. Clustering

Run hierarchical clustering with hclust() to extract clusters from the dataset with selected variables.
- with different dissimlarity metrics (Euclidian, Pearson, Spearman)
- with different agglomeration rules (single, average, complete, ward)
Plot the feature trees and compare the results obtained with the different choices.
Plot heatmpas with the different feature trees obtained before, and compare the results. Inactivate the individual clustering (default in the heatmap).

8. Principal Component Analysis (PCA)

We provide hereby

An R script with the principal component analysis of the expression set: [04_PCA.R]
An R markdown model of scientific report: [YOUR-NAME_REPORT-TOPICS.Rmd]

Write a short report that will integrate some pieces of chunks from the R script in the R markdown notebook, add your interpretation of the results, and compile it as a self-contained HTML file.

9. Functionnal enrichment analysis

analyse the features declared positive (provided by the teachers) to the gProfiler function profiling web tool (https://biit.cs.ut.ee/gprofiler/gost)
write a piece of R code that runs the same anlaysis with the gProfileR package
analyse (with either R or the Web site) the different gene groups discovered by the hclust approach, and evaluate if the results are more relevant when you submit all the differentially expressed features at once, or when you submit them cluster per cluster