Introduction

The goal of this practical is to combine different statistical approaches to explore an omics dataset (transcriptome, proteome).

  • graphical representations of the distribution
  • impact of the normalisation and standardisation
  • multidimensional scaling with Principal Component Analysis (PCA)
  • class discovery by clustering

Normalisation and standardisation

We prepared a notebook (in R markdown) detailing all the dirty cooking details to download the datasets from the Zenodo repository, and to run some preprocessing steps

  • log2 transformation
  • filtering out of the undetected features (genes, proteins)
  • median-based sample-wise centering
  • IQR-based sample-wise scaling

Data loading

We prepared memory images that enable you to reload the preprocessed datasets.

R script to reload data

Instructions

The following tasks will be covered in 2 sessions + a bit of personal work.

  • Write the solution of each exercise in a separate R script file.
  • Take care to properly document the code
  • Once the code is satisfying, write a report in R markdown that will incoroporate the R files, and comment the results.

Alternatively you can immediately write the code inside a R markdown document, as far as it combines a properly documented code and relevant interpretation comments.

Exercises

1. Data loading

Choose one of the datasets and load the memory image

2. Descriptive statistics

Compute sample-wise and feature-wise descriptive statistics

  • mean,
  • median,
  • sd,
  • var,
  • IQR,
  • some relevant percentiles (0, 05, 25, 50 , 75, 95, 100)

Use different graphical representations to compare the values and get familiar with your data. For example :

  • histogram of all the values
  • boxplot of the values per sample
  • feature means versus medians
  • feature standard deviation versus IQR
  • mean versus variance plot

Solution by the teachers: [02_descriptive_stats.R]

3. Summary per condition

  • Compute the mean value per condition (mean between the replicates)
  • Draw a dot plot to compare the values between each time point and the control.

4. Feature selection

Select the 500 top features according to two different criteria

  1. highest variance
  2. differential analysis (will be provided by the teachers)

5. Comparison metrics

Compute different metrics indicating the relations between samples (columns) of the log2-transformed values

  • covariance
  • Pearson correlation + derived distance matrix
  • Spearman correlation + derived distance matrix
  • Euclidian distance

Compute these on the commplete dataset and on the 500 matrix selections

Draw graphical representations of the result with corrplot().

Comment the results

  • differences between metrics
  • differences between results with all the features and selected features

6. Clustering

  • Run hierarchical clustering with hclust() to extract clusters from the dataset with selected variables.

    • with different dissimlarity metrics (Euclidian, Pearson, Spearman)
    • with different agglomeration rules (single, average, complete, ward)
  • Plot the feature trees and compare the results obtained with the different choices.

  • Plot heatmpas with the different feature trees obtained before, and compare the results. Inactivate the individual clustering (default in the heatmap).

8. Principal Component Analysis (PCA)

We provide hereby

Write a short report that will integrate some pieces of chunks from the R script in the R markdown notebook, add your interpretation of the results, and compile it as a self-contained HTML file.

9. Functionnal enrichment analysis

  • analyse the features declared positive (provided by the teachers) to the gProfiler function profiling web tool (https://biit.cs.ut.ee/gprofiler/gost)

  • write a piece of R code that runs the same anlaysis with the gProfileR package

  • analyse (with either R or the Web site) the different gene groups discovered by the hclust approach, and evaluate if the results are more relevant when you submit all the differentially expressed features at once, or when you submit them cluster per cluster