study-cases

Cas d'étude pour le diplôme universitaire en bioinformatique intégrative (DU-Bii)

View the Project on GitHub DU-Bii/study-cases

TCGA study case

Table of content

Introduction

This page describes a study case based on data from The Cancer Genome Atlas (TCGA; https://cancergenome.nih.gov/). This dataset contains more than 11,285 samples from patients suffering of a wide variety of cancer types. We use some subsets of this huge dataset for different courses of the Diplôme Universitaire en Bioinformatique Intégrative (DU-Bii).

The full datasets are available in the NCBI databases (Gene Expression Ombinus, Short Read Archives).

For the sake of simplicity, we took benefit of pre-processed data made available by Ron Shamir’s team.

We provide here

Data sources

Publications

TCGA web site

https://cancergenome.nih.gov/

Preprocessed datasets made available by Ron Shamir’s team

Recount2

Data preprocessing

We downloaded the TCGA raw counts from the Recount2 database, and applied the following preprocessing steps:

  1. select the samples belonging to the Breast Invasive Cancer (BIC) study;
  2. define the cancer type (used as class label for supervised classification) based on the three immuno markers;
  3. filter out “undetected” genes, i.e. genes having zero counts in almost all samples;
  4. library size standardisation;
  5. log2-transform of the counts;
  6. detection of differentially expressed genes
  7. selection of a reduced subset of genes (likely to be relevant for clustering and supervised classification) by keeping the 1000 genes having the lowest adjusted p-value in differential expression analysis;
  8. exported the different results (raw counts, filtered, normalised, differentially expressed) in TSV files.

The preprocessing was done with an R markdown file, which enables anyone to reproduce the results and understand each step.

Use in the different courses