The preprocessed datasets are available on the course github repository.
We will use the function download_only_once()
that we defined in a previous course on data exploration.
This function takes as input a base URL, a file name and a local folder.
This facilitates the downloading of the different files required for the practical. Open the code box below and run the code in your R console.
#' @title Download a file only if it is not yet here
#' @author Jacques van Helden email{Jacques.van-Helden@@france-bioinformatique.fr}
#' @param url_base base of the URL, that will be prepended to the file name
#' @param file_name name of the file (should not contain any path)
#' @param local_folder path of a local folder where the file should be stored
#' @return the function returns the path of the local file, built from local_folder and file_name
#' @export©
<- function(
download_only_once
url_base,
file_name,
local_folder) {
## Define the source URL
<- file.path(url_base, file_name)
url message("Source URL\n\t", url)
## Define the local file
<- file.path(local_folder, file_name)
local_file
## Create the local data folder if it does not exist
dir.create(local_folder, showWarnings = FALSE, recursive = TRUE)
## Download the file ONLY if it is not already there
if (!file.exists(local_file)) {
message("Downloading file from source URL to local file\n\t",
local_file)download.file(url = url, destfile = local_file)
else {
} message("Local file already exists, no need to download\n\t",
local_file)
}
return(local_file)
}
download_only_once()
to download the BIC transciptome data and metadata from the github web site and store it in a local folder (for example ~/m3-stat-R/TCGA-BIC_analysis
).Base URL: https://github.com/DU-Bii/module-3-Stat-R/raw/master/stat-R_2021/data/TCGA_BIC_subset/
Files :
Expression table (1000 top-ranking genes from differential analysis):
File name: BIC_log2-norm-counts_edgeR_DEG_top_1000.tsv.gz
This file contains log2-transformed and standardised counts, with 1000 genes (rows) x 819 samples (columns)
Metadata:
File name: BIC_sample-classes.tsv.gz
This file indicates the status of the 3 marker genes traditionnally used to diagnose the breast cancer type, as well as the cancer class derived from these 3 markers
For more information, see the pre-processing report: import_TCGA_from_Recount.html
## Define the remote URL and local folder
<- "https://github.com/DU-Bii/module-3-Stat-R/raw/master/stat-R_2021/data/TCGA_BIC_subset/"
bic_url
<- "~/m3-stat-R/TCGA-BIC_analysis"
bic_folder
## Download and load the expression data table
## Note: we use check.names=FALSE to avoid replacing hyphens by dots
## in sample names, because we want to keep them as in the
## original data files.
<- download_only_once(
bic_expr_file url_base = bic_url,
file_name = "BIC_log2-norm-counts_edgeR_DEG_top_1000.tsv.gz",
local_folder = bic_folder)
<- read.delim(file = bic_expr_file,
bic_expr header = TRUE,
row.names = 1,
check.names = FALSE)
# colnames(bic_expr)
# dim(bic_expr)
# View(head(bic_expr))
## Download the metadata file
<- download_only_once(
bic_meta_file url_base = bic_url,
file_name = "BIC_sample-classes.tsv.gz",
local_folder = bic_folder)
<- read.delim(file = bic_meta_file,
bic_meta header = TRUE,
row.names = 1,
check.names = FALSE)
Check the content of the metadata file by looking at the first rows, and count the number of samples per class.
## Show the head of the metadata table
kable(head(bic_meta, n = 10), caption = "First rows of the BIC metadata table")
cancer.type | ER1 | PR1 | Her2 | |
---|---|---|---|---|
1AB92ADA-637E-4A42-A39A-70CEEEA41AE3 | Luminal.A | Positive | Positive | Negative |
DA98A67C-F11F-41D3-8223-1161EBFF8B58 | Unclassified | Positive | Negative | Negative |
06CCFD0F-7FB8-471E-B823-C7876582D6FC | HER2pos | Negative | Negative | Positive |
A33B2F42-6EC6-4FB2-8BE5-542407A0382E | Unclassified | Positive | Negative | Negative |
D021A258-8713-4383-9DCA-45E2F54A0411 | Luminal.A | Positive | Positive | Negative |
C705FA90-D9AA-4949-BACA-1C022A14CB03 | Luminal.A | Positive | Positive | Negative |
85380A2D-9951-4D4B-A2A4-6F5F2AFC54E3 | Luminal.A | Positive | Positive | Negative |
F53A9C63-1AF7-4CBC-B8B7-4AA7AAED3364 | Luminal.A | Positive | Positive | Negative |
13EF5323-EAD9-4BC7-8AC4-33875BF12E17 | Luminal.B | Positive | Positive | Positive |
079EACA1-0319-4B54-B20B-673F4576C69D | Basal.like | Negative | Negative | Negative |
## Number of samples per class
kable(sort(table(bic_meta$cancer.type), decreasing = TRUE),
caption = "Number of samples per class")
Var1 | Freq |
---|---|
Luminal.A | 422 |
Basal.like | 131 |
Luminal.B | 118 |
Unclassified | 107 |
HER2pos | 41 |
Sort the expression and metadata tables so that the samples with the same class come together. This is a bit triccky so we provide immediately the solution, but you might attempt todo it in you way if you have time.
The expression table has one row per gene and one column per sample, whereas the metadata file has one row per sample.
You can use the function order()
to obtain the indices of the metadata table by order of sample class. This will return a vector with all the indixes of the first class, then all the indices of the second class, etc.
You can then use the indices of the dataframes to re-order them.
We are not sure that the samples have the same order in the metadata and in the expression tables. So we will use the sample class to re-order the rows (samples) of the metadata file, but after this is done we will use the sample IDs of the metadata file (in the row names) in order to sort the columns of the expression table. This will guarantee that samples are in a consistent order between data and metadata tables.
## Check that the row names of the metadata contain the same set
## of IDs than the column names of the expression table
length(rownames(bic_meta)) ## the metadata contains 819 rows
[1] 819
length(colnames(bic_expr)) ## the data contains 819 columns
[1] 819
length(intersect(rownames(bic_meta), colnames(bic_expr))) ## Their intersection contains the same number of elements
[1] 819
## Get the sample order according to the cancer type column of the metadata
<- order(bic_meta$cancer.type)
sample_order print(bic_meta$cancer.type[sample_order])
[1] "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like"
[20] "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like"
[39] "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like"
[58] "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like"
[77] "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like"
[96] "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like"
[115] "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "Basal.like" "HER2pos" "HER2pos"
[134] "HER2pos" "HER2pos" "HER2pos" "HER2pos" "HER2pos" "HER2pos" "HER2pos" "HER2pos" "HER2pos" "HER2pos" "HER2pos" "HER2pos" "HER2pos" "HER2pos" "HER2pos" "HER2pos" "HER2pos" "HER2pos" "HER2pos"
[153] "HER2pos" "HER2pos" "HER2pos" "HER2pos" "HER2pos" "HER2pos" "HER2pos" "HER2pos" "HER2pos" "HER2pos" "HER2pos" "HER2pos" "HER2pos" "HER2pos" "HER2pos" "HER2pos" "HER2pos" "HER2pos" "HER2pos"
[172] "HER2pos" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A"
[191] "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A"
[210] "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A"
[229] "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A"
[248] "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A"
[267] "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A"
[286] "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A"
[305] "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A"
[324] "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A"
[343] "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A"
[362] "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A"
[381] "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A"
[400] "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A"
[419] "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A"
[438] "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A"
[457] "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A"
[476] "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A"
[495] "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A"
[514] "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A"
[533] "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A"
[552] "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A"
[571] "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A"
[590] "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.A" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B"
[609] "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B"
[628] "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B"
[647] "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B"
[666] "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B"
[685] "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B"
[704] "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Luminal.B" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified"
[723] "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified"
[742] "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified"
[761] "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified"
[780] "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified"
[799] "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified" "Unclassified"
[818] "Unclassified" "Unclassified"
## Sort the metadata rows (samples) according to this order
<- bic_meta[sample_order, ]
bic_meta
## Sort the expression table to make sure that the samples (columns)
## come in the same order as the rows of the metadata table
<- bic_expr[, row.names(bic_meta)] bic_expr
The TCGA samples have very long and structured IDs, whcih make them inconvenient to display on the type of graphs we would like to generate (box plots, PC plots, heatmaps, …).
We will thus generate readable labels, and add them as a separate column to the metadata table.
For this, we will use the class name and use the funciotn make.names()
to avoid having duplicate labels (this will add a number besides the class name).
#### Sample labels ####
$label <- make.names(
bic_metanames = bic_meta$cancer.type,
unique = TRUE)
The row names of the expression table indicate the Ensembl gene ID and the version of the genome annotation (a number after the dot). This ID is quite long, and we would prefer to have more readable labels for the genes. For this, we will get the gene name (also called “gene symbol”).
#### Get gene names ####
<- data.frame(
gene_info ENSG = sub(x = row.names(bic_expr),
perl = TRUE,
replacement = "",
pattern = '\\..*'),
row.names = row.names(bic_expr)
)# head(gene_info)
## Get gene names from ENSEMBL using biomaRt package
library("biomaRt")
<- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
ensembl <- getBM(attributes = c('ensembl_gene_id',
ensembl_info 'external_gene_name',
'description'),
filters = 'ensembl_gene_id',
values = gene_info$ENSG,
mart = ensembl)
row.names(ensembl_info) <- ensembl_info$ensembl_gene_id
## Add gene name in a new column of the gene_info table
$name <- ensembl_info[gene_info$ENSG, "external_gene_name"]
gene_info$description <- ensembl_info[gene_info$ENSG, "description"]
gene_info
## There is one gene with no name
# sum(is.na(gene_info$name))
## Replace undefined gene names by the gene ID
is.na(gene_info$name), "name"] <-
gene_info[is.na(gene_info$name), "ENSG"]
gene_info[
## Print the first rows of the gene info table
kable(head(gene_info, n = 10), caption = "First rows of the gene information table collected from BioMart")
ENSG | name | description | |
---|---|---|---|
ENSG00000000003.14 | ENSG00000000003 | TSPAN6 | tetraspanin 6 [Source:HGNC Symbol;Acc:HGNC:11858] |
ENSG00000000419.12 | ENSG00000000419 | DPM1 | dolichyl-phosphate mannosyltransferase subunit 1, catalytic [Source:HGNC Symbol;Acc:HGNC:3005] |
ENSG00000000457.13 | ENSG00000000457 | SCYL3 | SCY1 like pseudokinase 3 [Source:HGNC Symbol;Acc:HGNC:19285] |
ENSG00000000460.16 | ENSG00000000460 | C1orf112 | chromosome 1 open reading frame 112 [Source:HGNC Symbol;Acc:HGNC:25565] |
ENSG00000000938.12 | ENSG00000000938 | FGR | FGR proto-oncogene, Src family tyrosine kinase [Source:HGNC Symbol;Acc:HGNC:3697] |
ENSG00000000971.15 | ENSG00000000971 | CFH | complement factor H [Source:HGNC Symbol;Acc:HGNC:4883] |
ENSG00000001036.13 | ENSG00000001036 | FUCA2 | alpha-L-fucosidase 2 [Source:HGNC Symbol;Acc:HGNC:4008] |
ENSG00000001084.10 | ENSG00000001084 | GCLC | glutamate-cysteine ligase catalytic subunit [Source:HGNC Symbol;Acc:HGNC:4311] |
ENSG00000001167.14 | ENSG00000001167 | NFYA | nuclear transcription factor Y subunit alpha [Source:HGNC Symbol;Acc:HGNC:7804] |
ENSG00000001460.17 | ENSG00000001460 | STPG1 | sperm tail PG-rich repeat containing 1 [Source:HGNC Symbol;Acc:HGNC:28070] |
## Export the gene info table in a TSV file
<- file.path(bic_folder, "BIC_top1000_gene_info.tsv")
gene_info_file_tsv message("Saving gene info table in TSV file\n\t",
gene_info_file_tsv)write.table(x = gene_info,
file = gene_info_file_tsv,
quote = FALSE,
sep = "\t",
row.names = TRUE,
col.names = NA)
## Export the gene info table in Excel (xlsx) file
<- file.path(bic_folder, "BIC_top1000_gene_info.xlsx")
gene_info_file_xlsx message("Saving gene info table in xlsx file\n\t",
gene_info_file_xlsx)::write.xlsx(x = gene_info, file = gene_info_file_xlsx) openxlsx
We add here a code to assign a color to each sample according to its class.
#### Class and sample colors ####
<- unique(bic_meta$cancer.type)
class_names
## Assign one color to each class
<- c(
class_color "Basal.like" = "#DDEEFF",
"HER2pos" = "#88FF88",
"Luminal.A" = "#FFBB55",
"Luminal.B" = "#EE88FF",
"Unclassified" = "#8888FF"
)
## Assign a color to each sample according to its class
$color <- class_color[bic_meta$cancer.type]
bic_meta# table(bic_meta$color)
We will standardise the samples using robust estimators, to avoid the impact of extreme values.
#### Sample standardisation ####
## Compute descriptive stats on the samples
$mean <- apply(bic_expr, 2, mean)
bic_meta$sd <- apply(bic_expr, 2, sd)
bic_meta$median <- apply(bic_expr, 2, median)
bic_meta$iqr <- apply(bic_expr, 2, IQR)
bic_meta
## Centering on the mean
<- median(unlist(bic_expr))
expr_median <- t(t(bic_expr) - bic_meta$median) + expr_median
bic_expr_centered # dim(bic_expr)
# dim(bic_expr_centered)
# apply(bic_expr_centered, 2, median)
## Compute the IQR of the whole expression table
<- IQR(unlist(bic_expr_centered))
expr_iqr
## Scale each sample based on its IQR
<- t((t(bic_expr) - unlist(bic_meta$median))
bic_expr_std / unlist(bic_meta$iqr)) * expr_iqr + expr_median
# apply(bic_expr_std, 2, IQR)
# range( apply(bic_expr_std, 2, IQR))
# range( apply(bic_expr_std, 2, median))
Create a dataset with
#### Data set with readable
<- bic_expr_std
bic_expr_labels dim(bic_expr_labels)
[1] 1000 819
colnames(bic_expr_labels) <- bic_meta$label
rownames(bic_expr_labels) <- make.names(gene_info$name)
Select 30 samples at random, and draw a boxplot.
Sort the samples according to their index in the columns of the expression table (this will ensure that samples of the same class come together).
#### Box plot ####
<- 30
sample_size
## select sample indices
<- sort(sample(
selected_samples x = 1:ncol(bic_expr),
size = sample_size,
replace = FALSE))
## Box plots
<- par(no.readonly = TRUE)
par.ori par(mfrow = c(1, 3))
par(mar = c(5, 6, 4, 1))
## Original data
boxplot(x = bic_expr[, selected_samples],
names = bic_meta$label[selected_samples],
col = bic_meta$color[selected_samples],
xlab = "log2(counts)",
main = "Before normalisation",
horizontal = TRUE,
las = 1, notch = TRUE,
cex.axis = 1)
boxplot(x = bic_expr_centered[, selected_samples],
names = bic_meta$label[selected_samples],
col = bic_meta$color[selected_samples],
xlab = "log2(counts)",
main = "Median-based centering",
horizontal = TRUE,
las = 1, notch = TRUE,
cex.axis = 0.7)
boxplot(x = bic_expr_std[, selected_samples],
names = bic_meta$label[selected_samples],
col = bic_meta$color[selected_samples],
xlab = "log2(counts)",
main = "IQR-based scaling",
horizontal = TRUE,
las = 1, notch = TRUE,
cex.axis = 0.7)
par(par.ori)
Generate a graph with the two first components and color samples according to their class. Do the sample classes segregate on this PC plot?
#### PC plot of the TAGC BIC samples ####
library(factoextra)
## Compute the principal components
## This is done on the transposed data table
<- PCA(t(bic_expr_labels),
bic_pca scale.unit = FALSE,
ncp = ncol(bic_expr_labels),
graph = FALSE)
## Table with the coordinates of each sample in the PC space
<- bic_pca$ind$coord
bic_pcs
# Check the PC dimensions
dim(bic_pcs)
[1] 819 818
#### Plot the variance per component ####
fviz_eig(bic_pca, addlabels = TRUE)
The plots of individuals show that the cancer types segregate partly onb the first and second PCs.
#### Plot individuals on the 2 first components ####
fviz_pca_ind(X = bic_pca,
axes = c(1, 2),
col.ind = bic_meta$cancer.type,
label = "none")
The 3rd and 4th PCs seem much less informative with respect to the cancer types: the individuals seem intermingled irrespective of their cancer type.
#### Plot individuals on the 2 first components ####
fviz_pca_ind(X = bic_pca,
axes = c(3, 4),
col.ind = bic_meta$cancer.type,
label = "none")
At the end of this tutorial, we dispose of the following data types.
Variable name | Data content |
---|---|
bic_meta |
metadata with a few added columns (sample color, estimators of central tendency and dispersion) |
gene_info |
ID, name and description of the 1000 genes used here |
bic_expr |
non-normalised expression table |
bic_expr_centered |
median-based expression table |
bic_expr_std |
sample-wise standardised expression table, all samples having the same median and IQR |
bic_expr_labels |
same content as bic_expr_std but with row names replaced by human-readable gene names, and column names by human-readable sample labels |
We can now save a memory image, which will enable us to reload the whole data with a single command for the next tutorials.
#### Save memory image ####
<- file.path(bic_folder, "bic_data.Rdata")
mem_image message("Saving memory image in file\n\t", mem_image)
save.image(file = mem_image)
The memory image has been saved in a local folder, in the following file: ~/m3-stat-R/TCGA-BIC_analysis/bic_data.Rdata
The memory image generated at the end of this tutorial is stored on the github repository, and can be reloaded as follows.
#### Reload memory image from github repository ####
<- "https://github.com/DU-Bii/module-3-Stat-R/blob/master/stat-R_2021/data/TCGA_BIC_subset/bic_data.Rdata?raw=true"
github_mem_img
## Define local destination folder
<- "~/m3-stat-R/TCGA-BIC_analysis"
bic_folder ## Create it if required
dir.create(bic_folder, showWarnings = FALSE, recursive = TRUE)
## Define local destination for the memory image
<- file.path(bic_folder, "bic_data.Rdata")
mem_image if (file.exists(mem_image)) {
message("Memory image already there, skipping download")
else {
} message("Downloading memory image from\n", github_mem_img)
download.file(url = github_mem_img, destfile = mem_image)
message("Local memory image\t", mem_image)
}
## Load the memory image
message("Loading memory image", mem_image)
load(mem_image)
As usual, we write the session info in the report for the sake of traceability.
sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Mojave 10.14.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] biomaRt_2.44.4 openxlsx_4.2.3 factoextra_1.0.7 ggplot2_3.3.3 FactoMineR_2.4 knitr_1.31
loaded via a namespace (and not attached):
[1] bit64_4.0.5 progress_1.2.2 httr_1.4.2 tools_4.0.2 backports_1.2.1 bslib_0.2.4 utf8_1.2.1 R6_2.5.0 DT_0.17 DBI_1.1.1 BiocGenerics_0.34.0 colorspace_2.0-0 withr_2.4.1 tidyselect_1.1.0
[15] prettyunits_1.1.1 bit_4.0.4 curl_4.3 compiler_4.0.2 Biobase_2.48.0 flashClust_1.01-2 xml2_1.3.2 labeling_0.4.2 sass_0.3.1 scales_1.1.1 askpass_1.1 rappdirs_0.3.3 stringr_1.4.0 digest_0.6.27
[29] foreign_0.8-81 rmarkdown_2.7 rio_0.5.26 pkgconfig_2.0.3 htmltools_0.5.1.1 dbplyr_2.1.0 fastmap_1.1.0 highr_0.8 htmlwidgets_1.5.3 rlang_0.4.10 readxl_1.3.1 RSQLite_2.2.5 farver_2.1.0 jquerylib_0.1.3
[43] generics_0.1.0 jsonlite_1.7.2 dplyr_1.0.5 zip_2.1.1 car_3.0-10 magrittr_2.0.1 leaps_3.1 Rcpp_1.0.6 munsell_0.5.0 S4Vectors_0.26.1 fansi_0.4.2 abind_1.4-5 lifecycle_1.0.0 scatterplot3d_0.3-41
[57] stringi_1.5.3 yaml_2.2.1 carData_3.0-4 MASS_7.3-53.1 BiocFileCache_1.12.1 grid_4.0.2 blob_1.2.1 parallel_4.0.2 ggrepel_0.9.1 forcats_0.5.1 crayon_1.4.1 lattice_0.20-41 haven_2.3.1 hms_1.0.0
[71] pillar_1.5.1 ggpubr_0.4.0 ggsignif_0.6.1 stats4_4.0.2 XML_3.99-0.6 glue_1.4.2 evaluate_0.14 data.table_1.14.0 BiocManager_1.30.12 vctrs_0.3.6 cellranger_1.1.0 gtable_0.3.0 openssl_1.4.3 purrr_0.3.4
[85] tidyr_1.1.3 assertthat_0.2.1 cachem_1.0.4 xfun_0.22 broom_0.7.5 rstatix_0.7.0 tibble_3.1.0 AnnotationDbi_1.50.3 memoise_2.0.0 IRanges_2.22.2 cluster_2.1.1 ellipsis_0.3.1