We provide hereafter a code to load the prepared data from a memory image on the github repository. This image has been generated at the end of the tutorial on data loading and exploration.
#### Reload memory image from github repository ####
<- "https://github.com/DU-Bii/module-3-Stat-R/blob/master/stat-R_2021/data/TCGA_BIC_subset/bic_data.Rdata?raw=true"
github_mem_img
## Define local destination folder
<- "~/m3-stat-R/TCGA-BIC_analysis"
bic_folder ## Create it if required
dir.create(bic_folder, showWarnings = FALSE, recursive = TRUE)
## Define local destination for the memory image
<- file.path(bic_folder, "bic_data.Rdata")
mem_image if (file.exists(mem_image)) {
message("Memory image already there, skipping download")
else {
} message("Downloading memory image from\n", github_mem_img)
download.file(url = github_mem_img, destfile = mem_image)
message("Local memory image\t", mem_image)
}
## Load the memory image
message("Loading memory image", mem_image)
load(mem_image)
The table below indicates the main variables loaded with this memory image.
Variable name | Data content |
---|---|
class_color |
a vector specifying the color to be associated to each sample class (cancer type) |
bic_meta |
metadata with a few added columns (sample color, estimators of central tendency and dispersion) |
gene_info |
ID, name and description of the 1000 genes used here |
bic_expr |
non-normalised expression table |
bic_expr_centered |
median-based expression table |
bic_expr_std |
sample-wise standardised expression table, all samples having the same median and IQR |
bic_expr_labels |
same content as bic_expr_std but with row names replaced by human-readable gene names, and column names by human-readable sample labels |
Use the command View()
or head()
or any other convenient way of your choice to check the content of these variables.
The original data table contains 819 samples (columns) x 1000 genes (rows). These dimensions are not convenient for visualisation, so we will start by selecting a smaller set of samples and genes.
Note that this smaller data subset will serve only for didactic purposes; at the end of the practical, we will apply the clustering to all the genes with all the samples, in order to get relevant results.
#### Samples per cancer type ####
kable(sort(table(bic_meta$cancer.type), decreasing = TRUE),
caption = "Number of samples per cancer type")
Var1 | Freq |
---|---|
Luminal.A | 422 |
Basal.like | 131 |
Luminal.B | 118 |
Unclassified | 107 |
HER2pos | 41 |
sample_subset
that will contain 20 sample IDs from each cancer type except the unclassified.Tip: use the commands subset()
and sample()
. If you are not familiar with these commands, use help()
to read their doc.
#### Sample subset selection ####
<- 20
samples_per_class
## Get the list of cancer types
<- unique(bic_meta$cancer.type)
cancer_types
## Initialise the result vector
<- vector()
sample_subset
## Select samples for each class
for (type in cancer_types) {
if (type != "Unclassified") { ## Skip Unclassified samples
## Select the samples of the current cancer type
<- subset(x = bic_meta, cancer.type == type,
type_sample select = "label")
<- append(
sample_subset
sample_subset, sample(row.names(type_sample),
size = samples_per_class,
replace = FALSE)
)
} }
subse_meta
with the metadata corresponding to this random selection and check the number of samples per class.#### Subset metadata ####
<- bic_meta[sample_subset, ]
subset_meta
kable(x = table(subset_meta$cancer.type),
caption = "Number of samples per class in the selected subset")
Var1 | Freq |
---|---|
Basal.like | 20 |
HER2pos | 20 |
Luminal.A | 20 |
Luminal.B | 20 |
Generate an expression table named subset_expr_labels
from the expression table bic_expr_labels
by keeping only
#### Select a subset of the expression table ####
<- 100
subset_gene_nb <- subset_meta$label
subset_labels <- bic_expr_labels[1:subset_gene_nb, subset_labels] subset_expr_labels
Read the help page of the enhanced distance function of the package factoextra
(factoextra::get_dist
).
Use this method to compute 3 dissimilarity matrices, using respectively Euclidian, Pearson and Spearman distances. Name these matrices respectively sample_dist_euclidian
, sample_dist_pearson
, and sample_dist_spearman
,
Tips:
get_dist()
computes distances betwee the rows of a data matrix. For sample clustering we will thus need to transpose the data.frame with the function t()
.#### Inter-sample distances ####
library(factoextra)
# ?factoextra::get_dist
## Euclidian distance
<- factoextra::get_dist(
sample_dist_euclidian x = t(subset_expr_labels), method = "euclidian")
Use get_dist()
to compute correlation-based distances with respectiely Pearson and Spearman correlations. Save the results in variables named sample_dist_pearson
and sample_dist_spearman
.
## Pearson distance
<- factoextra::get_dist(
sample_dist_pearson x = t(subset_expr_labels), method = "pearson")
## Spearman distance
<- factoextra::get_dist(
sample_dist_spearman x = t(subset_expr_labels), method = "spearman")
Use the function pheatmap()
(from the package of the same name) to plot the distance matrices. Use the annotation
option to display colors above the columns indicating the cancer type and the status of the three gene markers (Her2, PR1 and ER1).
#### Heatmap of the Euclidian distance ####s
<- subset_meta[, c("cancer.type", "ER1", "PR1", "Her2")]
subset_annotation rownames(subset_annotation) <- subset_meta$label
pheatmap(sample_dist_euclidian,
cluster_rows = FALSE,
cluster_cols = FALSE,
main = "Euclidian distance",
annotation_col = subset_annotation)
Run the same analysis for the Pearson and Spearman distance matrices. Interpret the results: do some of these measures seem more relevant to highlight the similarities between samples?
#### Pearson distance heatmap ####
#### Spearman distance heatmap ####
Comparing the three distance heatmaps above, can you already conclude something about the most relevant measure of dissimilarity between samples?
Use the function hclust()
to get trees from the previously computed distance matrices.
#### Sample tree with Euclidian distance, complete linkage ####
<- hclust(
sample_tree_euclidian_complete d = sample_dist_euclidian, method = "complete")
plot(sample_tree_euclidian_complete,
xlab = NA, ylab = NA, hang = -1)
We can also use some fancy library (e.g. ClassDiscovery
) to color the tree labels. Other librairies can be used to generate nice tree drawings, don’t hesitate to google and improve.
library(ClassDiscovery)
::plotColoredClusters(
ClassDiscovery
sample_tree_euclidian_complete, labs = subset_meta$cancer.type, ## class labels
col = "grey", ## tree color
cols = subset_meta$color,
cex = 0.6,
main = "Euclidian distance, Ward linkage")
legend("topright",
legend = names(class_color),
col = class_color,
text.col = class_color, cex=0.9)
Test the different agglomeration rules (single, average, complete, ward) and the different disimilarity measures (Euclidian, Spearman, Pearson).
Based on the analyses above, would you consider that some agglomeration rules are generally more relevant than other ones with this dataset? If so, is it systematic or does the best agglomeration rule vary depending on the dissimilarity measure?
Cut the tree to obtain 4 samples clusters from the tree obtained with Euclidian distance and complete linkage.
Create a confusion matrix to compare the sample compositon of the clusters with the cancer types annotated in the metadata.
Compute the ARI from this confusion matrix.
#### Cluster - cancer type comparison ####
## Cut the tree
<- 4
nclust <- cutree(
clusters_euclidian_complete tree = sample_tree_euclidian_complete, k = nclust)
<- table(
confusion_euclidian_complete
clusters_euclidian_complete,$cancer.type)
subset_meta
kable(confusion_euclidian_complete, row.names = TRUE)
Basal.like | HER2pos | Luminal.A | Luminal.B | |
---|---|---|---|---|
1 | 16 | 16 | 0 | 2 |
2 | 3 | 1 | 3 | 1 |
3 | 1 | 0 | 0 | 0 |
4 | 0 | 3 | 17 | 17 |
## Compute Adjusted Rand Index (ARI)
<- round(
ari_euclidian_complete digits = 3,
::ARI(
aricodec1 = clusters_euclidian_complete,
c2 = subset_meta$cancer.type))
message("ARI for Euclidian distance and complete linkage: ",
ari_euclidian_complete)
With Euclidian distance and the complete lin,kage, we obtain an Adjusted Rank Index (ARI) of 0.312, computed with a subset of the expression table.
The next question is of course: can we obtain better results with other distances and agglomeration rules? This will be left as an exercise.
Compute the ARI index with the same similarity measure and the same agglomeration rule, but with the full dataset (bic_expr_labels
)
## Choose the clustering parameters
<- "euclidian"
distance <- "complete"
rule
message("Computing ARI with ",
" dissimilarity and ",
distance, " linkage")
rule,
## Euclidian distance
<- factoextra::get_dist(
bic_dist x = t(bic_expr_labels), method = distance)
#### Sample tree with Euclidian distance, complete linkage ####
<- hclust(
sample_tree d = bic_dist, method = rule)
## Confusion matrix
<- 4
nclust <- cutree(
bic_clusters tree = sample_tree, k = nclust)
<- table(
bic_confusion
bic_clusters,$cancer.type)
bic_meta
kable(bic_confusion, row.names = TRUE,
caption = paste0(
"Confusion table of the sample clusters versus cancer subtypes (all samples and 1000 genes). ",
"Distance: ", distance, ". ",
"Agglomeration rule: ", rule, ". "
))
Basal.like | HER2pos | Luminal.A | Luminal.B | Unclassified | |
---|---|---|---|---|---|
1 | 117 | 40 | 378 | 106 | 100 |
2 | 4 | 0 | 0 | 0 | 3 |
3 | 9 | 1 | 18 | 7 | 4 |
4 | 1 | 0 | 26 | 5 | 0 |
## Compute Adjusted Rand Index (ARI)
<- round(
bic_ari digits = 3,
::ARI(
aricodec1 = bic_clusters,
c2 = bic_meta$cancer.type))
message("ARI for Euclidian distance and complete linkage (full dataset): ",
bic_ari)
With Euclidian distance and the complete lin,kage, we obtain an Adjusted Rank Index (ARI) of -0.011, computed with a subset of the expression table.
Biclustering amounts to cluster both the individuals (patient samples in our case) and variables (genes in our case).
We will illustrate this with the Spearman correlation as distance and Ward as agglomeration rule.
#### Biclustering ####
# left as an exercise
These analyses can be led collectively: each trainee could take one combination of parameters.
Based on the examples above, find the most relevant parameters (dissimilarity matrix, agglomeration rule) to cluster the samples of the subset expression matrix. Choose these parameters based on the confusion matrix and the ARI.
Draw different graphical representations of the best result:
Use these optimal parameters to run clustering with all the samples of the BIC dataset and the `1000 genes instead of the subset of 100. Export a table with the sample clusters and another one with the gene clusters.
As usual, we write the session info in the report for the sake of traceability.
sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Mojave 10.14.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ClassDiscovery_3.3.13 oompaBase_3.2.9 cluster_2.1.1 scales_1.1.1 pheatmap_1.0.12 aricode_1.0.0 factoextra_1.0.7 ggplot2_3.3.3 knitr_1.31
loaded via a namespace (and not attached):
[1] Rcpp_1.0.6 oompaData_3.1.1 highr_0.8 RColorBrewer_1.1-2 bslib_0.2.4 compiler_4.0.2 pillar_1.5.1 jquerylib_0.1.3 tools_4.0.2 mclust_5.4.7 digest_0.6.27 lattice_0.20-41 jsonlite_1.7.2 evaluate_0.14 lifecycle_1.0.0
[16] tibble_3.1.0 gtable_0.3.0 pkgconfig_2.0.3 rlang_0.4.10 Matrix_1.3-2 DBI_1.1.1 ggrepel_0.9.1 yaml_2.2.1 xfun_0.22 withr_2.4.1 stringr_1.4.0 dplyr_1.0.5 generics_0.1.0 sass_0.3.1 vctrs_0.3.6
[31] grid_4.0.2 tidyselect_1.1.0 glue_1.4.2 R6_2.5.0 fansi_0.4.2 rmarkdown_2.7 farver_2.1.0 purrr_0.3.4 magrittr_2.0.1 ellipsis_0.3.1 htmltools_0.5.1.1 assertthat_0.2.1 colorspace_2.0-0 utf8_1.2.1 stringi_1.5.3
[46] munsell_0.5.0 crayon_1.4.1