We provide hereafter a code to load the prepared data from a memory image on the github repository. This image has been generated at the end of the tutorial on data loading and exploration.
#### Reload memory image from github repository ####
<- "https://github.com/DU-Bii/module-3-Stat-R/blob/master/stat-R_2021/data/TCGA_BIC_subset/bic_data.Rdata?raw=true"
github_mem_img
## Define local destination folder
<- "~/m3-stat-R/TCGA-BIC_analysis"
bic_folder ## Create it if required
dir.create(bic_folder, showWarnings = FALSE, recursive = TRUE)
## Define local destination for the memory image
<- file.path(bic_folder, "bic_data.Rdata")
mem_image if (file.exists(mem_image)) {
message("Memory image already there, skipping download")
else {
} message("Downloading memory image from\n", github_mem_img)
download.file(url = github_mem_img, destfile = mem_image)
message("Local memory image\t", mem_image)
}
## Load the memory image
message("Loading memory image", mem_image)
load(mem_image)
The table below indicates the main variables loaded with this memory image.
Variable name | Data content |
---|---|
class_color |
a vector specifying the color to be associated to each sample class (cancer type) |
bic_meta |
metadata with a few added columns (sample color, estimators of central tendency and dispersion) |
gene_info |
ID, name and description of the 1000 genes used here |
bic_expr |
non-normalised expression table |
bic_expr_centered |
median-based expression table |
bic_expr_std |
sample-wise standardised expression table, all samples having the same median and IQR |
bic_expr_labels |
same content as bic_expr_std but with row names replaced by human-readable gene names, and column names by human-readable sample labels |
Use the command View()
or head()
or any other convenient way of your choice to check the content of these variables.
Generate an expression matrix and a metadata table without the unclassified samples, since these ones cannot be used to train a classifier.
#### Discard unclassified ####
<- bic_meta[bic_meta$cancer.type != "Unclassified",]
bic_meta_ok
## Check that the metadata table has only the right classes
nrow(bic_meta_ok)
[1] 712
table(bic_meta_ok$cancer.type)
Basal.like HER2pos Luminal.A Luminal.B
131 41 422 118
## Select the corresponding columns (samples) of the expression table
## We also transpose the data table for the supoervised
<- t(bic_expr_labels[, bic_meta_ok$label])
bic_expr_class_ok dim(bic_expr_class_ok)
[1] 712 1000
<- bic_meta_ok$cancer.type bic_classes_ok
Split the sample set into - a training (2/3 of the samples) - a testing set (the other 1/3 of the samples)
#### Split samples into training and testing sets ####
## Count the number of samples (total, train and test)
<- nrow(bic_expr_class_ok)
nsamples <- round(nsamples * 2/3)
nsamples_train <- nsamples - nsamples_train
nsamples_test
## Perfoirm a random sampling of indices
<- sample(1:nsamples, replace = FALSE)
resampled_indices <- resampled_indices[1:nsamples_train]
train_indices <- setdiff(resampled_indices, train_indices) test_indices
Use the svm()
funcvtion to train a support vector machine with the training set.
#### Train the SVM with the training subset ####
= "radial"
svm_kernel
## Define training set: expression matrix and class labels
<- bic_expr_class_ok[train_indices, ]
training_expr <- bic_meta_ok[train_indices, "cancer.type"]
training_classes table(training_classes)
training_classes
Basal.like HER2pos Luminal.A Luminal.B
87 27 279 82
## Train the SVM
<- svm(x = training_expr,
svm_model y = as.factor(training_classes),
type = "C-classification",
kernel = svm_kernel)
#### Predict the classes of the testing set ####
<- bic_expr_class_ok[test_indices, ]
testing_expr <- as.vector(predict(svm_model, testing_expr))
testing_pred
## Check the number of predicted samples per class
table(testing_pred)
testing_pred
Basal.like HER2pos Luminal.A Luminal.B
44 2 187 4
#### Evaluate the performances of the classifier ####
## Generate a contingency table comparing
## the known and predicted classes for the testing set
## Get the annotated classes for the testing set
<- bic_meta_ok[test_indices, "cancer.type"]
testing_classes table(testing_classes)
testing_classes
Basal.like HER2pos Luminal.A Luminal.B
44 14 143 36
## Compare the annotated and predicted cancer types
<- table(testing_classes, testing_pred)
contingency kable(contingency, row.names = TRUE, caption = )
Basal.like | HER2pos | Luminal.A | Luminal.B | |
---|---|---|---|---|
Basal.like | 40 | 0 | 4 | 0 |
HER2pos | 2 | 2 | 10 | 0 |
Luminal.A | 1 | 0 | 142 | 0 |
Luminal.B | 1 | 0 | 31 | 4 |
## Compute the number of correct predictions
<- testing_pred != testing_classes
errors
## Compute the misclassification error rate (MER)
<- sum(errors) / length(errors)
mer
message("MER = ", mer)
In this training / testing experiment we obtained a misclassification error rate (MER) of \(MER = 0.2067511\). However, we just performed a single random sampling, so we can expect to obtain different results if we repaeat this experiment.
Several strategies are classically used to obtain a more reliable estimation of the performances :
K-fold cross-validation: split the dataset into \(k\) subsets, each of which is used as test for one train/test experiment (and the \(k-1\) remaining ones are then used for training). We could for example run a 10-fold cross-validation with this dataset.
Leave-one-out (LOO, also called jack-knife) is a specific case of k-fold cross-validation where \(k = n\), i.e. each individual (biological sample in our case) is in turn discarded from the data set (“left out”), a classifier is trained with the \(n - 1\) remaining individuals, and the trained classifier (“model”) is used to predict the class of the left out individual.
Iterative subsampling consists in repeating the above procedure, where we select a given proportion of the individuals for training (e.g. 2/3) and the rest for testing.
Each of these methods of performance estimation has its pros and cons.
For this course, we will run a collective iterative subsampling: each trainee will run a trainig/test experiment and we will write the results in a shared spreadsheet, which will then enable us to compute some statistics on the MER values (mean, standard deviation, min, max).
Based on the above example, test the 4 SVM kernels (linear, radial, sigmoid, polynomial) and compare the performances. Which kernel gives the best results.
Use the best kernel defined above, and estimate the MER with the 100, 200, 300, 500, 1000 top genes respectively.
use the tune.svm()
function to find the optimal SVM parameters.
#### Tune SVM ####
<- tune.svm(
svm_tuning_result x = training_expr,
y = as.factor(training_classes))
As usual, we write the session info in the report for the sake of traceability.
sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Mojave 10.14.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] caret_6.0-86 ggplot2_3.3.3 lattice_0.20-41 e1071_1.7-6 knitr_1.31
loaded via a namespace (and not attached):
[1] Rcpp_1.0.6 lubridate_1.7.10 class_7.3-18 assertthat_0.2.1 digest_0.6.27 ipred_0.9-11 foreach_1.5.1 utf8_1.2.1 R6_2.5.0 plyr_1.8.6 stats4_4.0.2 evaluate_0.14 highr_0.8 pillar_1.5.1
[15] rlang_0.4.10 data.table_1.14.0 jquerylib_0.1.3 rpart_4.1-15 Matrix_1.3-2 rmarkdown_2.7 splines_4.0.2 gower_0.2.2 stringr_1.4.0 munsell_0.5.0 proxy_0.4-25 compiler_4.0.2 xfun_0.22 pkgconfig_2.0.3
[29] htmltools_0.5.1.1 nnet_7.3-15 tidyselect_1.1.0 tibble_3.1.0 prodlim_2019.11.13 codetools_0.2-18 fansi_0.4.2 crayon_1.4.1 dplyr_1.0.5 withr_2.4.1 MASS_7.3-53.1 recipes_0.1.15 ModelMetrics_1.2.2.2 grid_4.0.2
[43] nlme_3.1-152 jsonlite_1.7.2 gtable_0.3.0 lifecycle_1.0.0 DBI_1.1.1 magrittr_2.0.1 pROC_1.17.0.1 scales_1.1.1 stringi_1.5.3 reshape2_1.4.4 timeDate_3043.102 bslib_0.2.4 ellipsis_0.3.1 generics_0.1.0
[57] vctrs_0.3.6 lava_1.6.9 iterators_1.0.13 tools_4.0.2 glue_1.4.2 purrr_0.3.4 survival_3.2-10 yaml_2.2.1 colorspace_2.0-0 sass_0.3.1