Tutorial: machine-learning with TGCA BIC transcriptome

Data loading

We provide hereafter a code to load the prepared data from a memory image on the github repository. This image has been generated at the end of the tutorial on data loading and exploration.

#### Reload memory image from github repository ####
github_mem_img <- "https://github.com/DU-Bii/module-3-Stat-R/blob/master/stat-R_2021/data/TCGA_BIC_subset/bic_data.Rdata?raw=true"

## Define local destination folder
bic_folder <- "~/m3-stat-R/TCGA-BIC_analysis"
## Create it if required
dir.create(bic_folder, showWarnings = FALSE, recursive = TRUE)

## Define local destination for the memory image
mem_image <- file.path(bic_folder, "bic_data.Rdata")
if (file.exists(mem_image)) {
  message("Memory image already there, skipping download")
} else {
  message("Downloading memory image from\n", github_mem_img)
  download.file(url = github_mem_img, destfile = mem_image)
  message("Local memory image\t", mem_image)
}

## Load the memory image
message("Loading memory image", mem_image)
load(mem_image)

The table below indicates the main variables loaded with this memory image.

Variable name	Data content
`class_color`	a vector specifying the color to be associated to each sample class (cancer type)
`bic_meta`	metadata with a few added columns (sample color, estimators of central tendency and dispersion)
`gene_info`	ID, name and description of the 1000 genes used here
`bic_expr`	non-normalised expression table
`bic_expr_centered`	median-based expression table
`bic_expr_std`	sample-wise standardised expression table, all samples having the same median and IQR
`bic_expr_labels`	same content as bic_expr_std but with row names replaced by human-readable gene names, and column names by human-readable sample labels

Use the command View() or head() or any other convenient way of your choice to check the content of these variables.

Discard the unclassified samples

Generate an expression matrix and a metadata table without the unclassified samples, since these ones cannot be used to train a classifier.

#### Discard unclassified ####
bic_meta_ok <- bic_meta[bic_meta$cancer.type != "Unclassified",]

## Check that the metadata table has only the right classes
nrow(bic_meta_ok)

[1] 712

table(bic_meta_ok$cancer.type)


Basal.like    HER2pos  Luminal.A  Luminal.B 
       131         41        422        118

## Select the corresponding columns (samples) of the expression table
## We also transpose the data table for the supoervised
bic_expr_class_ok <- t(bic_expr_labels[, bic_meta_ok$label])
dim(bic_expr_class_ok)

[1]  712 1000

bic_classes_ok <- bic_meta_ok$cancer.type

Split the dataset into a testing and a training set

Split the sample set into - a training (2/3 of the samples) - a testing set (the other 1/3 of the samples)

#### Split samples into training and testing sets ####

## Count the number of samples (total, train and test)
nsamples <- nrow(bic_expr_class_ok)
nsamples_train <- round(nsamples * 2/3)
nsamples_test <- nsamples - nsamples_train

## Perfoirm a random sampling of indices
resampled_indices <- sample(1:nsamples, replace = FALSE)
train_indices <- resampled_indices[1:nsamples_train]
test_indices <- setdiff(resampled_indices, train_indices)

Training a classifier with the train set

Use the svm()funcvtion to train a support vector machine with the training set.

#### Train the SVM with the training subset ####
svm_kernel = "radial"

## Define training set: expression matrix and class labels
training_expr <- bic_expr_class_ok[train_indices, ]
training_classes <- bic_meta_ok[train_indices, "cancer.type"]
table(training_classes)

training_classes
Basal.like    HER2pos  Luminal.A  Luminal.B 
        87         27        279         82

## Train the SVM
svm_model <- svm(x = training_expr, 
                 y = as.factor(training_classes), 
                 type = "C-classification", 
                 kernel = svm_kernel)

Predicting the classes of the testing set

#### Predict the classes of the testing set ####
testing_expr <- bic_expr_class_ok[test_indices, ]
testing_pred <- as.vector(predict(svm_model,  testing_expr))

## Check the number of predicted samples per class
table(testing_pred)

testing_pred
Basal.like    HER2pos  Luminal.A  Luminal.B 
        44          2        187          4

#### Evaluate the performances of the classifier ####

## Generate a contingency table comparing 
## the known and predicted classes for the testing set


## Get the annotated classes for the testing set
testing_classes <- bic_meta_ok[test_indices, "cancer.type"]
table(testing_classes)

testing_classes
Basal.like    HER2pos  Luminal.A  Luminal.B 
        44         14        143         36

## Compare the annotated and  predicted cancer types
contingency <- table(testing_classes, testing_pred)
kable(contingency, row.names = TRUE, caption = )

	Basal.like	HER2pos	Luminal.A	Luminal.B
Basal.like	40	0	4	0
HER2pos	2	2	10	0
Luminal.A	1	0	142	0
Luminal.B	1	0	31	4

## Compute the number of correct predictions
errors <- testing_pred != testing_classes

## Compute the misclassification error rate (MER)
mer <- sum(errors) / length(errors)

message("MER = ", mer)

In this training / testing experiment we obtained a misclassification error rate (MER) of \(MER = 0.2067511\). However, we just performed a single random sampling, so we can expect to obtain different results if we repaeat this experiment.

Several strategies are classically used to obtain a more reliable estimation of the performances :

K-fold cross-validation: split the dataset into \(k\) subsets, each of which is used as test for one train/test experiment (and the \(k-1\) remaining ones are then used for training). We could for example run a 10-fold cross-validation with this dataset.
Leave-one-out (LOO, also called jack-knife) is a specific case of k-fold cross-validation where \(k = n\), i.e. each individual (biological sample in our case) is in turn discarded from the data set (“left out”), a classifier is trained with the \(n - 1\) remaining individuals, and the trained classifier (“model”) is used to predict the class of the left out individual.
Iterative subsampling consists in repeating the above procedure, where we select a given proportion of the individuals for training (e.g. 2/3) and the rest for testing.

Each of these methods of performance estimation has its pros and cons.

For this course, we will run a collective iterative subsampling: each trainee will run a trainig/test experiment and we will write the results in a shared spreadsheet, which will then enable us to compute some statistics on the MER values (mean, standard deviation, min, max).

Exercise: impact of the SVM kernel

Based on the above example, test the 4 SVM kernels (linear, radial, sigmoid, polynomial) and compare the performances. Which kernel gives the best results.

Exercise: impact of the number of variables (genes)

Use the best kernel defined above, and estimate the MER with the 100, 200, 300, 500, 1000 top genes respectively.

SVM tuning

use the tune.svm() function to find the optimal SVM parameters.

#### Tune SVM ####
svm_tuning_result <- tune.svm(
  x = training_expr, 
  y = as.factor(training_classes))

Session info

As usual, we write the session info in the report for the sake of traceability.

sessionInfo()

R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] caret_6.0-86    ggplot2_3.3.3   lattice_0.20-41 e1071_1.7-6     knitr_1.31     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6           lubridate_1.7.10     class_7.3-18         assertthat_0.2.1     digest_0.6.27        ipred_0.9-11         foreach_1.5.1        utf8_1.2.1           R6_2.5.0             plyr_1.8.6           stats4_4.0.2         evaluate_0.14        highr_0.8            pillar_1.5.1        
[15] rlang_0.4.10         data.table_1.14.0    jquerylib_0.1.3      rpart_4.1-15         Matrix_1.3-2         rmarkdown_2.7        splines_4.0.2        gower_0.2.2          stringr_1.4.0        munsell_0.5.0        proxy_0.4-25         compiler_4.0.2       xfun_0.22            pkgconfig_2.0.3     
[29] htmltools_0.5.1.1    nnet_7.3-15          tidyselect_1.1.0     tibble_3.1.0         prodlim_2019.11.13   codetools_0.2-18     fansi_0.4.2          crayon_1.4.1         dplyr_1.0.5          withr_2.4.1          MASS_7.3-53.1        recipes_0.1.15       ModelMetrics_1.2.2.2 grid_4.0.2          
[43] nlme_3.1-152         jsonlite_1.7.2       gtable_0.3.0         lifecycle_1.0.0      DBI_1.1.1            magrittr_2.0.1       pROC_1.17.0.1        scales_1.1.1         stringi_1.5.3        reshape2_1.4.4       timeDate_3043.102    bslib_0.2.4          ellipsis_0.3.1       generics_0.1.0      
[57] vctrs_0.3.6          lava_1.6.9           iterators_1.0.13     tools_4.0.2          glue_1.4.2           purrr_0.3.4          survival_3.2-10      yaml_2.2.1           colorspace_2.0-0     sass_0.3.1