Reference

Pavkovic, M., Pantano, L., Gerlach, C.V. et al. Multi omics analysis of fibrotic kidneys in two mouse models. Sci Data 6, 92 (2019).

Samples

Samples from two mouse models were collected. The first one is a reversible chemical-induced injury model (folic acid (FA) induced nephropathy). The second one is an irreversible surgically-induced fibrosis model (unilateral ureteral obstruction (UUO)). mRNA and small RNA sequencing, as well as 10-plex tandem mass tag (TMT) proteomics were performed with kidney samples from different time points over the course of fibrosis development. In the FA model, mice were sacrificed before the treatment (day 0) and 1, 2, 7, and 14 days after a single injection of folic acid. For the UUO model, mice were sacrificed before obstruction (day 0) and 3, 7, and 14 days after the ureter of the left kidney was obstructed via ligation. For both studies, kidneys were removed at each time point for total RNA isolation and protein sample preparation.

We will first explore the UUO transcriptome data.

Data sources

Doc URL
Total RNA for the experiment on Unilateral ureter obstruction (UUO) model https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE118339
Supplementary material of the article with all the data tables (zip archive) https://zenodo.org/record/2592516

Build metadata table

We roughly build a minimal metadata table from the info we find in the data table. This should be improved by importing dataset-specific information for a more relevant analysis.

Rough metadata table. Data type = transcriptomeDataset = fa
dataType sampleName condition sampleNumber color
transcriptome day1_1 day1 1 #FFFFDD
transcriptome day1_2 day1 2 #FFFFDD
transcriptome day1_3 day1 3 #FFFFDD
transcriptome day14_1 day14 1 #FF4400
transcriptome day14_2 day14 2 #FF4400
transcriptome day14_3 day14 3 #FF4400
transcriptome day2_1 day2 1 #FFDD88
transcriptome day2_2 day2 2 #FFDD88
transcriptome day2_3 day2 3 #FFDD88
transcriptome day3_1 day3 1 #FFBB44
transcriptome day3_2 day3 2 #FFBB44
transcriptome day3_3 day3 3 #FFBB44
transcriptome day7_1 day7 1 #FF8800
transcriptome day7_2 day7 2 #FF8800
transcriptome day7_3 day7 3 #FF8800
transcriptome normal_1 normal 1 #BBFFBB
transcriptome normal_2 normal 2 #BBFFBB
transcriptome normal_3 normal 3 #BBFFBB

Compute sample-wise statistics

We describe our data by computing sample-wise descriptors of the central tendency (mean, median), dispersion (standard deviation, inter-quartile range) and some milestone percentiles (05, 25, 50, 75, 95).

dataType sampleName condition sampleNumber color mean median sd IQR min perc05 Q1 Q3 perc95 max
transcriptome day1_1 day1 1 #FFFFDD 663.06 1.1055 65219 58.764 0 0 0 58.764 754.41 13174979
transcriptome day1_2 day1 2 #FFFFDD 1224.51 3.0536 123141 111.712 0 0 0 111.712 1445.26 25118595
transcriptome day1_3 day1 3 #FFFFDD 1368.61 3.7667 133450 122.382 0 0 0 122.382 1567.42 27255572
transcriptome day14_1 day14 1 #FF4400 540.26 2.0257 54481 65.013 0 0 0 65.013 697.51 11178198
transcriptome day14_2 day14 2 #FF4400 682.00 2.0329 74825 69.057 0 0 0 69.057 805.78 15430799
transcriptome day14_3 day14 3 #FF4400 562.57 2.2473 56027 75.117 0 0 0 75.117 727.79 11575718
transcriptome day2_1 day2 1 #FFDD88 1161.76 3.5015 103599 127.047 0 0 0 127.047 1612.19 21036079
transcriptome day2_2 day2 2 #FFDD88 223.88 0.0000 20432 14.542 0 0 0 14.542 229.94 4037963
transcriptome day2_3 day2 3 #FFDD88 628.44 1.4782 57481 63.361 0 0 0 63.361 849.71 11662570
transcriptome day3_1 day3 1 #FFBB44 649.84 2.0761 55211 77.844 0 0 0 77.844 920.35 11170233
transcriptome day3_2 day3 2 #FFBB44 514.86 1.9417 42768 74.165 0 0 0 74.165 827.81 8612320
transcriptome day3_3 day3 3 #FFBB44 854.10 2.9758 80565 104.951 0 0 0 104.951 1128.02 16484449
transcriptome day7_1 day7 1 #FF8800 576.24 2.2562 53113 90.510 0 0 0 90.510 899.23 10690493
transcriptome day7_2 day7 2 #FF8800 510.27 2.5377 47373 80.575 0 0 0 80.575 815.95 9682527
transcriptome day7_3 day7 3 #FF8800 604.97 2.8747 59073 83.500 0 0 0 83.500 880.28 12164986
transcriptome normal_1 normal 1 #BBFFBB 576.30 1.1667 57644 46.384 0 0 0 46.384 616.16 11777395
transcriptome normal_2 normal 2 #BBFFBB 1172.56 2.9758 111510 103.137 0 0 0 103.137 1387.60 22658521
transcriptome normal_3 normal 3 #BBFFBB 531.86 1.0817 57204 41.807 0 0 0 41.807 521.21 11636735

Distribution of raw values

Distribution of raw counts

Distribution of raw counts

The distribution of raw counts is not very informative, because the range is defined by some outlier, i.e. a feature having a huge values respective to the rest of the distribution.

This problem is particularly sensitive for RNA-seq data: even with a strong zoom on the abcsissa range from 0 to 500, the histogram shows a steep drop in the first bins.

Distribution of raw counts - two-sides trimming

For visualization, it is sometimes convenient to truncate both tails of the distribution, in order to avoid both the outliers (right tail) and the sometimes large number of zero values. This is particularly useful for single-cell RNA-seq data, which can have a “zero-inflated” distribution, but it is also useful for other data types.

Distribution of raw counts

Distribution of raw counts

Log2-transformed values

A typical normalising approach is apply a log2 transformation .

For RNA-seq data, this might however create a problem when the counts of a given gene in a given sample is 0. To circumvent this, we can add an epsilon (\(\epsilon = 0.1\)) before the log2 transformation.

Note: Pancovic’s “raw counts” are actually decimal numbers and their mimimal value is higher than 0. The epsilon is thus not required but we keep it in this script in order to enable re-using it with other data sources.

Feature filtering

We filter out the features having very low values across all the samples, which we consider as undetected.

Box plot of the log2-transformed values before (left) and after (right)  feature filtering.

Box plot of the log2-transformed values before (left) and after (right) feature filtering.

We filtered out all the features whose maximal count value across all samples was lower than 10. Among the 46679 genes from the raw count table, 17761 were considered undetected according to this criterion. We use the remaining 28918 genes for the subsequent analyses.

Standardization: centering

Before going any further, it is important to ensure some standardization of the samples, in order to correct for biases due to inter-sample differences in sequencing depth (RNA-seq) or other sample-specific bias.

For the sake of simplicity, we will use here a very simple criterion: median-based centering.

When we start from the raw values, the principle is to divide the counts of each sample by a sample-specific scaling factor in order to bring each sample to the same median count.

Since we are working with log2-transformed values, we compute the difference, which is equivalent to apply a ratio to the raw values.

Box plot of the log2-transformed feature-filtered values before (left) and after (right) median-based standardization.

Box plot of the log2-transformed feature-filtered values before (left) and after (right) median-based standardization.

Consistently, after centering, all the medians appear aligned on the box plot.

Standardization: scaling

The second step of standardization is to ensure homogeous dispersion between samples. This relies on some biological assumption, so it has to be evaluated on a case-per-case basis.

My trick: for high-throughput data, I generally use the inter-quartile range (IQR) rather than the standard deviation as scaling factor. This ensures robustness to outliers, which may have a very strong impact for some datatypes (e.g. RNA-seq).

Box plot of the log2-transformed feature-filtered values before (left) and after (right) IQR-based scaling.

Box plot of the log2-transformed feature-filtered values before (left) and after (right) IQR-based scaling.

Raw versus standardized values

At the end of this journey, we can compare the raw and the standardized values. Needless to say that the preceding steps quite transformed the data. Since this will likely impact all the subsequent analyses, it is worth making sure we took the right choices for each step (filtering, log2-transformation, centering, scaling). This should be evaluated based on your knowledge of your data.

Box plot of the raw values (left) and the final data after feature filtering, normalization by log2 transormation, centering and scaling (right).

Box plot of the raw values (left) and the final data after feature filtering, normalization by log2 transormation, centering and scaling (right).

Violin plots

We can also inspect the distribution of counts per sample with the vioplot() function.

Box plot of the raw values (left) and the final data after feature filtering, normalization by log2 transormation, centering and scaling (right).

Box plot of the raw values (left) and the final data after feature filtering, normalization by log2 transormation, centering and scaling (right).

Scatter plots

We can also inspect the distribution of counts per sample with the plot() function.

Exporting the result

We export the pre-processed data in separate tables for further reuse.

#### Save tables  ####

outfiles <- vector()

## Raw counts, all the variables
outfiles["raw"] <- file.path(outdirs$dataset, 
                             paste0(parameters$filePrefix, 
                                    "raw.tsv.gz"))
write.table(x = format(digits = 3, big.mark   = "", decimal.mark = ".", rawValues), 
            dec = ".", 
            file = gzfile(outfiles["raw"], "w"), 
            quote = FALSE, 
            sep = "\t")

## Log2-transformed counts, all the features
outfiles["log2"] <- file.path(outdirs$dataset, 
                              paste0(parameters$filePrefix, 
                                     "log2.tsv.gz"))
write.table(x = format(digits = 3, big.mark   = "", decimal.mark = ".", log2Values), 
            dec = ".", 
            file = gzfile(outfiles["log2"], "w"), 
            quote = FALSE, sep = "\t")

## Filtered features only, log2-transformed values
outfiles["filtered"] <- file.path(outdirs$dataset, 
                                  paste0(parameters$filePrefix, 
                                         "log2_filtered.tsv.gz"))
write.table(x = format(digits = 3, big.mark   = "", decimal.mark = ".", log2Filtered), 
            dec = ".", 
            file = gzfile(outfiles["filtered"], "w"), 
            quote = FALSE, sep = "\t")

## Filtered genes only, log2-transformed and centered but not scaled data
outfiles["centered"] <- file.path(outdirs$dataset, 
                                  paste0(parameters$filePrefix, 
                                         "log2_median-centered.tsv.gz"))
write.table(x = format(digits = 3, big.mark   = "", decimal.mark = ".", log2MedianCentered), 
            dec = ".", 
            file = gzfile(outfiles["centered"], "w"), 
            quote = FALSE, sep = "\t")


## Filtered genes only, log2-transformed and standardized counts
outfiles["standardized"] <- file.path(outdirs$dataset, 
                                      paste0(parameters$filePrefix, 
                                             "log2_standardized.tsv.gz"))
write.table(x = format(digits = 3, big.mark   = "", decimal.mark = ".", log2Standardized), 
            dec = ".", 
            file = gzfile(outfiles["standardized"], "w"), 
            quote = FALSE, sep = "\t")

## Metadata
outfiles["metadata"] <- file.path(outdirs$dataset, 
                                  paste0(parameters$filePrefix, 
                                         "metadata.tsv"))
write.table(x = metadata, 
            file = outfiles["metadata"] ,
            quote = FALSE, sep = "\t")

Memory image

We will now save a memory image in order to enable us to reload the data exactly in the state we reached now, without having to redo all the steps.

The memory image can be reloaded with a single R command: load(./results/pavkovic2019_transcriptome_fa_memimage.Rdata)

Session info

R version 3.6.3 (2020-02-29)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /shared/mfs/data/software/miniconda/envs/r-3.6.3/lib/libopenblasp-r0.3.9.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] vioplot_0.3.4 zoo_1.8-8     sm_2.2-5.6    stringr_1.4.0 knitr_1.28   

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6    lattice_0.20-41 digest_0.6.25   grid_3.6.3      magrittr_1.5    evaluate_0.14   highr_0.8       rlang_0.4.6     stringi_1.4.6   rmarkdown_2.1   tools_3.6.3     xfun_0.14       yaml_2.2.1      compiler_3.6.3  tcltk_3.6.3     htmltools_0.4.0