Required libraries

Load the following libraries or install them if required.

Introduction

This tutorial aims at providing an empirical introduction to the application of mean comparison tests to omics data.

The goals include

  • revisiting the basic underlying concepts (sampling, estimation, hypothesis testing, risks…);
  • perceiving the problems that arise when a test of hypothesis is applied on several thousand of features (multiple testing);
  • introducing some methods to circumvent these problems (multiple testing corrections);
  • using graphical representations in order to grasp the results of several thousand tests in a winkle of an eye:

    • p-value histogram
    • MA plot
    • volcano plot

The whole tutorial will rely on artificial data generated by drawing random numbers that follow a given distribution of probabilities (in this case, the normal distribution, but other choices could be made afterwards).

The tutorial will proceed progressively:

  1. Generate a multivariate table (with individuals in columns and features in rows) and fill it with random data following a given distribution of probability.

  2. Measure different descriptive parameters on the sampled data.

  3. Use different graphical representations to visualise the data distribution.

  4. Run a test of hypothesis on a given feature.

  5. Run the same test of hypothesis on all the features.

  6. Use different graphical representations to summarize the results of all the tests.

  7. Apply different corrections for multiple testing (Bonferroni, Benjamini-Hochberg, Storey-Tibshirani q-value).

  8. Compare the performances of the test depending on the chosen multiple testing correction.

Experimental setting

Well, by “experimental” we mean here that we will perform in silico experiments.

Let us define the parameters of our analysis. We will generate data tables of artificial data following normal distributions, with either different means (tests under \(H_1\)) or equal means (tests under \(H_0\)).

We will do this for a number of features \(m_0=10,000\) (number of rows in the “\(H_0\)” data table), which could be considered as replicates to study the impact of sampling fluctuations.

In a second time (not seen here) we could refine the script by running a sampling with a different mean for each feature, in order to mimic the behaviour of omics datasets (where genes have different levels of expression, proteins and metabolite different concentrations).

Parameters

Parameter Value Description
\(n_1\) \({2, 3, 4, 8, 16, 32, 64}\) size of the sample from the first population. individual choice. Each participant will choose a given sample size
\(n_2\) \(= n_1\) size of the sample from the second population
\(\mu_1\) 10 or 7 mean of the first population. each participant will chose one value
\(\mu_2\) 10 mean of the second population
\(\sigma_1\) 2 Standard deviation of the first population
\(\sigma_2\) 3 Standard deviation of the second population
\(m_0\)$ \(10,000\) number of features under null hypothesis

Sample sizes

Each participant will choose a different sample size among the following values: \(n \in {2, 3, 4, 5, 8, 16, 64}\). Noteowrthy, many omics studies are led with a very small number of replicates (frequently 3), so that it will be relevant to evaluate thee impact of the statistical sample size (number of replicates) on the sensibility (proportion of cases declared positive under \(H_1\)).

Performances of the tests

We will measure the performances of a test by running \(r = 10,000\) times under \(H_0\), and \(r = 10,000\) times under \(H_1\).

  • count the number of \(FP\), \(TP\), \(FN\), \(TN\)
  • compute the derived statistics: \(FPR\), \(FDR\) and \(Sn\)

\[FPR = \frac{FP}{m_0} = \frac{FP}{FP + TN} \]

\[FDR = \frac{FP}{\text{Positive}} = \frac{FP}{FP + TP} \]

\[Sn = \frac{TP}{m_1} = \frac{TP}{TP + FN} \] \[PPV = \frac{TP}{\text{Positive}} = \frac{TP}{TP + FP}\]

Recommendations

Coding recommendations

  1. Choose a consistent coding style, consistent with a reference style guide (e.g. Google R Syle Guide). In particular:

    • For variable names, use the camel back notation with a leading lowercase (e.g. myDataTable) rather than the dot separator (my.data.table)
    • For variable names, use the camel back notation with a leading uppercase (e.g. MyMeanCompaTest).
  2. Define your variables with explicit names (sigma, mu rather than a, b, c, …).

  3. Comment your code

    • indicate what each variable represents
    • before each segment of code, explain what it will do
  4. Ensure consistency between the code and the report \(\rightarrow\) inject the actual values of the R variables in the markdown.

Scientific recommendations

  1. Explicitly formulate the statistical hypotheses before running a test.

  2. Discuss the assumptions underlying the test: are they all fulfilled? If not explain why (e.g. because we want to test the impact of this parameter, …) .

Tutorial

Part 1: generating random datasets

Define your parameters

Write a block of code to define the parameters specified aboce.

Note that each participant will have a different value for the sample sizes (\(n_1, n_2\)).

The table below lists the actual values for my parameters (your values for \(n_i\) will be different).

Parameter Value Description
\(\mu_1\) 7 Mean of the first population
\(\mu_2\) 10 Mean of the second population
\(\sigma_1\) 2 Standard deviation of the first population
\(\sigma_2\) 2 Standard deviation of the second population
\(n_1\) 16 Sample size for the first group
\(n_2\) 16 Sample size for the second group

Generate random data sets

Exercise:

  • Generate an data frame named group1 which with \(m_0\) rows (the number of features under \(H_0\), defined above) and \(n_1\) columns (sample size for the first population), filled with random numbers drawn from the first population.

  • Name the columns with labels indicating the group and sample number: g1s1, , g1s2 … with indices from 1 \(n_1\).

  • Name the rows to denote the feature numbers: ft1, ft2, … with indices from 1 to \(m_0\).

  • Check the dimensions of group1.

  • Print its first and last rows to check its content and the row and column names.

  • Generate a second data frame named group2 for the samples drawn from the second population with the appropriate size, and name the columns and rows consistently.

Useful functions

  • rnorm()
  • matrix()
  • data.frame()
  • paste()
  • paste0()
  • colnames()
  • rownames()
  • dim()

Trick:

  • the funciton matrix() enables us to define the number of columns and the number of rows,
  • the function data.frames() does not enable this, but it can take as input a matrix, from which it will inherit the dimensions

Solution (click on the “code” button to check the solution).

Check the content of the data table from the first group.

[1] 1000   16
g1s1 g1s2 g1s3 g1s4 g1s5 g1s6 g1s7 g1s8 g1s9 g1s10 g1s11 g1s12 g1s13 g1s14 g1s15 g1s16
ft1 8.941285 6.065904 6.865641 4.859670 4.512642 6.895100 6.958049 6.571121 6.346707 9.316831 2.405495 7.560588 5.237921 7.455899 9.150481 7.584783
ft2 6.341780 5.572643 6.225146 7.607175 7.916639 9.159623 5.848137 3.714532 5.690572 8.518993 7.721817 8.796947 8.394252 5.201240 5.078987 5.059495
ft3 7.256030 5.609690 5.085780 6.251178 6.185095 5.746095 5.426706 5.937105 3.509892 9.136032 6.853375 6.274303 7.318755 4.817723 10.016098 8.024319
ft4 7.306063 9.761042 6.510410 2.995976 7.084978 9.170946 5.843874 7.777505 6.699460 7.871503 4.288901 11.089438 7.985812 6.003219 2.376816 7.505029
ft5 7.377395 5.728367 7.502510 6.681304 5.280177 4.125594 10.377625 2.956254 4.353912 8.138003 9.110175 8.887401 8.325482 5.321332 6.616982 4.688656
ft6 8.138350 7.867051 7.211847 9.442536 3.512614 7.266573 5.658330 7.845747 9.028672 7.819257 10.218765 4.371628 6.856506 6.572069 4.021843 5.959517
g1s1 g1s2 g1s3 g1s4 g1s5 g1s6 g1s7 g1s8 g1s9 g1s10 g1s11 g1s12 g1s13 g1s14 g1s15 g1s16
ft995 9.605869 5.740994 7.952485 6.786263 8.301584 5.963019 8.111063 3.566636 8.257055 6.686423 9.280676 9.945114 7.836719 5.139963 6.322468 3.879234
ft996 8.049430 8.322572 9.308645 6.249635 6.517046 6.945753 10.476903 8.258785 9.631009 5.271014 8.183092 4.636647 7.828096 4.856097 7.808470 4.576391
ft997 7.713957 8.750733 9.170241 6.859514 5.216096 8.194377 8.951150 7.481994 6.736394 5.140652 10.301499 5.520028 8.206613 4.289319 6.971314 8.807768
ft998 9.826277 5.728074 7.169590 9.398133 6.224335 6.586678 10.960816 4.886410 9.327067 10.245310 5.790441 3.356992 7.388533 6.234348 6.130373 4.776949
ft999 10.307381 8.634587 7.994570 8.684634 3.655567 7.106540 5.869480 5.938656 8.157128 7.209484 5.526127 2.734005 6.518238 8.962575 6.742045 7.910900
ft1000 6.695509 8.701822 10.610700 9.227807 5.761689 9.165080 6.451053 6.774868 7.586023 10.825243 6.221651 7.521273 6.485664 6.469598 3.654353 7.192678

Check the content of the data table from the second group.

[1] 1000   16
g2s1 g2s2 g2s3 g2s4 g2s5 g2s6 g2s7 g2s8 g2s9 g2s10 g2s11 g2s12 g2s13 g2s14 g2s15 g2s16
ft1 10.222202 6.803465 5.644039 15.310530 12.022135 9.205384 7.810089 6.824159 9.601778 3.042279 13.646228 12.040121 13.800577 5.405056 11.868858 5.146461
ft2 13.788490 11.414270 19.247855 9.104344 7.161907 14.430186 6.693538 11.771966 5.466395 13.846364 4.814672 10.611065 14.007772 9.264401 11.794563 11.574728
ft3 8.218564 9.340666 11.842295 13.045543 5.242310 7.199965 12.454361 11.112718 9.344606 11.487924 10.799383 4.877696 6.419245 13.373079 13.581551 10.329964
ft4 6.388767 14.841495 6.673942 6.147528 7.040335 15.048979 7.286110 7.894366 13.823895 14.894529 8.177395 11.067993 12.745004 13.283611 8.662708 6.765320
ft5 7.862331 8.373275 4.457363 11.435550 12.882556 13.882535 5.722801 12.662282 3.671602 17.983611 4.309335 10.602149 7.648618 8.398105 11.239692 8.139710
ft6 15.439498 5.644524 9.878748 5.295041 4.737427 11.884359 7.842318 15.196238 10.719587 12.656550 7.791976 8.960966 11.817566 4.536809 8.054717 6.694868
g2s1 g2s2 g2s3 g2s4 g2s5 g2s6 g2s7 g2s8 g2s9 g2s10 g2s11 g2s12 g2s13 g2s14 g2s15 g2s16
ft995 11.560118 11.477506 7.288923 7.257060 7.974623 11.264721 10.600428 10.025548 7.492476 15.06344 8.820455 15.313801 9.524212 8.412263 14.001309 9.712662
ft996 6.675313 12.578788 9.168974 8.874279 12.924633 8.100163 9.838190 9.654799 8.691387 12.08811 8.808767 10.163288 10.591247 11.112719 7.381530 4.043808
ft997 9.429636 10.788576 7.365419 6.562660 12.170922 11.668521 9.720355 14.542886 9.716674 11.50710 12.884846 7.063525 4.797859 8.954747 5.540972 9.692725
ft998 8.826315 11.819484 5.649444 10.404189 14.148555 8.348604 4.936733 14.004160 4.741586 10.12474 8.560174 7.037552 6.882366 9.336428 6.524944 7.723506
ft999 14.302763 5.136343 5.557525 4.767551 12.962332 8.779562 9.618915 15.235528 13.498886 10.64103 10.652171 10.128990 8.093268 6.784517 7.035637 9.486925
ft1000 7.111972 15.204233 13.274704 13.993838 7.358308 14.302453 7.925037 9.124976 12.256743 12.69099 7.320802 10.944899 12.102726 8.468052 7.981384 6.673717

Checking the properties of the data tables

Check the properties of the columns (individuals, e.g. biological samples) and rows (features, e.g. genes or proteins or metabolites) of the data tables.

  • Use the summary() funciton to quickly inspect the column-wise properties (statistics per individual).

  • Use apply(), mean() and sd() to generate a data frame that collects

    • the column-wise parameters (statistics per feature)
    • the row-wise parameters (statistics per feature).
      g1s1             g1s2              g1s3              g1s4              g1s5              g1s6             g1s7              g1s8              g1s9            g1s10             g1s11            g1s12            g1s13             g1s14            g1s15            g1s16        
 Min.   : 1.153   Min.   : 0.2647   Min.   :-0.8534   Min.   : 0.9221   Min.   : 0.2519   Min.   : 1.098   Min.   :-0.6261   Min.   : 0.5412   Min.   :-2.541   Min.   : 0.4808   Min.   : 1.537   Min.   : 0.614   Min.   : 0.3706   Min.   :-0.904   Min.   : 1.078   Min.   : 0.2331  
 1st Qu.: 5.574   1st Qu.: 5.7378   1st Qu.: 5.6475   1st Qu.: 5.7334   1st Qu.: 5.7508   1st Qu.: 5.718   1st Qu.: 5.5893   1st Qu.: 5.5918   1st Qu.: 5.565   1st Qu.: 5.6125   1st Qu.: 5.567   1st Qu.: 5.431   1st Qu.: 5.5188   1st Qu.: 5.503   1st Qu.: 5.699   1st Qu.: 5.7006  
 Median : 6.868   Median : 7.2550   Median : 7.0104   Median : 7.0928   Median : 7.1135   Median : 7.106   Median : 6.9690   Median : 6.9331   Median : 6.979   Median : 7.1559   Median : 7.057   Median : 6.877   Median : 6.9950   Median : 6.899   Median : 6.987   Median : 7.0582  
 Mean   : 6.891   Mean   : 7.1632   Mean   : 6.9962   Mean   : 7.1103   Mean   : 7.0383   Mean   : 7.107   Mean   : 6.9354   Mean   : 6.9583   Mean   : 6.987   Mean   : 7.0500   Mean   : 7.049   Mean   : 6.846   Mean   : 6.9139   Mean   : 6.959   Mean   : 7.005   Mean   : 6.9945  
 3rd Qu.: 8.174   3rd Qu.: 8.5433   3rd Qu.: 8.3104   3rd Qu.: 8.4067   3rd Qu.: 8.2805   3rd Qu.: 8.506   3rd Qu.: 8.2376   3rd Qu.: 8.2567   3rd Qu.: 8.349   3rd Qu.: 8.4553   3rd Qu.: 8.461   3rd Qu.: 8.207   3rd Qu.: 8.3111   3rd Qu.: 8.348   3rd Qu.: 8.262   3rd Qu.: 8.3510  
 Max.   :14.239   Max.   :12.3955   Max.   :13.7764   Max.   :14.7685   Max.   :15.2060   Max.   :13.430   Max.   :13.3995   Max.   :13.1033   Max.   :14.015   Max.   :13.6878   Max.   :13.513   Max.   :13.384   Max.   :12.5050   Max.   :13.272   Max.   :13.596   Max.   :12.9327  
      g2s1              g2s2             g2s3             g2s4               g2s5              g2s6               g2s7              g2s8             g2s9            g2s10             g2s11            g2s12            g2s13             g2s14            g2s15             g2s16        
 Min.   : 0.4878   Min.   :-1.210   Min.   : 2.164   Min.   :-0.07996   Min.   : 0.2839   Min.   : 0.05548   Min.   : 0.4454   Min.   : 0.598   Min.   : 1.132   Min.   : 0.1758   Min.   :-1.166   Min.   : 1.895   Min.   :-0.6536   Min.   : 1.289   Min.   :-0.7888   Min.   :-0.0963  
 1st Qu.: 7.8010   1st Qu.: 8.025   1st Qu.: 7.959   1st Qu.: 8.07439   1st Qu.: 7.8946   1st Qu.: 8.02792   1st Qu.: 8.0490   1st Qu.: 7.798   1st Qu.: 8.122   1st Qu.: 8.0893   1st Qu.: 8.050   1st Qu.: 8.090   1st Qu.: 7.8425   1st Qu.: 8.126   1st Qu.: 7.9821   1st Qu.: 8.0623  
 Median :10.0579   Median : 9.916   Median :10.021   Median : 9.92345   Median :10.0376   Median : 9.99913   Median :10.0563   Median :10.013   Median :10.071   Median :10.0239   Median : 9.844   Median :10.022   Median : 9.9218   Median :10.060   Median : 9.8194   Median :10.0806  
 Mean   : 9.8699   Mean   : 9.985   Mean   : 9.965   Mean   :10.00777   Mean   :10.0254   Mean   :10.05125   Mean   :10.0442   Mean   : 9.945   Mean   :10.129   Mean   :10.0470   Mean   : 9.965   Mean   :10.104   Mean   : 9.9233   Mean   : 9.961   Mean   : 9.9155   Mean   :10.0670  
 3rd Qu.:11.7515   3rd Qu.:11.961   3rd Qu.:11.942   3rd Qu.:12.03088   3rd Qu.:12.0975   3rd Qu.:12.10682   3rd Qu.:12.0168   3rd Qu.:11.963   3rd Qu.:12.214   3rd Qu.:12.0801   3rd Qu.:12.018   3rd Qu.:12.244   3rd Qu.:12.0490   3rd Qu.:12.005   3rd Qu.:11.9846   3rd Qu.:12.0891  
 Max.   :19.6257   Max.   :19.358   Max.   :20.257   Max.   :20.79365   Max.   :20.4195   Max.   :20.66677   Max.   :19.9660   Max.   :18.719   Max.   :19.274   Max.   :20.0369   Max.   :19.276   Max.   :19.836   Max.   :20.0104   Max.   :19.669   Max.   :18.6784   Max.   :20.0330  

Add a column with the difference between sample means for each feature.

Tips: this can be done in a single operation.

Merge the two groups in a single data frame

In omics data analysis, we typically obtain

  • a single data table with all the individuals (biological samples) of all the groups
  • a table containing the metadata, i.e. the information about each individual (biological sample)

Two methods could be envisaged a priori:

  • cbind(), which simply binds the columns of two input tables. This can however be somewhat dangerous, because it assumes that these two tables have the same number of rows (features) and that these rows contain information about the same features in the same order. However, in some cases we dispose of data tables coming from different sources (or software tools), where the features (genes, proteins, metabolites) might have a partial overlap rather than an exact 1-to-1 correspondence, and, even when the feature sets are the same, they might be presented in different orderes.

  • A much safer approach is thus to use merge(), and to explicitly indicate one or several colummns on which the features from the two table will be unified

In our case, the two data tables only contain numeric data, and the identification will be done on the row names (which contain the feature identifiers ft1, ft2, … that we defined above). In some cases, we will have to merge data table containing different informations, including a column with identifiers (or maybe additional columns e.g. the genotype, conditions, …) and we will use internal columns of the table to unify their rows.

We will create such a structure for further analysis/

Metadata table
sampleNames sampleGroup sampleColor
g1s1 g1 #DDEEFF
g1s2 g1 #DDEEFF
g1s3 g1 #DDEEFF
g1s4 g1 #DDEEFF
g1s5 g1 #DDEEFF
g1s6 g1 #DDEEFF
g1s7 g1 #DDEEFF
g1s8 g1 #DDEEFF
g1s9 g1 #DDEEFF
g1s10 g1 #DDEEFF
g1s11 g1 #DDEEFF
g1s12 g1 #DDEEFF
g1s13 g1 #DDEEFF
g1s14 g1 #DDEEFF
g1s15 g1 #DDEEFF
g1s16 g1 #DDEEFF
g2s1 g2 #FFEEDD
g2s2 g2 #FFEEDD
g2s3 g2 #FFEEDD
g2s4 g2 #FFEEDD
g2s5 g2 #FFEEDD
g2s6 g2 #FFEEDD
g2s7 g2 #FFEEDD
g2s8 g2 #FFEEDD
g2s9 g2 #FFEEDD
g2s10 g2 #FFEEDD
g2s11 g2 #FFEEDD
g2s12 g2 #FFEEDD
g2s13 g2 #FFEEDD
g2s14 g2 #FFEEDD
g2s15 g2 #FFEEDD
g2s16 g2 #FFEEDD

Visualisation of the data

Histogram of the sampled values

  • Draw two histograms with all the values group 1 and group2, respectively.

Tip: use mfrow() to display the histogram above each other, and set the limits of the abscissa (x axis) to the same value.

Sampling distribution of the means

Draw histogram with the sampling distribution of the means in the respective groups.

Sampling distribution of the mean

Sampling distribution of the mean

  • Compare the standard deviations measures in the sampled values, and in the feature means. Do they differ ? Explain why.

Answer: the standard deviation of the sample means corresponds to the standard error.

Part 2: hypothesis testing

Run Student test on a given feature

Since we are interested by differences in either directions, we run a two-tailed test.

Hypotheses:

\[H_0: \mu_1 = \mu2\]

\[H_1: \mu_1 \ne \mu2\] Exercise: pick up a given feature (e.g. the \(267^{th}\)) and run a mean comparison test. Choose the parameters according to your experimental setting.

Tips:

  • t.test()
  • you need to choose the test depending on whether the two populations have equal variance (Student) or not (Welch). Since we defined different values for the populations standard deviations (\(\sigma_1\), \(\sigma_2\)), the choise is obvious.

    Welch Two Sample t-test

data:  x1 and x2
t = -2.7572, df = 24.441, p-value = 0.01086
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -3.8449347 -0.5546979
sample estimates:
mean of x mean of y 
 7.807264 10.007081 

Interpret the result

The difference between sample means was \(d = 2.2\).

The \(t\) test computed the \(t\) statistics, which standardizes this observed distance between sample means relative to the estimated variance of the population, and to the sample sizes. With the random numbers generated above, the value is \(t_{obs} = -2.7572\).

The corresponding p-value is computed as the sum of the area of the left and right tails of the Student distribution, with \(\nu = n_1 + n_2 - 2 = 24.4411491\) degrees of freedom. It indicates the probability of obtaining by chance – under the null hypothesis – a result at least as extreme as the one we observed.

In our case, we obtain \(p = P(T > |t_{obs}| = P(T > 2.7572) = 0.0109\). This is higher than our threshold \(alpha = 0.05\). We thus accept the null hypothesis.

Replicating the test for each feature

In R, loops are quite inefficient, and it is generally recommended to directly run the computations on whole vectors (R has been designed to be efficient for this), or to use specific functions in order to apply a given function each row / column of a table, or to each element of a list.

For the sake of simplicity, we will first show how to implement a simple but inefficient code with a loop. In the advanced course (STATS2) will see how to optimize the speed with the apply() function.

## Define the statistics we want to collect
resultColumns <- c("i",       # iteration number
                    "m1",      # first sample mean
                    "m2",      # second sample mean
                    "s1",      # sd estimation for the first population
                    "s2",      # sd estimation for the second population
                    "d",       # difference between sample means
                    "t",       # test statistics
                    "df",      # degrees of freedom
                    "p.value"  # nominal p-value
                    )

## Instantiate a result table to store the results
resultTable <- data.frame(matrix(nrow = m, ncol = length(resultColumns)))
colnames(resultTable) <- resultColumns # set the column names
# View(resultTable) ## Check the table: it contians NA values
  
## Iterate random number sampling followed by t-tests
## Use the funciton system.time() to measure the elapsed time
## This function is particular: you can also use it with curly brackets in order to enclose a block of several lines
time.iteration <- system.time(
  for (i in 1:m) {
    ## Generate two vectors containing the values for sample 1 and sample 2, resp.
    x1 <- unlist(group1[i, ]) ## sample 1 values
    x2 <- unlist(group2[i, ]) ## sample 2 values
    
    ## Run the t test
    t.result <- t.test(
      x = x1, y = x2, 
      alternative = "two.sided", var.equal = FALSE)
    # names(t.result)
    
    ## Collect the selected statistics in the result table
    resultTable[i, "i"] <- i
    resultTable[i, "t"] <- t.result$statistic
    resultTable[i, "df"] <- t.result$parameter
    resultTable[i, "p.value"] <- t.result$p.value
    
    ## Compute some additional statistics about the samples
    resultTable[i, "m1"] <- mean(x1) ## Mean of sample 1
    resultTable[i, "m2"] <- mean(x2) ## Mean of sample 2
    resultTable[i, "s1"] <- sd(x1) ## Standard dev estimation for population 1
    resultTable[i, "s2"] <- sd(x2) ## Standard dev estimation for population 1
    resultTable[i, "d"] <- resultTable[i, "m1"] - resultTable[i, "m2"]  ## Difference between sample means
  }
  #}
  ## View(resultTable)
)  

print(time.iteration)
   user  system elapsed 
  0.846   0.292   1.107 

Creating a function to reuse the same code with different parameters

Depending on the selected task in the assignments above, we will run different tests with different parameters and compare the results. The most rudimentary way to do this is top copy-paste the chunk of code above for each test and set of parameters required for the assigned tasks.

However, having several copies of an almost identical block of code is a very bad pracrice in programming, for several reasons

  • lack of readability: the code rapidly becomes very heavy;
  • difficulty to maintain: any modification has to be done on each copy of the chunk of code;
  • risk for consistency: this is a source of inconsistency, because at some moment we will modify one copy and forget another one.

A better practice is to define a function that encapsulates the code, and enables to modify the parameters by passing them as **arguments*. Hereafter we define a function that

  • takes the parameters of the analysis as arguments

    • population means \(\mu_1\) and \(\mu_2\),
    • population standard deviations \(\sigma_1\) and \(\sigma_2\),
    • sample sizes \(n_1\) and \(n_2\),
    • number of iterations \(r\).
  • runs \(r\) iterations of the t-test with 2 random samples,
  • returns the results in a table with one row per iteration, and one column per resulting statistics (observed \(t\) score, p-value, difference between means, …);

#### Define a function that runs r iterations of the t-test ####

#' @title Repeat a T test with random numbers drawn from a normal distribution
#' @param mu1 mean of the first population
#' @param mu2 mean of the second population
#' @param sigma1 standard deviation of the first population
#' @param sigma2 standard deviation of the second population
#' @param n1 first sample size
#' @param n2 second sample size
#' @param m number of repetitions of the tests
#' @param ... additional parameters are passed to t.test(). This enables to set var.equal, alternative, ...
#' @return  a table with one row per repetition of the test, and one column per statistics
IterateTtest <- function(mu1,    
                         mu2,
                         sigma1,
                         sigma2, 
                         n1,     
                         n2,     
                         m,    
                         ...) {
  
  
  ## Define the statistics we want to collect
  resultColumns <- c("i",       # iteration number
                      "m1",      # first sample mean
                      "m2",      # second sample mean
                      "s1",      # sd estimation for the first population
                      "s2",      # sd estimation for the second population
                      "d",       # difference between sample means
                      "t",       # test statistics
                      "df",      # degrees of freedom
                      "p.value"  # nominal p-value
  )
  
  ## Instantiate a result table to store the results
  resultTable <- data.frame(matrix(nrow = m, ncol = length(resultColumns)))
  colnames(resultTable) <- resultColumns # set the column names
  # View(resultTable) ## Check the table: it contians NA values
  
  ## Iterate random number sampling followed by t-tests
  for (i in 1:m) {
    ## Generate two vectors containing the values for sample 1 and sample 2, resp.
    x1 <- rnorm(n = n1, mean = mu1, sd = sigma1) ## sample 1 values
    x2 <- rnorm(n = n2, mean = mu2, sd = sigma2) ## sample 2 values
    
    ## Run the t test
    t.result <- t.test(
      x = x1, y = x2, 
      alternative = "two.sided", var.equal = TRUE)
    # names(t.result)
    
    ## Collect the selected statistics in the result table
    resultTable[i, "i"] <- i
    resultTable[i, "t"] <- t.result$statistic
    resultTable[i, "df"] <- t.result$parameter
    resultTable[i, "p.value"] <- t.result$p.value
    
    ## Compute some additional statistics about the samples
    resultTable[i, "m1"] <- mean(x1) ## Mean of sample 1
    resultTable[i, "m2"] <- mean(x2) ## Mean of sample 2
    resultTable[i, "s1"] <- sd(x1) ## sd estimate for population 1
    resultTable[i, "s2"] <- sd(x2) ## sd estimate for population 2
    resultTable[i, "d"] <- resultTable[i, "m1"] - resultTable[i, "m2"] ## Difference between sample means
  }
  
  #### Compute multiple testing corrections ####

  ## E-value
  ## Expected number of FP
  resultTable$e.value <- resultTable$p.value * nrow(resultTable)
  
  ## Family-Wise Error Rate (FWER)
  ## Probability to have at least one FP given the p-value
  resultTable$FWER <- pbinom(q = 0, 
                             size = nrow(resultTable), 
                             prob = resultTable$p.value, 
                             lower.tail = FALSE)

  
  ## Multiple testing corrections
  for (method in p.adjust.methods) {
    resultTable[, method] <- p.adjust(resultTable$p.value, method = method)
  }
  
  # ## Storey - Tibshirani q-value
  # qvalResult <- qvalue::qvalue(p = unlist(resultTable$p.value))
  # resultTable$q.value <- 
  
  return(resultTable) ## This function returns the result table
}

We can now use this function to iterate the \(t\) test with the parameters we want. Let us measure the running time

   user  system elapsed 
  0.598   0.247   0.817 
   user  system elapsed 
  0.561   0.233   0.767 

This function can then be used several times, with different values of the parameters.

P-value histograms

P-value historams of the multiple test results with increasing effect sizes.

P-value historams of the multiple test results with increasing effect sizes.

Volcano plots

Volcano plots of the multiple test results with increasing effect sizes.

Volcano plots of the multiple test results with increasing effect sizes.

Interpretation of the results

We should now write a report of interpretation, which will address the following questions.

  • Based on the experiments under \(H_0\), compute the number of false positives and estimate the false positive rate (FPR). Compare these values with the E-value (expected number of false positives) for the 1000 tests, and with your \(alpha\) trheshold.

  • Based on the experiments under \(H_1\), estimate the sensitivity (Sn) of the test for the different mean differences tested here.

  • Interpret the histograms of P-values obtained with the different parameters ?

  • Draw a power curve (i.e. the sensitivity as a function of the actual difference between population means)

  • Discuss about the adequation between the test and the conditions of our simulations.

  • Do these observations correspond to what would be expected from the theory?