DU Bii - module 3: R and stats


Session 1: tutorial on dataframes

Wednesday 3rd of March, 2021

teachers: Claire Vandiedonck & Anne Badel; helpers: Antoine Bridier-Nahmias, Bruno Toupace, Clémence Réda, Jacques van Helden

Content of this tutorial:

  1. Some reminders on R basics
        1.0. What is R?
        1.1. R as a calculator 
        1.2. Assigning data to R objects, using and reading them  
        1.3. Managing your session
        1.4. Managing objects in your R Session
        1.5. Saving your data, session, and history
             a. Data: specific variables or functions to save
             b. Session: save all variables and functions
             c. History: save all past commands
        1.6. Classes and types of R objects
             a. classes of objects
             b. main data structures in R
                 1.Vectors
                 2.Matrices
  2. Dataframes
         2.1. Creating a dataframe
         2.2. Reading a text file into RData
         2.3. Subsetting a dataframe on several criteria
         2.4. Merging dataframes
         2.5. Some basic plotting

Before going further

Caution:
1. Create a new directory "Rsession1" in your home with a right click in the left-hand panel of the lab.
2. Save a backup copy of this notebook in this folder : in the left-hand panel, right-click on this file and select "Duplicate" and add your name, e.g: "tutorial_dataframes_vandiedonck.ipynb" and move it to the proper folder
You can also make backups during the analysis. Don't forget to save your notebook regularly.
Warning: you are strongly advised to run the cells in the indicated order. If you want to rerun cells above, you can just restart the kernel to start at 1 again.
About jupyter notebooks:
- To add a new cell, click on the "+" icon in the toolbar above your notebook
- You can "click and drag" to move a cell up or down
- You choose the type of cell in the toolbar above your notebook:
- 'Code' to enter command lines to be executed
- 'Markdown' cells to add text, that can be formatted with some characters
- To execute a 'Code' cell, press SHIFT+ENTER or click on the "play" icon
- To display a 'Markdown' cell, press SHIFT+ENTER or click on the "play" icon
- To modify a 'Markdown'cell, double-click on it

To make nice html reports with markdown: html visualization tool 1 or html visualization tool 2, to draw nice tables, and the Ultimate guide.
Further reading on JupyterLab notebooks: Jupyter Lab documentation.

=> About this jupyter notebook

This a jupyter notebook in R, meaning that the commands you will enter or run in Code cells are directly understood by the server in the R language.
You could run the same commands in a Terminal or in RStudio.

In this tutorial, you will run one cell at a time.

I. Some reminders on R basics



I.0 What is R ?


R is available on this website: https://www.r-project.org

The language is:

R is a statsitical programming language. This project started in 1993. We are currently at version 4.0.4 (15/02/2021). There is a new release twice a year.

R includes a "core language" called R base with more than 3000 contributed packages. A package is a set of functions.

R can be used for:

  1. data manipulation: import, format, edit, export
  2. statistics
  3. avdanced graphics

Some useful links

I.1 - R as a calulator


Some very simple examples**

You can directly use R to perform mathematic operations with usual operators: +, -, * to multiply,^ to raise to the power, / to divide, %% to get the modulo.

You can use built-in functions like round(),log(), mean...

You can nest functions, in the following example, exp() is nested in round()

For some functions, you need to enter several arguments. In the example below, we add the base argument for the log() function.

Getting help on functions:

To know which argument to use, it is recommanded to always look at the help of the functions. To do so, enter the name of the function after ? or help() and the name of the function in the brackets. A help page will be displayed with different sections:

I-2 - Assigning data into R objects, using and reading them


We can store values in R objects/variables to reuse them in another command. To do so, use <- made with < and -. An alternative is to use =. For code clarity, it is not recommanded.

Let's assign for example 2 to x:

To know what is in x just enter x:

We can do operations on x:

You can then assign an operation with x to y .

To get the result y, enter it in the next command:

Caution: If you assign a new value to x, y will not change because the result of the operation x+3 was stored in y, not the operation "x + 3" itself.

So you would have to rerun the command assigning x+3 to y to change the value of y.

In addition to numeric values, we can store other kind of data in an object. For example we will put a string of character in s. Strings of characters have to be entered between "quotes"

Of note, you can check the type of an R object using class().

It is important that numeric values are well encoded as numeric in R and not as strings of characters. Y

If you try to add "1" and 3, an error message is returned here since we are trying to make an impossible operation:

If you are using numeric variables, the operation can be done:

I.3 - Managing your session


When working with R, it is always a good practice to document the R version you are using and the packages that are loaded. The function is sessionInfo().

As you can see, the version 4.0.2 is the one installed on the IFB clore cluster. By default, some "base" packages like stats are loaded. We will see in the next R Session that we can load other packages.

The result should be like this:`'/shared/`.

Then we change it to the RSession1 folder in your home directory.

I-4 - Managing objects in your R Session and working directory


The objects x, yand syou have cretaed above are only present in your R session, but they are not written in your working directory on the computer -> they are not present in the left-hand panel of Jupyter Lab.

So, to know which objects you have in your R session, you can use the same function as in Unix/bash to list the files. The only difference is that in R you add brackets to use functions.

Similarly, you can get rid of an object with the function rm().

Conversely, you can also look at the data on your computer from R with the function dir() or list.files(). With the second function, you can add an argument to specify a pattern of interest.

I.5 - Saving your data, session, and history


Before quitting R, you will probably want to save objects and other session information on your computer to be able to find them again next time you use R. By default, all the data and files you save will be saved in your working directory.

a - Saving specific data (or functions)

The function save() is used to save a specific object in your computer. You will have to give a name to the file on your computer. Generally, we save them with the extension .RData.

With the above command, you should have created the file x.Rdata in your working directory. Check it is present on the left-hand panel of Jupyter Lab.
Now, if you remove x from your R session, you can load it back again with the load() function.

You can also delete the file from the working directory with the function file.remove().

Instead of saving a single object, you can save several by listing them all as separate arguments in the save() function.

b - Saving all variables (and functions) at once

It is even more efficient when you want to save all objects to use the function save.image()

And similarly you can upload them all back after removing all objects in the session or starting a new one.

c- Save "history" = all past commands

Do not run. It does not work in R notebooks where no history is saved because we are running independant cells! The command below would be the one to run in R shell (Terminal > R) or in RStudio (change "lab" in URL to "rstudio").

I.6 - Classes and types of R objects


a - Classes of R objects

The main types of variables are :

To have a more classical R display than in a notebook, you can add print().

X contains 4 numeric values. We can check it is numeric with the function is.numeric().

It returns the logical value TRUE.

You can also perform tests that will return logical values. Below we test wether the values in x are below 2.

Only the third value of x is < 2. Similarly, we can test which values of x are equal to 2.

In R, the function class() returns the class of the object. The functions is.logical(), is.numeric(), is.character(),...test whether the values are of this type. You may enventually do a type conversion with as.numeric(), as.logical(), ...

Coercion rules: There are some coercion rules when doing conversions on concatenating elements of different types: logical <integer < numeric < complex < character < list

b. Main data structures in R

There are 4 main data structures in R. The heterogeneous ones accept several classes inside.

object Can it be heterogeneous?
vector no
matrix no
dataframe yes
list yes
1. Vectors
Remark:
In such a jupyter notebook, by default each item of a vector is displayed sperated by a `.`. Should you wish to display a vector in a more classical way, like in the R console, where they are not displayed in different rows but in a row, you should use the function print().

You can check the class of a vector but also get some information on its length with length() and structure with str().


2 - Matrices

The function to create a matrix is matrix()

Thus by default, a matrix is filled by columns but you can change this behaviour and fill it by rows.

Printing the matrix shows you [i,j] coordinates, where i is the index of the row and j that of the column.

II - Dataframes



Dataframes are two-dimensional objects that can be heterogeneous between columns (but homogeneous within a column)

II.1. - Creating a dataframe:


This can be done using existing vectors of same length like the previoulsy generated "weight", "size" and "bmi" .

If you do not wish to do the tutorial stepwise and directly start here:. it would be necessary to run all above cells in order to have all required objects already loaded in the session. To do so, click on "Run" in the top menu and select "Run all above selected cell".

The obtained dataframe looks pretty much like the previous matrix myData2.

Note that if the vectors used to generate the dataframe are character strings, it is advised in versions < 4 to add the argument stringsAsFactors=FALSE

If the vectors that will generate the dataframe do not exist yet in the session, but you would like to initiate a dataframe to fill it during your analysis, you could imagine creating an empty dataframe. But this method is useless as it is impossible to fill the generated dataframe having 0 columns and rows.

In that case, it is better to create an empty matrix and to convert it to a dataframe. See below.

Let's try with the object myData2 we previously created. It is a matrix:

You may also use as.data.frame() matrix generated by binding rows or columns:

So, similarly, we can do such a conversion of an empy matrix into a dataframe like in this example with a matrix of two rows and three columns currently filled with missing values:

You may use the same fonctions as the ones used for matrices: rownames() and colnames():

But it is better to use the functions dedicated to dataframes which are row.names() and names():

Caution: each row name must be unique in a dataframe!

To better follow, let's first diplay again myDataf

Variables are the columns of a dataframe. You can extract the vector corresponding to a column from a dataframe with its index, with the name of the column inside"" or using the symbol $:

You have two options to do so:

  1. either by specifying the index of the row
  1. or by giving its name within the "" insie the squared brackets:

In both cases, you may notice that you obtain a dataframe and not a vector, even if you extract only one row. If you wish to get the vector corresponing to a row, you have to convert it with the unlist() function.

Of note, dataframes are a special case of list variables of the same number of rows with unique row names.

Your turn: have a look at slide 31 and start thinking of answers on your own -> we will discuss the solutions together.
  1. either you add one vector at a time:

Here is another example to add a colum "sex" to the dataframe myData using a vector called "sex". I changed the name of the vector but you could keep the same name!

  1. or add several vectors or several columns from another dataframe at once using data.frame():
Caution: You could also use cbind() but it is at risk as cbind() is rather a function for matrices. If you use it for dataframes, it will keep the data types only if you combine several variables of both dataframes. If you take only one variable from a dataframe, cbind() will convert it as a vector with a possible risk of coercion and of factorisation in versions of R < 4.

II.2. - Reading a text file into R and vice versa


a. reading a text file into R

The function read.table() reads a delimited text file (tabulated, scv or other column separator) into R and generates a dataframe.

Before importing the file Temperature.txt let's see how it looks like. Just double click on it. It is located in /shared/projects/dubii2021/trainers/module3/data/

You will see it is a tab-delimited text file.

Now let's import it in R by specifying the correct separator with the read.table() function:

In the above command, I used the argument stringsAsFactors=FALSEto avoid a factorisation of the columns with strings of character (here the "Month" column). In R versions < 4, the default value for this argument is TRUE. Let's see what would have happened:

Here the "Month" column has been factorised. How?

By alphabetic order, which is not what you want! Thus always use stringsAsFactors=FALSE

Personal work:to better understand the behaviour of factors, you will follow a tutorial on factors which will be available on Friday on the module webpage.

b. writing a dataframe on your computer

Conversely, save a dataframe into your working directory with write.table():

Have a look at it by double clicking on it in your working directory.

and check you can import it back in R again:

II.3. - Subsetting a dataframe


a. The function which() returns the index of what is TRUE in a tested condition:

Here, we obtain a vector where 3, 4 and 6 corrrespond to the positions or indexes (1-based) of the occurence "Woman" in the vector/variable myDataf$sex. We can the use this vector as usual in a dataframe before the "," to select the corresponding rows.

Instead of "==" one can use ̀!= for "is different" to detect what does not match.

Abother method would be to add ! for "not" before the test, to get the complementary result:

Caution: What happens if you do not use `which()`?

Lets' make a copy of our dataframe and replace the gender of Claire by a missing value:

and rerun the same command as above without which() on the new myDataf2:

Caution: If you have missing data and you forget to use which(), you will also return them. => Always use which()

b. One can also search for a pattern with grep():

It returns the index of what matches, even partially.

c. The function subset() is even simpler than which():

Just enter the dataframe as first argument, and the variable without "quotes" on which you do the filtering followed by the condition.

d. You can even combine conditions:

II.4. -Merging dataframes: using a column as a "key"

In this example, I add one column with indexes that I will use as a key, but we can also use an existing variable as a key.

Then I generate another dataframe with handedness information on 6 samples, but one sample is new compared to the initial dataframe.

We can now merge them together by specifying the "key" column with the argument by. The all argument is used to keep all the rows of a dataframe that are not present in the other. The .x refers to the first dataframe while .y refers to the second one.

Warning:If adding sort=F we will avoid the merged dataframe to be sorted by the "key" column.

In the merged dataframe, we start with all the rows present in both dataframes. The next row contains the data only present in the first dataframe with missing data for the columns in the second dataframe. The last rows are the ones with data only present in the second dataframe with missing data for the first dataframe.

Unless the merge is done on the row names (by="0"), the row names of the initial dataframe are lost. The new dataframe has its own row names.

If two columns have the same name in both dataframes, by default R adds an ".x" to the one from the first dataframe and ".y" to the one of the second dataframe. The names can be changed with the argument suffixes.


II.5 - Some basic plotting

We will see more deeply how to generate basic plots in you personal work on Wednesday for different kind of variables, and during session 2 of R how to generate custom plots either with R base or ggplot.

But let's have a quick view of what can be done on our dataframe.

a. scatter plot with the function plot()

b. Representation of quantitative data distribution:

or using ~ to display boxplots on the same plot depending on a categorical variable:

We will further see that graphs have three-level functions:

  1. primary graph functions like plot(), boxplot and hist() to display the most principal graphs in R

  2. secondary graph functions to complement an existing plot

  3. graphical parameters to modify the plots display:

    • either as options of the primary and secondary functions
    • or permanetly with the par() function before plotting the graph.


Success: Well done! You now know all the main functions to create and manipulate dataframes.

Lets' save all the main objects of this session into an R object:

We will keep myDataf and temperatures.

Caution:
Don't forget to save you notebook and export a copy as an html file as well
- Open "File" in the Menu
- Select "Export Notebook As"
- Export notebook as HTML
- You can then open it in your browser even without being connected to the IFB Jupyter hub!