
# DU Bii - module 3: R and stats
---
## **Session 1: tutorial on dataframes**
*Wednesday 3rd of March, 2021*

teachers: Claire Vandiedonck & Anne Badel; helpers: Antoine Bridier-Nahmias, Bruno Toupace, Clémence Réda, Jacques van Helden

*Content of this tutorial:*

1. Some reminders on R basics
           1.0. What is R?
           1.1. R as a calculator 
           1.2. Assigning data to R objects, using and reading them  
           1.3. Managing your session
           1.4. Managing objects in your R Session
           1.5. Saving your data, session, and history
                a. Data: specific variables or functions to save
                b. Session: save all variables and functions
                c. History: save all past commands
           1.6. Classes and types of R objects
                a. classes of objects
                b. main data structures in R
                    1.Vectors
                    2.Matrices
2. Dataframes
            2.1. Creating a dataframe
            2.2. Reading a text file into RData
            2.3. Subsetting a dataframe on several criteria
            2.4. Merging dataframes
            2.5. Some basic plotting


---
## **Before going further**

<div class="alert alert-block alert-danger"><b>Caution:</b><br> 
    <b>1. Create a new directory "Rsession1" </b> in your home with a right click in the left-hand panel of the lab.<br>
    <b>2. Save a backup copy of this notebook in this folder </b>: in the left-hand panel, right-click on this file and select "Duplicate" and add your name, e.g: "tutorial_dataframes_vandiedonck.ipynb" and move it to the proper folder<br>
You can also make backups during the analysis. Don't forget to save your notebook regularly.
</div>

<div class="alert alert-block alert-warning"><b>Warning:</b> you are strongly advised to run the cells in the indicated order. If you want to rerun cells above, you can just restart the kernel to start at 1 again. </div>

<div class="alert alert-block alert-info"> 
    
<b><em> About jupyter notebooks:</em></b> <br>

- To add a new cell, click on the "+" icon in the toolbar above your notebook <br>
- You can "click and drag" to move a cell up or down <br>
- You choose the type of cell in the toolbar above your notebook: <br>
    - 'Code' to enter command lines to be executed <br>
    - 'Markdown' cells to add text, that can be formatted with some characters <br>
- To execute a 'Code' cell, press SHIFT+ENTER or click on the "play" icon  <br>
- To display a 'Markdown' cell, press SHIFT+ENTER or click on the "play" icon  <br>
- To modify a 'Markdown'cell, double-click on it <br>
<br>    

<em>  
To make nice html reports with markdown: <a href="https://dillinger.io/" title="dillinger.io">html visualization tool 1</a> or <a href="https://stackedit.io/app#" title="stackedit.io">html visualization tool 2</a>, <a href="https://www.tablesgenerator.com/markdown_tables" title="tablesgenerator.com">to draw nice tables</a>, and the <a href="https://medium.com/analytics-vidhya/the-ultimate-markdown-guide-for-jupyter-notebook-d5e5abf728fd" title="Ultimate guide">Ultimate guide</a>. <br>
Further reading on JupyterLab notebooks: <a href="https://jupyterlab.readthedocs.io/en/latest/user/notebook.html" title="Jupyter Lab">Jupyter Lab documentation</a>.<br>   
</em>
    
    
</div>   

__*=> About this jupyter notebook*__

This a jupyter notebook in **R**, meaning that the commands you will enter or run in `Code` cells are directly understood by the server in the R language.
<br>You could run the same commands in a Terminal or in RStudio. 


> In this tutorial, you will run one cell at a time.    



## **I. Some reminders on R basics**
---
---

### **I.0 What is R ?**
---

R is available on this website: https://www.r-project.org

The language is:
- open-source
- available for Windows, Mac and Unix
- widely used in academia, finance, pharma, social sciences...

R is a statsitical programming language. This project started in 1993. We are currently at version 4.0.4 (15/02/2021). There is a new release twice a year.

R includes a "core language" called `R base` with more than 3000 contributed packages. A package is a set of functions.

R can be used for:
1. data manipulation: import, format, edit, export
2. statistics
3. avdanced graphics

***Some useful links***
- Quick R: https://www.statmethods.net/index.html
- Emmanuel Paradis tutorial: [in French](https://cran.r-project.org/doc/contrib/Paradis-rdebuts_fr.pdf) or [in English](https://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf)
- R cheatsheet: https://rstudio.com/resources/cheatsheets/
- R style guide: https://google.github.io/styleguide/Rguide.html

### **I.1 - R as a calulator**
---

Some very simple examples**

You can directly use R to perform mathematic operations with usual operators: `+`, `-`, `*` to multiply,`^` to raise to the power, `/` to divide, `%%` to get the modulo.

In [None]:
2+2
2-3
6/2
10/3
10%%3

You can use built-in functions like `round()`,`log()`, `mean`...


In [None]:
mean(c(1,2)) # we will see we need to put concatenate different values with a c() first
exp(-2)

You can nest functions, in the following example, `exp()` is nested in `round()`

In [None]:
round(exp(-2), 2)

For some functions, you need to enter several arguments. In the example below, we add the `base` argument for the `log()` function.

In [None]:
log(100,base=10) #we want to get the log of 100 in base 10 

***Getting help on functions:***

To know which argument to use, it is recommanded to always look at the help of the functions. To do so, enter the name of the function after `?` or `help()` and the name of the function in the brackets. A help page will be displayed with different sections:

- description: what is the purpose of the function?
- usage: how is it used?
- arguments: which parameters are used by the function. Default values may be specified.
- details: technical description of the function
- value: type of the output returned by the function
- see also: similar functions in R
- source/references: not always
- example: concrete examples -> the best way to learn how it works!

In [None]:
help(round)

In [None]:
?exp

### **I-2 - Assigning data into R objects, using and reading them**
---

We can store values in R objects/variables to reuse them in another command.
To do so, use `<-` made with `<` and `-`. *An alternative is to use `=`. For code clarity, it is not recommanded.*

Let's assign for example `2` to `x`:

In [None]:
x <- 2

To know what is in `x` just enter `x`:

In [None]:
x

We can do operations on x:

In [None]:
x+x

You can then assign an operation with `x` to `y` .

In [None]:
y <- x+3

To get the result y, enter it in the next command:

In [None]:
y

In [None]:
x <- 4
y

<div class="alert alert-block alert-danger"><b>Caution:</b> 
If you assign a new value to x, y will not change because the result of the operation x+3 was stored in y, not the operation "x + 3" itself.
</div>

So you would have to rerun the command assigning `x+3` to y to change the value of y.

In [None]:
y <- x+3
y

In addition to numeric values, we can store other kind of data in an object. For example we will put a string of character in s. Strings of characters have to be entered between "quotes"

In [None]:
s <- "this is a string of characters"
s

Of note, you can check the type of an R object using `class()`.

In [None]:
class(x)
class(s)

It is important that numeric values are well encoded as numeric in R and not as strings of characters. Y

In [None]:
"1"
class("1")
class(1)

If you try to add `"1"` and 3,  an error message is returned here since we are trying to make an impossible operation:

In [None]:
try("1" + 3)# I added the try function to avoid stopping the notebook if you want to run all the cells

If you are using numeric variables, the operation can be done:

In [None]:
1 + 3



### **I.3 - Managing your session**
---

When working with R, it is always a good practice to document the R version you are using and the packages that are loaded. The function is `sessionInfo()`.

In [None]:
sessionInfo()

As you can see, the version 4.0.2 is the one installed on the IFB clore cluster. By default, some "base" packages like stats are loaded. We will see in the next R Session that we can load other packages.

In [None]:
getwd()

<div class="alert alert-block alert-warning"><b>The result should be like this:</b>`'/shared/`.</div>

Then we change it to the RSession1 folder in your home directory.

In [None]:
setwd('/shared/home/cvandiedonck/RSession1') #change with your login!!!
getwd() #change is visible


### **I-4 - Managing objects in your R Session and working directory**
___


The objects `x`, `y`and `s`you have cretaed above are only present in your R session, but they are not written in your working directory on the computer -> they are not present in the left-hand panel of Jupyter Lab.

So, to know which objects you have in your R session, you can use the same function as in Unix/bash to list the files. The only difference is that in R you add brackets to use functions.

In [None]:
ls()

Similarly, you can get rid of an object with the function `rm()`.

In [None]:
rm(y)
ls()

Conversely, you can also look at the data on your computer from R with the function `dir()` or `list.files()`. With the second function, you can add an argument to specify a pattern of interest.

In [None]:
dir()

In [None]:
list.files(pattern=".ipynb")

### **I.5 - Saving your data, session, and history**
___


Before quitting R, you will probably want to save objects and other session information on your computer to be able to find them again next time you use R.
By default, all the data and files you save will be saved in your ***working directory***.

#### **a - Saving specific data *(or functions)***

The function `save()` is used to save a specific object in your computer. You will have to give a name to the file on your computer. Generally, we save them with the extension `.RData`.

In [None]:
save(x,file="x.RData")

With the above command, you should have created the file `x.Rdata` in your working directory. Check it is present on the left-hand panel of Jupyter Lab.<br>
Now, if you remove `x` from your R session, you can load it back again with the `load()` function.

In [None]:
rm(x)
ls()

In [None]:
load("x.RData")
ls()
x #x is again accessible

You can also delete the file from the working directory with the function `file.remove()`.

In [None]:
file.remove("x.RData") #remove file: returns TRUE on successful removal

Instead of saving a single object, you can save several by listing them all as separate arguments in the `save()` function.

In [None]:
save(x,s, file="xands.RData")

In [None]:
file.remove("xands.RData")# to clean the working directory

#### **b - Saving all variables *(and functions)* at once**

It is even more efficient when you want to save all objects to use the function `save.image()`

In [None]:
ls()
save.image(file="AllMyData.RData")

And similarly you can upload them all back after removing all objects in the session or starting a new one.

In [None]:
rm(list=ls()) # this command removes all the objects on the R session
ls() #all variables have been removed

In [None]:
load("AllMyData.RData")
ls() #all variables are accessible again
file.remove("AllMyData.RData")
ls()

#### **c- Save "history"** = all past commands

<div class="alert alert-block alert-warning"><b> Do not run</b>. It does not work in R notebooks where no history is saved because we are running independant cells! The command below would be the one to run in R shell (Terminal > R) or in RStudio (change "lab" in URL to "rstudio").</div>

In [None]:
# ls()
# savehistory(file="MyHistory.Rhistory") #save all previously run commands in a special formatted file
# loadhistory("MyHistory.Rhistory") #load all commands stored in the specified file
# my_history <- read.delim("MyHistory.Rhistory") #see how the file is formatted: number of line and associated command
# head(my_history)


### **I.6 - Classes and types of R objects**
___


#### **a - Classes of R objects**

The main types of variables are :

- numeric/integer
- character
- logical (FALSE/TRUE/NA)
- factors

In [None]:
x <- c(3,7,1,2) # we define a variable x with 4 numeric values concatenated
x

To have a more classical R display than in a notebook, you can add print().

In [None]:
print(x) 

X contains 4 numeric values. We can check it is numeric with the function `is.numeric()`.

In [None]:
is.numeric(x)

It returns the logical value `TRUE`.

You can also perform tests that will return logical values. Below we test wether the values in x are below 2.

In [None]:
x<2 # we test wether the 4 values are < 2

Only the third value of x is < 2. Similarly, we can test which values of x are equal to 2.

In [None]:
x==2

In R, the function `class()` returns the class of the object. The functions `is.logical()`, `is.numeric()`, `is.character()`,...test whether the values are of this type. You may enventually do a type conversion with `as.numeric()`, `as.logical()`, ...

In [None]:
class(x)
class(s)
is.character(s)
is.numeric(s)
print(as.numeric(x<2))
is.numeric("1")
is.numeric(as.numeric("1"))
is.numeric(c(1,"1"))

***Coercion rules:*** There are some coercion rules when doing conversions on concatenating elements of different types: `logical <integer < numeric < complex < character < list`
- if character strings are present, everything will be coerced to a character string.
- otherwise logical values are coerced to numbers: TRUE is converted to 1, FALSE to 0
- values are converted to the simplest type required to represent all information
- object attributes (sort of metadata of objects/variables like their names) are dropped

#### **b. Main data structures in R**

There are 4 main data structures in R. The heterogeneous ones accept several classes inside.

|   object  | Can it be heterogeneous? |
|:---------:|:------------------------:|
|   vector  |            no            |
|   matrix  |            no            |
| dataframe |            yes           |
|    list   |            yes           |



##### **1. Vectors**

- They are the most elementary R objects. They have one dimension. Some functions to create them are `c()`, `seq()`, `:`, `rep()`, `append()`...


In [None]:
a <- c()
a

In [None]:
weight <- c(60, 72, 57, 90, 95, 72)
weight

<div class="alert alert-block alert-info"><b>Remark:</b><br> In such a jupyter notebook, by default each item of a vector is displayed sperated by a `.`. Should you wish to display a vector in a more classical way, like in the R console, where they are not displayed in different rows but in a row, you should use the function <b>print()</b>. </div>


In [None]:
print(weight)

In [None]:
4:10
print(4:10)

In [None]:
print(seq(4,10))

In [None]:
print(seq(2,10,2))

In [None]:
print(rep(4,2))

In [None]:
print(rep(seq(4,10,2)))
print(c(rep(1,4),rep(2,4)))
print(c(5,s))

You can check the class of a vector but also get some information on its length with `length()` and structure with `str()`.

In [None]:
class(c(5,s))
length(1:10)
length(weight)
str(weight)

- You can perform operation directly on vectors:

In [None]:
size <- c(1.75, 1.8, 1.65, 1.9, 1.74, 1.91)
print(size^2)
print(bmi <- weight/size^2 )
print(bmi)

- You can order them or get dispersion values:

In [None]:
print(sort(size))
mean(size)
sd(size)
median(size)
min(size)
max(size)
print(range(size))
summary(size)

- You can extract some values from a vector with the index of the values you want to extract inside using square brackets `[]`:

In [None]:
print(size)
size[1]
size[2]
size[6]
size[c(2,6)]
size[c(6,2)]
min(size[c(6,2)])

- Finally you can add a name to the different values. Names on vector values are attributes of the vector. Here the function `names()` returns a vector of the names of vector `size`. 

In [None]:
names(size)
names(size) <- c("Fabien","Pierre","Sandrine","Claire","Bruno","Delphine")
size
str(size)


---
##### **2 - Matrices**

- 2-dimension objects (rows x columns)
- contain only one type of varibale (e.g numeric) = homogeneous

The function to create a matrix is `matrix()`

In [None]:
myData <- matrix(c(1,2,3, 11,12,13), nrow = 2, ncol = 3)
myData
class(myData)

Thus by default, a matrix is filled by columns but you can change this behaviour and fill it by rows.

In [None]:
myData <- matrix(c(1,2,3, 11,12,13), nrow = 2, ncol = 3, byrow = TRUE)
myData

- you can check the dimensions with `dim()` or `str()`, `nrow()` or `ncol()`

In [None]:
print(dim(myData))
str(myData)
nrow(myData)
ncol(myData)

Printing the matrix shows you `[i,j]` coordinates, where `i` is the index of the row and `j` that of the column.

In [None]:
print(myData)

- values can be sliced with the `[]`

In [None]:
myData[1,2] # returns the value of the 1st row and 2nd column

In [None]:
myData[2,1] # returns the value of the 2nd row and 1st column

In [None]:
print(myData[,1]) # returns the values of the vector corresponding to the 1st column

In [None]:
print(myData[2,])  # returns the values of the vector corresponding to the 2nd row

In [None]:
myData[,2:3] # subsets the initial matrix returning a sub-matrix
             # with all rows of the 2nd and 3rd columns from the initial matrix
             # the generated matrix has 2 rows and 2 columns

In [None]:
print(dim(myData[,2:3])) # the generated matrix has 2 rows and 2 columns

In [None]:
class(myData[,1]) # we extract a vector -> thus the class is numeric and no more matrix
length(myData[1,])
length(myData[,1])

- Vectors can be associated to generate a matrix with `rbind()` or `cbind()`

In [None]:
myData2 <- cbind(weight, size, bmi)
myData2
myData3 <- rbind(weight, size, bmi)
myData3

- of course, operations can be applied to the values in the matrix

In [None]:
myData2*2
summary(myData2)
mean(myData2)
mean(myData2[,1])

## **II - Dataframes**
---
---

Dataframes are two-dimensional objects that can be heterogeneous between columns (but homogeneous within a column)

### **II.1. - Creating a dataframe:**
---

- They are generated with the function `data.frame()`:

This can be done **using existing vectors of same length** like the previoulsy generated "weight", "size" and "bmi" .

<div class="alert alert-block alert-warning"><b> If you do not wish to do the tutorial stepwise and directly start here:</b>. it would be necessary to run all above cells in order to have all required objects already loaded in the session. To do so, click on "Run" in the top menu and select "Run all above selected cell".</div>

In [None]:
myDataf <- data.frame(weight, size, bmi)
myDataf

The obtained dataframe looks pretty much like the previous matrix myData2.

In [None]:
class(myDataf)

In [None]:
str(myDataf)

In [None]:
print(dim(myDataf))

>*Note that if the vectors used to generate the dataframe are character strings, it is advised in versions < 4 to add the argument `stringsAsFactors=FALSE`*

If the vectors that will generate the dataframe do not exist yet in the session, but you would like to initiate a dataframe to fill it during your analysis, you could imagine creating an empty dataframe. But this method is useless as it is impossible to fill the generated dataframe having 0 columns and rows.

In [None]:
d <- data.frame()
d
dim(d)

In that case, it is better to create an empty matrix and to convert it to a dataframe. See below.

- Dataframes can be generated by **converting a matrix into a dataframe** with `as.data.frame()`

Let's try with the object myData2 we previously created. It is a matrix:

In [None]:
class(myData2)
class(as.data.frame(myData2))
str(as.data.frame(myData2))

You may also use `as.data.frame()` matrix generated by binding rows or columns:

In [None]:
d2 <- as.data.frame(cbind(1:2, 10:11))
str(d2)

So, similarly, we can do such a conversion of an empy matrix into a dataframe like in this example with a matrix of two rows and three columns currently filled with missing values:

In [None]:
d <- as.data.frame(matrix(NA,2,3))
d
dim(d)
str(d)

- Getting **row and column names** of a dataframe:

You may use the same fonctions as the ones used for matrices: `rownames()` and `colnames()`:

In [None]:
rownames(d)
colnames(d)

But it is better to use the functions dedicated to dataframes which are `row.names()` and `names()`:

In [None]:
row.names(d)
names(d)

<div class="alert alert-block alert-danger"><b>Caution:</b>
    each row name must be unique in a dataframe!
</div>

- **Getting a variable from a dataframe:**

To better follow, let's first diplay again myDataf

In [None]:
print(myDataf)

Variables are the columns of a dataframe. You can extract the vector corresponding to a column from a dataframe with its `index`, with the `name` of the column inside`""` or using the symbol `$`:

In [None]:
print(myDataf[,2])
print(myDataf[,"size"])
print(myDataf$size)

- **Extracting rows from a dataframe:**

You have two options to do so:

1. either by specifying the index of the row

In [None]:
myDataf[2,]

2. or by giving its name within the `""` insie the squared brackets:

In [None]:
myDataf["Pierre",]

In [None]:
class(myDataf["Pierre",])

In both cases, you may notice that you obtain a dataframe and not a vector, even if you extract only one row.
If you wish to get the vector corresponing to a row, you have to convert it with the `unlist()` function.

>*Of note, dataframes are a special case of list variables of the same number of rows with unique row names.*

In [None]:
temp <- unlist(myDataf["Pierre",])
print(temp)
class(temp)

<div class="alert alert-block alert-warning"><b>Your turn:</b> have a look at slide 31 and start thinking of answers on your own -> we will discuss the solutions together. </div>

- **adding a column:** creating a new vector with characters and including it in the dataframe

1. either you add one vector at a time:

In [None]:
d2$new <- 1:2
d2

Here is another example to add a colum "sex" to the dataframe myData using a vector called "sex". I changed the name of the vector but you could keep the same name!

In [None]:
gender <- c("Man","Man","Woman","Woman","Man","Woman")
print(gender)
myDataf$sex <- gender
print(myDataf$sex)
myDataf
str(myDataf)

2. or add several vectors or several columns from another dataframe at once using `data.frame()`:

In [None]:
d3 <-  data.frame(d, d2)
d3

<div class="alert alert-block alert-danger"><b>Caution:</b> 
    You could also use <b>cbind()</b> but it is at risk as cbind() is rather a function for matrices. If you use it for dataframes, it will keep the data types only if you combine several variables of both dataframes. If you take only one variable from a dataframe, cbind() will convert it as a vector with a possible risk of coercion and of factorisation in versions of R < 4.
</div>

### **II.2. - Reading a text file into R and vice versa**
---

#### **a. reading a text file into R**

The function `read.table()` reads a delimited text file (tabulated, scv or other column separator) into R and **generates a dataframe**. 

Before importing the file `Temperature.txt` let's see how it looks like. Just double click on it. It is located in `/shared/projects/dubii2021/trainers/module3/data/`

You will see it is a tab-delimited text file.

Now let's import it in R by specifying the correct separator with the `read.table()` function:

In [None]:
path_to_file <- "/shared/projects/dubii2021/trainers/module3/data/Temperatures.txt" 
temperatures <- read.table(path_to_file, sep="\t", header=T, stringsAsFactors=F)
temperatures
str(temperatures)

In the above command, I used the argument `stringsAsFactors=FALSE`to avoid a factorisation of the columns with strings of character (here the "Month" column).
In R versions < 4, the default value for this argument is `TRUE`. Let's see what would have happened:

In [None]:
temperatures.2 <- read.table(path_to_file, sep="\t", header=T, stringsAsFactors=TRUE)
str(temperatures.2)

Here the "Month" column has been factorised. How?

In [None]:
levels(temperatures.2$Month)

By alphabetic order, which is not what you want!
Thus always use `stringsAsFactors=FALSE`

<div class="alert alert-block alert-info"><b>Personal work:</b>to better understand the behaviour of factors, you will follow a tutorial on factors which will be available on Friday on the module webpage.</div>

#### **b. writing a dataframe on your computer**

Conversely, save a dataframe into your working directory with `write.table()`:

In [None]:
# save a dataframe as a text file in the working directory
write.table(myDataf, file="bmi_data.txt", sep="\t", quote=F, col.names=T)

Have a look at it by double clicking on it in your working directory.

and check you can import it back in R again:

In [None]:
rm(myDataf)
myDataf <- read.table("bmi_data.txt", sep="\t", header=T, stringsAsFactors=F)
head(myDataf) #myDataf is again accessible
file.remove("bmi_data.txt") #to clean the working directory

### **II.3. - Subsetting a dataframe**
---

#### **a. The function `which()` returns the index of what is TRUE in a tested condition:**

In [None]:
print(which ( myDataf$sex == "Woman") )

Here, we obtain a vector where 3, 4 and 6 corrrespond to the positions or indexes (1-based) of the occurence "Woman" in the vector/variable myDataf$sex. We can the use this vector as usual in a dataframe before the "," to select the corresponding rows.

In [None]:
myDataf [ which ( myDataf$sex == "Woman") , ] 

In [None]:
str(myDataf [ which ( myDataf$sex == "Woman") , ])

Instead of "==" one can use ̀`!=` for "is different" to detect what does not match.

In [None]:
print(which ( myDataf$sex != "Man"))

Abother method would be to add `!`  for "not" before the test, to get the complementary result:

In [None]:
print(which (! myDataf$sex == "Man"))

<div class="alert alert-block alert-danger"><b>Caution:</b>
    What happens if you do not use `which()`?
</div>

Lets' make a copy of our dataframe and replace the gender of Claire by a missing value:

In [None]:
myDataf2 <- myDataf
myDataf2["Claire", "sex"] <- NA
myDataf2

and rerun the same command as above without which() on the new myDataf2:

In [None]:
myDataf2[myDataf2$sex == "Woman",]

In [None]:
myDataf2[which(myDataf2$sex == "Woman"),]

<div class="alert alert-block alert-danger"><b>Caution:</b>
    If you have missing data and you forget to use which(), you will also return them.<b> =>  Always use which()</b>
</div>

#### **b. One can also search for a pattern with `grep()`:**

It returns the index of what matches, even partially.

In [None]:
print(grep("Wom", myDataf$sex))

In [None]:
print(grep("Woman", myDataf$sex))

In [None]:
myDataf [grep("Woman", myDataf$sex), ] 

In [None]:
print(grep("a", row.names(myDataf)))

In [None]:
myDataf [grep("a", row.names(myDataf)),]

#### **c. The function `subset()` is even simpler than `which()`:**

Just enter the dataframe as first argument, and the variable without "quotes" on which you do the filtering followed by the condition.

In [None]:
WomenDataf <- subset(myDataf, gender== "Woman")
WomenDataf

#### **d. You can even combine conditions:**

- logical: `&` = AND, `|` = OR, `!` = not
- comparisons: `==` , `!=` for diffferent, `>`, `<`, `>=`, `>=`
- "is an element of" a vector using `%in%`

In [None]:
filteredData <- myDataf [ which ( myDataf$sex == "Woman" & myDataf$weight < 80 & myDataf$bmi > 20), ]
filteredData

In [None]:
subset( myDataf, sex == "Woman" & weight < 80 & bmi > 20)

### **II.4. -Merging dataframes:** using a column as a "key"

In this example, I add one column with indexes that I will use as a key, but we can also use an existing variable as a key.

In [None]:
myDataf$index <- 1:6
myDataf

Then I generate another dataframe with handedness information on 6 samples, but one sample is new compared to the initial dataframe.

In [None]:
OtherData <- data.frame(c(1:5, 7),rep(c("right-handed","left-handed"),3))
names(OtherData) <- c("ID","handedness")
OtherData

We can now merge them together by specifying the "key" column with the argument `by`. The `all` argument is used to keep all the rows of a dataframe that are not present in the other. The `.x` refers to the first dataframe while `.y` refers to the second one.

<div class="alert alert-block alert-warning"><b>Warning:</b>If adding <b>sort=F</b> we will avoid the merged dataframe to be sorted by the "key" column. </div>


In [None]:
myMergedDataf <- merge(myDataf, OtherData, by.x="index", by.y="ID", all.x=T, all.y=T, sort=F)
myMergedDataf

In the merged dataframe, we start with all the rows present in both dataframes. The next row contains the data only present in the first dataframe with missing data for the columns in the second dataframe. The last rows are the ones with data only present in the second dataframe with missing data for the first dataframe.

Unless the merge is done on the row names (by="0"), the row names of the initial dataframe are lost. The new dataframe has its own row names. 

If two columns have the same name in both dataframes, by default R adds an ".x" to the one from the first dataframe and ".y" to the one of the second dataframe. The names can be changed with the argument `suffixes`.

___

### **II.5 - Some basic plotting**

We will see more deeply how to generate basic plots in you personal work on Wednesday for different kind of variables, and during session 2 of R how to generate custom plots either with R base or ggplot.

But let's have a quick view of what can be done on our dataframe.

#### **a. scatter plot with the function `plot()`**

In [None]:
plot(myDataf$weight~myDataf$size)  

#### **b. Representation of quantitative data distribution:** 

- as a boxplot with `boxplot()`:

In [None]:
boxplot(myDataf$weight)

or using `~ ` to display boxplots on the same plot depending on a categorical variable:

In [None]:
boxplot(myDataf$weight~myDataf$sex) 

- as a histogram with `hist()`:

In [None]:
a <- rnorm(1000) # to sample 1000 values from a normal distribution of mean 0 and standard deviation 1
hist(a, breaks=20) # the argument breaks is used to specify the number of intervals

We will further see that graphs have three-level functions:

1. primary graph functions like `plot()`, `boxplot` and `hist()` to display the most principal graphs in R

2. secondary graph functions to complement an existing plot

3. graphical parameters to modify the plots display:
    - either as options of the primary and secondary functions
    - or permanetly with the `par()` function before plotting the graph.

---
---

<div class="alert alert-block alert-success"><b>Success:</b> Well done! You now know all the main functions to create and manipulate dataframes.

</div>
    

Lets' save all the main objects of this session into an R object:

In [None]:
ls()

We will keep `myDataf` and `temperatures`.

In [None]:
save(myDataf,temperatures, file="RSession1_tutorial.RData")

<div class="alert alert-block alert-danger"><b>Caution:</b><br> 
 Don't forget to save you notebook and export a copy as an <b>html</b> file as well <br>
- Open "File" in the Menu<br>
- Select "Export Notebook As"<br>
- Export notebook as HTML<br>
- You can then open it in your browser even without being connected to the IFB Jupyter hub! 
</div>

In [None]:
sessionInfo()