{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "# DU Bii - module 3: R and stats\n", "---\n", "## **Session 1: tutorial on dataframes**\n", "*Wednesday 3rd of March, 2021*\n", "\n", "teachers: Claire Vandiedonck & Anne Badel; helpers: Antoine Bridier-Nahmias, Bruno Toupace, Clémence Réda, Jacques van Helden\n", "\n", "*Content of this tutorial:*\n", "\n", "1. Some reminders on R basics\n", " 1.0. What is R?\n", " 1.1. R as a calculator \n", " 1.2. Assigning data to R objects, using and reading them \n", " 1.3. Managing your session\n", " 1.4. Managing objects in your R Session\n", " 1.5. Saving your data, session, and history\n", " a. Data: specific variables or functions to save\n", " b. Session: save all variables and functions\n", " c. History: save all past commands\n", " 1.6. Classes and types of R objects\n", " a. classes of objects\n", " b. main data structures in R\n", " 1.Vectors\n", " 2.Matrices\n", "2. Dataframes\n", " 2.1. Creating a dataframe\n", " 2.2. Reading a text file into RData\n", " 2.3. Subsetting a dataframe on several criteria\n", " 2.4. Merging dataframes\n", " 2.5. Some basic plotting\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## **Before going further**\n", "\n", "
Caution:
\n", " 1. Create a new directory \"Rsession1\" in your home with a right click in the left-hand panel of the lab.
\n", " 2. Save a backup copy of this notebook in this folder : in the left-hand panel, right-click on this file and select \"Duplicate\" and add your name, e.g: \"tutorial_dataframes_vandiedonck.ipynb\" and move it to the proper folder
\n", "You can also make backups during the analysis. Don't forget to save your notebook regularly.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Warning: you are strongly advised to run the cells in the indicated order. If you want to rerun cells above, you can just restart the kernel to start at 1 again.
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " \n", " About jupyter notebooks:
\n", "\n", "- To add a new cell, click on the \"+\" icon in the toolbar above your notebook
\n", "- You can \"click and drag\" to move a cell up or down
\n", "- You choose the type of cell in the toolbar above your notebook:
\n", " - 'Code' to enter command lines to be executed
\n", " - 'Markdown' cells to add text, that can be formatted with some characters
\n", "- To execute a 'Code' cell, press SHIFT+ENTER or click on the \"play\" icon
\n", "- To display a 'Markdown' cell, press SHIFT+ENTER or click on the \"play\" icon
\n", "- To modify a 'Markdown'cell, double-click on it
\n", "
\n", "\n", " \n", "To make nice html reports with markdown: html visualization tool 1 or html visualization tool 2, to draw nice tables, and the Ultimate guide.
\n", "Further reading on JupyterLab notebooks: Jupyter Lab documentation.
\n", "
\n", " \n", " \n", "
\n", "\n", "__*=> About this jupyter notebook*__\n", "\n", "This a jupyter notebook in **R**, meaning that the commands you will enter or run in `Code` cells are directly understood by the server in the R language.\n", "
You could run the same commands in a Terminal or in RStudio. \n", "\n", "\n", "> In this tutorial, you will run one cell at a time. \n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **I. Some reminders on R basics**\n", "---\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **I.0 What is R ?**\n", "---\n", "\n", "R is available on this website: https://www.r-project.org\n", "\n", "The language is:\n", "- open-source\n", "- available for Windows, Mac and Unix\n", "- widely used in academia, finance, pharma, social sciences...\n", "\n", "R is a statsitical programming language. This project started in 1993. We are currently at version 4.0.4 (15/02/2021). There is a new release twice a year." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "R includes a \"core language\" called `R base` with more than 3000 contributed packages. A package is a set of functions.\n", "\n", "R can be used for:\n", "1. data manipulation: import, format, edit, export\n", "2. statistics\n", "3. avdanced graphics\n", "\n", "***Some useful links***\n", "- Quick R: https://www.statmethods.net/index.html\n", "- Emmanuel Paradis tutorial: [in French](https://cran.r-project.org/doc/contrib/Paradis-rdebuts_fr.pdf) or [in English](https://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf)\n", "- R cheatsheet: https://rstudio.com/resources/cheatsheets/\n", "- R style guide: https://google.github.io/styleguide/Rguide.html" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **I.1 - R as a calulator**\n", "---\n", "\n", "Some very simple examples**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can directly use R to perform mathematic operations with usual operators: `+`, `-`, `*` to multiply,`^` to raise to the power, `/` to divide, `%%` to get the modulo." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "2+2\n", "2-3\n", "6/2\n", "10/3\n", "10%%3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can use built-in functions like `round()`,`log()`, `mean`...\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mean(c(1,2)) # we will see we need to put concatenate different values with a c() first\n", "exp(-2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can nest functions, in the following example, `exp()` is nested in `round()`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "round(exp(-2), 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For some functions, you need to enter several arguments. In the example below, we add the `base` argument for the `log()` function." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "log(100,base=10) #we want to get the log of 100 in base 10 " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***Getting help on functions:***\n", "\n", "To know which argument to use, it is recommanded to always look at the help of the functions. To do so, enter the name of the function after `?` or `help()` and the name of the function in the brackets. A help page will be displayed with different sections:\n", "\n", "- description: what is the purpose of the function?\n", "- usage: how is it used?\n", "- arguments: which parameters are used by the function. Default values may be specified.\n", "- details: technical description of the function\n", "- value: type of the output returned by the function\n", "- see also: similar functions in R\n", "- source/references: not always\n", "- example: concrete examples -> the best way to learn how it works!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "help(round)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "?exp" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **I-2 - Assigning data into R objects, using and reading them**\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can store values in R objects/variables to reuse them in another command.\n", "To do so, use `<-` made with `<` and `-`. *An alternative is to use `=`. For code clarity, it is not recommanded.*\n", "\n", "Let's assign for example `2` to `x`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x <- 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To know what is in `x` just enter `x`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can do operations on x:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x+x" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can then assign an operation with `x` to `y` ." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y <- x+3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get the result y, enter it in the next command:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x <- 4\n", "y" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Caution: \n", "If you assign a new value to x, y will not change because the result of the operation x+3 was stored in y, not the operation \"x + 3\" itself.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So you would have to rerun the command assigning `x+3` to y to change the value of y." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y <- x+3\n", "y" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In addition to numeric values, we can store other kind of data in an object. For example we will put a string of character in s. Strings of characters have to be entered between \"quotes\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s <- \"this is a string of characters\"\n", "s" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Of note, you can check the type of an R object using `class()`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class(x)\n", "class(s)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is important that numeric values are well encoded as numeric in R and not as strings of characters. Y" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"1\"\n", "class(\"1\")\n", "class(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you try to add `\"1\"` and 3, an error message is returned here since we are trying to make an impossible operation:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "try(\"1\" + 3)# I added the try function to avoid stopping the notebook if you want to run all the cells" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you are using numeric variables, the operation can be done:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "1 + 3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "### **I.3 - Managing your session**\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When working with R, it is always a good practice to document the R version you are using and the packages that are loaded. The function is `sessionInfo()`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sessionInfo()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, the version 4.0.2 is the one installed on the IFB clore cluster. By default, some \"base\" packages like stats are loaded. We will see in the next R Session that we can load other packages." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "getwd()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
The result should be like this:`'/shared/`.
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then we change it to the RSession1 folder in your home directory." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "setwd('/shared/home/cvandiedonck/RSession1') #change with your login!!!\n", "getwd() #change is visible" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### **I-4 - Managing objects in your R Session and working directory**\n", "___\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The objects `x`, `y`and `s`you have cretaed above are only present in your R session, but they are not written in your working directory on the computer -> they are not present in the left-hand panel of Jupyter Lab.\n", "\n", "So, to know which objects you have in your R session, you can use the same function as in Unix/bash to list the files. The only difference is that in R you add brackets to use functions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ls()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similarly, you can get rid of an object with the function `rm()`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rm(y)\n", "ls()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Conversely, you can also look at the data on your computer from R with the function `dir()` or `list.files()`. With the second function, you can add an argument to specify a pattern of interest." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dir()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "list.files(pattern=\".ipynb\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **I.5 - Saving your data, session, and history**\n", "___\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before quitting R, you will probably want to save objects and other session information on your computer to be able to find them again next time you use R.\n", "By default, all the data and files you save will be saved in your ***working directory***." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### **a - Saving specific data *(or functions)***" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The function `save()` is used to save a specific object in your computer. You will have to give a name to the file on your computer. Generally, we save them with the extension `.RData`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "save(x,file=\"x.RData\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the above command, you should have created the file `x.Rdata` in your working directory. Check it is present on the left-hand panel of Jupyter Lab.
\n", "Now, if you remove `x` from your R session, you can load it back again with the `load()` function." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rm(x)\n", "ls()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "load(\"x.RData\")\n", "ls()\n", "x #x is again accessible" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also delete the file from the working directory with the function `file.remove()`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "file.remove(\"x.RData\") #remove file: returns TRUE on successful removal" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instead of saving a single object, you can save several by listing them all as separate arguments in the `save()` function." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "save(x,s, file=\"xands.RData\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "file.remove(\"xands.RData\")# to clean the working directory" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### **b - Saving all variables *(and functions)* at once**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is even more efficient when you want to save all objects to use the function `save.image()`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ls()\n", "save.image(file=\"AllMyData.RData\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And similarly you can upload them all back after removing all objects in the session or starting a new one." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rm(list=ls()) # this command removes all the objects on the R session\n", "ls() #all variables have been removed" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "load(\"AllMyData.RData\")\n", "ls() #all variables are accessible again\n", "file.remove(\"AllMyData.RData\")\n", "ls()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### **c- Save \"history\"** = all past commands" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Do not run. It does not work in R notebooks where no history is saved because we are running independant cells! The command below would be the one to run in R shell (Terminal > R) or in RStudio (change \"lab\" in URL to \"rstudio\").
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# ls()\n", "# savehistory(file=\"MyHistory.Rhistory\") #save all previously run commands in a special formatted file\n", "# loadhistory(\"MyHistory.Rhistory\") #load all commands stored in the specified file\n", "# my_history <- read.delim(\"MyHistory.Rhistory\") #see how the file is formatted: number of line and associated command\n", "# head(my_history)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### **I.6 - Classes and types of R objects**\n", "___\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### **a - Classes of R objects**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The main types of variables are :\n", "\n", "- numeric/integer\n", "- character\n", "- logical (FALSE/TRUE/NA)\n", "- factors" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x <- c(3,7,1,2) # we define a variable x with 4 numeric values concatenated\n", "x" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To have a more classical R display than in a notebook, you can add print()." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(x) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "X contains 4 numeric values. We can check it is numeric with the function `is.numeric()`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "is.numeric(x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It returns the logical value `TRUE`.\n", "\n", "You can also perform tests that will return logical values. Below we test wether the values in x are below 2." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x<2 # we test wether the 4 values are < 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Only the third value of x is < 2. Similarly, we can test which values of x are equal to 2." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x==2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In R, the function `class()` returns the class of the object. The functions `is.logical()`, `is.numeric()`, `is.character()`,...test whether the values are of this type. You may enventually do a type conversion with `as.numeric()`, `as.logical()`, ..." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class(x)\n", "class(s)\n", "is.character(s)\n", "is.numeric(s)\n", "print(as.numeric(x<2))\n", "is.numeric(\"1\")\n", "is.numeric(as.numeric(\"1\"))\n", "is.numeric(c(1,\"1\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***Coercion rules:*** There are some coercion rules when doing conversions on concatenating elements of different types: `logical Remark:
In such a jupyter notebook, by default each item of a vector is displayed sperated by a `.`. Should you wish to display a vector in a more classical way, like in the R console, where they are not displayed in different rows but in a row, you should use the function print(). \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(weight)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "4:10\n", "print(4:10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(seq(4,10))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(seq(2,10,2))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(rep(4,2))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(rep(seq(4,10,2)))\n", "print(c(rep(1,4),rep(2,4)))\n", "print(c(5,s))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can check the class of a vector but also get some information on its length with `length()` and structure with `str()`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class(c(5,s))\n", "length(1:10)\n", "length(weight)\n", "str(weight)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- You can perform operation directly on vectors:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "size <- c(1.75, 1.8, 1.65, 1.9, 1.74, 1.91)\n", "print(size^2)\n", "print(bmi <- weight/size^2 )\n", "print(bmi)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- You can order them or get dispersion values:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(sort(size))\n", "mean(size)\n", "sd(size)\n", "median(size)\n", "min(size)\n", "max(size)\n", "print(range(size))\n", "summary(size)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- You can extract some values from a vector with the index of the values you want to extract inside using square brackets `[]`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(size)\n", "size[1]\n", "size[2]\n", "size[6]\n", "size[c(2,6)]\n", "size[c(6,2)]\n", "min(size[c(6,2)])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Finally you can add a name to the different values. Names on vector values are attributes of the vector. Here the function `names()` returns a vector of the names of vector `size`. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "names(size)\n", "names(size) <- c(\"Fabien\",\"Pierre\",\"Sandrine\",\"Claire\",\"Bruno\",\"Delphine\")\n", "size\n", "str(size)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "---\n", "##### **2 - Matrices**\n", "\n", "- 2-dimension objects (rows x columns)\n", "- contain only one type of varibale (e.g numeric) = homogeneous\n", "\n", "The function to create a matrix is `matrix()`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "myData <- matrix(c(1,2,3, 11,12,13), nrow = 2, ncol = 3)\n", "myData\n", "class(myData)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Thus by default, a matrix is filled by columns but you can change this behaviour and fill it by rows." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "myData <- matrix(c(1,2,3, 11,12,13), nrow = 2, ncol = 3, byrow = TRUE)\n", "myData" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- you can check the dimensions with `dim()` or `str()`, `nrow()` or `ncol()`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(dim(myData))\n", "str(myData)\n", "nrow(myData)\n", "ncol(myData)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Printing the matrix shows you `[i,j]` coordinates, where `i` is the index of the row and `j` that of the column." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(myData)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- values can be sliced with the `[]`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "myData[1,2] # returns the value of the 1st row and 2nd column" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "myData[2,1] # returns the value of the 2nd row and 1st column" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(myData[,1]) # returns the values of the vector corresponding to the 1st column" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(myData[2,]) # returns the values of the vector corresponding to the 2nd row" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "myData[,2:3] # subsets the initial matrix returning a sub-matrix\n", " # with all rows of the 2nd and 3rd columns from the initial matrix\n", " # the generated matrix has 2 rows and 2 columns" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(dim(myData[,2:3])) # the generated matrix has 2 rows and 2 columns" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class(myData[,1]) # we extract a vector -> thus the class is numeric and no more matrix\n", "length(myData[1,])\n", "length(myData[,1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Vectors can be associated to generate a matrix with `rbind()` or `cbind()`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "myData2 <- cbind(weight, size, bmi)\n", "myData2\n", "myData3 <- rbind(weight, size, bmi)\n", "myData3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- of course, operations can be applied to the values in the matrix" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "myData2*2\n", "summary(myData2)\n", "mean(myData2)\n", "mean(myData2[,1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **II - Dataframes**\n", "---\n", "---\n", "\n", "Dataframes are two-dimensional objects that can be heterogeneous between columns (but homogeneous within a column)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **II.1. - Creating a dataframe:**\n", "---\n", "\n", "- They are generated with the function `data.frame()`:\n", "\n", "This can be done **using existing vectors of same length** like the previoulsy generated \"weight\", \"size\" and \"bmi\" ." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
If you do not wish to do the tutorial stepwise and directly start here:. it would be necessary to run all above cells in order to have all required objects already loaded in the session. To do so, click on \"Run\" in the top menu and select \"Run all above selected cell\".
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "myDataf <- data.frame(weight, size, bmi)\n", "myDataf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The obtained dataframe looks pretty much like the previous matrix myData2." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class(myDataf)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "str(myDataf)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(dim(myDataf))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ">*Note that if the vectors used to generate the dataframe are character strings, it is advised in versions < 4 to add the argument `stringsAsFactors=FALSE`*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the vectors that will generate the dataframe do not exist yet in the session, but you would like to initiate a dataframe to fill it during your analysis, you could imagine creating an empty dataframe. But this method is useless as it is impossible to fill the generated dataframe having 0 columns and rows." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "d <- data.frame()\n", "d\n", "dim(d)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In that case, it is better to create an empty matrix and to convert it to a dataframe. See below." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Dataframes can be generated by **converting a matrix into a dataframe** with `as.data.frame()`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try with the object myData2 we previously created. It is a matrix:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class(myData2)\n", "class(as.data.frame(myData2))\n", "str(as.data.frame(myData2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You may also use `as.data.frame()` matrix generated by binding rows or columns:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "d2 <- as.data.frame(cbind(1:2, 10:11))\n", "str(d2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, similarly, we can do such a conversion of an empy matrix into a dataframe like in this example with a matrix of two rows and three columns currently filled with missing values:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "d <- as.data.frame(matrix(NA,2,3))\n", "d\n", "dim(d)\n", "str(d)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Getting **row and column names** of a dataframe:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You may use the same fonctions as the ones used for matrices: `rownames()` and `colnames()`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rownames(d)\n", "colnames(d)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But it is better to use the functions dedicated to dataframes which are `row.names()` and `names()`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "row.names(d)\n", "names(d)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Caution:\n", " each row name must be unique in a dataframe!\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- **Getting a variable from a dataframe:**\n", "\n", "To better follow, let's first diplay again myDataf" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(myDataf)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Variables are the columns of a dataframe. You can extract the vector corresponding to a column from a dataframe with its `index`, with the `name` of the column inside`\"\"` or using the symbol `$`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(myDataf[,2])\n", "print(myDataf[,\"size\"])\n", "print(myDataf$size)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- **Extracting rows from a dataframe:**\n", "\n", "You have two options to do so:\n", "\n", "1. either by specifying the index of the row" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "myDataf[2,]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. or by giving its name within the `\"\"` insie the squared brackets:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "myDataf[\"Pierre\",]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class(myDataf[\"Pierre\",])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In both cases, you may notice that you obtain a dataframe and not a vector, even if you extract only one row.\n", "If you wish to get the vector corresponing to a row, you have to convert it with the `unlist()` function.\n", "\n", ">*Of note, dataframes are a special case of list variables of the same number of rows with unique row names.*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "temp <- unlist(myDataf[\"Pierre\",])\n", "print(temp)\n", "class(temp)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Your turn: have a look at slide 31 and start thinking of answers on your own -> we will discuss the solutions together.
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- **adding a column:** creating a new vector with characters and including it in the dataframe\n", "\n", "1. either you add one vector at a time:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "d2$new <- 1:2\n", "d2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is another example to add a colum \"sex\" to the dataframe myData using a vector called \"sex\". I changed the name of the vector but you could keep the same name!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gender <- c(\"Man\",\"Man\",\"Woman\",\"Woman\",\"Man\",\"Woman\")\n", "print(gender)\n", "myDataf$sex <- gender\n", "print(myDataf$sex)\n", "myDataf\n", "str(myDataf)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. or add several vectors or several columns from another dataframe at once using `data.frame()`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "d3 <- data.frame(d, d2)\n", "d3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Caution: \n", " You could also use cbind() but it is at risk as cbind() is rather a function for matrices. If you use it for dataframes, it will keep the data types only if you combine several variables of both dataframes. If you take only one variable from a dataframe, cbind() will convert it as a vector with a possible risk of coercion and of factorisation in versions of R < 4.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **II.2. - Reading a text file into R and vice versa**\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### **a. reading a text file into R**\n", "\n", "The function `read.table()` reads a delimited text file (tabulated, scv or other column separator) into R and **generates a dataframe**. \n", "\n", "Before importing the file `Temperature.txt` let's see how it looks like. Just double click on it. It is located in `/shared/projects/dubii2021/trainers/module3/data/`\n", "\n", "You will see it is a tab-delimited text file.\n", "\n", "Now let's import it in R by specifying the correct separator with the `read.table()` function:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "path_to_file <- \"/shared/projects/dubii2021/trainers/module3/data/Temperatures.txt\" \n", "temperatures <- read.table(path_to_file, sep=\"\\t\", header=T, stringsAsFactors=F)\n", "temperatures\n", "str(temperatures)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the above command, I used the argument `stringsAsFactors=FALSE`to avoid a factorisation of the columns with strings of character (here the \"Month\" column).\n", "In R versions < 4, the default value for this argument is `TRUE`. Let's see what would have happened:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "temperatures.2 <- read.table(path_to_file, sep=\"\\t\", header=T, stringsAsFactors=TRUE)\n", "str(temperatures.2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here the \"Month\" column has been factorised. How?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "levels(temperatures.2$Month)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By alphabetic order, which is not what you want!\n", "Thus always use `stringsAsFactors=FALSE`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Personal work:to better understand the behaviour of factors, you will follow a tutorial on factors which will be available on Friday on the module webpage.
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### **b. writing a dataframe on your computer**\n", "\n", "Conversely, save a dataframe into your working directory with `write.table()`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# save a dataframe as a text file in the working directory\n", "write.table(myDataf, file=\"bmi_data.txt\", sep=\"\\t\", quote=F, col.names=T)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Have a look at it by double clicking on it in your working directory." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and check you can import it back in R again:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rm(myDataf)\n", "myDataf <- read.table(\"bmi_data.txt\", sep=\"\\t\", header=T, stringsAsFactors=F)\n", "head(myDataf) #myDataf is again accessible\n", "file.remove(\"bmi_data.txt\") #to clean the working directory" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **II.3. - Subsetting a dataframe**\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### **a. The function `which()` returns the index of what is TRUE in a tested condition:**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(which ( myDataf$sex == \"Woman\") )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, we obtain a vector where 3, 4 and 6 corrrespond to the positions or indexes (1-based) of the occurence \"Woman\" in the vector/variable myDataf$sex. We can the use this vector as usual in a dataframe before the \",\" to select the corresponding rows." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "myDataf [ which ( myDataf$sex == \"Woman\") , ] " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "str(myDataf [ which ( myDataf$sex == \"Woman\") , ])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instead of \"==\" one can use ̀`!=` for \"is different\" to detect what does not match." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(which ( myDataf$sex != \"Man\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Abother method would be to add `!` for \"not\" before the test, to get the complementary result:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(which (! myDataf$sex == \"Man\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Caution:\n", " What happens if you do not use `which()`?\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets' make a copy of our dataframe and replace the gender of Claire by a missing value:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "myDataf2 <- myDataf\n", "myDataf2[\"Claire\", \"sex\"] <- NA\n", "myDataf2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and rerun the same command as above without which() on the new myDataf2:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "myDataf2[myDataf2$sex == \"Woman\",]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "myDataf2[which(myDataf2$sex == \"Woman\"),]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Caution:\n", " If you have missing data and you forget to use which(), you will also return them. => Always use which()\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### **b. One can also search for a pattern with `grep()`:**\n", "\n", "It returns the index of what matches, even partially." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(grep(\"Wom\", myDataf$sex))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(grep(\"Woman\", myDataf$sex))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "myDataf [grep(\"Woman\", myDataf$sex), ] " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(grep(\"a\", row.names(myDataf)))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "myDataf [grep(\"a\", row.names(myDataf)),]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### **c. The function `subset()` is even simpler than `which()`:**\n", "\n", "Just enter the dataframe as first argument, and the variable without \"quotes\" on which you do the filtering followed by the condition." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "WomenDataf <- subset(myDataf, gender== \"Woman\")\n", "WomenDataf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### **d. You can even combine conditions:**\n", "\n", "- logical: `&` = AND, `|` = OR, `!` = not\n", "- comparisons: `==` , `!=` for diffferent, `>`, `<`, `>=`, `>=`\n", "- \"is an element of\" a vector using `%in%`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "filteredData <- myDataf [ which ( myDataf$sex == \"Woman\" & myDataf$weight < 80 & myDataf$bmi > 20), ]\n", "filteredData" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "subset( myDataf, sex == \"Woman\" & weight < 80 & bmi > 20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **II.4. -Merging dataframes:** using a column as a \"key\"\n", "\n", "In this example, I add one column with indexes that I will use as a key, but we can also use an existing variable as a key." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "myDataf$index <- 1:6\n", "myDataf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then I generate another dataframe with handedness information on 6 samples, but one sample is new compared to the initial dataframe." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "OtherData <- data.frame(c(1:5, 7),rep(c(\"right-handed\",\"left-handed\"),3))\n", "names(OtherData) <- c(\"ID\",\"handedness\")\n", "OtherData" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now merge them together by specifying the \"key\" column with the argument `by`. The `all` argument is used to keep all the rows of a dataframe that are not present in the other. The `.x` refers to the first dataframe while `.y` refers to the second one.\n", "\n", "
Warning:If adding sort=F we will avoid the merged dataframe to be sorted by the \"key\" column.
\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "myMergedDataf <- merge(myDataf, OtherData, by.x=\"index\", by.y=\"ID\", all.x=T, all.y=T, sort=F)\n", "myMergedDataf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the merged dataframe, we start with all the rows present in both dataframes. The next row contains the data only present in the first dataframe with missing data for the columns in the second dataframe. The last rows are the ones with data only present in the second dataframe with missing data for the first dataframe.\n", "\n", "Unless the merge is done on the row names (by=\"0\"), the row names of the initial dataframe are lost. The new dataframe has its own row names. \n", "\n", "If two columns have the same name in both dataframes, by default R adds an \".x\" to the one from the first dataframe and \".y\" to the one of the second dataframe. The names can be changed with the argument `suffixes`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "___\n", "\n", "### **II.5 - Some basic plotting**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will see more deeply how to generate basic plots in you personal work on Wednesday for different kind of variables, and during session 2 of R how to generate custom plots either with R base or ggplot.\n", "\n", "But let's have a quick view of what can be done on our dataframe." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### **a. scatter plot with the function `plot()`**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot(myDataf$weight~myDataf$size) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### **b. Representation of quantitative data distribution:** \n", "\n", "- as a boxplot with `boxplot()`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "boxplot(myDataf$weight)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "or using `~ ` to display boxplots on the same plot depending on a categorical variable:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "boxplot(myDataf$weight~myDataf$sex) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- as a histogram with `hist()`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a <- rnorm(1000) # to sample 1000 values from a normal distribution of mean 0 and standard deviation 1\n", "hist(a, breaks=20) # the argument breaks is used to specify the number of intervals" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will further see that graphs have three-level functions:\n", "\n", "1. primary graph functions like `plot()`, `boxplot` and `hist()` to display the most principal graphs in R\n", "\n", "2. secondary graph functions to complement an existing plot\n", "\n", "3. graphical parameters to modify the plots display:\n", " - either as options of the primary and secondary functions\n", " - or permanetly with the `par()` function before plotting the graph.\n", "\n", "---\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Success: Well done! You now know all the main functions to create and manipulate dataframes.\n", "\n", "
\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets' save all the main objects of this session into an R object:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ls()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will keep `myDataf` and `temperatures`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "save(myDataf,temperatures, file=\"RSession1_tutorial.RData\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Caution:
\n", " Don't forget to save you notebook and export a copy as an html file as well
\n", "- Open \"File\" in the Menu
\n", "- Select \"Export Notebook As\"
\n", "- Export notebook as HTML
\n", "- You can then open it in your browser even without being connected to the IFB Jupyter hub! \n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sessionInfo()" ] } ], "metadata": { "kernelspec": { "display_name": "R 4.0.2", "language": "R", "name": "r-4.0.2" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "4.0.2" }, "toc-showtags": true }, "nbformat": 4, "nbformat_minor": 4 }