Wednesday 3rd of March, 2021
teachers: Claire Vandiedonck & Anne Badel; helpers: Antoine Bridier-Nahmias, Bruno Toupace, Clémence Réda, Jacques van Helden
Content of this tutorial:
1.0. What is R?
1.1. R as a calculator
1.2. Assigning data to R objects, using and reading them
1.3. Managing your session
1.4. Managing objects in your R Session
1.5. Saving your data, session, and history
a. Data: specific variables or functions to save
b. Session: save all variables and functions
c. History: save all past commands
1.6. Classes and types of R objects
a. classes of objects
b. main data structures in R
1.Vectors
2.Matrices
2.1. Creating a dataframe
2.2. Reading a text file into RData
2.3. Subsetting a dataframe on several criteria
2.4. Merging dataframes
2.5. Some basic plotting
=> About this jupyter notebook
This a jupyter notebook in R, meaning that the commands you will enter or run in Code
cells are directly understood by the server in the R language.
You could run the same commands in a Terminal or in RStudio.
In this tutorial, you will run one cell at a time.
R is available on this website: https://www.r-project.org
The language is:
R is a statsitical programming language. This project started in 1993. We are currently at version 4.0.4 (15/02/2021). There is a new release twice a year.
R includes a "core language" called R base
with more than 3000 contributed packages. A package is a set of functions.
R can be used for:
Some useful links
You can directly use R to perform mathematic operations with usual operators: +
, -
, *
to multiply,^
to raise to the power, /
to divide, %%
to get the modulo.
2+2
2-3
6/2
10/3
10%%3
You can use built-in functions like round()
,log()
, mean
...
mean(c(1,2)) # we will see we need to put concatenate different values with a c() first
exp(-2)
You can nest functions, in the following example, exp()
is nested in round()
round(exp(-2), 2)
For some functions, you need to enter several arguments. In the example below, we add the base
argument for the log()
function.
log(100,base=10) #we want to get the log of 100 in base 10
Getting help on functions:
To know which argument to use, it is recommanded to always look at the help of the functions. To do so, enter the name of the function after ?
or help()
and the name of the function in the brackets. A help page will be displayed with different sections:
help(round)
Round {base} | R Documentation |
ceiling
takes a single numeric argument x
and returns a
numeric vector containing the smallest integers not less than the
corresponding elements of x
.
floor
takes a single numeric argument x
and returns a
numeric vector containing the largest integers not greater than the
corresponding elements of x
.
trunc
takes a single numeric argument x
and returns a
numeric vector containing the integers formed by truncating the values in
x
toward 0
.
round
rounds the values in its first argument to the specified
number of decimal places (default 0). See ‘Details’ about
“round to even” when rounding off a 5.
signif
rounds the values in its first argument to the specified
number of significant digits.
ceiling(x) floor(x) trunc(x, ...) round(x, digits = 0) signif(x, digits = 6)
x |
a numeric vector. Or, for |
digits |
integer indicating the number of decimal places
( |
... |
arguments to be passed to methods. |
These are generic functions: methods can be defined for them
individually or via the Math
group
generic.
Note that for rounding off a 5, the IEC 60559 standard (see also
‘IEEE 754’) is expected to be used, ‘go to the even digit’.
Therefore round(0.5)
is 0
and round(-1.5)
is
-2
. However, this is dependent on OS services and on
representation error (since e.g. 0.15
is not represented
exactly, the rounding rule applies to the represented number and not
to the printed number, and so round(0.15, 1)
could be either
0.1
or 0.2
).
Rounding to a negative number of digits means rounding to a power of
ten, so for example round(x, digits = -2)
rounds to the nearest
hundred.
For signif
the recognized values of digits
are
1...22
, and non-missing values are rounded to the nearest
integer in that range. Complex numbers are rounded to retain the
specified number of digits in the larger of the components. Each
element of the vector is rounded individually, unlike printing.
These are all primitive functions.
These are all (internally) S4 generic.
ceiling
, floor
and trunc
are members of the
Math
group generic. As an S4
generic, trunc
has only one argument.
round
and signif
are members of the
Math2
group generic.
The realities of computer arithmetic can cause unexpected results,
especially with floor
and ceiling
. For example, we
‘know’ that floor(log(x, base = 8))
for x = 8
is
1
, but 0
has been seen on an R platform. It is
normally necessary to use a tolerance.
Rounding to decimal digits in binary arithmetic is non-trivial (when
digits != 0
) and may be surprising. Be aware that most decimal
fractions are not exactly representable in binary double precision.
In R 4.0.0, the algorithm for round(x, d)
, for d > 0, has
been improved to measure and round “to nearest even”,
contrary to earlier versions of R (or also to sprintf()
or format()
based rounding).
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.
The ISO/IEC/IEEE 60559:2011 standard is available for money from https://www.iso.org.
The IEEE 754:2008 standard is more openly documented, e.g, at https://en.wikipedia.org/wiki/IEEE_754.
as.integer
.
Package round's roundX()
for several
versions or implementations of rounding, including some previous and the
current R version (as version = "3d.C"
).
round(.5 + -2:4) # IEEE / IEC rounding: -2 0 0 2 2 4 4 ## (this is *good* behaviour -- do *NOT* report it as bug !) ( x1 <- seq(-2, 4, by = .5) ) round(x1) #-- IEEE / IEC rounding ! x1[trunc(x1) != floor(x1)] x1[round(x1) != floor(x1 + .5)] (non.int <- ceiling(x1) != floor(x1)) x2 <- pi * 100^(-1:3) round(x2, 3) signif(x2, 3)
?exp
log {base} | R Documentation |
log
computes logarithms, by default natural logarithms,
log10
computes common (i.e., base 10) logarithms, and
log2
computes binary (i.e., base 2) logarithms.
The general form log(x, base)
computes logarithms with base
base
.
log1p(x)
computes log(1+x) accurately also for
|x| << 1.
exp
computes the exponential function.
expm1(x)
computes exp(x) - 1 accurately also for
|x| << 1.
log(x, base = exp(1)) logb(x, base = exp(1)) log10(x) log2(x) log1p(x) exp(x) expm1(x)
x |
a numeric or complex vector. |
base |
a positive or complex number: the base with respect to which
logarithms are computed. Defaults to e= |
All except logb
are generic functions: methods can be defined
for them individually or via the Math
group generic.
log10
and log2
are only convenience wrappers, but logs
to bases 10 and 2 (whether computed via log
or the wrappers)
will be computed more efficiently and accurately where supported by the OS.
Methods can be set for them individually (and otherwise methods for
log
will be used).
logb
is a wrapper for log
for compatibility with S. If
(S3 or S4) methods are set for log
they will be dispatched.
Do not set S4 methods on logb
itself.
All except log
are primitive functions.
A vector of the same length as x
containing the transformed
values. log(0)
gives -Inf
, and log(x)
for
negative values of x
is NaN
. exp(-Inf)
is 0
.
For complex inputs to the log functions, the value is a complex number with imaginary part in the range [-pi, pi]: which end of the range is used might be platform-specific.
exp
, expm1
, log
, log10
, log2
and
log1p
are S4 generic and are members of the
Math
group generic.
Note that this means that the S4 generic for log
has a
signature with only one argument, x
, but that base
can
be passed to methods (but will not be used for method selection). On
the other hand, if you only set a method for the Math
group
generic then base
argument of log
will be ignored for
your class.
log1p
and expm1
may be taken from the operating system,
but if not available there then they are based on the Fortran subroutine
dlnrel
by W. Fullerton of Los Alamos Scientific Laboratory (see
http://www.netlib.org/slatec/fnlib/dlnrel.f) and (for small x) a
single Newton step for the solution of log1p(y) = x
respectively.
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988)
The New S Language.
Wadsworth & Brooks/Cole.
(for log
, log10
and exp
.)
Chambers, J. M. (1998)
Programming with Data. A Guide to the S Language.
Springer. (for logb
.)
Trig
,
sqrt
,
Arithmetic
.
log(exp(3)) log10(1e7) # = 7 x <- 10^-(1+2*1:9) cbind(x, log(1+x), log1p(x), exp(x)-1, expm1(x))
We can store values in R objects/variables to reuse them in another command.
To do so, use <-
made with <
and -
. An alternative is to use =
. For code clarity, it is not recommanded.
Let's assign for example 2
to x
:
x <- 2
To know what is in x
just enter x
:
x
We can do operations on x:
x+x
You can then assign an operation with x
to y
.
y <- x+3
To get the result y, enter it in the next command:
y
x <- 4
y
So you would have to rerun the command assigning x+3
to y to change the value of y.
y <- x+3
y
In addition to numeric values, we can store other kind of data in an object. For example we will put a string of character in s. Strings of characters have to be entered between "quotes"
s <- "this is a string of characters"
s
Of note, you can check the type of an R object using class()
.
class(x)
class(s)
It is important that numeric values are well encoded as numeric in R and not as strings of characters. Y
"1"
class("1")
class(1)
If you try to add "1"
and 3, an error message is returned here since we are trying to make an impossible operation:
try("1" + 3)# I added the try function to avoid stopping the notebook if you want to run all the cells
Error in "1" + 3 : non-numeric argument to binary operator
If you are using numeric variables, the operation can be done:
1 + 3
When working with R, it is always a good practice to document the R version you are using and the packages that are loaded. The function is sessionInfo()
.
sessionInfo()
R version 4.0.2 (2020-06-22) Platform: x86_64-conda_cos6-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core) Matrix products: default BLAS/LAPACK: /shared/ifbstor1/software/miniconda/envs/r-4.0.2/lib/libopenblasp-r0.3.10.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] compiler_4.0.2 ellipsis_0.3.1 IRdisplay_0.7.0 pbdZMQ_0.3-3.1 [5] tools_4.0.2 htmltools_0.5.1 pillar_1.4.7 base64enc_0.1-3 [9] crayon_1.3.4 uuid_0.1-4 IRkernel_1.1.1 jsonlite_1.7.2 [13] digest_0.6.27 lifecycle_0.2.0 repr_1.1.0 rlang_0.4.10 [17] evaluate_0.14
As you can see, the version 4.0.2 is the one installed on the IFB clore cluster. By default, some "base" packages like stats are loaded. We will see in the next R Session that we can load other packages.
getwd()
Then we change it to the RSession1 folder in your home directory.
setwd('/shared/home/cvandiedonck/RSession1') #change with your login!!!
getwd() #change is visible
The objects x
, y
and s
you have cretaed above are only present in your R session, but they are not written in your working directory on the computer -> they are not present in the left-hand panel of Jupyter Lab.
So, to know which objects you have in your R session, you can use the same function as in Unix/bash to list the files. The only difference is that in R you add brackets to use functions.
ls()
Similarly, you can get rid of an object with the function rm()
.
rm(y)
ls()
Conversely, you can also look at the data on your computer from R with the function dir()
or list.files()
. With the second function, you can add an argument to specify a pattern of interest.
dir()
list.files(pattern=".ipynb")
Before quitting R, you will probably want to save objects and other session information on your computer to be able to find them again next time you use R. By default, all the data and files you save will be saved in your working directory.
The function save()
is used to save a specific object in your computer. You will have to give a name to the file on your computer. Generally, we save them with the extension .RData
.
save(x,file="x.RData")
With the above command, you should have created the file x.Rdata
in your working directory. Check it is present on the left-hand panel of Jupyter Lab.
Now, if you remove x
from your R session, you can load it back again with the load()
function.
rm(x)
ls()
load("x.RData")
ls()
x #x is again accessible
You can also delete the file from the working directory with the function file.remove()
.
file.remove("x.RData") #remove file: returns TRUE on successful removal
Instead of saving a single object, you can save several by listing them all as separate arguments in the save()
function.
save(x,s, file="xands.RData")
file.remove("xands.RData")# to clean the working directory
It is even more efficient when you want to save all objects to use the function save.image()
ls()
save.image(file="AllMyData.RData")
And similarly you can upload them all back after removing all objects in the session or starting a new one.
rm(list=ls()) # this command removes all the objects on the R session
ls() #all variables have been removed
load("AllMyData.RData")
ls() #all variables are accessible again
file.remove("AllMyData.RData")
ls()
# ls()
# savehistory(file="MyHistory.Rhistory") #save all previously run commands in a special formatted file
# loadhistory("MyHistory.Rhistory") #load all commands stored in the specified file
# my_history <- read.delim("MyHistory.Rhistory") #see how the file is formatted: number of line and associated command
# head(my_history)
The main types of variables are :
x <- c(3,7,1,2) # we define a variable x with 4 numeric values concatenated
x
To have a more classical R display than in a notebook, you can add print().
print(x)
[1] 3 7 1 2
X contains 4 numeric values. We can check it is numeric with the function is.numeric()
.
is.numeric(x)
It returns the logical value TRUE
.
You can also perform tests that will return logical values. Below we test wether the values in x are below 2.
x<2 # we test wether the 4 values are < 2
Only the third value of x is < 2. Similarly, we can test which values of x are equal to 2.
x==2
In R, the function class()
returns the class of the object. The functions is.logical()
, is.numeric()
, is.character()
,...test whether the values are of this type. You may enventually do a type conversion with as.numeric()
, as.logical()
, ...
class(x)
class(s)
is.character(s)
is.numeric(s)
print(as.numeric(x<2))
is.numeric("1")
is.numeric(as.numeric("1"))
is.numeric(c(1,"1"))
[1] 0 0 1 0
Coercion rules: There are some coercion rules when doing conversions on concatenating elements of different types: logical <integer < numeric < complex < character < list
There are 4 main data structures in R. The heterogeneous ones accept several classes inside.
object | Can it be heterogeneous? |
---|---|
vector | no |
matrix | no |
dataframe | yes |
list | yes |
c()
, seq()
, :
, rep()
, append()
...a <- c()
a
NULL
weight <- c(60, 72, 57, 90, 95, 72)
weight
print(weight)
[1] 60 72 57 90 95 72
4:10
print(4:10)
[1] 4 5 6 7 8 9 10
print(seq(4,10))
[1] 4 5 6 7 8 9 10
print(seq(2,10,2))
[1] 2 4 6 8 10
print(rep(4,2))
[1] 4 4
print(rep(seq(4,10,2)))
print(c(rep(1,4),rep(2,4)))
print(c(5,s))
[1] 4 6 8 10 [1] 1 1 1 1 2 2 2 2 [1] "5" "this is a string of characters"
You can check the class of a vector but also get some information on its length with length()
and structure with str()
.
class(c(5,s))
length(1:10)
length(weight)
str(weight)
num [1:6] 60 72 57 90 95 72
size <- c(1.75, 1.8, 1.65, 1.9, 1.74, 1.91)
print(size^2)
print(bmi <- weight/size^2 )
print(bmi)
[1] 3.0625 3.2400 2.7225 3.6100 3.0276 3.6481 [1] 19.59184 22.22222 20.93664 24.93075 31.37799 19.73630 [1] 19.59184 22.22222 20.93664 24.93075 31.37799 19.73630
print(sort(size))
mean(size)
sd(size)
median(size)
min(size)
max(size)
print(range(size))
summary(size)
[1] 1.65 1.74 1.75 1.80 1.90 1.91
[1] 1.65 1.91
Min. 1st Qu. Median Mean 3rd Qu. Max. 1.650 1.742 1.775 1.792 1.875 1.910
[]
:print(size)
size[1]
size[2]
size[6]
size[c(2,6)]
size[c(6,2)]
min(size[c(6,2)])
[1] 1.75 1.80 1.65 1.90 1.74 1.91
names()
returns a vector of the names of vector size
. names(size)
names(size) <- c("Fabien","Pierre","Sandrine","Claire","Bruno","Delphine")
size
str(size)
NULL
Named num [1:6] 1.75 1.8 1.65 1.9 1.74 1.91 - attr(*, "names")= chr [1:6] "Fabien" "Pierre" "Sandrine" "Claire" ...
The function to create a matrix is matrix()
myData <- matrix(c(1,2,3, 11,12,13), nrow = 2, ncol = 3)
myData
class(myData)
1 | 3 | 12 |
2 | 11 | 13 |
Thus by default, a matrix is filled by columns but you can change this behaviour and fill it by rows.
myData <- matrix(c(1,2,3, 11,12,13), nrow = 2, ncol = 3, byrow = TRUE)
myData
1 | 2 | 3 |
11 | 12 | 13 |
dim()
or str()
, nrow()
or ncol()
print(dim(myData))
str(myData)
nrow(myData)
ncol(myData)
[1] 2 3 num [1:2, 1:3] 1 11 2 12 3 13
Printing the matrix shows you [i,j]
coordinates, where i
is the index of the row and j
that of the column.
print(myData)
[,1] [,2] [,3] [1,] 1 2 3 [2,] 11 12 13
[]
myData[1,2] # returns the value of the 1st row and 2nd column
myData[2,1] # returns the value of the 2nd row and 1st column
print(myData[,1]) # returns the values of the vector corresponding to the 1st column
[1] 1 11
print(myData[2,]) # returns the values of the vector corresponding to the 2nd row
[1] 11 12 13
myData[,2:3] # subsets the initial matrix returning a sub-matrix
# with all rows of the 2nd and 3rd columns from the initial matrix
# the generated matrix has 2 rows and 2 columns
2 | 3 |
12 | 13 |
print(dim(myData[,2:3])) # the generated matrix has 2 rows and 2 columns
[1] 2 2
class(myData[,1]) # we extract a vector -> thus the class is numeric and no more matrix
length(myData[1,])
length(myData[,1])
rbind()
or cbind()
myData2 <- cbind(weight, size, bmi)
myData2
myData3 <- rbind(weight, size, bmi)
myData3
weight | size | bmi | |
---|---|---|---|
Fabien | 60 | 1.75 | 19.59184 |
Pierre | 72 | 1.80 | 22.22222 |
Sandrine | 57 | 1.65 | 20.93664 |
Claire | 90 | 1.90 | 24.93075 |
Bruno | 95 | 1.74 | 31.37799 |
Delphine | 72 | 1.91 | 19.73630 |
Fabien | Pierre | Sandrine | Claire | Bruno | Delphine | |
---|---|---|---|---|---|---|
weight | 60.00000 | 72.00000 | 57.00000 | 90.00000 | 95.00000 | 72.0000 |
size | 1.75000 | 1.80000 | 1.65000 | 1.90000 | 1.74000 | 1.9100 |
bmi | 19.59184 | 22.22222 | 20.93664 | 24.93075 | 31.37799 | 19.7363 |
myData2*2
summary(myData2)
mean(myData2)
mean(myData2[,1])
weight | size | bmi | |
---|---|---|---|
Fabien | 120 | 3.50 | 39.18367 |
Pierre | 144 | 3.60 | 44.44444 |
Sandrine | 114 | 3.30 | 41.87328 |
Claire | 180 | 3.80 | 49.86150 |
Bruno | 190 | 3.48 | 62.75598 |
Delphine | 144 | 3.82 | 39.47260 |
weight size bmi Min. :57.00 Min. :1.650 Min. :19.59 1st Qu.:63.00 1st Qu.:1.742 1st Qu.:20.04 Median :72.00 Median :1.775 Median :21.58 Mean :74.33 Mean :1.792 Mean :23.13 3rd Qu.:85.50 3rd Qu.:1.875 3rd Qu.:24.25 Max. :95.00 Max. :1.910 Max. :31.38
Dataframes are two-dimensional objects that can be heterogeneous between columns (but homogeneous within a column)
data.frame()
:This can be done using existing vectors of same length like the previoulsy generated "weight", "size" and "bmi" .
myDataf <- data.frame(weight, size, bmi)
myDataf
weight | size | bmi | |
---|---|---|---|
<dbl> | <dbl> | <dbl> | |
Fabien | 60 | 1.75 | 19.59184 |
Pierre | 72 | 1.80 | 22.22222 |
Sandrine | 57 | 1.65 | 20.93664 |
Claire | 90 | 1.90 | 24.93075 |
Bruno | 95 | 1.74 | 31.37799 |
Delphine | 72 | 1.91 | 19.73630 |
The obtained dataframe looks pretty much like the previous matrix myData2.
class(myDataf)
str(myDataf)
'data.frame': 6 obs. of 3 variables: $ weight: num 60 72 57 90 95 72 $ size : num 1.75 1.8 1.65 1.9 1.74 1.91 $ bmi : num 19.6 22.2 20.9 24.9 31.4 ...
print(dim(myDataf))
[1] 6 3
Note that if the vectors used to generate the dataframe are character strings, it is advised in versions < 4 to add the argument
stringsAsFactors=FALSE
If the vectors that will generate the dataframe do not exist yet in the session, but you would like to initiate a dataframe to fill it during your analysis, you could imagine creating an empty dataframe. But this method is useless as it is impossible to fill the generated dataframe having 0 columns and rows.
d <- data.frame()
d
dim(d)
In that case, it is better to create an empty matrix and to convert it to a dataframe. See below.
as.data.frame()
Let's try with the object myData2 we previously created. It is a matrix:
class(myData2)
class(as.data.frame(myData2))
str(as.data.frame(myData2))
'data.frame': 6 obs. of 3 variables: $ weight: num 60 72 57 90 95 72 $ size : num 1.75 1.8 1.65 1.9 1.74 1.91 $ bmi : num 19.6 22.2 20.9 24.9 31.4 ...
You may also use as.data.frame()
matrix generated by binding rows or columns:
d2 <- as.data.frame(cbind(1:2, 10:11))
str(d2)
'data.frame': 2 obs. of 2 variables: $ V1: int 1 2 $ V2: int 10 11
So, similarly, we can do such a conversion of an empy matrix into a dataframe like in this example with a matrix of two rows and three columns currently filled with missing values:
d <- as.data.frame(matrix(NA,2,3))
d
dim(d)
str(d)
V1 | V2 | V3 |
---|---|---|
<lgl> | <lgl> | <lgl> |
NA | NA | NA |
NA | NA | NA |
'data.frame': 2 obs. of 3 variables: $ V1: logi NA NA $ V2: logi NA NA $ V3: logi NA NA
You may use the same fonctions as the ones used for matrices: rownames()
and colnames()
:
rownames(d)
colnames(d)
But it is better to use the functions dedicated to dataframes which are row.names()
and names()
:
row.names(d)
names(d)
To better follow, let's first diplay again myDataf
print(myDataf)
weight size bmi Fabien 60 1.75 19.59184 Pierre 72 1.80 22.22222 Sandrine 57 1.65 20.93664 Claire 90 1.90 24.93075 Bruno 95 1.74 31.37799 Delphine 72 1.91 19.73630
Variables are the columns of a dataframe. You can extract the vector corresponding to a column from a dataframe with its index
, with the name
of the column inside""
or using the symbol $
:
print(myDataf[,2])
print(myDataf[,"size"])
print(myDataf$size)
[1] 1.75 1.80 1.65 1.90 1.74 1.91 [1] 1.75 1.80 1.65 1.90 1.74 1.91 [1] 1.75 1.80 1.65 1.90 1.74 1.91
You have two options to do so:
myDataf[2,]
weight | size | bmi | |
---|---|---|---|
<dbl> | <dbl> | <dbl> | |
Pierre | 72 | 1.8 | 22.22222 |
""
insie the squared brackets:myDataf["Pierre",]
weight | size | bmi | |
---|---|---|---|
<dbl> | <dbl> | <dbl> | |
Pierre | 72 | 1.8 | 22.22222 |
class(myDataf["Pierre",])
In both cases, you may notice that you obtain a dataframe and not a vector, even if you extract only one row.
If you wish to get the vector corresponing to a row, you have to convert it with the unlist()
function.
Of note, dataframes are a special case of list variables of the same number of rows with unique row names.
temp <- unlist(myDataf["Pierre",])
print(temp)
class(temp)
weight size bmi 72.00000 1.80000 22.22222
d2$new <- 1:2
d2
V1 | V2 | new |
---|---|---|
<int> | <int> | <int> |
1 | 10 | 1 |
2 | 11 | 2 |
Here is another example to add a colum "sex" to the dataframe myData using a vector called "sex". I changed the name of the vector but you could keep the same name!
gender <- c("Man","Man","Woman","Woman","Man","Woman")
print(gender)
myDataf$sex <- gender
print(myDataf$sex)
myDataf
str(myDataf)
[1] "Man" "Man" "Woman" "Woman" "Man" "Woman" [1] "Man" "Man" "Woman" "Woman" "Man" "Woman"
weight | size | bmi | sex | |
---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <chr> | |
Fabien | 60 | 1.75 | 19.59184 | Man |
Pierre | 72 | 1.80 | 22.22222 | Man |
Sandrine | 57 | 1.65 | 20.93664 | Woman |
Claire | 90 | 1.90 | 24.93075 | Woman |
Bruno | 95 | 1.74 | 31.37799 | Man |
Delphine | 72 | 1.91 | 19.73630 | Woman |
'data.frame': 6 obs. of 4 variables: $ weight: num 60 72 57 90 95 72 $ size : num 1.75 1.8 1.65 1.9 1.74 1.91 $ bmi : num 19.6 22.2 20.9 24.9 31.4 ... $ sex : chr "Man" "Man" "Woman" "Woman" ...
data.frame()
:d3 <- data.frame(d, d2)
d3
V1 | V2 | V3 | V1.1 | V2.1 | new |
---|---|---|---|---|---|
<lgl> | <lgl> | <lgl> | <int> | <int> | <int> |
NA | NA | NA | 1 | 10 | 1 |
NA | NA | NA | 2 | 11 | 2 |
The function read.table()
reads a delimited text file (tabulated, scv or other column separator) into R and generates a dataframe.
Before importing the file Temperature.txt
let's see how it looks like. Just double click on it. It is located in /shared/projects/dubii2021/trainers/module3/data/
You will see it is a tab-delimited text file.
Now let's import it in R by specifying the correct separator with the read.table()
function:
path_to_file <- "/shared/projects/dubii2021/trainers/module3/data/Temperatures.txt"
temperatures <- read.table(path_to_file, sep="\t", header=T, stringsAsFactors=F)
temperatures
str(temperatures)
Month | Mean_Temp |
---|---|
<chr> | <dbl> |
January | 2.0 |
February | 2.6 |
March | 7.9 |
April | 11.2 |
May | 15.3 |
June | 22.2 |
July | 22.9 |
August | 22.5 |
September | 17.3 |
October | 11.7 |
November | 5.2 |
December | 2.8 |
'data.frame': 12 obs. of 2 variables: $ Month : chr "January" "February" "March" "April" ... $ Mean_Temp: num 2 2.6 7.9 11.2 15.3 22.2 22.9 22.5 17.3 11.7 ...
In the above command, I used the argument stringsAsFactors=FALSE
to avoid a factorisation of the columns with strings of character (here the "Month" column).
In R versions < 4, the default value for this argument is TRUE
. Let's see what would have happened:
temperatures.2 <- read.table(path_to_file, sep="\t", header=T, stringsAsFactors=TRUE)
str(temperatures.2)
'data.frame': 12 obs. of 2 variables: $ Month : Factor w/ 12 levels "April","August",..: 5 4 8 1 9 7 6 2 12 11 ... $ Mean_Temp: num 2 2.6 7.9 11.2 15.3 22.2 22.9 22.5 17.3 11.7 ...
Here the "Month" column has been factorised. How?
levels(temperatures.2$Month)
By alphabetic order, which is not what you want!
Thus always use stringsAsFactors=FALSE
Conversely, save a dataframe into your working directory with write.table()
:
# save a dataframe as a text file in the working directory
write.table(myDataf, file="bmi_data.txt", sep="\t", quote=F, col.names=T)
Have a look at it by double clicking on it in your working directory.
and check you can import it back in R again:
rm(myDataf)
myDataf <- read.table("bmi_data.txt", sep="\t", header=T, stringsAsFactors=F)
head(myDataf) #myDataf is again accessible
file.remove("bmi_data.txt") #to clean the working directory
weight | size | bmi | sex | |
---|---|---|---|---|
<int> | <dbl> | <dbl> | <chr> | |
Fabien | 60 | 1.75 | 19.59184 | Man |
Pierre | 72 | 1.80 | 22.22222 | Man |
Sandrine | 57 | 1.65 | 20.93664 | Woman |
Claire | 90 | 1.90 | 24.93075 | Woman |
Bruno | 95 | 1.74 | 31.37799 | Man |
Delphine | 72 | 1.91 | 19.73630 | Woman |
which()
returns the index of what is TRUE in a tested condition:¶print(which ( myDataf$sex == "Woman") )
[1] 3 4 6
Here, we obtain a vector where 3, 4 and 6 corrrespond to the positions or indexes (1-based) of the occurence "Woman" in the vector/variable myDataf$sex. We can the use this vector as usual in a dataframe before the "," to select the corresponding rows.
myDataf [ which ( myDataf$sex == "Woman") , ]
weight | size | bmi | sex | |
---|---|---|---|---|
<int> | <dbl> | <dbl> | <chr> | |
Sandrine | 57 | 1.65 | 20.93664 | Woman |
Claire | 90 | 1.90 | 24.93075 | Woman |
Delphine | 72 | 1.91 | 19.73630 | Woman |
str(myDataf [ which ( myDataf$sex == "Woman") , ])
'data.frame': 3 obs. of 4 variables: $ weight: int 57 90 72 $ size : num 1.65 1.9 1.91 $ bmi : num 20.9 24.9 19.7 $ sex : chr "Woman" "Woman" "Woman"
Instead of "==" one can use ̀!=
for "is different" to detect what does not match.
print(which ( myDataf$sex != "Man"))
[1] 3 4 6
Abother method would be to add !
for "not" before the test, to get the complementary result:
print(which (! myDataf$sex == "Man"))
[1] 3 4 6
Lets' make a copy of our dataframe and replace the gender of Claire by a missing value:
myDataf2 <- myDataf
myDataf2["Claire", "sex"] <- NA
myDataf2
weight | size | bmi | sex | |
---|---|---|---|---|
<int> | <dbl> | <dbl> | <chr> | |
Fabien | 60 | 1.75 | 19.59184 | Man |
Pierre | 72 | 1.80 | 22.22222 | Man |
Sandrine | 57 | 1.65 | 20.93664 | Woman |
Claire | 90 | 1.90 | 24.93075 | NA |
Bruno | 95 | 1.74 | 31.37799 | Man |
Delphine | 72 | 1.91 | 19.73630 | Woman |
and rerun the same command as above without which() on the new myDataf2:
myDataf2[myDataf2$sex == "Woman",]
weight | size | bmi | sex | |
---|---|---|---|---|
<int> | <dbl> | <dbl> | <chr> | |
Sandrine | 57 | 1.65 | 20.93664 | Woman |
NA | NA | NA | NA | NA |
Delphine | 72 | 1.91 | 19.73630 | Woman |
myDataf2[which(myDataf2$sex == "Woman"),]
weight | size | bmi | sex | |
---|---|---|---|---|
<int> | <dbl> | <dbl> | <chr> | |
Sandrine | 57 | 1.65 | 20.93664 | Woman |
Delphine | 72 | 1.91 | 19.73630 | Woman |
grep()
:¶It returns the index of what matches, even partially.
print(grep("Wom", myDataf$sex))
[1] 3 4 6
print(grep("Woman", myDataf$sex))
[1] 3 4 6
myDataf [grep("Woman", myDataf$sex), ]
weight | size | bmi | sex | |
---|---|---|---|---|
<int> | <dbl> | <dbl> | <chr> | |
Sandrine | 57 | 1.65 | 20.93664 | Woman |
Claire | 90 | 1.90 | 24.93075 | Woman |
Delphine | 72 | 1.91 | 19.73630 | Woman |
print(grep("a", row.names(myDataf)))
[1] 1 3 4
myDataf [grep("a", row.names(myDataf)),]
weight | size | bmi | sex | |
---|---|---|---|---|
<int> | <dbl> | <dbl> | <chr> | |
Fabien | 60 | 1.75 | 19.59184 | Man |
Sandrine | 57 | 1.65 | 20.93664 | Woman |
Claire | 90 | 1.90 | 24.93075 | Woman |
subset()
is even simpler than which()
:¶Just enter the dataframe as first argument, and the variable without "quotes" on which you do the filtering followed by the condition.
WomenDataf <- subset(myDataf, gender== "Woman")
WomenDataf
weight | size | bmi | sex | |
---|---|---|---|---|
<int> | <dbl> | <dbl> | <chr> | |
Sandrine | 57 | 1.65 | 20.93664 | Woman |
Claire | 90 | 1.90 | 24.93075 | Woman |
Delphine | 72 | 1.91 | 19.73630 | Woman |
&
= AND, |
= OR, !
= not==
, !=
for diffferent, >
, <
, >=
, >=
%in%
filteredData <- myDataf [ which ( myDataf$sex == "Woman" & myDataf$weight < 80 & myDataf$bmi > 20), ]
filteredData
weight | size | bmi | sex | |
---|---|---|---|---|
<int> | <dbl> | <dbl> | <chr> | |
Sandrine | 57 | 1.65 | 20.93664 | Woman |
subset( myDataf, sex == "Woman" & weight < 80 & bmi > 20)
weight | size | bmi | sex | |
---|---|---|---|---|
<int> | <dbl> | <dbl> | <chr> | |
Sandrine | 57 | 1.65 | 20.93664 | Woman |
In this example, I add one column with indexes that I will use as a key, but we can also use an existing variable as a key.
myDataf$index <- 1:6
myDataf
weight | size | bmi | sex | index | |
---|---|---|---|---|---|
<int> | <dbl> | <dbl> | <chr> | <int> | |
Fabien | 60 | 1.75 | 19.59184 | Man | 1 |
Pierre | 72 | 1.80 | 22.22222 | Man | 2 |
Sandrine | 57 | 1.65 | 20.93664 | Woman | 3 |
Claire | 90 | 1.90 | 24.93075 | Woman | 4 |
Bruno | 95 | 1.74 | 31.37799 | Man | 5 |
Delphine | 72 | 1.91 | 19.73630 | Woman | 6 |
Then I generate another dataframe with handedness information on 6 samples, but one sample is new compared to the initial dataframe.
OtherData <- data.frame(c(1:5, 7),rep(c("right-handed","left-handed"),3))
names(OtherData) <- c("ID","handedness")
OtherData
ID | handedness |
---|---|
<dbl> | <chr> |
1 | right-handed |
2 | left-handed |
3 | right-handed |
4 | left-handed |
5 | right-handed |
7 | left-handed |
We can now merge them together by specifying the "key" column with the argument by
. The all
argument is used to keep all the rows of a dataframe that are not present in the other. The .x
refers to the first dataframe while .y
refers to the second one.
myMergedDataf <- merge(myDataf, OtherData, by.x="index", by.y="ID", all.x=T, all.y=T, sort=F)
myMergedDataf
index | weight | size | bmi | sex | handedness |
---|---|---|---|---|---|
<dbl> | <int> | <dbl> | <dbl> | <chr> | <chr> |
1 | 60 | 1.75 | 19.59184 | Man | right-handed |
2 | 72 | 1.80 | 22.22222 | Man | left-handed |
3 | 57 | 1.65 | 20.93664 | Woman | right-handed |
4 | 90 | 1.90 | 24.93075 | Woman | left-handed |
5 | 95 | 1.74 | 31.37799 | Man | right-handed |
6 | 72 | 1.91 | 19.73630 | Woman | NA |
7 | NA | NA | NA | NA | left-handed |
In the merged dataframe, we start with all the rows present in both dataframes. The next row contains the data only present in the first dataframe with missing data for the columns in the second dataframe. The last rows are the ones with data only present in the second dataframe with missing data for the first dataframe.
Unless the merge is done on the row names (by="0"), the row names of the initial dataframe are lost. The new dataframe has its own row names.
If two columns have the same name in both dataframes, by default R adds an ".x" to the one from the first dataframe and ".y" to the one of the second dataframe. The names can be changed with the argument suffixes
.
We will see more deeply how to generate basic plots in you personal work on Wednesday for different kind of variables, and during session 2 of R how to generate custom plots either with R base or ggplot.
But let's have a quick view of what can be done on our dataframe.
plot()
¶plot(myDataf$weight~myDataf$size)
boxplot()
:boxplot(myDataf$weight)
or using ~
to display boxplots on the same plot depending on a categorical variable:
boxplot(myDataf$weight~myDataf$sex)
hist()
:a <- rnorm(1000) # to sample 1000 values from a normal distribution of mean 0 and standard deviation 1
hist(a, breaks=20) # the argument breaks is used to specify the number of intervals
We will further see that graphs have three-level functions:
primary graph functions like plot()
, boxplot
and hist()
to display the most principal graphs in R
secondary graph functions to complement an existing plot
graphical parameters to modify the plots display:
par()
function before plotting the graph.Lets' save all the main objects of this session into an R object:
ls()
We will keep myDataf
and temperatures
.
save(myDataf,temperatures, file="RSession1_tutorial.RData")
sessionInfo()
R version 4.0.2 (2020-06-22) Platform: x86_64-conda_cos6-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core) Matrix products: default BLAS/LAPACK: /shared/ifbstor1/software/miniconda/envs/r-4.0.2/lib/libopenblasp-r0.3.10.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] digest_0.6.27 crayon_1.3.4 IRdisplay_0.7.0 repr_1.1.0 [5] lifecycle_0.2.0 jsonlite_1.7.2 evaluate_0.14 pillar_1.4.7 [9] rlang_0.4.10 uuid_0.1-4 vctrs_0.3.6 ellipsis_0.3.1 [13] IRkernel_1.1.1 Cairo_1.5-12.2 tools_4.0.2 compiler_4.0.2 [17] base64enc_0.1-3 htmltools_0.5.1 pbdZMQ_0.3-3.1