Goal of this tutorial

This tutorial aims at showing how to use the gProfiler2 R package. It is more or less a Rmd version of the vignette. You can refer to the vignette and reference manual (links below) for more details.

g:GOSt - functional enrichment analysis of gene lists

Here is an example in Homo sapiens with various types of gene identifiers, a SNP ID, chromosomal intervals and a term ID from the Gene Ontology (GO). Parameters are described in the vignette. The result is a named list where “result” is a data.frame with the enrichment analysis results and “meta” contains a named list with all the metadata for the query.

## Example of gene list functional enrichment analysis with gost
gostres <- gost(
  query = c("X:1000:1000000", 
            "rs17396340", 
            "GO:0005005",
            "ENSG00000156103", 
            "NLRP1"), 
  organism = "hsapiens", 
  ordered_query = FALSE, 
  multi_query = FALSE, 
  significant = TRUE, 
  exclude_iea = FALSE, 
  measure_underrepresentation = FALSE, 
  evcodes = FALSE, 
  user_threshold = 0.05, 
  correction_method = "g_SCS", 
  domain_scope = "annotated", 
  custom_bg = NULL, 
  numeric_ns = "", 
  sources = NULL, 
  as_short_link = FALSE)

names(gostres)
[1] "result" "meta"  
summary(gostres)
       Length Class      Mode
result 14     data.frame list
meta    5     -none-     list
names(gostres$meta)
[1] "query_metadata"  "result_metadata" "genes_metadata"  "timestamp"       "version"        
# head(gostres$result)
kable(head(gostres$result), digits = c(0, 0, 30, 7, 7, 7, 3, 3, 0, 0, 7, 7, 0, 0))
query significant p_value term_size query_size intersection_size precision recall term_id source term_name effective_domain_size source_order parents
query_1 TRUE 3.000000e-30 88 22 16 0.727 0.182 GO:0048013 GO:BP ephrin receptor signaling pathway 18092 14554 GO:0007169
query_1 TRUE 1.077092e-21 285 22 16 0.727 0.056 GO:0007411 GO:BP axon guidance 18092 3305 GO:0007409, GO:0097485
query_1 TRUE 1.140559e-21 286 22 16 0.727 0.056 GO:0097485 GO:BP neuron projection guidance 18092 21970 GO:0006928, GO:0006935, GO:0048812
query_1 TRUE 6.823529e-18 489 22 16 0.727 0.033 GO:0007409 GO:BP axonogenesis 18092 3304 GO:0048667, GO:0048812, GO:0061564
query_1 TRUE 2.809503e-17 534 22 16 0.727 0.030 GO:0061564 GO:BP axon development 18092 18331 GO:0031175
query_1 TRUE 2.846480e-16 617 22 16 0.727 0.026 GO:0048667 GO:BP cell morphogenesis involved in neuron differentiation 18092 15031 GO:0000904, GO:0048666

With the parameter evcodes = TRUE, the result will include the evidence codes. In addition, a column “intersection” will appear showing the input gene IDs that intersect with the corresponding functional term. Note that this parameter can decrease the performance and make the query slower.

# Example with intersection (can be time consuming)
gostres2 <- gost(
  query = c("X:1000:1000000", 
            "rs17396340",
            "GO:0005005", 
            "ENSG00000156103", 
            "NLRP1"), 
  organism = "hsapiens", 
  ordered_query = FALSE, 
  multi_query = FALSE, 
  significant = TRUE, 
  exclude_iea = FALSE, 
  measure_underrepresentation = FALSE, 
  evcodes = TRUE, 
  user_threshold = 0.05, 
  correction_method = "g_SCS", 
  domain_scope = "annotated", 
  custom_bg = NULL, 
  numeric_ns = "", 
  sources = NULL)

# head(gostres2$result)
kable(head(gostres2$result), digits = c(0, 0, 30, 7, 7, 7, 3, 3, 0, 0, 7, 7, 0, 0, 0, 0))
query significant p_value term_size query_size intersection_size precision recall term_id source term_name effective_domain_size source_order parents evidence_codes intersection
query_1 TRUE 3.000000e-30 88 22 16 0.727 0.182 GO:0048013 GO:BP ephrin receptor signaling pathway 18092 14554 GO:0007169 IDA TAS IEA,ISS TAS IEA,TAS IEA,IDA IBA TAS,IGI TAS IEA,ISS TAS IEA,IDA TAS IEA,IDA TAS IEA,IGI ISS IBA TAS IEA,ISS TAS IEA,TAS,IDA TAS IEA,IDA TAS,TAS IEA,IDA TAS IEA,IBA TAS IEA ENSG00000044524,ENSG00000070886,ENSG00000080224,ENSG00000108947,ENSG00000116106,ENSG00000133216,ENSG00000135333,ENSG00000142627,ENSG00000143590,ENSG00000145242,ENSG00000146904,ENSG00000154928,ENSG00000182580,ENSG00000183317,ENSG00000196411,ENSG00000243364
query_1 TRUE 1.077092e-21 285 22 16 0.727 0.056 GO:0007411 GO:BP axon guidance 18092 3305 GO:0007409, GO:0097485 IBA,ISS IBA IEA,IBA,IBA IEA,ISS IBA IEA,ISS IBA IEA,IBA IEA,IBA,IBA,ISS IBA IEA,IBA,ISS IBA IEA,ISS IBA IEA,IBA,IBA,IBA ENSG00000044524,ENSG00000070886,ENSG00000080224,ENSG00000108947,ENSG00000116106,ENSG00000133216,ENSG00000135333,ENSG00000142627,ENSG00000143590,ENSG00000145242,ENSG00000146904,ENSG00000154928,ENSG00000182580,ENSG00000183317,ENSG00000196411,ENSG00000243364
query_1 TRUE 1.140559e-21 286 22 16 0.727 0.056 GO:0097485 GO:BP neuron projection guidance 18092 21970 GO:0006928, GO:0006935, GO:0048812 IBA,ISS IBA IEA,IBA,IBA IEA,ISS IBA IEA,ISS IBA IEA,IBA IEA,IBA,IBA,ISS IBA IEA,IBA,ISS IBA IEA,ISS IBA IEA,IBA,IBA,IBA ENSG00000044524,ENSG00000070886,ENSG00000080224,ENSG00000108947,ENSG00000116106,ENSG00000133216,ENSG00000135333,ENSG00000142627,ENSG00000143590,ENSG00000145242,ENSG00000146904,ENSG00000154928,ENSG00000182580,ENSG00000183317,ENSG00000196411,ENSG00000243364
query_1 TRUE 6.823529e-18 489 22 16 0.727 0.033 GO:0007409 GO:BP axonogenesis 18092 3304 GO:0048667, GO:0048812, GO:0061564 IBA,ISS IBA IEA,IBA,ISS IBA IEA,ISS IBA IEA,ISS IBA IEA,IBA IEA,IBA,IBA,ISS IBA IEA,IBA,ISS IBA IEA,ISS IBA IEA,IBA,IBA,IBA ENSG00000044524,ENSG00000070886,ENSG00000080224,ENSG00000108947,ENSG00000116106,ENSG00000133216,ENSG00000135333,ENSG00000142627,ENSG00000143590,ENSG00000145242,ENSG00000146904,ENSG00000154928,ENSG00000182580,ENSG00000183317,ENSG00000196411,ENSG00000243364
query_1 TRUE 2.809503e-17 534 22 16 0.727 0.030 GO:0061564 GO:BP axon development 18092 18331 GO:0031175 ISS IBA,ISS IBA IEA,IBA,ISS IBA IEA,ISS IBA IEA,ISS IBA IEA,IBA IEA,IBA,IBA,ISS IBA IEA,IBA,ISS IBA IEA,ISS IBA IEA,IBA,IBA,IBA ENSG00000044524,ENSG00000070886,ENSG00000080224,ENSG00000108947,ENSG00000116106,ENSG00000133216,ENSG00000135333,ENSG00000142627,ENSG00000143590,ENSG00000145242,ENSG00000146904,ENSG00000154928,ENSG00000182580,ENSG00000183317,ENSG00000196411,ENSG00000243364
query_1 TRUE 2.846480e-16 617 22 16 0.727 0.026 GO:0048667 GO:BP cell morphogenesis involved in neuron differentiation 18092 15031 GO:0000904, GO:0048666 IBA,ISS IBA IEA,IBA,ISS IBA IEA,ISS IBA IEA,ISS IBA IEA,IBA IEA,IBA,IBA,ISS IBA IEA,IBA,ISS IBA IEA,ISS IBA IEA,IBA,IBA,IBA ENSG00000044524,ENSG00000070886,ENSG00000080224,ENSG00000108947,ENSG00000116106,ENSG00000133216,ENSG00000135333,ENSG00000142627,ENSG00000143590,ENSG00000145242,ENSG00000146904,ENSG00000154928,ENSG00000182580,ENSG00000183317,ENSG00000196411,ENSG00000243364

The query results can also be gathered into a short-link to the g:Profiler web tool.

# Get a web link
gostres_link <- gost(
  query = c("X:1000:1000000", 
            "rs17396340", 
            "GO:0005005", 
            "ENSG00000156103", 
            "NLRP1"), 
  as_short_link = TRUE)
gostres_link
[1] "https://biit.cs.ut.ee/gplink/l/2-zGBtraRm"

The function gost also allows to perform enrichment on multiple input gene lists. If the parameter multiquery = TRUE is used, then the results from all of the input queries are grouped according to term IDs for better comparison.

# Multiple gene lists
multi_gostres1 <- gost(
  query = list("chromX" = c("X:1000:1000000", 
                            "rs17396340", 
                            "GO:0005005", 
                            "ENSG00000156103", 
                            "NLRP1"),
               "chromY" = c("Y:1:10000000", 
                            "rs17396340", 
                            "GO:0005005", 
                            "ENSG00000156103", 
                            "NLRP1")), 
                multi_query = FALSE)

# head(multi_gostres1$result)
kable(head(multi_gostres1$result), digits = c(0, 0, 30, 7, 7, 7, 3, 3, 0, 0, 7, 7, 0, 0))
query significant p_value term_size query_size intersection_size precision recall term_id source term_name effective_domain_size source_order parents
chromX TRUE 3.000000e-30 88 22 16 0.727 0.182 GO:0048013 GO:BP ephrin receptor signaling pathway 18092 14554 GO:0007169
chromX TRUE 1.077092e-21 285 22 16 0.727 0.056 GO:0007411 GO:BP axon guidance 18092 3305 GO:0007409, GO:0097485
chromX TRUE 1.140559e-21 286 22 16 0.727 0.056 GO:0097485 GO:BP neuron projection guidance 18092 21970 GO:0006928, GO:0006935, GO:0048812
chromX TRUE 6.823529e-18 489 22 16 0.727 0.033 GO:0007409 GO:BP axonogenesis 18092 3304 GO:0048667, GO:0048812, GO:0061564
chromX TRUE 2.809503e-17 534 22 16 0.727 0.030 GO:0061564 GO:BP axon development 18092 18331 GO:0031175
chromX TRUE 2.846480e-16 617 22 16 0.727 0.026 GO:0048667 GO:BP cell morphogenesis involved in neuron differentiation 18092 15031 GO:0000904, GO:0048666
multi_gostres2 <- gost(
  query = list("chromX" = c("X:1000:1000000", 
                            "rs17396340",
                            "GO:0005005", 
                            "ENSG00000156103", "NLRP1"),
               "chromY" = c("Y:1:10000000", 
                            "rs17396340", 
                            "GO:0005005", 
                            "ENSG00000156103", "NLRP1")), 
  multi_query = TRUE)

# head(multi_gostres2$result)
kable(head(multi_gostres2$result), digits = c(0, 0, 30, 7, 7, 7, 3, 3, 0, 0, 7, 7, 0, 0))
term_id p_values significant term_size query_sizes intersection_sizes source term_name effective_domain_size source_order parents
GO:0005005 7.162616e-48, 4.090780e-44 TRUE, TRUE 16 23, 33 16, 16 GO:MF transmembrane-ephrin receptor activity 18694 1548 GO:0005003
GO:0005003 6.933232e-45, 3.953788e-41 TRUE, TRUE 19 23, 33 16, 16 GO:MF ephrin receptor activity 18694 1546 GO:0004714
GO:0004714 2.581144e-33, 1.439607e-29 TRUE, TRUE 63 23, 33 16, 16 GO:MF transmembrane receptor protein tyrosine kinase activity 18694 1295 GO:0004713, GO:0019199
REAC:R-HSA-3928665 4.164481e-33, 1.836785e-32 TRUE, TRUE 49 20, 21 16, 16 REAC EPH-ephrin mediated repulsion of cells 10531 747 REAC:R-HSA-2682334
GO:0019199 2.920627e-31, 1.613380e-27 TRUE, TRUE 82 23, 33 16, 16 GO:MF transmembrane receptor protein kinase activity 18694 4539 GO:0004672, GO:0004888
GO:0048013 2.813706e-30, 8.827603e-26 TRUE, TRUE 88 22, 34 16, 16 GO:BP ephrin receptor signaling pathway 18092 14554 GO:0007169

The enrichment results can be visualized with a Manhattan-like-plot.

# Visualization
gostplot(gostres, capped = TRUE, interactive = TRUE)

If interactive = FALSE, then the function returns a static ggplot object.

The function publish_gostplot takes the static plot object as an input and enables to highlight a selection of interesting terms from the results with numbers and table of results.

# Static ggplot object
p <- gostplot(gostres, capped = FALSE, interactive = FALSE)

# Highlight terms
pp <- publish_gostplot(p, 
                       highlight_terms = c("GO:0048013", "REAC:R-HSA-3928663"), 
                       width = NA, 
                       height = NA, 
                       filename = NULL )

The gost results can also be visualized as a table.

# Table
publish_gosttable(
  gostres, 
  highlight_terms = gostres$result[c(1:2,10,100:102,120,124,125),],
  use_colors = TRUE, 
  show_columns = c("source", 
                   "term_name", 
                   "term_size", 
                   "intersection_size"),
  filename = NULL)

The same functions work also in case of multiquery results showing multiple Manhattan plots on top of each other and a multiple result table.

# Multiquery plot
gostplot(multi_gostres2, capped = TRUE, interactive = TRUE)
# Multiquery table
publish_gosttable(
  multi_gostres1, 
  highlight_terms = multi_gostres1$result[c(1, 24, 82, 176, 204, 234),],
  use_colors = TRUE, 
  show_columns = c("source", "term_name", "term_size"),
  filename = NULL)

In addition to the available GO, KEGG, etc data sources, users can upload their own custom data source using the Gene Matrix Transposed file format (GMT). The users can compose the files themselves or use pre-compiled gene sets from available dedicated websites like Molecular Signatures Database (MSigDB), etc.

Other tools

GProfiler2 also has a tool to convert gene IDs between databases, etc. It is called g:Convert.

# Gene identifier conversion with gconvert
gconvert(
  query = c("REAC:R-HSA-3928664", "rs17396340", "NLRP1"), 
  organism = "hsapiens", 
  target = "ENSG",
  mthreshold = Inf, 
  filter_na = TRUE)
   input_number              input target_number          target    name                                                                       description                           namespace
1             1 REAC:R-HSA-3928664           1.1 ENSG00000010810     FYN FYN proto-oncogene, Src family tyrosine kinase [Source:HGNC Symbol;Acc:HGNC:4037]                                REAC
2             1 REAC:R-HSA-3928664          1.10 ENSG00000133216   EPHB2                                EPH receptor B2 [Source:HGNC Symbol;Acc:HGNC:3393]                                REAC
3             1 REAC:R-HSA-3928664          1.11 ENSG00000136238    RAC1                      Rac family small GTPase 1 [Source:HGNC Symbol;Acc:HGNC:9801]                                REAC
4             1 REAC:R-HSA-3928664          1.12 ENSG00000137575   SDCBP                      syndecan binding protein [Source:HGNC Symbol;Acc:HGNC:10662]                                REAC
5             1 REAC:R-HSA-3928664          1.13 ENSG00000149269    PAK1                  p21 (RAC1) activated kinase 1 [Source:HGNC Symbol;Acc:HGNC:8590]                                REAC
6             1 REAC:R-HSA-3928664          1.14 ENSG00000154928   EPHB1                                EPH receptor B1 [Source:HGNC Symbol;Acc:HGNC:3392]                                REAC
7             1 REAC:R-HSA-3928664          1.15 ENSG00000180370    PAK2                  p21 (RAC1) activated kinase 2 [Source:HGNC Symbol;Acc:HGNC:8591]                                REAC
8             1 REAC:R-HSA-3928664          1.16 ENSG00000182580   EPHB3                                EPH receptor B3 [Source:HGNC Symbol;Acc:HGNC:3394]                                REAC
9             1 REAC:R-HSA-3928664          1.17 ENSG00000196411   EPHB4                                EPH receptor B4 [Source:HGNC Symbol;Acc:HGNC:3395]                                REAC
10            1 REAC:R-HSA-3928664           1.2 ENSG00000071051    NCK2                          NCK adaptor protein 2 [Source:HGNC Symbol;Acc:HGNC:7665]                                REAC
11            1 REAC:R-HSA-3928664           1.3 ENSG00000077264    PAK3                  p21 (RAC1) activated kinase 3 [Source:HGNC Symbol;Acc:HGNC:8592]                                REAC
12            1 REAC:R-HSA-3928664           1.4 ENSG00000090776   EFNB1                                      ephrin B1 [Source:HGNC Symbol;Acc:HGNC:3226]                                REAC
13            1 REAC:R-HSA-3928664           1.5 ENSG00000101608  MYL12A                        myosin light chain 12A [Source:HGNC Symbol;Acc:HGNC:16701]                                REAC
14            1 REAC:R-HSA-3928664           1.6 ENSG00000102606 ARHGEF7      Rho guanine nucleotide exchange factor 7 [Source:HGNC Symbol;Acc:HGNC:15607]                                REAC
15            1 REAC:R-HSA-3928664           1.7 ENSG00000108262    GIT1                                   GIT ArfGAP 1 [Source:HGNC Symbol;Acc:HGNC:4272]                                REAC
16            1 REAC:R-HSA-3928664           1.8 ENSG00000108947   EFNB3                                      ephrin B3 [Source:HGNC Symbol;Acc:HGNC:3228]                                REAC
17            1 REAC:R-HSA-3928664           1.9 ENSG00000125266   EFNB2                                      ephrin B2 [Source:HGNC Symbol;Acc:HGNC:3227]                                REAC
18            2         rs17396340           2.1 ENSG00000054523   KIF1B                      kinesin family member 1B [Source:HGNC Symbol;Acc:HGNC:16636]                                    
19            3              NLRP1           3.1 ENSG00000091592   NLRP1          NLR family pyrin domain containing 1 [Source:HGNC Symbol;Acc:HGNC:14374] ENTREZGENE,HGNC,UNIPROT_GN,WIKIGENE

GProfiler2 also has a tool to map gene IDs between species. It is called g:Orth.

# Mapping homologous genes across related organisms with gorth
gorth(query = c("Klf4", "Sox2", "71950"), 
      source_organism = "mmusculus", 
      target_organism = "hsapiens", 
      mthreshold = Inf, 
      filter_na = TRUE,
      numeric_ns = "ENTREZGENE_ACC")
  input_number input         input_ensg ensg_number ortholog_name   ortholog_ensg                                                        description
1            1  Klf4 ENSMUSG00000003032       1.1.1          KLF4 ENSG00000136826           Kruppel like factor 4 [Source:HGNC Symbol;Acc:HGNC:6348]
2            2  Sox2 ENSMUSG00000074637       2.1.1          SOX2 ENSG00000181449 SRY-box transcription factor 2 [Source:HGNC Symbol;Acc:HGNC:11195]
3            3 71950 ENSMUSG00000012396       3.1.1         NANOG ENSG00000111704                 Nanog homeobox [Source:HGNC Symbol;Acc:HGNC:20857]
4            3 71950 ENSMUSG00000012396       3.1.2       NANOGP8 ENSG00000255192    Nanog homeobox retrogene P8 [Source:HGNC Symbol;Acc:HGNC:23106]

Other gProfiler versions

It is possible to use older (archived) versions or the beta version of gProfiler.

# Accessing archived version (gprofiler2 package is only compatible with versions e94_eg41_p11 and higher)
set_base_url("http://biit.cs.ut.ee/gprofiler_archive3/e95_eg42_p13")

# Accessing the beta release
set_base_url("http://biit.cs.ut.ee/gprofiler_beta")

# Check the current version
get_base_url()
[1] "http://biit.cs.ut.ee/gprofiler_beta"

Save your session info

For the sake of traceability, store the specifications of your R environment in the report, with the command sessionInfo(). This will indicate the version of R as well as of all the libraries used in this notebook.

sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /shared/ifbstor1/software/miniconda/envs/r-4.0.2/lib/libopenblasp-r0.3.10.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] gprofiler2_0.2.0 knitr_1.30      

loaded via a namespace (and not attached):
 [1] tidyselect_1.1.0  xfun_0.20         purrr_0.3.4       colorspace_2.0-0  vctrs_0.3.6       generics_0.1.0    htmltools_0.5.1.1 viridisLite_0.3.0 yaml_2.2.1        utf8_1.2.1        plotly_4.9.3      rlang_0.4.10      pillar_1.5.1      later_1.1.0.1     glue_1.4.2        DBI_1.1.1        
[17] lifecycle_1.0.0   stringr_1.4.0     munsell_0.5.0     gtable_0.3.0      htmlwidgets_1.5.3 evaluate_0.14     labeling_0.4.2    fastmap_1.1.0     httpuv_1.5.5      crosstalk_1.1.1   fansi_0.4.2       highr_0.8         Rcpp_1.0.6        xtable_1.8-4      scales_1.1.1      promises_1.2.0.1 
[33] debugme_1.1.0     jsonlite_1.7.2    farver_2.1.0      mime_0.10         gridExtra_2.3     ggplot2_3.3.3     digest_0.6.27     stringi_1.5.3     dplyr_1.0.5       shiny_1.5.0       grid_4.0.2        tools_4.0.2       bitops_1.0-6      magrittr_2.0.1    lazyeval_0.2.2    RCurl_1.98-1.3   
[49] tibble_3.1.0      crayon_1.4.1      tidyr_1.1.3       pkgconfig_2.0.3   ellipsis_0.3.1    data.table_1.14.0 assertthat_0.2.1  rmarkdown_2.5     httr_1.4.2        R6_2.5.0          compiler_4.0.2