Tutorial gProfiler

Goal of this tutorial

This tutorial aims at showing how to use the gProfiler2 R package. It is more or less a Rmd version of the vignette. You can refer to the vignette and reference manual (links below) for more details.

References

R package: https://cran.r-project.org/web/packages/gprofiler2
Reference manual: https://cran.r-project.org/web/packages/gprofiler2/gprofiler2.pdf
Vignette: https://cran.r-project.org/web/packages/gprofiler2/vignettes/gprofiler2.html
ev codes: http://geneontology.org/docs/guide-go-evidence-codes/

g:GOSt - functional enrichment analysis of gene lists

Here is an example in Homo sapiens with various types of gene identifiers, a SNP ID, chromosomal intervals and a term ID from the Gene Ontology (GO). Parameters are described in the vignette. The result is a named list where “result” is a data.frame with the enrichment analysis results and “meta” contains a named list with all the metadata for the query.

## Example of gene list functional enrichment analysis with gost
gostres <- gost(
  query = c("X:1000:1000000", 
            "rs17396340", 
            "GO:0005005",
            "ENSG00000156103", 
            "NLRP1"), 
  organism = "hsapiens", 
  ordered_query = FALSE, 
  multi_query = FALSE, 
  significant = TRUE, 
  exclude_iea = FALSE, 
  measure_underrepresentation = FALSE, 
  evcodes = FALSE, 
  user_threshold = 0.05, 
  correction_method = "g_SCS", 
  domain_scope = "annotated", 
  custom_bg = NULL, 
  numeric_ns = "", 
  sources = NULL, 
  as_short_link = FALSE)

names(gostres)

[1] "result" "meta"

summary(gostres)

       Length Class      Mode
result 14     data.frame list
meta    5     -none-     list

names(gostres$meta)

[1] "query_metadata"  "result_metadata" "genes_metadata"  "timestamp"       "version"

# head(gostres$result)
kable(head(gostres$result), digits = c(0, 0, 30, 7, 7, 7, 3, 3, 0, 0, 7, 7, 0, 0))

query	significant	p_value	term_size	query_size	intersection_size	precision	recall	term_id	source	term_name	effective_domain_size	source_order	parents
query_1	TRUE	3.000000e-30	88	22	16	0.727	0.182	GO:0048013	GO:BP	ephrin receptor signaling pathway	18092	14554	GO:0007169
query_1	TRUE	1.077092e-21	285	22	16	0.727	0.056	GO:0007411	GO:BP	axon guidance	18092	3305	GO:0007409, GO:0097485
query_1	TRUE	1.140559e-21	286	22	16	0.727	0.056	GO:0097485	GO:BP	neuron projection guidance	18092	21970	GO:0006928, GO:0006935, GO:0048812
query_1	TRUE	6.823529e-18	489	22	16	0.727	0.033	GO:0007409	GO:BP	axonogenesis	18092	3304	GO:0048667, GO:0048812, GO:0061564
query_1	TRUE	2.809503e-17	534	22	16	0.727	0.030	GO:0061564	GO:BP	axon development	18092	18331	GO:0031175
query_1	TRUE	2.846480e-16	617	22	16	0.727	0.026	GO:0048667	GO:BP	cell morphogenesis involved in neuron differentiation	18092	15031	GO:0000904, GO:0048666

With the parameter evcodes = TRUE, the result will include the evidence codes. In addition, a column “intersection” will appear showing the input gene IDs that intersect with the corresponding functional term. Note that this parameter can decrease the performance and make the query slower.

# Example with intersection (can be time consuming)
gostres2 <- gost(
  query = c("X:1000:1000000", 
            "rs17396340",
            "GO:0005005", 
            "ENSG00000156103", 
            "NLRP1"), 
  organism = "hsapiens", 
  ordered_query = FALSE, 
  multi_query = FALSE, 
  significant = TRUE, 
  exclude_iea = FALSE, 
  measure_underrepresentation = FALSE, 
  evcodes = TRUE, 
  user_threshold = 0.05, 
  correction_method = "g_SCS", 
  domain_scope = "annotated", 
  custom_bg = NULL, 
  numeric_ns = "", 
  sources = NULL)

# head(gostres2$result)
kable(head(gostres2$result), digits = c(0, 0, 30, 7, 7, 7, 3, 3, 0, 0, 7, 7, 0, 0, 0, 0))

query	significant	p_value	term_size	query_size	intersection_size	precision	recall	term_id	source	term_name	effective_domain_size	source_order	parents	evidence_codes	intersection
query_1	TRUE	3.000000e-30	88	22	16	0.727	0.182	GO:0048013	GO:BP	ephrin receptor signaling pathway	18092	14554	GO:0007169	IDA TAS IEA,ISS TAS IEA,TAS IEA,IDA IBA TAS,IGI TAS IEA,ISS TAS IEA,IDA TAS IEA,IDA TAS IEA,IGI ISS IBA TAS IEA,ISS TAS IEA,TAS,IDA TAS IEA,IDA TAS,TAS IEA,IDA TAS IEA,IBA TAS IEA	ENSG00000044524,ENSG00000070886,ENSG00000080224,ENSG00000108947,ENSG00000116106,ENSG00000133216,ENSG00000135333,ENSG00000142627,ENSG00000143590,ENSG00000145242,ENSG00000146904,ENSG00000154928,ENSG00000182580,ENSG00000183317,ENSG00000196411,ENSG00000243364
query_1	TRUE	1.077092e-21	285	22	16	0.727	0.056	GO:0007411	GO:BP	axon guidance	18092	3305	GO:0007409, GO:0097485	IBA,ISS IBA IEA,IBA,IBA IEA,ISS IBA IEA,ISS IBA IEA,IBA IEA,IBA,IBA,ISS IBA IEA,IBA,ISS IBA IEA,ISS IBA IEA,IBA,IBA,IBA	ENSG00000044524,ENSG00000070886,ENSG00000080224,ENSG00000108947,ENSG00000116106,ENSG00000133216,ENSG00000135333,ENSG00000142627,ENSG00000143590,ENSG00000145242,ENSG00000146904,ENSG00000154928,ENSG00000182580,ENSG00000183317,ENSG00000196411,ENSG00000243364
query_1	TRUE	1.140559e-21	286	22	16	0.727	0.056	GO:0097485	GO:BP	neuron projection guidance	18092	21970	GO:0006928, GO:0006935, GO:0048812	IBA,ISS IBA IEA,IBA,IBA IEA,ISS IBA IEA,ISS IBA IEA,IBA IEA,IBA,IBA,ISS IBA IEA,IBA,ISS IBA IEA,ISS IBA IEA,IBA,IBA,IBA	ENSG00000044524,ENSG00000070886,ENSG00000080224,ENSG00000108947,ENSG00000116106,ENSG00000133216,ENSG00000135333,ENSG00000142627,ENSG00000143590,ENSG00000145242,ENSG00000146904,ENSG00000154928,ENSG00000182580,ENSG00000183317,ENSG00000196411,ENSG00000243364
query_1	TRUE	6.823529e-18	489	22	16	0.727	0.033	GO:0007409	GO:BP	axonogenesis	18092	3304	GO:0048667, GO:0048812, GO:0061564	IBA,ISS IBA IEA,IBA,ISS IBA IEA,ISS IBA IEA,ISS IBA IEA,IBA IEA,IBA,IBA,ISS IBA IEA,IBA,ISS IBA IEA,ISS IBA IEA,IBA,IBA,IBA	ENSG00000044524,ENSG00000070886,ENSG00000080224,ENSG00000108947,ENSG00000116106,ENSG00000133216,ENSG00000135333,ENSG00000142627,ENSG00000143590,ENSG00000145242,ENSG00000146904,ENSG00000154928,ENSG00000182580,ENSG00000183317,ENSG00000196411,ENSG00000243364
query_1	TRUE	2.809503e-17	534	22	16	0.727	0.030	GO:0061564	GO:BP	axon development	18092	18331	GO:0031175	ISS IBA,ISS IBA IEA,IBA,ISS IBA IEA,ISS IBA IEA,ISS IBA IEA,IBA IEA,IBA,IBA,ISS IBA IEA,IBA,ISS IBA IEA,ISS IBA IEA,IBA,IBA,IBA	ENSG00000044524,ENSG00000070886,ENSG00000080224,ENSG00000108947,ENSG00000116106,ENSG00000133216,ENSG00000135333,ENSG00000142627,ENSG00000143590,ENSG00000145242,ENSG00000146904,ENSG00000154928,ENSG00000182580,ENSG00000183317,ENSG00000196411,ENSG00000243364
query_1	TRUE	2.846480e-16	617	22	16	0.727	0.026	GO:0048667	GO:BP	cell morphogenesis involved in neuron differentiation	18092	15031	GO:0000904, GO:0048666	IBA,ISS IBA IEA,IBA,ISS IBA IEA,ISS IBA IEA,ISS IBA IEA,IBA IEA,IBA,IBA,ISS IBA IEA,IBA,ISS IBA IEA,ISS IBA IEA,IBA,IBA,IBA	ENSG00000044524,ENSG00000070886,ENSG00000080224,ENSG00000108947,ENSG00000116106,ENSG00000133216,ENSG00000135333,ENSG00000142627,ENSG00000143590,ENSG00000145242,ENSG00000146904,ENSG00000154928,ENSG00000182580,ENSG00000183317,ENSG00000196411,ENSG00000243364

The query results can also be gathered into a short-link to the g:Profiler web tool.

# Get a web link
gostres_link <- gost(
  query = c("X:1000:1000000", 
            "rs17396340", 
            "GO:0005005", 
            "ENSG00000156103", 
            "NLRP1"), 
  as_short_link = TRUE)
gostres_link

[1] "https://biit.cs.ut.ee/gplink/l/2-zGBtraRm"

The function gost also allows to perform enrichment on multiple input gene lists. If the parameter multiquery = TRUE is used, then the results from all of the input queries are grouped according to term IDs for better comparison.

# Multiple gene lists
multi_gostres1 <- gost(
  query = list("chromX" = c("X:1000:1000000", 
                            "rs17396340", 
                            "GO:0005005", 
                            "ENSG00000156103", 
                            "NLRP1"),
               "chromY" = c("Y:1:10000000", 
                            "rs17396340", 
                            "GO:0005005", 
                            "ENSG00000156103", 
                            "NLRP1")), 
                multi_query = FALSE)

# head(multi_gostres1$result)
kable(head(multi_gostres1$result), digits = c(0, 0, 30, 7, 7, 7, 3, 3, 0, 0, 7, 7, 0, 0))

query	significant	p_value	term_size	query_size	intersection_size	precision	recall	term_id	source	term_name	effective_domain_size	source_order	parents
chromX	TRUE	3.000000e-30	88	22	16	0.727	0.182	GO:0048013	GO:BP	ephrin receptor signaling pathway	18092	14554	GO:0007169
chromX	TRUE	1.077092e-21	285	22	16	0.727	0.056	GO:0007411	GO:BP	axon guidance	18092	3305	GO:0007409, GO:0097485
chromX	TRUE	1.140559e-21	286	22	16	0.727	0.056	GO:0097485	GO:BP	neuron projection guidance	18092	21970	GO:0006928, GO:0006935, GO:0048812
chromX	TRUE	6.823529e-18	489	22	16	0.727	0.033	GO:0007409	GO:BP	axonogenesis	18092	3304	GO:0048667, GO:0048812, GO:0061564
chromX	TRUE	2.809503e-17	534	22	16	0.727	0.030	GO:0061564	GO:BP	axon development	18092	18331	GO:0031175
chromX	TRUE	2.846480e-16	617	22	16	0.727	0.026	GO:0048667	GO:BP	cell morphogenesis involved in neuron differentiation	18092	15031	GO:0000904, GO:0048666

multi_gostres2 <- gost(
  query = list("chromX" = c("X:1000:1000000", 
                            "rs17396340",
                            "GO:0005005", 
                            "ENSG00000156103", "NLRP1"),
               "chromY" = c("Y:1:10000000", 
                            "rs17396340", 
                            "GO:0005005", 
                            "ENSG00000156103", "NLRP1")), 
  multi_query = TRUE)

# head(multi_gostres2$result)
kable(head(multi_gostres2$result), digits = c(0, 0, 30, 7, 7, 7, 3, 3, 0, 0, 7, 7, 0, 0))

term_id	p_values	significant	term_size	query_sizes	intersection_sizes	source	term_name	effective_domain_size	source_order	parents
GO:0005005	7.162616e-48, 4.090780e-44	TRUE, TRUE	16	23, 33	16, 16	GO:MF	transmembrane-ephrin receptor activity	18694	1548	GO:0005003
GO:0005003	6.933232e-45, 3.953788e-41	TRUE, TRUE	19	23, 33	16, 16	GO:MF	ephrin receptor activity	18694	1546	GO:0004714
GO:0004714	2.581144e-33, 1.439607e-29	TRUE, TRUE	63	23, 33	16, 16	GO:MF	transmembrane receptor protein tyrosine kinase activity	18694	1295	GO:0004713, GO:0019199
REAC:R-HSA-3928665	4.164481e-33, 1.836785e-32	TRUE, TRUE	49	20, 21	16, 16	REAC	EPH-ephrin mediated repulsion of cells	10531	747	REAC:R-HSA-2682334
GO:0019199	2.920627e-31, 1.613380e-27	TRUE, TRUE	82	23, 33	16, 16	GO:MF	transmembrane receptor protein kinase activity	18694	4539	GO:0004672, GO:0004888
GO:0048013	2.813706e-30, 8.827603e-26	TRUE, TRUE	88	22, 34	16, 16	GO:BP	ephrin receptor signaling pathway	18092	14554	GO:0007169

The enrichment results can be visualized with a Manhattan-like-plot.

# Visualization
gostplot(gostres, capped = TRUE, interactive = TRUE)

If interactive = FALSE, then the function returns a static ggplot object.

The function publish_gostplot takes the static plot object as an input and enables to highlight a selection of interesting terms from the results with numbers and table of results.

# Static ggplot object
p <- gostplot(gostres, capped = FALSE, interactive = FALSE)

# Highlight terms
pp <- publish_gostplot(p, 
                       highlight_terms = c("GO:0048013", "REAC:R-HSA-3928663"), 
                       width = NA, 
                       height = NA, 
                       filename = NULL )

The gost results can also be visualized as a table.

# Table
publish_gosttable(
  gostres, 
  highlight_terms = gostres$result[c(1:2,10,100:102,120,124,125),],
  use_colors = TRUE, 
  show_columns = c("source", 
                   "term_name", 
                   "term_size", 
                   "intersection_size"),
  filename = NULL)

The same functions work also in case of multiquery results showing multiple Manhattan plots on top of each other and a multiple result table.

# Multiquery plot
gostplot(multi_gostres2, capped = TRUE, interactive = TRUE)

# Multiquery table
publish_gosttable(
  multi_gostres1, 
  highlight_terms = multi_gostres1$result[c(1, 24, 82, 176, 204, 234),],
  use_colors = TRUE, 
  show_columns = c("source", "term_name", "term_size"),
  filename = NULL)

In addition to the available GO, KEGG, etc data sources, users can upload their own custom data source using the Gene Matrix Transposed file format (GMT). The users can compose the files themselves or use pre-compiled gene sets from available dedicated websites like Molecular Signatures Database (MSigDB), etc.

Other tools

GProfiler2 also has a tool to convert gene IDs between databases, etc. It is called g:Convert.

# Gene identifier conversion with gconvert
gconvert(
  query = c("REAC:R-HSA-3928664", "rs17396340", "NLRP1"), 
  organism = "hsapiens", 
  target = "ENSG",
  mthreshold = Inf, 
  filter_na = TRUE)

   input_number              input target_number          target    name                                                                       description                           namespace
1             1 REAC:R-HSA-3928664           1.1 ENSG00000010810     FYN FYN proto-oncogene, Src family tyrosine kinase [Source:HGNC Symbol;Acc:HGNC:4037]                                REAC
2             1 REAC:R-HSA-3928664          1.10 ENSG00000133216   EPHB2                                EPH receptor B2 [Source:HGNC Symbol;Acc:HGNC:3393]                                REAC
3             1 REAC:R-HSA-3928664          1.11 ENSG00000136238    RAC1                      Rac family small GTPase 1 [Source:HGNC Symbol;Acc:HGNC:9801]                                REAC
4             1 REAC:R-HSA-3928664          1.12 ENSG00000137575   SDCBP                      syndecan binding protein [Source:HGNC Symbol;Acc:HGNC:10662]                                REAC
5             1 REAC:R-HSA-3928664          1.13 ENSG00000149269    PAK1                  p21 (RAC1) activated kinase 1 [Source:HGNC Symbol;Acc:HGNC:8590]                                REAC
6             1 REAC:R-HSA-3928664          1.14 ENSG00000154928   EPHB1                                EPH receptor B1 [Source:HGNC Symbol;Acc:HGNC:3392]                                REAC
7             1 REAC:R-HSA-3928664          1.15 ENSG00000180370    PAK2                  p21 (RAC1) activated kinase 2 [Source:HGNC Symbol;Acc:HGNC:8591]                                REAC
8             1 REAC:R-HSA-3928664          1.16 ENSG00000182580   EPHB3                                EPH receptor B3 [Source:HGNC Symbol;Acc:HGNC:3394]                                REAC
9             1 REAC:R-HSA-3928664          1.17 ENSG00000196411   EPHB4                                EPH receptor B4 [Source:HGNC Symbol;Acc:HGNC:3395]                                REAC
10            1 REAC:R-HSA-3928664           1.2 ENSG00000071051    NCK2                          NCK adaptor protein 2 [Source:HGNC Symbol;Acc:HGNC:7665]                                REAC
11            1 REAC:R-HSA-3928664           1.3 ENSG00000077264    PAK3                  p21 (RAC1) activated kinase 3 [Source:HGNC Symbol;Acc:HGNC:8592]                                REAC
12            1 REAC:R-HSA-3928664           1.4 ENSG00000090776   EFNB1                                      ephrin B1 [Source:HGNC Symbol;Acc:HGNC:3226]                                REAC
13            1 REAC:R-HSA-3928664           1.5 ENSG00000101608  MYL12A                        myosin light chain 12A [Source:HGNC Symbol;Acc:HGNC:16701]                                REAC
14            1 REAC:R-HSA-3928664           1.6 ENSG00000102606 ARHGEF7      Rho guanine nucleotide exchange factor 7 [Source:HGNC Symbol;Acc:HGNC:15607]                                REAC
15            1 REAC:R-HSA-3928664           1.7 ENSG00000108262    GIT1                                   GIT ArfGAP 1 [Source:HGNC Symbol;Acc:HGNC:4272]                                REAC
16            1 REAC:R-HSA-3928664           1.8 ENSG00000108947   EFNB3                                      ephrin B3 [Source:HGNC Symbol;Acc:HGNC:3228]                                REAC
17            1 REAC:R-HSA-3928664           1.9 ENSG00000125266   EFNB2                                      ephrin B2 [Source:HGNC Symbol;Acc:HGNC:3227]                                REAC
18            2         rs17396340           2.1 ENSG00000054523   KIF1B                      kinesin family member 1B [Source:HGNC Symbol;Acc:HGNC:16636]                                    
19            3              NLRP1           3.1 ENSG00000091592   NLRP1          NLR family pyrin domain containing 1 [Source:HGNC Symbol;Acc:HGNC:14374] ENTREZGENE,HGNC,UNIPROT_GN,WIKIGENE

GProfiler2 also has a tool to map gene IDs between species. It is called g:Orth.

# Mapping homologous genes across related organisms with gorth
gorth(query = c("Klf4", "Sox2", "71950"), 
      source_organism = "mmusculus", 
      target_organism = "hsapiens", 
      mthreshold = Inf, 
      filter_na = TRUE,
      numeric_ns = "ENTREZGENE_ACC")

  input_number input         input_ensg ensg_number ortholog_name   ortholog_ensg                                                        description
1            1  Klf4 ENSMUSG00000003032       1.1.1          KLF4 ENSG00000136826           Kruppel like factor 4 [Source:HGNC Symbol;Acc:HGNC:6348]
2            2  Sox2 ENSMUSG00000074637       2.1.1          SOX2 ENSG00000181449 SRY-box transcription factor 2 [Source:HGNC Symbol;Acc:HGNC:11195]
3            3 71950 ENSMUSG00000012396       3.1.1         NANOG ENSG00000111704                 Nanog homeobox [Source:HGNC Symbol;Acc:HGNC:20857]
4            3 71950 ENSMUSG00000012396       3.1.2       NANOGP8 ENSG00000255192    Nanog homeobox retrogene P8 [Source:HGNC Symbol;Acc:HGNC:23106]

Other gProfiler versions

It is possible to use older (archived) versions or the beta version of gProfiler.

# Accessing archived version (gprofiler2 package is only compatible with versions e94_eg41_p11 and higher)
set_base_url("http://biit.cs.ut.ee/gprofiler_archive3/e95_eg42_p13")

# Accessing the beta release
set_base_url("http://biit.cs.ut.ee/gprofiler_beta")

# Check the current version
get_base_url()

[1] "http://biit.cs.ut.ee/gprofiler_beta"

Save your session info

For the sake of traceability, store the specifications of your R environment in the report, with the command sessionInfo(). This will indicate the version of R as well as of all the libraries used in this notebook.

sessionInfo()

R version 4.0.2 (2020-06-22)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /shared/ifbstor1/software/miniconda/envs/r-4.0.2/lib/libopenblasp-r0.3.10.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] gprofiler2_0.2.0 knitr_1.30      

loaded via a namespace (and not attached):
 [1] tidyselect_1.1.0  xfun_0.20         purrr_0.3.4       colorspace_2.0-0  vctrs_0.3.6       generics_0.1.0    htmltools_0.5.1.1 viridisLite_0.3.0 yaml_2.2.1        utf8_1.2.1        plotly_4.9.3      rlang_0.4.10      pillar_1.5.1      later_1.1.0.1     glue_1.4.2        DBI_1.1.1        
[17] lifecycle_1.0.0   stringr_1.4.0     munsell_0.5.0     gtable_0.3.0      htmlwidgets_1.5.3 evaluate_0.14     labeling_0.4.2    fastmap_1.1.0     httpuv_1.5.5      crosstalk_1.1.1   fansi_0.4.2       highr_0.8         Rcpp_1.0.6        xtable_1.8-4      scales_1.1.1      promises_1.2.0.1 
[33] debugme_1.1.0     jsonlite_1.7.2    farver_2.1.0      mime_0.10         gridExtra_2.3     ggplot2_3.3.3     digest_0.6.27     stringi_1.5.3     dplyr_1.0.5       shiny_1.5.0       grid_4.0.2        tools_4.0.2       bitops_1.0-6      magrittr_2.0.1    lazyeval_0.2.2    RCurl_1.98-1.3   
[49] tibble_3.1.0      crayon_1.4.1      tidyr_1.1.3       pkgconfig_2.0.3   ellipsis_0.3.1    data.table_1.14.0 assertthat_0.2.1  rmarkdown_2.5     httr_1.4.2        R6_2.5.0          compiler_4.0.2