This tutorial aims at showing how to use the gProfiler2
R package. It is more or less a Rmd version of the vignette. You can refer to the vignette and reference manual (links below) for more details.
Here is an example in Homo sapiens with various types of gene identifiers, a SNP ID, chromosomal intervals and a term ID from the Gene Ontology (GO). Parameters are described in the vignette. The result is a named list where “result” is a data.frame
with the enrichment analysis results and “meta” contains a named list with all the metadata for the query.
## Example of gene list functional enrichment analysis with gost
<- gost(
gostres query = c("X:1000:1000000",
"rs17396340",
"GO:0005005",
"ENSG00000156103",
"NLRP1"),
organism = "hsapiens",
ordered_query = FALSE,
multi_query = FALSE,
significant = TRUE,
exclude_iea = FALSE,
measure_underrepresentation = FALSE,
evcodes = FALSE,
user_threshold = 0.05,
correction_method = "g_SCS",
domain_scope = "annotated",
custom_bg = NULL,
numeric_ns = "",
sources = NULL,
as_short_link = FALSE)
names(gostres)
[1] "result" "meta"
summary(gostres)
Length Class Mode
result 14 data.frame list
meta 5 -none- list
names(gostres$meta)
[1] "query_metadata" "result_metadata" "genes_metadata" "timestamp" "version"
# head(gostres$result)
kable(head(gostres$result), digits = c(0, 0, 30, 7, 7, 7, 3, 3, 0, 0, 7, 7, 0, 0))
query | significant | p_value | term_size | query_size | intersection_size | precision | recall | term_id | source | term_name | effective_domain_size | source_order | parents |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
query_1 | TRUE | 3.000000e-30 | 88 | 22 | 16 | 0.727 | 0.182 | GO:0048013 | GO:BP | ephrin receptor signaling pathway | 18092 | 14554 | GO:0007169 |
query_1 | TRUE | 1.077092e-21 | 285 | 22 | 16 | 0.727 | 0.056 | GO:0007411 | GO:BP | axon guidance | 18092 | 3305 | GO:0007409, GO:0097485 |
query_1 | TRUE | 1.140559e-21 | 286 | 22 | 16 | 0.727 | 0.056 | GO:0097485 | GO:BP | neuron projection guidance | 18092 | 21970 | GO:0006928, GO:0006935, GO:0048812 |
query_1 | TRUE | 6.823529e-18 | 489 | 22 | 16 | 0.727 | 0.033 | GO:0007409 | GO:BP | axonogenesis | 18092 | 3304 | GO:0048667, GO:0048812, GO:0061564 |
query_1 | TRUE | 2.809503e-17 | 534 | 22 | 16 | 0.727 | 0.030 | GO:0061564 | GO:BP | axon development | 18092 | 18331 | GO:0031175 |
query_1 | TRUE | 2.846480e-16 | 617 | 22 | 16 | 0.727 | 0.026 | GO:0048667 | GO:BP | cell morphogenesis involved in neuron differentiation | 18092 | 15031 | GO:0000904, GO:0048666 |
With the parameter evcodes = TRUE
, the result will include the evidence codes. In addition, a column “intersection” will appear showing the input gene IDs that intersect with the corresponding functional term. Note that this parameter can decrease the performance and make the query slower.
# Example with intersection (can be time consuming)
<- gost(
gostres2 query = c("X:1000:1000000",
"rs17396340",
"GO:0005005",
"ENSG00000156103",
"NLRP1"),
organism = "hsapiens",
ordered_query = FALSE,
multi_query = FALSE,
significant = TRUE,
exclude_iea = FALSE,
measure_underrepresentation = FALSE,
evcodes = TRUE,
user_threshold = 0.05,
correction_method = "g_SCS",
domain_scope = "annotated",
custom_bg = NULL,
numeric_ns = "",
sources = NULL)
# head(gostres2$result)
kable(head(gostres2$result), digits = c(0, 0, 30, 7, 7, 7, 3, 3, 0, 0, 7, 7, 0, 0, 0, 0))
query | significant | p_value | term_size | query_size | intersection_size | precision | recall | term_id | source | term_name | effective_domain_size | source_order | parents | evidence_codes | intersection |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
query_1 | TRUE | 3.000000e-30 | 88 | 22 | 16 | 0.727 | 0.182 | GO:0048013 | GO:BP | ephrin receptor signaling pathway | 18092 | 14554 | GO:0007169 | IDA TAS IEA,ISS TAS IEA,TAS IEA,IDA IBA TAS,IGI TAS IEA,ISS TAS IEA,IDA TAS IEA,IDA TAS IEA,IGI ISS IBA TAS IEA,ISS TAS IEA,TAS,IDA TAS IEA,IDA TAS,TAS IEA,IDA TAS IEA,IBA TAS IEA | ENSG00000044524,ENSG00000070886,ENSG00000080224,ENSG00000108947,ENSG00000116106,ENSG00000133216,ENSG00000135333,ENSG00000142627,ENSG00000143590,ENSG00000145242,ENSG00000146904,ENSG00000154928,ENSG00000182580,ENSG00000183317,ENSG00000196411,ENSG00000243364 |
query_1 | TRUE | 1.077092e-21 | 285 | 22 | 16 | 0.727 | 0.056 | GO:0007411 | GO:BP | axon guidance | 18092 | 3305 | GO:0007409, GO:0097485 | IBA,ISS IBA IEA,IBA,IBA IEA,ISS IBA IEA,ISS IBA IEA,IBA IEA,IBA,IBA,ISS IBA IEA,IBA,ISS IBA IEA,ISS IBA IEA,IBA,IBA,IBA | ENSG00000044524,ENSG00000070886,ENSG00000080224,ENSG00000108947,ENSG00000116106,ENSG00000133216,ENSG00000135333,ENSG00000142627,ENSG00000143590,ENSG00000145242,ENSG00000146904,ENSG00000154928,ENSG00000182580,ENSG00000183317,ENSG00000196411,ENSG00000243364 |
query_1 | TRUE | 1.140559e-21 | 286 | 22 | 16 | 0.727 | 0.056 | GO:0097485 | GO:BP | neuron projection guidance | 18092 | 21970 | GO:0006928, GO:0006935, GO:0048812 | IBA,ISS IBA IEA,IBA,IBA IEA,ISS IBA IEA,ISS IBA IEA,IBA IEA,IBA,IBA,ISS IBA IEA,IBA,ISS IBA IEA,ISS IBA IEA,IBA,IBA,IBA | ENSG00000044524,ENSG00000070886,ENSG00000080224,ENSG00000108947,ENSG00000116106,ENSG00000133216,ENSG00000135333,ENSG00000142627,ENSG00000143590,ENSG00000145242,ENSG00000146904,ENSG00000154928,ENSG00000182580,ENSG00000183317,ENSG00000196411,ENSG00000243364 |
query_1 | TRUE | 6.823529e-18 | 489 | 22 | 16 | 0.727 | 0.033 | GO:0007409 | GO:BP | axonogenesis | 18092 | 3304 | GO:0048667, GO:0048812, GO:0061564 | IBA,ISS IBA IEA,IBA,ISS IBA IEA,ISS IBA IEA,ISS IBA IEA,IBA IEA,IBA,IBA,ISS IBA IEA,IBA,ISS IBA IEA,ISS IBA IEA,IBA,IBA,IBA | ENSG00000044524,ENSG00000070886,ENSG00000080224,ENSG00000108947,ENSG00000116106,ENSG00000133216,ENSG00000135333,ENSG00000142627,ENSG00000143590,ENSG00000145242,ENSG00000146904,ENSG00000154928,ENSG00000182580,ENSG00000183317,ENSG00000196411,ENSG00000243364 |
query_1 | TRUE | 2.809503e-17 | 534 | 22 | 16 | 0.727 | 0.030 | GO:0061564 | GO:BP | axon development | 18092 | 18331 | GO:0031175 | ISS IBA,ISS IBA IEA,IBA,ISS IBA IEA,ISS IBA IEA,ISS IBA IEA,IBA IEA,IBA,IBA,ISS IBA IEA,IBA,ISS IBA IEA,ISS IBA IEA,IBA,IBA,IBA | ENSG00000044524,ENSG00000070886,ENSG00000080224,ENSG00000108947,ENSG00000116106,ENSG00000133216,ENSG00000135333,ENSG00000142627,ENSG00000143590,ENSG00000145242,ENSG00000146904,ENSG00000154928,ENSG00000182580,ENSG00000183317,ENSG00000196411,ENSG00000243364 |
query_1 | TRUE | 2.846480e-16 | 617 | 22 | 16 | 0.727 | 0.026 | GO:0048667 | GO:BP | cell morphogenesis involved in neuron differentiation | 18092 | 15031 | GO:0000904, GO:0048666 | IBA,ISS IBA IEA,IBA,ISS IBA IEA,ISS IBA IEA,ISS IBA IEA,IBA IEA,IBA,IBA,ISS IBA IEA,IBA,ISS IBA IEA,ISS IBA IEA,IBA,IBA,IBA | ENSG00000044524,ENSG00000070886,ENSG00000080224,ENSG00000108947,ENSG00000116106,ENSG00000133216,ENSG00000135333,ENSG00000142627,ENSG00000143590,ENSG00000145242,ENSG00000146904,ENSG00000154928,ENSG00000182580,ENSG00000183317,ENSG00000196411,ENSG00000243364 |
The query results can also be gathered into a short-link to the g:Profiler
web tool.
# Get a web link
<- gost(
gostres_link query = c("X:1000:1000000",
"rs17396340",
"GO:0005005",
"ENSG00000156103",
"NLRP1"),
as_short_link = TRUE)
gostres_link
[1] "https://biit.cs.ut.ee/gplink/l/2-zGBtraRm"
The function gost
also allows to perform enrichment on multiple input gene lists. If the parameter multiquery = TRUE
is used, then the results from all of the input queries are grouped according to term IDs for better comparison.
# Multiple gene lists
<- gost(
multi_gostres1 query = list("chromX" = c("X:1000:1000000",
"rs17396340",
"GO:0005005",
"ENSG00000156103",
"NLRP1"),
"chromY" = c("Y:1:10000000",
"rs17396340",
"GO:0005005",
"ENSG00000156103",
"NLRP1")),
multi_query = FALSE)
# head(multi_gostres1$result)
kable(head(multi_gostres1$result), digits = c(0, 0, 30, 7, 7, 7, 3, 3, 0, 0, 7, 7, 0, 0))
query | significant | p_value | term_size | query_size | intersection_size | precision | recall | term_id | source | term_name | effective_domain_size | source_order | parents |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
chromX | TRUE | 3.000000e-30 | 88 | 22 | 16 | 0.727 | 0.182 | GO:0048013 | GO:BP | ephrin receptor signaling pathway | 18092 | 14554 | GO:0007169 |
chromX | TRUE | 1.077092e-21 | 285 | 22 | 16 | 0.727 | 0.056 | GO:0007411 | GO:BP | axon guidance | 18092 | 3305 | GO:0007409, GO:0097485 |
chromX | TRUE | 1.140559e-21 | 286 | 22 | 16 | 0.727 | 0.056 | GO:0097485 | GO:BP | neuron projection guidance | 18092 | 21970 | GO:0006928, GO:0006935, GO:0048812 |
chromX | TRUE | 6.823529e-18 | 489 | 22 | 16 | 0.727 | 0.033 | GO:0007409 | GO:BP | axonogenesis | 18092 | 3304 | GO:0048667, GO:0048812, GO:0061564 |
chromX | TRUE | 2.809503e-17 | 534 | 22 | 16 | 0.727 | 0.030 | GO:0061564 | GO:BP | axon development | 18092 | 18331 | GO:0031175 |
chromX | TRUE | 2.846480e-16 | 617 | 22 | 16 | 0.727 | 0.026 | GO:0048667 | GO:BP | cell morphogenesis involved in neuron differentiation | 18092 | 15031 | GO:0000904, GO:0048666 |
<- gost(
multi_gostres2 query = list("chromX" = c("X:1000:1000000",
"rs17396340",
"GO:0005005",
"ENSG00000156103", "NLRP1"),
"chromY" = c("Y:1:10000000",
"rs17396340",
"GO:0005005",
"ENSG00000156103", "NLRP1")),
multi_query = TRUE)
# head(multi_gostres2$result)
kable(head(multi_gostres2$result), digits = c(0, 0, 30, 7, 7, 7, 3, 3, 0, 0, 7, 7, 0, 0))
term_id | p_values | significant | term_size | query_sizes | intersection_sizes | source | term_name | effective_domain_size | source_order | parents |
---|---|---|---|---|---|---|---|---|---|---|
GO:0005005 | 7.162616e-48, 4.090780e-44 | TRUE, TRUE | 16 | 23, 33 | 16, 16 | GO:MF | transmembrane-ephrin receptor activity | 18694 | 1548 | GO:0005003 |
GO:0005003 | 6.933232e-45, 3.953788e-41 | TRUE, TRUE | 19 | 23, 33 | 16, 16 | GO:MF | ephrin receptor activity | 18694 | 1546 | GO:0004714 |
GO:0004714 | 2.581144e-33, 1.439607e-29 | TRUE, TRUE | 63 | 23, 33 | 16, 16 | GO:MF | transmembrane receptor protein tyrosine kinase activity | 18694 | 1295 | GO:0004713, GO:0019199 |
REAC:R-HSA-3928665 | 4.164481e-33, 1.836785e-32 | TRUE, TRUE | 49 | 20, 21 | 16, 16 | REAC | EPH-ephrin mediated repulsion of cells | 10531 | 747 | REAC:R-HSA-2682334 |
GO:0019199 | 2.920627e-31, 1.613380e-27 | TRUE, TRUE | 82 | 23, 33 | 16, 16 | GO:MF | transmembrane receptor protein kinase activity | 18694 | 4539 | GO:0004672, GO:0004888 |
GO:0048013 | 2.813706e-30, 8.827603e-26 | TRUE, TRUE | 88 | 22, 34 | 16, 16 | GO:BP | ephrin receptor signaling pathway | 18092 | 14554 | GO:0007169 |
The enrichment results can be visualized with a Manhattan-like-plot.
# Visualization
gostplot(gostres, capped = TRUE, interactive = TRUE)
If interactive = FALSE
, then the function returns a static ggplot object.
The function publish_gostplot
takes the static plot object as an input and enables to highlight a selection of interesting terms from the results with numbers and table of results.
# Static ggplot object
<- gostplot(gostres, capped = FALSE, interactive = FALSE)
p
# Highlight terms
<- publish_gostplot(p,
pp highlight_terms = c("GO:0048013", "REAC:R-HSA-3928663"),
width = NA,
height = NA,
filename = NULL )
The gost results can also be visualized as a table.
# Table
publish_gosttable(
gostres, highlight_terms = gostres$result[c(1:2,10,100:102,120,124,125),],
use_colors = TRUE,
show_columns = c("source",
"term_name",
"term_size",
"intersection_size"),
filename = NULL)
The same functions work also in case of multiquery results showing multiple Manhattan plots on top of each other and a multiple result table.
# Multiquery plot
gostplot(multi_gostres2, capped = TRUE, interactive = TRUE)
# Multiquery table
publish_gosttable(
multi_gostres1, highlight_terms = multi_gostres1$result[c(1, 24, 82, 176, 204, 234),],
use_colors = TRUE,
show_columns = c("source", "term_name", "term_size"),
filename = NULL)
In addition to the available GO, KEGG, etc data sources, users can upload their own custom data source using the Gene Matrix Transposed file format (GMT). The users can compose the files themselves or use pre-compiled gene sets from available dedicated websites like Molecular Signatures Database (MSigDB), etc.
GProfiler2
also has a tool to convert gene IDs between databases, etc. It is called g:Convert.
# Gene identifier conversion with gconvert
gconvert(
query = c("REAC:R-HSA-3928664", "rs17396340", "NLRP1"),
organism = "hsapiens",
target = "ENSG",
mthreshold = Inf,
filter_na = TRUE)
input_number input target_number target name description namespace
1 1 REAC:R-HSA-3928664 1.1 ENSG00000010810 FYN FYN proto-oncogene, Src family tyrosine kinase [Source:HGNC Symbol;Acc:HGNC:4037] REAC
2 1 REAC:R-HSA-3928664 1.10 ENSG00000133216 EPHB2 EPH receptor B2 [Source:HGNC Symbol;Acc:HGNC:3393] REAC
3 1 REAC:R-HSA-3928664 1.11 ENSG00000136238 RAC1 Rac family small GTPase 1 [Source:HGNC Symbol;Acc:HGNC:9801] REAC
4 1 REAC:R-HSA-3928664 1.12 ENSG00000137575 SDCBP syndecan binding protein [Source:HGNC Symbol;Acc:HGNC:10662] REAC
5 1 REAC:R-HSA-3928664 1.13 ENSG00000149269 PAK1 p21 (RAC1) activated kinase 1 [Source:HGNC Symbol;Acc:HGNC:8590] REAC
6 1 REAC:R-HSA-3928664 1.14 ENSG00000154928 EPHB1 EPH receptor B1 [Source:HGNC Symbol;Acc:HGNC:3392] REAC
7 1 REAC:R-HSA-3928664 1.15 ENSG00000180370 PAK2 p21 (RAC1) activated kinase 2 [Source:HGNC Symbol;Acc:HGNC:8591] REAC
8 1 REAC:R-HSA-3928664 1.16 ENSG00000182580 EPHB3 EPH receptor B3 [Source:HGNC Symbol;Acc:HGNC:3394] REAC
9 1 REAC:R-HSA-3928664 1.17 ENSG00000196411 EPHB4 EPH receptor B4 [Source:HGNC Symbol;Acc:HGNC:3395] REAC
10 1 REAC:R-HSA-3928664 1.2 ENSG00000071051 NCK2 NCK adaptor protein 2 [Source:HGNC Symbol;Acc:HGNC:7665] REAC
11 1 REAC:R-HSA-3928664 1.3 ENSG00000077264 PAK3 p21 (RAC1) activated kinase 3 [Source:HGNC Symbol;Acc:HGNC:8592] REAC
12 1 REAC:R-HSA-3928664 1.4 ENSG00000090776 EFNB1 ephrin B1 [Source:HGNC Symbol;Acc:HGNC:3226] REAC
13 1 REAC:R-HSA-3928664 1.5 ENSG00000101608 MYL12A myosin light chain 12A [Source:HGNC Symbol;Acc:HGNC:16701] REAC
14 1 REAC:R-HSA-3928664 1.6 ENSG00000102606 ARHGEF7 Rho guanine nucleotide exchange factor 7 [Source:HGNC Symbol;Acc:HGNC:15607] REAC
15 1 REAC:R-HSA-3928664 1.7 ENSG00000108262 GIT1 GIT ArfGAP 1 [Source:HGNC Symbol;Acc:HGNC:4272] REAC
16 1 REAC:R-HSA-3928664 1.8 ENSG00000108947 EFNB3 ephrin B3 [Source:HGNC Symbol;Acc:HGNC:3228] REAC
17 1 REAC:R-HSA-3928664 1.9 ENSG00000125266 EFNB2 ephrin B2 [Source:HGNC Symbol;Acc:HGNC:3227] REAC
18 2 rs17396340 2.1 ENSG00000054523 KIF1B kinesin family member 1B [Source:HGNC Symbol;Acc:HGNC:16636]
19 3 NLRP1 3.1 ENSG00000091592 NLRP1 NLR family pyrin domain containing 1 [Source:HGNC Symbol;Acc:HGNC:14374] ENTREZGENE,HGNC,UNIPROT_GN,WIKIGENE
GProfiler2
also has a tool to map gene IDs between species. It is called g:Orth
.
# Mapping homologous genes across related organisms with gorth
gorth(query = c("Klf4", "Sox2", "71950"),
source_organism = "mmusculus",
target_organism = "hsapiens",
mthreshold = Inf,
filter_na = TRUE,
numeric_ns = "ENTREZGENE_ACC")
input_number input input_ensg ensg_number ortholog_name ortholog_ensg description
1 1 Klf4 ENSMUSG00000003032 1.1.1 KLF4 ENSG00000136826 Kruppel like factor 4 [Source:HGNC Symbol;Acc:HGNC:6348]
2 2 Sox2 ENSMUSG00000074637 2.1.1 SOX2 ENSG00000181449 SRY-box transcription factor 2 [Source:HGNC Symbol;Acc:HGNC:11195]
3 3 71950 ENSMUSG00000012396 3.1.1 NANOG ENSG00000111704 Nanog homeobox [Source:HGNC Symbol;Acc:HGNC:20857]
4 3 71950 ENSMUSG00000012396 3.1.2 NANOGP8 ENSG00000255192 Nanog homeobox retrogene P8 [Source:HGNC Symbol;Acc:HGNC:23106]
It is possible to use older (archived) versions or the beta version of gProfiler
.
# Accessing archived version (gprofiler2 package is only compatible with versions e94_eg41_p11 and higher)
set_base_url("http://biit.cs.ut.ee/gprofiler_archive3/e95_eg42_p13")
# Accessing the beta release
set_base_url("http://biit.cs.ut.ee/gprofiler_beta")
# Check the current version
get_base_url()
[1] "http://biit.cs.ut.ee/gprofiler_beta"
For the sake of traceability, store the specifications of your R environment in the report, with the command sessionInfo()
. This will indicate the version of R as well as of all the libraries used in this notebook.
sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS/LAPACK: /shared/ifbstor1/software/miniconda/envs/r-4.0.2/lib/libopenblasp-r0.3.10.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] gprofiler2_0.2.0 knitr_1.30
loaded via a namespace (and not attached):
[1] tidyselect_1.1.0 xfun_0.20 purrr_0.3.4 colorspace_2.0-0 vctrs_0.3.6 generics_0.1.0 htmltools_0.5.1.1 viridisLite_0.3.0 yaml_2.2.1 utf8_1.2.1 plotly_4.9.3 rlang_0.4.10 pillar_1.5.1 later_1.1.0.1 glue_1.4.2 DBI_1.1.1
[17] lifecycle_1.0.0 stringr_1.4.0 munsell_0.5.0 gtable_0.3.0 htmlwidgets_1.5.3 evaluate_0.14 labeling_0.4.2 fastmap_1.1.0 httpuv_1.5.5 crosstalk_1.1.1 fansi_0.4.2 highr_0.8 Rcpp_1.0.6 xtable_1.8-4 scales_1.1.1 promises_1.2.0.1
[33] debugme_1.1.0 jsonlite_1.7.2 farver_2.1.0 mime_0.10 gridExtra_2.3 ggplot2_3.3.3 digest_0.6.27 stringi_1.5.3 dplyr_1.0.5 shiny_1.5.0 grid_4.0.2 tools_4.0.2 bitops_1.0-6 magrittr_2.0.1 lazyeval_0.2.2 RCurl_1.98-1.3
[49] tibble_3.1.0 crayon_1.4.1 tidyr_1.1.3 pkgconfig_2.0.3 ellipsis_0.3.1 data.table_1.14.0 assertthat_0.2.1 rmarkdown_2.5 httr_1.4.2 R6_2.5.0 compiler_4.0.2