CALANGO

Comparative AnaLysis with ANnotation-based Genomic cOmponentes
Now published in Patterns CRAN_Status_Badge CRAN Downloads R CMD check DOI

CALANGO logo. Drawn by Brazilian artist Berze - https://www.facebook.com/berzearte


DESCRIPTION

CALANGO is a first-principles, phylogeny-aware comparative genomics package to search for annotation terms (e.g Pfam IDs, GO terms or superfamilies), formally described in a dictionary-like structure and used to annotate genomic components, associated with a quantitative/rank variable (e.g. number of cell types, genome size or density of specific genomic elements).

Our software has been freely inspired by (and explicitly modeled to take into account information from) ideas and tools as diverse as comparative phylogenetics, genome annotation, gene enrichment analysis, and data visualization and interactivity.

HOW TO INSTALL

The latest version of CALANGO can be installed directly from the repository using devtools::install_github():

devtools::install_github("LABpackages/CALANGO")

Alternatively, you should soon be able to install the last release version from CRAN by simply doing:

install.packages("CALANGO")

In either case, please make sure that you have the latest R version (at least 3.6.1) as well as updated versions of all installed packages.

INSTALLING AND UPDATING DEPENDENCIES

CALANGO depends on some packages from Bioconductor. This requires an extra installation step before CALANGO can be used. Just run:

library(CALANGO)
install_bioc_dependencies(which = "all")

to install / update all dependencies (packages listed as CALANGO’s imports and suggests) to their latest versions.


HOW TO USE - OVERVIEW

To run CALANGO you need two things:

  1. a set of data
  2. an input list defining the path to that data and what is to be done.

Retrieving data

A set of example data folders and files can accessed directly from the package using function retrieve_data_files(). For instance, a call:

CALANGO::retrieve_data_files("./data")

will create a folder called data in the current working directory path, and download the sample files into subfolders within it. These subfolders will contain:

The downloaded data will also contain files that describe the input list mentioned above, under ./data/parameters/. These full examples represent all input files required to locally reproduce several types of analyses.


Once these two pieces are in place, the CALANGO pipeline can be run by invoking the main function of the package. For instance:

output <- run_CALANGO(defs = "./data/parameters/parameters_domain2GO_freq.txt", cores = 2)

This call will generate an enriched CALANGO list as the output, and generate a dynamic HTML5 output (plus several tab-separated value (tsv) files in the directory provided as output.dir.

PREPARING YOUR INPUT FILES

CALANGO requires the following files (please check the examples in in ./data/parameters/ if in doubt about file specifications):


genome annotation file

A text file for each species describing their set of biologically meaningful genomic elements and their respective annotation IDs (e.g. non redundant proteomes annotated to GO terms, or non-redundant protein domains annotated to protein domain IDs). An example of such file, where gene products are annotated using Gene Ontology (GO) terms and Kegg Orthology (KO) identifiers would be as follows:

Entry   GO_IDs   KEGG_Orthology_ID
Q7L8J4  GO:0017124;GO:0005737;GO:0035556;GO:1904030;GO:0061099;GO:0004860
Q8WW27  GO:0016814;GO:0006397;GO:0008270  K18773
Q96P50  GO:0005096;GO:0046872   K12489

And is specified as:

In a more abstract representation, files representing genome annotations for a single annotation schema would have two columns and the following general structure:

genomic_element_name/ID_1     annotation_ID_1;(...);annotation_ID_N
genomic_element_name/ID_2     annotation_ID_12

phylogenetic tree file

newick or PHYLIP format, containing at least:

A tree in newick format (however, with no branch lengths), would be:

(genome_ID_1,(genome_ID_2,genome_ID_3))

A metadata file

Containing species-specific information:

The tabular format for the correlation analysis where column 1 contains the genome IDs, column 2 contains the variable to rank genomes and column 3 contains the normalizing factor could be as follows:

../projects/my_project/genome_ID_1  1.7  2537
../projects/my_project/genome_ID_2  1.2  10212
../projects/my_project/genome_ID_3  0.9  1534

Metadada files are specified as follows:


A dictionary

Tab delimited, linking annotation IDs to their biologically meaningful descriptions. Our software currently supports two dictionary types:

Annotation_ID     Annotation_definition
annotation_ID_1   All alpha proteins
annotation_ID_2   Globin-like
annotation_ID_3   Globin-like
annotation_ID_4   Truncated hemoglobin
(...)
annotation_ID_N   annotation_ID_description

CALANGO can treat each identifier as its own description, preventing the need to prepare an ontology that isn’t natively supported. For that, do not specify any ontology file and set the ontology = "other".

SETTING UP CALANGO PARAMETERS

CALANGO’s parameters are listed in the documentation of run_CALANGO(), as well as in the examples file provided (./data/parameters/). They are, for the current version:

CALANGO OUTPUT

Live examples of CALANGO output HTML5 pages can be found here.

CALANGO produces as its main output a dynamic HMTL5 website containing:

Please check our examples page at https://labpackages.github.io/CALANGO/ to explore the full output of CALANGO for a variety of examples. The required data to fully reproduce these examples can be obtained by using CALANGO::retrieve_data_files().


CALANGO MASTERMIND

OTHER DEVELOPERS

NON-CODING COLLABORATORS