Just_Annotate_My_proteins (JAMp)

Introduction

Whether working on a model or non-model species for biomedical, economical or evolutionary research, next-generation sequencing has enabled biologists to rapidly generate a reference sequence for downstream applications and hypotheses generation. With the exception of a limited number of species, functional annotation is conducted by in-silico experiments based on sequence similarity. Some groups are also enriching their data with expression studies. JAMp is a platform that allows biologists to reclaim the analysis of transcript reconstruction experiments by providing an automated process for generating functional annotations and a user-friendly overview of these in-silico experiments. The entire software is built so that novice command-line users can take a transcriptome assembly and generate websites like those found in our demo.

How does it work

Like most bioinformatic software, JAMp and its dependencies must first be installed before end-users can use it. The process is by itself simple but Linux system administration is often an onerous task so feel free to ask our user mailing list if you require help.

First, administrators install the software, including the annotation database and a website that drives the GUI. This needs to occur only once and they are very simple
- Administrators also have to install the HHblits software that we use for conducting the Hidden Markov searches. As a High Performance Computing (HPC) environment is required, your HPC administrators might be able to help (no compilation is required so that too ought to be simple - and we may be able to help)
Second, the assembly file is used to predict proteins with our quick-n-easy protein prediction software TransDecoder
Third, you conduct the in-silico experiment (currently HHblits is supported). These are conducted using Hidden-Markov searches against a SwissProt database (the universal archive of all proteins that have experimental functional information).
Optionally, you can align any RNA-Seq datasets you have against your reference and generate a number of informative graphs with our software DEW.
Once all the in-silico experiments are complete, a script then stores the output and relevant metadata (e.g. name of species) into the JAMp database. This database can drive multiple websites/datasets, each with their own user-authentication criteria.
Finally, the users (and the anyone with access to the website, such as paper reviewers and readers) use their browser to explore the functional annotations using a powerful yet intuitive GUI.

Too easy!

How does it look?

A core element of JAMp is its rich Graphical User Interface (GUI). Such interfaces are important not only because they allow biologists to explore their data without using the command line and spreadsheets but because it provide a vastly more efficient human-machine interface (i.e. yes, many expert bioinformaticians also use GUIs to conduct analyses!). See this demo for an example of how your community’s new website could look like.

Controlled Vocabularies

We use a number of control vocabularies (and ontologies) to organize the way we present functional annotation knowledge. There is no limit to what kind of vocabularies JAMp can support but we are currently using the Gene Ontology. Enzyme Classification (currently primarily for bacteria), EggNOG and Kyoto Encyclopedia of Genes and Genomes (KEGG).

Users must understand a bit about the process of electronic inference. We have an unknown protein and we wish to infer what it may do or what process it may be involved in. We don’t have any functional data so we look for (using a software, i.e. 'in-silico'), a similar known protein that has experimental evidence. We then infer that our protein has the same function. JAMp ensures that only known proteins with real experimental evidence are used in this inference, in other words, we are being careful not to do a double inference: assigning a function to our unknown protein because it is similar to another protein that has no experimental data but is similar to one that does. In theory this reduces the sensitivity of our search but the way we populate the database and identify our similar known proteins is not by BLAST but by a Hidden Markov search which is far more sensitive (see below).

High-Performance Computing

Overall, annotating with JAMp is significantly faster than annotating with the traditional BLASTx approach. This is especially true in the NGS era. In the past, transcriptomic projects were composed of a few hundred reliable genes and therefore tasks based on BLAST (e.g. BLAST2GO), were commonplace. In the RNAseq era, it is commonplace that dozens of thousands of high quality transcripts are reconstructed. Further, full genome sequencing (and gene model reconstruction) is now available for hundrends of species. Further, the so called non-model species are often significantly divergent from Arabidopsis, Fruitfly, Human/Mouse, Yeast etc that a model-based, recursive yet time-consuming searches are required (such PSI-BLAST or an HMM search).

How dow we address this problem? First, JAMp uses HHblits, a HM model-based search software that searches your protein against Uniprot, builds an alignment and a model and then searches again. Importantly, thanks to the ffindex software, we offer an MPI approach to running these in-silico experiments. Sadly, because of the cost of creating a reliable infrastructure, we currently cannot offer this is as a service to public but a number of institutes host bioinformatics enabled HPC environments. Some, such as DIAG, the Data Intensive Academic Grid, are freely available to any life-science researcher. We are interested in being to deploy this on an Amazon cloud (I guess that a single analysis would be about $300) but we do not currently have the manpower to accomplish that.

What uses cases does JAMp accomodate?

Basically, Just_Annotate_My_proteins does exactly what it reads on the tin! You give it a protein FASTA file and it annotates it with functional terms. The terms are derived from the literature and the biocuration teams working on linking known proteins with these terms. The use cases therefore are not restricted but we routinely use it for annotating:

Transcriptome experiments that produce a reference sequence (an assembly)

Obtaining JAMg

Freely distributed from here. Please subscribe to the user mailing list.

It’s not published. We really appreciate any comments and feedback you can provide, so please do no matter how minor you think they are (e.g. typos in this document). Any JAMp websites you build can be released under any license you want decides but you must keep the Author & license copyright notice (and add your copyright at the bottom).

What’s next?

We have a number of new features in mind but we haven’t yet sought any funding. If you want to sponsor or enhance JAMp with your efforts, let us know.

Example new features

Theming and language support: because not every community is the same!
Performance improvements: we haven’t done any work on improving performance and there are some clear low hanging fruits
Better and more powerful graphs: our graph user-interface can use improvements including more user-interactions.
More data mining: currently only ontology functional annotations are searchable. Some user-interface work can vastly improve user experience by e.g. allowing users to search for genes with particular expression profiles or network memberships.
More data types: we have in-silico annotation, networks and gene expression experiments covered but we still lack phylogenies, wet-lab experiments (e.g. knock-outs) and other functional experiments.
More use-cases: because there is no reason why the Tree of Life can’t have a JAMp!

Authors & license

Alexie Papanicolaou and Temi Varghese

CSIRO Ecosystem Sciences
alexie.papanicolaou@csiro.au

This software and Website is Copyright CSIRO 2013-2014. They are provided "as is" without warranty of any kind.
The software is released under a modified Apache License v.2. You can find the terms and conditions in the software distribution.
Demonstration JAMp websites are released under a Creative Commons Attribution-ShareAlike 4.0 International License: http://creativecommons.org/licenses/by-sa/4.0.

Website copyright

This website is released under a Creative Commons Attribution-ShareAlike 4.0 International License: http://creativecommons.org/licenses/by-sa/4.0.