era7 bioinformatics R&D group

Talks, Genomes, Reads and Annotations

Last 1st of June I was attending to the conference “Applied Bioinformatics & Public Health Microbiology” held at the Welcome Trust Conference Centre (Hinxton, UK)

The main purpose of the conference was to point out the importance of high-throughput technologies (mainly NGS) and bioinformatics in solving issues related to public health.

During the morning sessions of the second day (2nd of June) BGI announced the release of the sequencing data of 5 IonTorrent chips (see the announcement here ). Just some hours later Nick Loman (@pathogenomenick) published a de novo assembly of the reads with MIRA in his blog (see post here). And then, some hours later (in the morning of the 3rd of June) we published the annotation of the Nick’s assembly in our website. We annotated it with the pipeline (BG7 pipeline) we were presenting at the conference (see talk slides here).

Now we are working on this automatic annotation data searching for proteins that may be involved in pathogenesis. These preliminary results look promising.

Definitely, it’s been a really exciting experience. In the middle of a conference where we were trying to see how NGS could be applied to solve public health issues we could get the genome of the bacteria responsible for this important outbreak. In less than 24 hours we got the reads, the assembly and the annotation. A good case study :)

Playing with Gephi, Bio4j and Go

It had already been some time without having some fun with Gephi so today I told myself: why not trying visualizing the whole Gene Ontology and seeing what happens?

First of all I had to generate the corresponding file in gexf format containing all the terms and relationships belonging to the ontology. For that I did a small program (GenerateGexfGo.java) which uses Bio4j for terms/relationships info retrieval and a couple of XML Gexf wrapper classes from the github project BioinfoXML.

Once I had my gexf file I tried opening it (~17 MB) with gephi in my laptop with no success, (gephi froze forever when trying to import the file). Then, after a quick search on google I figured out that the amount of memory used by Gephi was really easy to change, (just open the file ‘etc/gephi07beta.conf’ and change the -Xmx value).

With my file already imported, first I applied the algorithm OpenOrd (which is the best one for large graphs) and then once it had an acceptable distribution I finally applied some iterations of the algorithm Fruchterman Reingold for a better visualization. And this is what I got:

Colors correspondance:

  • Green: Cellular component
  • Blue: Molecular function
  • Orange: Biological process

UPDATE: zoomable independent ontology visualizations using gephi SeaDragon plugin.

Molecular function

Cellular component

Biological process

Here you can download the gexf file in case you want to experiment a bit with it.

Using Bio4j Go Tools to get the GO annotations of a set of Uniprot proteins

We have recently launched a first version of Bio4j Go Tools: an AIR application that allows the user to perform different kinds of GO Analysis using Bio4j as back-end. More info about the tool in this post of Bio4j blog

So far you can do the following:

  1. Get the GO annotations of a set of proteins. All the GO annotations belonging to the three different GO ontologies (Molecular function, Biological process y Cellular component) are provided separately
  2. “Perform” a GO Slim analysis

Using this tool for 1) is pretty easy. You just have to enter in the text field the Uniprot IDs (protein accessions) of the set of proteins of interest and select which kind of separator you are using (enter, tab, whitespace…). Then click on “Get results”, select the location where the result file will be downloaded and wait until the “File downloaded!” message appears.

New charts utilities are coming soon. By now, you can import the XML result file into an Excel file and create the charts there. The following ones are an example of the charts you can make. They represent the GO annotations of all the Helicobacter pylori proteins (11,934 proteins), only those GO terms that appear 20 times or more in the set of proteins are depicted.

How to use this tool to perform GOSlim analysis in upcoming posts…

Bio4j: a graph based DB containing most of data available in Uniprot, Uniref and Gene Ontology

We have recently launched Bio4j, a graph based database that includes most of the data available in Uniprot, Uniref clusters and Gene Ontology. Quoting from the Bio4j wiki:

  > Bio4j provides a completely new and powerful framework for protein related information querying and management. Since it relies on a high-performance graph engine (Neo4j), data is stored in a way that semantically represents its own structure. On the contrary, traditional relational databases must flatten the data they represent into tables, creating “artificial” ids in order to connect the different tuples; which can in some cases eventually lead to domain models that have almost nothing to do with the actual structure of data.

Thanks to its graph structure queries that could be time consuming or even inpossible in relational DBs can be performed easily in Bio4j.

An example of this kind of queries would be the following:

Suppose you are working in the membrane proteins of a non model organism that unfortunately have very poor annotations. Searching for proteins of your interest in Uniprot by your organism name or taxonomy only results in a few unrewieved entries with few more than the sequence. Next step could probably be to use the Uniprot advance search tool to get the proteins similar to your proteins in a 50, 90 or 100% (the Uniref clusters) and then look protein by protein which of them have annotations that could indicate that they are membrane proteins and select those that 1) have good enough annotaions for your purpose and 2) are related enough to your protein to be able to infer the functions of the well-annotated proteins to your protein. This would be a manual and tedious process time consuming and error prone.

All this process of getting all the proteins of your organism, then getting the high similar ones and then getting only those which are in the membrane and have quality annotations could be perfomed programmatically in a fairly simple graph traversal. All you would have to specify would be:

  • The level of needed similarity between the proteins of your organism and the well-characterized ones
  • The criteria to consider a protein has good enough annotations: if the entry is reviewed, if it’s got lots of cross references, of functional data (Interpro, GO…)
  • The criteria to consider a protein is in the membrane, for example that it’s got a GO ID indicating that is a membrane protein, that its subcellular location is the membrane… both…

More information about Bio4j in its website, the blog, the wiki and the twitter profile.

Using Uniprot Advanced search

This short post shows how to use Uniprot Advanced search to retrieve proteins of interest. In this Biostar question they want to retrieve human proteins with the domain “FYVE”

First click on Advanced search in Uniprot (www.uniprot.org) website

Then select Domain in the field box and type FYVE

Click then Add & Search and select now Organism in the Field box and type Homo sapiens.

Clicking again on Add & Search you get your set of proteins. And if you click on Download on the right you can download the data in several formats.

Similar searches (and more complex ones) can be done combining the logical operators and the available fields.

Bacterial diversity identification with normalized libraries?

Normally the bacterial identification experiments with NGS are based on analyzing 16S rRNA amplicons. Typically with 454 technology. In this kind of experiments the diversity of the bacterial populations found in the samples are infered from the similarity of the reads with already known 16s rRNA sequences. And the abundance of detected species could be measured with the abundance of detected reads of such specie in the sample.

This kind of approach where the abundance of a certain thing is meassured with the abundance of annotated sequences with that thing is also applied in RNA-seq experiments (where the gene expression is such thing) with no normalized libraries.

If the goal of the transcriptome experiment is not to quantify gene expression but detect as many different transcripts as possible then the library is normalized.

I wonder if in the bacterial identificacion experiments using 16s amplicons this approach could also be used in the cases where you want to get the maximum number of species present in your sample.

Still playing with Gephi and GO (I)

I was wondering the other day how would the whole Gene Ontology look like in Gephi. After realizing that it exporting the ontology to a gexf file could be pretty easy I couldn’t resist to give a try. Soon enough I realized that representing the three sub-ontologies of GO (biological process, cellular component and molecular function) at the same time was kind of too much. So in the end I decided to start with the smallest one (cellular component), and this is what I got.

Click here to open the pdf in a new window.

Using Gephi with gene ontology graphs

It’s already been a couple of weeks since we started integrating Gephi in some of our services and projects and I must say I’m quite impressed with the results so far.

One of the first uses we found for it was representing gene ontology protein annotations and Go Slim analysis.

Here you can see a graph representation of a protein set gene ontology annotations. Different colors have been applied to each ontology (molecular function, biological process and cellular component). Besides, nodes size is proportional to protein annotations number.

Click here to open the svg in a new window.