ohnosequences!

era7 bioinformatics R&D group

Bio4j preprint

It’s been years since we started working on Bio4j, and during all this time we have prioritized working on Bio4j over writing about (and publishing) what we were doing. Thanks in part to this, Bio4j 1.0 is going to be an amazing resource for the Bioinformatics community, built by only a handful of people; which I personally find quite impressive, given the scale and scope of the project.

However, not having a standard way of citing Bio4j was starting to cause some difficulties for Bio4j users; a link to a GitHub repository is (still) not a generally accepted practice. This was so even for ourselves: we have several papers in the works which build on Bio4j, blocked because it would not make sense to publish them before Bio4j itself.

Well, this is no longer an issue: yesterday a preprint in the bioRxiv went online, describing what is Bio4j

We chose the bioRxiv because preprints there are easily citable (they get a DOI assigned), and several open-access journals are happy with publishing a manuscript based on a preprint submitted to it (something which we will certainly do in the next few months).

I’m including below the citing info and abstract; any sort of feedback is welcome!



Bio4j: a high-performance cloud-enabled graph-based data platform

Pablo Pareja-Tobes, Raquel Tobes, Marina Manrique, Eduardo Pareja, Eduardo Pareja-Tobes
bioRxivdoi: 10.1101/016758

Background. Next Generation Sequencing and other high-throughput technologies have brought a revolution to the bioinformatics landscape, by offering sheer amounts of data about previously unaccessible domains in a cheap and scalable way. However, fast, reproducible, and cost-effective data analysis at such scale remains elusive. A key need for achieving it is being able to access and query the vast amount of publicly available data, specially so in the case of knowledge-intensive, semantically rich data: incredibly valuable information about proteins and their functions, genes, pathways, or all sort of biological knowledge encoded in ontologies remains scattered, semantically and physically fragmented.

Methods and Results. Guided by this, we have designed and developed Bio4j. It aims to offer a platform for the integration of semantically rich biological data using typed graph models. We have modeled and integrated most publicly available data linked with proteins into a set of interdependent graphs. Data querying is possible through a data model aware Domain Specific Language implemented in Java, letting the user write typed graph traversals over the integrated data. A ready to use cloud-based data distribution, based on the Titan graph database engine is provided; generic data import code can also be used for in-house deployment.

Conclusion. Bio4j represents a unique resource for the current Bioinformatician, providing at once a solution for several key problems: data integration; expressive, high performance data access; and a cost-effective scalable cloud deployment model.

INTERCROSSING cloud and NGS course

During the past August we gave a course as part of the INTERCROSSING Initial Training Network we are part of. The course was titled Cloud Computing and NGS Data Analysis

The audience composition was a mix of Biologists, Mathematicians, Bioinformaticians and Statisticians; we tried to have something for all of them.

We set 3 working groups with a mix of people with different backgrounds, in every group you could find bio and maths/CS guys. We made up these interdisciplinary teams on purpose, we wanted to simulate as much as possible the kind of working environment they could find in real world with its pros and cons like the advantage of being able to learn one from others but also the difficulties of understanding people with completely different background.

From the beginning the students were given a problem to solve: outbreak pathogen identification. And they were asked to design a system to face it. We wanted them to come up with a solution as realistic as possible. We picked a challenging problem with no define solution to enhance their creativity and to make them work hard on something in a similar scenario of real life problems. We were not so interested in a given solution but in how they faced the problem and designed a solution from scratch. We really wanted them to think about a real life problem and we think we got it.

So, we scheduled the talks in a way that the first days they could get some general concepts they would need to face the problem and from then on they also had sessions for team working and asking questions. On Friday each team explained the others their system.

It was a pretty intensive week for them and for us! We had a great time supervising their work and watching from the background how they developed their projects. We hope they enjoyed the course as much as we did!

course slides

You can get the slides from the course here

A short report about the Conference Exploring Human Host-Microbiome Interactions in Health and Disease at Wellcome Trust, Hinxton on May 8 - 10, 2012

In my opinion, Curtis Huttenhower presented the newest perspectives for understanding human microbiome. Some ideas from his talk:

  • Not who, but why: the IBD microbiome is defined by adaptation to oxidative stress.
  • Niche specialization is crucial in human microbiome.
  • Human Microbiome Project Unified Metabolic Analysis Network: Who’s there vs what they’re doing.
  • Species-level resolution is critical for understanding microbiome function, strain-level is even better.

Some interesting ideas presented at the conference:

Ian Wilson:

  • Disease and microbiome: causal or casual?
  • Recolonization after Atbs resulted in cage-dependent subgroups with different bacterial populations and metabotypes.

Karen Scott:

  • Anaerobic Roseburia/E. rectale group bacteria constitute 10% of gut microbiota in healthy individuals.
  • The main fermentation product of Roseburia species is butyrate. The genomic analysis of the enzymes involved is complex.
  • Dietary interventions (augmenting the amount of starch consumed) can dramatically affect Roseburia genus presence in gut.

Barbara Pachikian:

  • Fructooligosaccharides (FOS) supplementation reverses hepatic steatosis.

Francisco Guarner:

  • In IBD mucosal lesions are caused by immune response against colonic bacteria.

Paul Cotter:

  • Bacteriocins as tools to alter the composition of gut microbiota.

Ruth Ley:

  • In the third trimester of pregnancy gut bacterial composition shows a dramatic shift.
  • The third trimester microbiota could induce low-grade inflammation, adiposity gain and reduced insulin sensitivity beneficial in pregnancy R.Ley
  • Gut microbiota’s impact on metabolism: highly adaptive in pregnancy and detrimental in obese host.

Fredrik Backhed:

  • Apoe -/- germ-free male mice are resistant to diet-induced atherosclerosis.
  • Gut microbiota modulate bile acid synthesis.

Simon Murch:

  • There are differences between the flora from cesarean infants and from vaginally delivered infants.
  • Different bacteria induce different gene expression in the host
  • Breast milking and reduction of antibiotic exposure are critical for microbial diversity in low birth weight neonates.

A. Walker:

  • CF infants have more pathogenic bacteria in their lower respiratory microbiota (P. aeruginosa, S. maltophilia, B. cepacia)

Elisa Noll:

  • Gut-microbial metabolism can alter host metabolism in the favour of obesity and insulin resistance.

Fergus Shanaham:

  • What is a healthy gut? Low biodiversity implies more pathogens colonization.

Douwe van Sinderen:

  • Bifidobacterial pili and surface polysaccharide allow competitive host colonization and persistence -Perhaps genes involved in colonization are not expressed in vitro.

Catarina Simoes:

  • Diet affects the gut microbiota in twins.

Kathleen Sim:

  • Missing bifidobacteria: undercounting in neonatal gut microbiota using classical universal primers.

Julian Marchesi:

  • Experiments colonizing gut from mices with human microbiota
  • Core functions of gut microbiota need to be useful to the host but at the same time to the bacteria.

Willem de Vos:

  • Gut microbiome with low diversity in colitis
  • Ileum microbiota is shaped by fast sugar uptake
  • Microbiota diversity likely to provide resilience
  • Diversity and stability characterize a healthy microbiome.

Jeremy K. Nicholson:

  • There are many more targets for cancer in human microbiome than in human genome
  • The microbiome as a therapeutic target
  • Influence of maternal genome on early microbiome development.

Neo4j Server and AWS become good friends!

Hi everyone!

Christmas Eve is almost here and there’s still time for a last-minute present. Thanks to CloudFormation and this template I’m about to show you, Neo4j Server is now friends with AWS (Amazon Web Services) and together they bring you the opportunity of getting your own fresh Neo4j Server machine running in just a few clicks!

I created the github repository Neo4jAWS where you can find all the files needed for this, (which are actually not many but probably I’ll be adding more tools for Neo4j and AWS integration soon).

Ok, so what does this CloudFormation template actually do?

  1. It launches an instance in the availability zone you decide and with a type of your choice -_ you should also provide your key-pair_
  2. Attaches the volume including your Neo4j DB to the new instance (you must provide your volume ID)
  3. Downloads the latest Neo4j stable release (1.5) and overwrites the server properties file with your own file(you have to provide a public URL where it should be available)
  4. It finally starts the Neo4j Server previously copying your DB folder under the data server folder (you have to provide the name of that folder as a parameter for the template)

And, what do you have to do?

  1. Go to the CloudFormation section of the AWS console
  2. Click on ‘Create New Stack’ button
  3. Download the template file from the github repository and then select the option ‘Upload a Template File’
  4. You should be seeing now the parameters window where you should enter the values: KeyPairName, Neo4jDBFolder, AvailabilityZone, EBSVolumeID, ServerPropertiesFile **(this should be a public URL), and **InstanceType
  5. Once you’ve reviewed that everything’s OK, just click next and wait for the stack to change to the state CREATE_COMPLETE

UPDATE –> Here you have the set of screenshots you should see in the process:

CloudFormation tab: click on Create new stack button.

Give a name to your stack and choose the option for uploading a file, browsing to the template file you previously downloaded from Neo4jAWS repository. Click ‘Continue’ then.

You should be seeing something like this by now. It’s time to provide all the parameters! When you’re done, click con ‘Continue’ after reviewing the values and just wait for it to change to state ‘CREATE_COMPLETE’ ;)

If nothing weird happens, you should be able to see the WebAdmin in your browser typing as URL the public IP given as output of the stack plus the port you specified in your neo4j-server.properties file.

Beware that the template opens by default the port 7474 for communicating with the Server, if you want to use another port number for any reason, you should change the SecurityGroup manually

As always, please don’t hesitate to give any kind of feedback or suggestion you may have, as well as pointing to possible issues/bugs (you can use github issues in the repository for that).

Happy Holidays!

@pablopareja

Gene Ontology (GO) graph visualizations for EHEC automatic annotation of BGI V4 assembly

Hi everyone!

A couple of days ago I published a post describing how to obtain cool GO annotation visualizations with Gephi + Bio4j. As an example I used data from one of the first assemblies for the **EHEC **genome, and I was wondering today: Why not using the last version from BGI assembly we annotated with our great BG7 bacterial genome annotation pipeline and put together the visualizations for the three sub-ontologies? Here you have the result:

Biological Process

(Please click on the image above to check the zoomable version)

Molecular Function

(Please click on the image above to check the zoomable version)

Cellular Component

(Please click on the image above to check the zoomable version)

Have a good weekend! ;)

@pablopareja

Playing with microsatellites (Simple Sequence Repeats), Java, and Neo4j

Hi!

I just finished this afternoon a small project I had to do about identification of microsatellites in DNA sequences. As with every new project I start, I think of something that:

  • I didn’t try before
  • is worth learning
  • is applicable in order to meet the needs of the specific project

These last few days it was the chance to get to know and try the visualization tool included in the last version of Neo4j Webadmin dashboard. I had already heard of it a couple of times from different sources but had not had the chance to play a bit with it yet. So, after my first contact with it I have to say that although it’s something Neo4j introduced in the last versions, it already has a decent GUI and promising functionality.

Apart from GUI considerations, I created the repository MicrosatellitesNeo4jModel with a bunch of nodes and relationships wrappers as an API for performing traversals for all this data in an easy way. Here is the domain model I chose:

Microsatellites Neo4j domain model

On the programs side, I developed two different Java classes, one dealing with the identification of the microsatellites and their subsequent storage on the Neo4j DB (CreateMicrosatellitesDB.java) and another (ExtractDataToCSV.java)for extracting statistical information for a set of specific parameters like tuple length and things like that. Both classes are in the repository Microsatellites.

Once the DB was created, I played a bit with the display profiles in the WebAdmin data browser so that different node types had a different aspect and this is what I got:

Microsatellites DB data browser screenshot

Here you can find blue circles (sequence IDs), orange boxes (tuples repeated in the microsatellites found), and greenish squares (tuple length nodes).

One of the features I was missing in the visualization was having style rules for relationships as well as for nodes. This was specially important in my case where I have relevant information stored as relationships attributes, (I actually could not visualize the number of tuple repeats in the microsatellites found, just the name of the relationship ‘MICROSATELLITE_FOUND’ everywhere). However I posted a question on neo4j user list about this and it seems they already are working on this, cool!

As always, everything here is open source and released under under AGPLv3.

Cheers,

@pablopareja

Some thoughts about Neo4j

Today I found an interesting discussion in Neo4j user list and found myself in the mood of writing a couple of related thoughts I have had in mind the last months.

Here they are: (the titles are taken from the guidelines for building your domain model)

Use reference and subreference nodes to organize entry points

In principle I don’t agree with this. Well, I do in theory (it actually is how I implemented things in the beginning). However in one of my projects, where I’m dealing with huge amounts of relationships, reference and subreference nodes become supernodes which in turn are a very problematic bottleneck in most traversals; (this is due to the lack of a natively implemented system for discerning between relationships with different types, I opened this issue about this here). While this is solved, I’m always tempted/forced to start using indexes instead of relationships, but then I wonder how different things are in the end compared to other not graph-based DB systems !??

Use relationships types appropiately

I’m gonna be frank with this, I never understood why there’re mandatory relationships types but not mandatory nodes types!? It just doesn’t make any sense for me. I can understand that this could bring some trouble depending on implementation decisions taken at core level but then, why doing this only half way? It’d have been better having no restriction for either nodes or relationships than how things are now.

With all this having been said, I still find Neo4j a very promising DB in the near future and I’m really happy to use it in a lot of different projects/use-cases; however I think the way for it to get better each day is not keeping saying how cool it is but actually pointing at the weak spots it may have.

Pablo Pareja

Automatic annotation of E. coli H112180280 strain (sequence and assembly by HPA)

And right now we’ve finished the automatic annotation of the assembly of new strain that HPA (Health Protection Agency UK)  made available yesterday (get the assembly file here http://www.hpa-bioinformatics.org.uk/lgp/resource/454Scaffolds.fna)

Once more we’ve used BG7 and the same set of reference proteins (137,063 proteins in total):

  • The representative Uniprot proteins corresponding to all Uniref90 clusters for all Escherichia coli proteins
  • All Uniprot proteins from organisms including in their name the terms “EHEC” or “EAEC”
  • All Uniprot proteins from bacteria that have in any Uniprot field the term “toxin”
  • All Uniprot proteins from bacteria that have in any Uniprot field  “hemolysin”
  • All the proteins from Salmonella typhi, Yersinia pestis and Shigella dysenteriae

Results

We’ve detected 5,916 genes

  • 5,792 protein encoding genes
  • 124 RNA genes

4,912 out of the 5,792 (84.80%) protein encoding genes have canonical start and stop codon and haven´t either frame-shifts or intragenic stop codons.

615 out of the 5,792 (10.61%) protein encoding genes have some frameshifts or intragenic stop codon in their sequences, probably caused by inherent technology errors.

You can get the results of the annotation here https://github.com/ehec-outbreak-crowdsourced/BGI-data-analysis/tree/master/strains/H112180280/seqProject/HealthProtectionAgencyUK/annotations/era7bioinformatics/era7_HPA_H112180280_annotations)

Automatic annotation of BGI V3 assembly of E. coli TY-2482 genome

We’ve finished the automatic annotation of the third BGI assembly of the E. coli TY-2482 strain genome (get the assembly file here ftp://ftp.genomics.org.cn/pub/Ecoli_TY-2482/Escherichia_coli_TY-2482.scaffold.20110610.fa.gz)

As in the other annotations we’ve done so far we used BG7 system to annotate the genome. And we have used the same set of reference proteins (137,063 proteins in total):

  • The representative Uniprot proteins corresponding to all Uniref90 clusters for all Escherichia coli proteins
  • All Uniprot proteins from organisms including in their name the terms “EHEC” or “EAEC”
  • All Uniprot proteins from bacteria that have in any Uniprot field the term “toxin”
  • All Uniprot proteins from bacteria that have in any Uniprot field  “hemolysin”
  • All the proteins from Salmonella typhi, Yersinia pestis and Shigella dysenteriae

Results

We’ve detected 5,936 genes

  • 5,806 protein encoding genes
  • 130 RNA genes

4,881 out of the 5,806 (84.06%) protein encoding genes have canonical start and stop codon and haven´t either frame-shifts or intragenic stop codons.

533 out of the 5,806 (9.18%) protein encoding genes have some frameshifts or intragenic stop codon in their sequences, probably caused by inherent technology errors.

You can get the results of the annotation here https://github.com/ehec-outbreak-crowdsourced/BGI-data-analysis/tree/master/strains/TY2482/seqProject/BGI/annotations/era7bioinformatics/BGI_V3

Missed region in EHEC

David Studholme detected a missed region in EHEC TY-2482-v1 (http://www.genomic.org.uk/blog/?p=523) assembly that was also absent in TY-2482-v2.

This is the region with type VI secretion system surrounding the detected by Studholme missed region:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
contig  Era7 geneID ini Era7 tags Protein name
106 108864  2776  secretion system VI Putative uncharacterized protein
106 83591 6193  secretion system VI Putative uncharacterized protein
106 85611 6792  secretion system VI Putative uncharacterized protein
106 107028  7091  secretion system VI Putative type VI secretion protein
106 79778 8549  secretion system VI Putative uncharacterized protein
106 58451 9104  secretion system VI Putative uncharacterized protein
106 108747  10185 secretion system VI Putative uncharacterized protein
106 60210 10494 secretion system VI Putative uncharacterized protein
106 106509  10965 secretion system VI Putative uncharacterized protein
106 75277 12958 secretion system VI Putative type VI secretion protein
106 99465 13930 secretion system VI Putative uncharacterized protein
106 92338 15696 secretion system VI Putative uncharacterized protein
106 94122 16111 secretion system VI Putative uncharacterized protein
106 81926 16627 secretion system VI Putative type VI secretion protein
106 95285 18115 secretion system VI Putative type VI secretion protein
106 102409  18594 secretion system VI Putative type VI secretion protein
106 11998 19032   Transposase

contig

Era7 geneID

ini

Era7 tags

Protein name

106

108864

2776

secretion system VI

Putative uncharacterized protein

106

83591

6193

secretion system VI

Putative uncharacterized protein

106

85611

6792

secretion system VI

Putative uncharacterized protein

106

107028

7091

secretion system VI

Putative type VI secretion protein

106

79778

8549

secretion system VI

Putative uncharacterized protein

106

58451

9104

secretion system VI

Putative uncharacterized protein

106

108747

10185

secretion system VI

Putative uncharacterized protein

106

60210

10494

secretion system VI

Putative uncharacterized protein

106

106509

10965

secretion system VI

Putative uncharacterized protein

106

75277

12958

secretion system VI

Putative type VI secretion protein

106

99465

13930

secretion system VI

Putative uncharacterized protein

106

92338

15696

secretion system VI

Putative uncharacterized protein

106

94122

16111

secretion system VI

Putative uncharacterized protein

106

81926

16627

secretion system VI

Putative type VI secretion protein

106

95285

18115

secretion system VI

Putative type VI secretion protein

106

102409

18594

secretion system VI

Putative type VI secretion protein

106

11998

19032

Transposase

.

These are the BLASTN results for that region of  BGI_assembly_v2 vs EC 55989:

1
2
3
4
5
query id  subject id  % identity  align len q. start  q. end  s. start  s. end  evalue
106 gi|218350208  99.88 18937 1 18937 3369005 3387941 0.0
106 gi|218350208  100.00  3964  18923 22886 3389265 3393228 0.0
106 gi|218350208  99.92 3832  22885 26716 3397877 3401707 0.0
106 gi|218350208  99.84 2455  26837 29291 3427391 3429843 0.0

query id

subject id

% identity

align len

q. start

q. end

s. start

s. end

evalue

106

gi 218350208

99.88

18937

1

18937

3369005

3387941

0.0

106

gi 218350208

100.00

3964

18923

22886

3389265

3393228

0.0

106

gi 218350208

99.92

3832

22885

26716

3397877

3401707

0.0

106

gi 218350208

99.84

2455

26837

29291

3427391

3429843

0.0

.

These are the EC 55989 genes missed in BGI assembly v2:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
hypothetical protein EC55989_3318 3401747 3402160 EC55989_3318
hypothetical protein EC55989_3319 3402305 3402739 EC55989_3319
hypothetical protein EC55989_3320 3402739 3403275 EC55989_3320
hypothetical protein EC55989_3321 3403256 3404356 EC55989_3321
hypothetical protein EC55989_3322 3404311 3406074 EC55989_3322
hypothetical protein; putative membrane protein 3406082 3406879 EC55989_3323
hypothetical protein EC55989_3324 3406776 3408374 EC55989_3324
hypothetical protein EC55989_3325 3408374 3411763 EC55989_3325
hypothetical protein EC55989_3326 3411756 3412904 EC55989_3326
hypothetical protein EC55989_3327 3412908 3413174 EC55989_3327
Conserved hypothetical protein. Putative exported protein 3413206 3413883 EC55989_3328
conserved hypothetical protein; putative exported protein 3414027 3414704 EC55989_3329
hypothetical protein EC55989_3330 3414724 3416406 EC55989_3330
hypothetical protein EC55989_3331 3416403 3418928 EC55989_3331
hypothetical protein EC55989_3333 3419451 3420026 EC55989_3333
putative chaperone clpB 3420014 3422680 EC55989_3334
hypothetical protein EC55989_3335 3422840 3423331 EC55989_3335
hypothetical protein EC55989_3336 3423337 3425163 EC55989_3336
hypothetical protein EC55989_3337 3425070 3425789 EC55989_3337
hypothetical protein EC55989_3338 3425720 3427057 EC55989_3338
hypothetical protein EC55989_3339 3427073 3428617 EC55989_3339

hypothetical protein EC55989_3318

3401747

3402160

EC55989_3318

hypothetical protein EC55989_3319

3402305

3402739

EC55989_3319

hypothetical protein EC55989_3320

3402739

3403275

EC55989_3320

hypothetical protein EC55989_3321

3403256

3404356

EC55989_3321

hypothetical protein EC55989_3322

3404311

3406074

EC55989_3322

hypothetical protein; putative membrane protein

3406082

3406879

EC55989_3323

hypothetical protein EC55989_3324

3406776

3408374

EC55989_3324

hypothetical protein EC55989_3325

3408374

3411763

EC55989_3325

hypothetical protein EC55989_3326

3411756

3412904

EC55989_3326

hypothetical protein EC55989_3327

3412908

3413174

EC55989_3327

Conserved hypothetical protein. Putative exported protein

3413206

3413883

EC55989_3328

conserved hypothetical protein; putative exported protein

3414027

3414704

EC55989_3329

hypothetical protein EC55989_3330

3414724

3416406

EC55989_3330

hypothetical protein EC55989_3331

3416403

3418928

EC55989_3331

hypothetical protein EC55989_3333

3419451

3420026

EC55989_3333

putative chaperone clpB

3420014

3422680

EC55989_3334

hypothetical protein EC55989_3335

3422840

3423331

EC55989_3335

hypothetical protein EC55989_3336

3423337

3425163

EC55989_3336

hypothetical protein EC55989_3337

3425070

3425789

EC55989_3337

hypothetical protein EC55989_3338

3425720

3427057

EC55989_3338

hypothetical protein EC55989_3339

3427073

3428617

EC55989_3339

.

We will review this region in the new annotations that we will do for the two new available genome sequences contributed by HPA and BGI.