Wednesday, July 16, 2008

Gmap on Github

Hi all,

sorry for the blackout period.... for a couple of you that asked me about the Gmap Ruby class I have made some time ago, I am pleased to tell you that the first version of this work is available on Github at the following address:

http://github.com/fstrozzi/gmap

If you want to contribute or help to this is ok, otherwise you can download directly the GEM and use it. Please keep in mind that I am not a Ruby guru, so probably the code isn't so elegant and for sure there will be thousands of ways to do the same things I have done (and probably better). But it works and it is memory efficient (as I need to parse gigabyte of data from Gmap). So if you have comments or suggestions I will really appreciate them.

Bye

Thursday, February 28, 2008

Gmap

In this period I'm working on a EST project, in which we have to map against the genome milions of sequences. We are currently using a program called Gmap which is designed to deal with short expressed sequences. Also, Gmap works with "gene maps", to assign the results of the alignment, using known genes positions on the genome. Unfortunately this program seems not so used by the community and there are no parser available in any language (or at least I didn't find any).
So, this seemed a perfect situation for Ruby. I created a simple parser based on the default output of the program, which is very informative. Here is an example of the output... check it out:


>bt1_4_9053053
Paths (5):
Path 1: query 1--36 (36 bp) => chr chrMT:15,335--15,370 (36 bp)
cDNA direction: indeterminate
Genomic pos: bt3.1:2,334,349,549--2,334,349,584 (+ strand)
Accessions: chrMT:15,335--15,370 (out of 16338 bp)
Number of exons: 1
Coverage: 100.0 (query length: 36 bp)
Trimmed coverage: 100.0 (trimmed length: 36 bp, trimmed region: 1..36)
Percent identity: 100.0 (36 matches, 0 mismatches, 0 indels, 0 unknowns)
Translation: 1..36 (12 aa)
Amino acid changes:

Alignments:
Alignment for path 1:

+chrMT:15335-15370 (1-36) 100%

0 . : . : . : .
aa.g 1 L I C I R N L T I N P Q
+chrMT:15335 CTTATTTGCATACGCAATCTTACGATCAATCCCCAA
||||||||||||||||||||||||||||||||||||
1 CTTATTTGCATACGCAATCTTACGATCAATCCCCAA
aa.c 1 L I C I R N L T I N P Q

Maps:
Map hits for path 1 (1):
gene_maps chr10:14514..15653 3283889


The simple parser I created works well, retrieving all the important informations such as the query sequence coordinates, the target sequence coordinates, the mismatches, the indels, the identity, the coverage, the whole alignment and the overlap with known genes. I've created a Ruby class for this and I had a lot of fun!
The class simply open the file for reading and then parses the informations for each result, creating a Gmap::Result object, which includes all the above informations as attributes. Here is how it looks a script using the Gmap class:


Gmap.open(filename) do |gmap|
gmap.each_result do |r|
# do something
end
end


If anyone knows Gmap and want to provide help or suggestions, you are welcome. If someone else is so desperate that is using the same program and want to deal with a Ruby based OO parser (written by a newbie, of course), just let me know ;)

Thursday, February 21, 2008

New BioRuby Web Site

After the BioHackaton in Japan, the BioRuby people set up a new web site, with wiki, tutorials and a very cool frontpage. Check it out!

http://bioruby.open-bio.org/

Thanks people, for the work you are doing with BioRuby!

Wednesday, February 20, 2008

RailsBeans

As I'm writing every web interface using Ruby On Rails under Linux, I needed to find a good editor to help me work on the Rails apps (normally I used TextMate under MacOS X, but know I have a Linux box at work). I come from the VIM school and I really never be interested in IDE (Integreted Development Environment) as I prefered something simpler and faster to edit my programs and files. Also normally, I work on servers and clusters via SSH connection, so a shell editor like VIM was perfect for every situation. But VIM can be very unpleasant when you need to work on several files, moving within directories as normally every Rails user need to do....
So, basically I needed two things:
1) Some graphical cool application to work with Rails under Linux
2) As I have all the web sites on a remote server, something to "mount" remote directories as local mount point on a Linux box

For the point 2, I found out SSHFS, part of the FUSE project. As the name indicates, it is based on SSH protocol, so every data you exchange between your computer and the server, will be encrypted. Using this utility and adding the FUSE support to the linux kernel of my Linux box (just modprobing....) I was able to mount remote directories from the server, as local mount point. So after this, I just needed a good graphical editor to edit my files!

For point 1, I tested different IDE and editors and I report here just the few that hit my attention (all are open-source and free):

- Kate: part of the KDE environment, but usable also under Gnome (my default DE); is a very good editor and includes also the Linux Terminal in the same window (very useful).

- Anjuta Editor: this is designed for Gnome and is a complete IDE. It includes the Linux Terminal and some interesting features via installable plugins, but seemed not so customizable.

- NetBeans: the famous Java IDE. Complete and very very well designed.

In the end I prefered NetBeans for many different reasons: the Ruby/ Rails version is just around 22 Mb of download and has full support for the autocompletion of functions and commands, including also all the Ruby and Rails API documentation. It also have support for Version Control Systems, like CVS and Subversion (the one I currently use) and you can manage direct connections with every Database system, using JDBC. It also has a very good windows design, to manage and control all the aspects of a single project.

Defenetively, in my opinion, the best choice under linux for a complete and professional Ruby / Rails IDE.
Enjoy

Thursday, February 14, 2008

PostgreSQL and Bioinformatics

It is not the first time that I run into PostgreSQL (www.postgresql.org) database management system, but to be honest I never really worked with it. For many different reasons I preferred to use MySQL as it is easier and bioinfo compliant :))).
But now I started to work on a very huge (from the data point of view) project with milions EST sequences to be mapped on the genome. So really soon I will have to deal with tables having several milions of entries and obviously I will have to merge this data with the available annotation data, such as gene positions, SNPs and so on. Also probably, all this stuff will be published on the web in the near future and I think that Ruby On Rails will be my best friend for that task!

So in this period I started to use PostgreSQL, trying to understand the basics, to see if this RDBMS can be the best choice for big public databases. Scrolling down the documentation I just have one single impression: PostgreSQL is amazing and includes so many features (like the PL or Procedural Language) that can be easily compared with industrial closed-source RDBMS like Oracle. Also it includes Procedural Languages in Perl and Python and that can be a very interesting point for bioinformatics people.
The questions in my mind now are: will all these features be really useful for bioinformatics projects and data management? And also, how much time will be required to understand the basics of PostgreSQL and apply them successfully with the biological data? I belive that most of these answers will obviously depend on the database design I will use.

As first step, I wanted to import all the data from the current release of Ensembl (a procedure that I normally do, using MySQL) and try to run some analysis script using PostgreSQL database; just a very rough benchmark to check if all is ok and if there are some speed improvement or changes, using another RDBMS. This first step was a bit complicated, as Ensembl SQL data are prepared for MySQL databases, so I had to translate the database schema provided by Ensembl FTP site in a way that PostgreSQL can understand, as the SQL language has some differeces between the two systems. Fortunately I found this script http://pgfoundry.org/projects/mysql2pgsql/ , that helped me in doing this task without many efforts. After this, I discovered that a "mysqlimport" like program doesn't exist in PostgreSQL so I wrote a workaround in Ruby to use the COPY command of PostgreSQL and the "psql" client interface to accomplish the task.

So, at the end, I finally had the Ensembl Core database installed inside PostgreSQL and I started to use the Ensembl API (Perl) to check if all was ok. I installed the DBD::Pg (http://search.cpan.org/dist/DBD-Pg/Pg.pm) driver for Perl DBI interface without problems and set the correct driver for the DBSQL Adaptor of the API (that currently uses Perl DBI). But the Ensembl API seemed not to work properly with PostgreSQL and I received several SQL errors from the simple program I used. So, I belive that Ensembl API are created using SQL statements not fully compatible with RDBMS different from MySQL. I didn't find any specification in the API docs (http://www.ensembl.org/info/using/api/Pdoc/ensembl/index.html) about this and at the moment I decided not to use Ensembl database with PostgreSQL.

But I really WANT to use PostgreSQL for my projects and I will write down here my future experiences, hoping to be luckier the next time...

Wednesday, February 6, 2008

BioRuby Tutorial

Tutorial and HowTos are critical resources to help people start working with new packages and / or softwares. In this, BioPerl and other OpenBio projects have more informations than BioRuby.
Now, thanks to the work of Piotr Prins, the BioRuby Project have a more complete
tutorial!

Many thanks to the people that spent some time doing this very useful document for BioRuby newbie (such as me).

Thursday, December 13, 2007

Rails 2.0 is out!

The most important release of Ruby, since it's creation is finally out!

Check the full report from the developers

As soon as possible I would like to move my Rails site to this new version and see what I will have to change in my code!