GeCo++ library Documentation

geco.png
Version:
0.1
Author:
Matteo Cereda
Uberto Pozzoli

Introduction

GeCo++ (Genomic Computation C++ Library) is a library developed to manage genomic elements annotation, sequences and positional genomic features and to help users to keep The library has been developed starting from the idea to represent and manage the numeric results of computational algorithms keeping them tied to annotations of genomic elements (transcripts, binding sites, conserved regions, transposable elements etc.), to their sequences and to genomic variations. Given this level of abstraction and the inherent complexity, we considered an object oriented software model to be the most appropriate. The choice of ISO C++ guaranteed speed, portability and most importantly for users, the access to a great number of other efficient and specialized computational biology libraries. GeCo++ defines two fundamental classes: gArray and gElement. The first one is a general purpose template array class. The development of such a class was justified by the need to provide both tracking of undefined/invalid elements (NA: Not Available) and array subsetting in a memory efficient way. NA tracking is obtained through a speed optimized bits array class (gBitsArray) while memory efficient subsetting by mean of an internal reference counting mechanism that allows instantiation of array subsets without data duplication. A number of optimized methods have been added to perform typical array operations needed in bioinformatics such as sorting, values finding, counting, reversion, descriptive statistics and others. Windowed versions of the same methods are also provided. Furthermore a gArray object can be indexed through both a scalar value or by another gArray for multiple indexing. Type casting capabilities (between different template specializations) are also provided. Three additional classes have been derived from gArray: gMatrix still a template, with additional methods to treat matrices; gString that specializes gArray to manage character strings, and gSequence that inherits from gString adding methods for basic sequence management.
A genomic element is defined as an interval of a given reference sequence along a given strand. Reference positions are defined as absolute (unsigned) positions along a reference while element ones are relative (signed) to the beginning of an element along its strand. Sites are defined as particular positions along the element (for example transcription start sites, splice sites or protein binding sites) while a connection represent a directed relation between two sites (introns, exons). Positional feature (PFs) are defined as a property that varies along the element. While no assumption is made on the biological meaning of sites connection and features, this model is general enough to represent the majority of real-world genomic elements and their features. The GeCo++ library defines the class gElement as an implementation of this model: it allows users to instantiate objects representing genomic element which can contain sequence as well as sites, connections and features information. Element positions can be converted to reference ones and vice-versa as well as one element positions can be mapped to another one. Sequence and features are maintained as gArray objects avoiding unnecessary data-duplication. Their retrieval or calculation has been kept independent from the gElement object itself by defining a hierarchy of gArrayRetriever objects from which users can easily derive new classes implementing sequence retrieval as well as feature calculation algorithms. This mechanism make easier to develop application that are independent from the specific computational algorithms. The most important characteristic of gElement objects is that they can be instantiated as a sub-intervals of another one considering a strand and the presence of genomic variations. Sequence, sites, connections and features are inherited by the new object consistently with the interval, the strand and the variations. Features recalculation and sequence retrieval are kept to a minimum avoiding unnecessary recalculation at unaffected positions. This make very straightforward to evaluate the effect of genomic variations on features, positions and sequence.

Requirements

The library is written in standard ISO C++ and should compile on most platforms given an ISO C++ compliant compiler is provided. It doesn't depend on other libraries except than on stl (vector,string and ostream). It comes with a cmake CMakeList file that requires cmake version 2.8 or higher.

Library installation

To install the library dowload the package from: http://bioinformatics.emedea.it/geco/geco-0.1.tar.gz. ungzip it in a directory of your choice and follow the instruction contained in the INSTALL file

Examples

At the address http://bioinformatics.emedea.it/geco you can find a complete tutorial with code examplestha tcovers many of the features of this library.

License

Copyright (C) 2010 by Uberto Pozzoli and Matteo Cereda (uberto.pozzoli@bp.lnf.it)

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.