GeCo++ library tutorial

Auhtors:

  • Matteo Cereda
  • Uberto Pozzoli

Introduction

GeCo++ (Genomic Computation C++ Library) is a library developed to manage genomic elements annotation, sequences and positional genomic features and to help users to keep track of genomic variations. This tutorial is an introduction to both concepts and usage of the library. It is not a complete reference but covers quite completely the main features of the library. For a complete reference on classes, functions and other details please refer to the Doxygen generated reference manual.

Motivations

C/C++ (C++ can actually be considered an object-oriented extension of C language that keeps an high degree of source and link compatibility [1]) are among the most widely used languages in bioinformatics “core” applications. Blast, ClustalW, Blat, Phred, the entire EMBOSS Suite, the seqAn and libSequence libraries are a few but particularly suggestive examples of successful applications/algorithm libraries written in C/C++. Usually people use script languages such as Perl (BioPerl), Python (BioPython) or R (Bioconductor) to provide data in a suitable format to call external applications, parse the output and analyse the results, possibly integrating them with annotations that come from other sources. As the reviewer suggests, this is a quite long and repetitive task. Besides requiring a significant programming effort it greatly impacts on performances since I/O operations are both time and memory consuming. Furthermore, this approach poses severe limitations to the flexibility by which algorithms can be invoked.
One could argue that Perl and Python are particularly fitted to mangle data to/from different formats and to manipulate text information (i.e. genomic sequences) while R is an extremely powerful statistical analysis environment,so why not writing algorithms using these languages? The answer is that R, Perl and Python are interpreted languages and thus extremely slower and less memory efficient than compiled ones such as C and C++, as extensively reviewed in [2].
Memory and time have always been an issue in algorithm implementation and program development. In the past this has been partially compensated by hardware improvements. Nevertheless, with the advent of Next Generation Sequencing (NGS), and the exponential growth of biological data available for genome-wide studies, these are becoming big issues again.
At least for R and Perl, this memory/time inefficiency has been partially resolved by providing interfaces to C/C++ modules. In this way one can write the “core” routines (the most memory/computation intensive) in C or C++ and then access them from the engine by loading these modules (e.g. many R/Bioconductor packages are extensively written in C). So, ultimately: we are using C/C++ again. Moreover, we still have to deal with specific and often complex data structures by which the script engine exchange data with the C/C++ modules. Better would be to move anything possible to the C/C++ layer leaving to the engine what it is specialised for (e.g. statistical analysis, graphic representation, data format mangling).
GeCo++ provides a set of classes specifically designed to add those characteristics that, lacking in C++, are present in most of the mentioned packages and that makes them so widely used in bioinformatics. By using GeCo++, most of the post processing, data integration and retrieval can be done directly from C++ programs, fully exploiting the above-mentioned important advantages of compiled languages. Furthermore, the library presents some improvement over plain C++ introducing an implementation of non-duplicating array subsetting, that helps a lot in avoiding memory duplications. Beside the advantages of using C++, GeCo++ also provides the definition of an operative genomic-element model that tightly integrates information about genomic-ranges, positions, genomic variations (namely, substitutions, insertions and deletions) and computed/retrieved numerical features. The core idea of the model is to refer genomic ranges and positions to a reference sequence and to add a set of variations (substitutions, insertion and deletions) to represent actual elements (i.e. mutated ones, individual haplotypes and so on). Based on this model, the library provides a class (gElement) that has the following capabilities:
  • memory efficient representation of alignments in terms of reference/variations: this can save a lot of memory, especially when relatively few variations are involved;
  • automatic tracking of position changes introduced by insertions/deletions;
  • easy mapping of corresponding positions between different sequences;
  • automatic recalculation of numerical features only where variations make it necessary, thus avoiding redundant recalculations in unvaried regions.

To our knowledge, this is an approach that hasn't been implemented in any other package yet, irrespectively by the language used. We believe that it can greatly simplify the development of complex applications, especially when variations have to be considered. This latter is a condition that will become more and more frequent with the availability of NGS sequencing data for personalized genomics (e.g. 1000 Genomes project). It is also particularly suited for population genomics studies.

[1] Stroustrup B. (1999) An Overview of the C++ Programming Language, From The Handbook of Object Technology CRC Press, ISBN 0-8493-3135-8 (available at http://www.computer-books.us/cpp_2.php)
[2] Fourment M., Gillings M.R. (2008) A comparison of common programming languages used in bioinformatics. BMC Bioinformatics, 9, 82.

General description

The library has been developed starting from the idea to represent and manage the numeric results of computational algorithms, keeping them tied to annotations of genomic elements (transcripts, binding sites, conserved regions, transposable elements etc.), to their sequences and to genomic variations. Given this level of abstraction and the inherent complexity, we considered an object oriented software model to be the most appropriate. The choice of ISO C++ guaranteed speed, portability and ,most importantly for users, the access to a great number of other efficient and specialized computational biology libraries.
GeCo++ defines two fundamental classes: gArray and gElement. The first one is a general purpose template array class. The development of such a class was justified by the need to provide both tracking of undefined/invalid elements (NA: Not Available) and array subsetting in a memory efficient way. NA tracking is obtained through a speed optimized bits array class (gBitsArray), while memory efficient subsetting is accomplished by mean of an internal reference counting mechanism that allows instantiation of array subsets without data duplication. A number of optimized methods has been added to perform typical array operations needed in bioinformatics, such as sorting, values finding, counting, reversion, descriptive statistics and others. Windowed versions of the same methods are also provided. Furthermore, a gArray object can be indexed through both a scalar value or by another gArray for multiple indexing. Type casting capabilities (between different template specializations) are also provided. Three additional classes have been derived from gArray: gMatrix, still a template, but with additional methods to treat matrices; gString, that specializes gArray to manage character strings, and gSequence that inherits from gString adding specific sequence management methods.
A genomic element is defined as an interval of a given reference sequence along a given strand. Reference positions are defined as absolute (unsigned) positions along a reference while element ones are relative (signed) to the beginning of an element along its strand. Sites are defined as particular positions along the element (i.e. transcription start sites, splice sites or protein binding sites) while a connection represents a directed relation between two sites (introns, exons). A Positional Feature (PF) is defined as a property that varies along the element. While no assumption is made on the biological meaning of sites connection and features, this model is general enough to represent the majority of real-world genomic elements and their features.
The GeCo++ library defines the class gElement as an implementation of this model: it allows users to instantiate objects representing a genomic element, which can contain sequences, as well as sites, connections and features information. Element positions can be converted to reference ones and back as well as one element positions can be mapped onto another element. Sequence and features are maintained as gArray objects, avoiding unnecessary data-duplication. Their retrieval or calculation has been kept independent from the gElement object itself by defining a hierarchy of gArrayRetriever objects, from which users can easily derive new classes implementing sequence retrieval as well as feature calculation algorithms. This mechanism makes it easier to develop applications which are independent from specific computational algorithms.
The most important feature of gElement objects is that they can be instantiated as sub-intervals of another one, considering a strand and the presence of genomic variations. Sequence, sites, connections and features are inherited by the new object consistently with the interval, the strand and the variations. Features recalculation and sequence retrieval are kept to a minimum, avoiding unnecessary recalculation at unaffected positions. This makes it very straightforward to evaluate the effect of genomic variations on features, positions and sequence.

Requirements

The library is written in standard ISO C++ and should compile on most platforms, given an ISO C++ compliant compiler is provided. It doesn't depend on other libraries except than on stl (vector, string and ostream). It comes with a cmake CMakeList file that requires cmake version 2.8 or higher.
In order to be used, the library (as well as this tutorial) requires a fairly good knowledge of C++.

Installation

To install the library download the package from: http://bioinformatics.emedea.it/geco/. Unzip the tarball in a directory, enter the directory and:

mkdir build
cmake ../
make
sudo make install

What is GeCo++ for? An introductory example

To show some of the possible applications of the library, we now report a (relatively) short example of how an usual genomic task can be achieved. We want to evaluate the effect of some genomic variations on a transcript (NM_016404). The variations we are going to analyse are completely artificial: they do not represent real mutations or polymorphisms.
We will not go deep into internal details of the object we will use: details on objects will be provided in the following chapters of this tutorial.
To use the library you need to include the geco.h header file and state that we are using the geco namespace:
#include <geco.h>
using namespace geco;
Now we will instantiate some object that will be used later: gLocal2bitSequenceRetriever is the first retriever we encounter, it retrieves a sequence from the .2bit file specified in the constructor. As said before, we are for now not interested in specifying how a retriever can be written, but only explaining how and when it can be used. This particular one will not be described here, since there is quite a lot of details inherent the 2bits format which could confound the reader. In the example application chapter we will describe a simpler sequence retriever based on text files. gMySqlDb is an object used to access MySql databases.
Both these objects are not part of the GeCo++ library: they are included in a specialized utilities collection we currently use in our lab, and has many dependencies on other libraries (if interested, you can contact us to have more information).
//Let's instantiate a sequence retriever
gLocal2bitSequenceRetriever seq_retriever("/adat/database/hg18/hg18.2bit",true);
//And a Db Connection (actual values for the parameters have been hidden)
gMySqlDb UCSCDb(dbname,hostname,dbuser,dbpass);
//gMySqlDb UCSCDb("hg18","genome-mysql.cse.ucsc.edu","genome","");
We now instantiate an object of type gTranscript. To do this we make a call to a function (also contained in the utility lib) that fetches the transcript information from the "refGenes" table of the UCSC Genome Browser Database using the provided database connection. We will describe in details this function in the section "Transcripts", where we'll also see how to directly instantiate a gTranscript.
The function requires the user to pass the refSeq ID of the transcript of interest, a connection object, a sequence retriever and another parameter we'll explain later. The user can also specify start and end offsets. The function returns a gTranscript object.
gTranscript aTranscript=getUCSCRefseq("NM_016404",UCSCDb,seq_retriever,gRef,200,200);
gTranscripts derive from the class gElement. One thing to keep in mind is that there are two different types of coordinates, namely element coordinates and reference coordinates. The element coordinates represent the positions referring to the beginning of the element (0), while reference coordinates are related to the particular reference used to build the element (for instance, a chromosome in a given assembly). The class gElement has a group of functions which allow for converting coordinates from one type to the other. In this first example we get the transcript interval in element coordinates, and then we'll translate it into reference coordinates.
// we can for example obtain the genomic length of the transcript
cout << left << setw(48) << "Element length (including offsets): ";
cout << aTranscript.getLength() << endl;

//the element interval is returned as a 2 values gArray 
gElementInterval boundaries=aTranscript.getPremRNAInterval();  
cout << left << setw(48) << "Transcript in element coordinates: ";
cout << boundaries.getStart() << "-" << boundaries.getEnd() << endl;  

//we can convert it to reference coordiantes
gReferenceInterval refBoundaries=aTranscript.getReferenceIntervalFromElement(boundaries);
//getReferenceName returns the name of the reference used to build the object
cout << left << setw(48) << "Transcript in reference coordinates: ";
cout << aTranscript.getReferenceName() << ":" << refBoundaries.getStart() << "-" << refBoundaries.getEnd() << endl;
Element length (including offsets): 1269
Transcript in element coordinates: 200 - 1069
Transcript in reference coordinates: chr11: 63840740 - 63841609
As shown in the following example, many other information can be retrieved using gTranscript. Of particular interest is the opportunity to obtain exons (and introns) information, as well as obtaining a new gElement that describe an exon. These elements will share the same reference with the gTranscript they derive from.
//The strand relative to the reference
cout << left << setw(48) << "Strand relative to reference,(false = reverse):";
cout << aTranscript.isForward() << endl;

//The number of exons
cout << left << setw(48) << "Exon count:";
cout << aTranscript.getExonCount() << endl;

//The length of the mRNA
cout << left << setw(48) << "mRNA length:";
cout << aTranscript.getPremRNALength() << endl;

//The interval of one exon
boundaries=aTranscript.getExonInterval(1);
cout << left << setw(48) << "Second exon element boundaries:";
cout << boundaries.getStart() << "-" << boundaries.getEnd() << endl;  

//Its reference coordinates
refBoundaries=aTranscript.getReferenceIntervalFromElement(boundaries);
cout << left << setw(48) << "Second exon reference boundaries:";
cout <<  aTranscript.getReferenceName() << ":" << refBoundaries.getStart() << "-" << refBoundaries.getEnd() << endl;

//Its phase  
gArray<gShortUnsigned> phases=aTranscript.getExonPhase(1);
cout << left << setw(48) << "Second exon phases:";
cout <<  phases[0] << "-" << phases[1] << endl;  

//A new gElement object representing the exon
gElement anExon=aTranscript.getElement(aTranscript.getExonInterval(1));
//From which you can, for example, obtain the sequence
cout << left << setw(48) << "Second exon sequence:";
cout << anExon.getSequence() << endl << endl;
Strand relative to reference,(false = reverse): 0
Exon count: 4
mRNA length: 869
Second exon element boundaries: 396-498
Second exon reference boundaries: chr11:63841311-63841413
Second exon phases: 0-0
Second exon sequence: GCCACCGAGGTCCGTATCTGCCCTGTGGAATTCAACCCCAACTTCGTGGCGCGTATGATACCTAAAGTGGAGTGGTCGGCGTTCCTGGAGGCGGCCGATAAC
What makes this library useful is the ability to define genomic variations and to "apply" them to already instantiated objects. A variation object must contain a reference to some genomic region. This reference can be obtained from any gElement object (including gTranscripts). In the following example we construct such an object, passing to the constructor the transcript object we instantiated before. In this way, the variation object can contain any variation occurring in the genomic region shared with that object.
// Create gVariations object using the transcript reference
gVariations variations(aTranscript);
Variations can now be added. We defined three types of variation, namely: Insertions, Deletions and Substitutions. Obviously one has to specify where the variation occurs. In order to do this, the member functions used to add variations require you to pass an object which the variation positions will be referred to.
To clarify this concept, we will add three variations in the following example. The first one is an insertion of 10 bases, whose position is referred to the same reference used to initialize the variations object. The second variation is a 94 bp deletion occurring between (element) positions 400 and 494, referred to the transcript object. The third one is a substitution of one base occurring between positions 60 and 61 relative to the third exon of our transcript.
//Add some variations to the object using different set of coordinates
variations.addInsertion(aTranscript.getReferenceName(),63841709,gSequence("TTTTTTTTTT"));  
variations.addDeletion(aTranscript,gElementInterval(400,493));
variations.addSubstitution(aTranscript.getElement(aTranscript.getExonInterval(2)),gElementInterval(60,61),gSequence("T"));
gTranscript objects, like any other gElement, can contain "features". Features are quantitative values depending on the position along the object, providing a convenient way to keep track of biologically relevant positions and of the values of the feature itself. As well as any other characteristic of the object, features are efficiently recalculated in sub-elements even if variations are applied. To illustrate this, we add now a feature to our aTranscript object. As an example, we chose to test for SF2/ASF (a well known splicing enhancer) binding sites. We use a Positional Weight Matrix (PWM) approach to accomplish this task. It consists in evaluating how well a given sequence of length n matches a frequency (or weight) matrix (of size 4 x n), that assigns a score to each nucleotide for each position. The sum of these values (normalized in various ways) is then used as the score for the given sequence. If we repeat this procedure for each sub-sequence of length n along a certain sequence, we obtain a series of values representing how well the sequence matches the consensus for that element at any given position. A threshold is usually provided to identify putative binding sites.
We will now add a feature to represent the scores obtained of a PWM specific for SF2/ASF (Smith, P. J., Zhang, C., Wang, J. Chew, S. L., Zhang, M. Q. and Krainer, A. R. 2006. An increased specificity score matrix for the prediction of SF2/ASF-specific exonic splicing enhancers. Hum. Mol. Genet. 15(16): 2490-2508.) In order to add a feature, we need a retriever, the object that actually makes the calculations. In this example we won't describe in details the algorithm, but we will use an object derived from gFeatureRetriever,described later in the chapter about positional features. We will only say that the retriever returns the scores that pass the threshold, or 0 otherwise.
//The matrix raw data for SF2/ASF 
static const gScore SF2ASF_data[28]={
-1.139669, 0.621677,-1.584962, 1.323616,-1.584962,-1.584962, 0.621677, 
 1.365525,-1.103778, 0.725053, 0.330068, 0.938765,-1.584962,-1.584962,
-0.211782, 0.174455, 0.478823,-1.584962, 0.334650, 0.992775,-0.105105, 
-1.584962,-0.500847,-1.584962,-1.127254,-1.584962,-1.127254, 0.268437
};
//A gMatrix object representing the PWM for SF2/ASF
static const gMatrix<gScore> SF2ASF(4,7,SF2ASF_data);
//the threshold to identify putative elements
gScore SF2ASF_threshold=1.956;
//Retriever instantiation
gFeatureRetriever<gScore> sf2retriever(PWMScoreRetriever(SF2ASF,SF2ASF_threshold));
//Now we can add the feature
aTranscript.addFeature<gScore>("SF2/ASF",sf2retriever,0);
We can now instantiate a new gTranscript object, applying the variations to the transcript we instantiated before. To this purpose we can use a "copy with variations" constructor:
gTranscript vTranscript(aTranscript,variations);
Using this new object we can replicate the steps we have seen before to obtain transcript information. By comparing these results, we can appreciate the effects of the variations applied. Note that to facilitate comparisons we also reported in the following output the result obtained before without reporting the code (for clarity).
cout << "Varied Transcript Results" << endl;
cout << "---------------------------------------------------------------------------------------------" << endl;
boundaries=vTranscript.getPremRNAInterval();
refBoundaries=vTranscript.getReferenceIntervalFromElement(boundaries);
cout << left << setw(48) << "Element length (including offsets): ";
cout << vTranscript.getLength() << endl;  
cout << left << setw(48) << "Transcript in element coordinates: ";
cout << boundaries.getStart() << "-" << boundaries.getEnd() << endl;
cout << left << setw(48) << "Transcript in reference coordinates: ";
cout << vTranscript.getReferenceName() << ":" << refBoundaries.getStart() << "-" << refBoundaries.getEnd() << endl;
cout << left << setw(48) << "Strand relative to reference,(false = reverse):";
cout << vTranscript.isForward() << endl;
cout << left << setw(48) << "Exon count:";
cout << vTranscript.getExonCount() << endl;
cout << left << setw(48) << "mRNA length:";
cout << vTranscript.getmRNALength() << endl;
boundaries=vTranscript.getExonInterval(1);
refBoundaries=vTranscript.getReferenceIntervalFromElement(boundaries);
cout << left << setw(48) << "Second exon element boundaries:";
cout << boundaries.getStart() << "-" << boundaries.getEnd() << endl;
cout << left << setw(48) << "Second exon reference boundaries:";
cout <<  vTranscript.getReferenceName() << ":" << refBoundaries.getStart() << "-" << refBoundaries.getEnd() << endl;
phases=vTranscript.getExonPhase(1);
cout << left << setw(48) << "Second exon phases:";
cout <<  phases[0] << "-" << phases[1] << endl;  
//Here we do not instantiate a new gElement but obtain the sequence directly    
cout << left << setw(48) << "Second exon sequence:";
cout << vTranscript.getSequence(vTranscript.getExonInterval(1)) << endl << endl;
"Wild-type" Transcript Results
---------------------------------------------------------------------------------------------
Element length (including offsets): 1269
Transcript in element coordinates: 200-1069
Transcript in reference coordinates: chr11:63840740-63841609
Strand relative to reference,(false = reverse): 0
Exon count: 4
mRNA length: 581
Second exon element boundaries: 396-498
Second exon reference boundaries: chr11:63841311-63841413
Second exon phases: 0-0
Second exon sequence: GCCACCGAGGTCCGTATCTGCCCTGTGGAATTCAACCCCAACTTCGTGGCGCGTATGATACCTAAAGTGGAGTGGTCGGCGTTCCTGGAGGCGGCCGATAAC

Varied Transcript Results
---------------------------------------------------------------------------------------------
Element length (including offsets): 1186
Transcript in element coordinates: 210-986
Transcript in reference coordinates: chr11:63840740-63841609
Strand relative to reference,(false = reverse): 0
Exon count: 4
mRNA length: 488
Second exon element boundaries: 406-415
Second exon reference boundaries: chr11:63841311-63841413
Second exon phases: 0-0
Second exon sequence: GCCAATAAC
The total element length decreases of 83 bp due to the insertion of 10 bp and the deletion of 93 bp. These are also reflected in the transcript boundaries in element coordinates. Note that transcript boundaries in reference coordinates are unaffected. Variations affects elements and not references. Strand e exon counts are unaffected while mRNA length is 93 bp shorter (deletion). The second exon is also affected in its boundaries and sequence that is 93 bp shorter maintaining the first 4 and the last 5 bps. Again, its reference mapped boundaries are unaffected.
The substitution we introduced is at the 60th bp of the third exon. We can appreciate its effect (C->T) by printing the corresponding sequences:
//We get the sequence corresponding to the third exon in both the transcripts:
gSequence e3s=aTranscript.getSequence(aTranscript.getExonInterval(2));
gSequence ve3s=vTranscript.getSequence(vTranscript.getExonInterval(2));
//then we construct string of the same length with an asterisk where the sequences are not equal.
gString mark(e3s.getLength(),'|');
gArray<gPos> diffpos=which(e3s!=ve3s);
mark.setValues(diffpos,'*');
cout << "Third exon sequence comparison" << endl;
cout << "---------------------------------------------------------------------------------------------" << endl; 
cout << "Wild-Type:\t" << e3s << endl;
cout << "          \t" << mark << endl;
cout << "Varied:   \t" << ve3s << endl;
Third exon sequence comparison
---------------------------------------------------------------------------------------------
Wild-Type: TTGCGTCTGATCCAGGTGCCGAAAGGGCCGGTTGAGGGATATGAGGAGAATGAGGAGTTTCTGAGGACCATGCACCACCTGCTGCTGGAG
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||*|||||||||||||||||||||||||||||
Varied: TTGCGTCTGATCCAGGTGCCGAAAGGGCCGGTTGAGGGATATGAGGAGAATGAGGAGTTTTTGAGGACCATGCACCACCTGCTGCTGGAG
Now we want to verify whether this substitution has an effect on the product encoded by this gene. First we want to check if something happens in the translation:
cout << "Third exon Translated sequences comparison" << endl;
cout << "---------------------------------------------------------------------------------------------" << endl;  
//we map the third exon interval onto the Protein sequence
gElementInterval mintv=aTranscript.mapElementIntervalOnProtein(aTranscript.getExonInterval(2));
//get the corresponding protein sequence
cout << "Wild-Type:\t" << aTranscript.getProteinSequence(mintv) << endl;
//then we do the same for the varied transcript
mintv=vTranscript.mapElementIntervalOnProtein(vTranscript.getExonInterval(2));  
cout << "Varied:   \t" << vTranscript.getProteinSequence(mintv) << endl;
Third exon Translated sequences comparison
---------------------------------------------------------------------------------------------
Wild-Type: LRLIQVPKGPVEGYEENEEFLRTMHHLLLE
Varied: LRLIQVPKGPVEGYEENEEFLRTMHHLLLE
The protein portion corresponding to the third exon is the same in both version of our transcript, thus the substitution we are analysing is silent. Now we want to verify if it might have an effect on the gene by disrupting some binding sites for splicing regulatory elements. To this purpose, we use the feature we added before that represents an estimate of the probability that a given position along the transcript sequence can serve as a binding site for SF2/ASF proteins.
Since we are interested in what happens in the third exon, where the mutation occurs, we will instantiate a gElement object representing the 3rd exon. The new gElement, being obtained as an interval copy of the transcript object, already contains the feature we added before, allowing us to retrieve the scores we are interested in. In particular we will search for positions corresponding to non null scores.
//We get the element representing the 3rd exon
gElement wtExon3=aTranscript.getElement(aTranscript.getExonInterval(2));
//then we retrieve the feature values for this element
gArray<gScore> wtScores=wtExon3.getFeature<gScore>(0,gElementInterval(0,wtExon3.getLength()));
//Finally we search for non null values
gArray<gPos> wtHits=which(wtScores>(gScore)0);

cout << "\"Wild-type\" third exon SF2/ASF elements:\nPosition\tScore" << endl;
cout << "------------------------" << endl;
for(gSize i=0;i<wtHits.getSize();i++){
 cout << wtHits[i] << "\t\t" << wtScores[wtHits[i]] << endl;
}
cout << endl << endl;
"Wild-type" third exon SF2/ASF elements:
Position Score
------------------------
15 2.64948
24 2.598
25 2.61366
40 2.11102
49 2.11102
60 4.61622
85 1.98779
Note that a binding site have been identified at position 60 (where the substitution occurs)
We are now interested in seeing what happens in the third exon of the varied version of our transcript. To do so we can exactly repeat the steps we have seen before. We then check if the two objects are equal and we output our new results.
gElement vExon3=vTranscript.getElement(vTranscript.getExonInterval(2));
gArray<gScore> vScores=vExon3.getFeature<gScore>(0,gElementInterval(0,vExon3.getLength()));
gArray<gPos> vHits=which(vScores>(gScore)0);

//Output the results
cout << "\"Mutated\" third exon SF2/ASF genomic positions:\nPosition\tScore" << endl;
cout << "------------------------" << endl;
for(gSize i=0;i<vHits.getSize();i++){
 cout << vHits[i] << "\t\t" << vScores[vHits[i]] << endl;
}
cout << endl << endl;

//Check if they are equal
if(vScores.equals(wtScores)){
 cout << "vScores and wtScores are the same!" << endl << endl;
}else{
 cout << "vScores and wtScores differ!" << endl << endl;
}
"Mutated" third exon SF2/ASF genomic positions:
Position Score
------------------------
15 2.64948
24 2.598
25 2.61366
40 2.11102
49 2.11102
85 1.98779

vScores and wtScores differ!
As you may appreciate comparing the last two outputs, the effect of the substitution at position 60 in the third exon is to disrupt the putative binding site. We want to check how the binding site has been modified by the substitution, and to do this we instantiate two new gElement objects representing the binding site in each case, and then output their sequences. Furthermore we want the genomic reference boundaries of the element and so we call the interval conversion member on the wt element. These positions can for example be used in a genome browser to check for other features of interest(conservation regulatory potential and so on...).
gElement wtESE(wtExon3,wtHits[5],wtHits[5]+7);
gElement vESE(vExon3,wtHits[5],wtHits[5]+7);
cout << "The wt binding site:    \t" << wtESE.getSequence() << endl;
cout << "The varied binding site:\t" << vESE.getSequence() << endl;  
  
refBoundaries=wtESE.getReferenceInterval();
cout << "ESE Reference coordinates:\t";
cout << wtESE.getReferenceName() << ":" << refBoundaries.getStart() << "-" << refBoundaries.getEnd() << endl;
The wt binding site: CTGAGGA
The varied binding site: TTGAGGA
ESE Reference coordinates: chr11:63841128-63841135

Arrays

gArray is a template class used throughout the library to represent arrays of different types. It has been developed for two main reasons: to make subsetting efficient in terms of memory usage, and to keep trace of Not Available values (NAs). Efficient subsetting is obtained through a reference counting mechanism: gArray maintains an internal reference to a gArrayInternal, which actually carries out the desired task. A gArrayInternal object has a reference counter that is increased when the object is referenced by a gArray object and decreased when a gArray object referencing it is destroyed. gArrays also contains the starting and ending positions relative to the gArrayInternal object, in order to allow for subsetting. In other words, a number of different gArray objects could reference the same gArrayInternal object with different starting and/or ending positions. Only if one of the values of a gArray instance is changed, the original gArrayElement is dereferenced and a copy of the relevant data (i.e. between start and end positions) is made in a new gArrayElement object. NAs tracing is achieved by mean of a specialized gBitsArray class that manages bits arrays. gBitsArray maintains an internal pointer to a gBytes allocated array (typedefined in geco_define.h) and performs a fast calculation to map array position onto the bits array (gBytes is defined by default as unsigned char, but can be modified recompiling the library).

Accessing gArray elements

In this tutorial we will mainly use the gArray template instantiation with T = gShortUnsigned (actually a short integer). Let's imagine we have a gArray instantiation containing 10 elements. You can individually access array values by position, using the bracket operator as well as the isNA() member function that will return the NA status at the given position. You can also obtain the array size and the number of NA values by calling the getSize() and NACount() member functions respectively. Please note also that positions are always interpreted as 0 based, except in cases where they represent the end of some interval (this will be discussed later).
cout << "Array size: " << AnArray.getSize() << endl;
cout << "Number of NAs in the array: " << AnArray.NACount() << endl;
cout << "Array value at position 0:  " << AnArray[0] << endl;
cout << "NA status at position 0:" << AnArray.isNA(0) << endl;
cout << "Array value at position 8:  " << AnArray[8] << endl;
cout << "NA status at position 8:" << AnArray.isNA(8) << endl;
Array size: 10
Number of NAs in the array: 1
Array value at position 0: 10
NA status at position 0: 0
Array value at position 8: 90
NA status at position 8: 1
For our convenience, we define an overloaded output streaming operator that will be used in the following examples to show array values and NA status:
template <class T> 
ostream & operator << (ostream & out,const gArray<T> & array) {
 size_t lfield;
 if(typeid(T) == typeid(gScore) ){
  lfield=7;
  cout.precision(1);
  cout << fixed;
 }else{
  lfield=5;
 }
 string sep(4+lfield*array.getSize(),'-');
 out << "Pos:";
 for (gPos i=0;i<array.getSize();i++) {
  out << setw(lfield) << i;
 }
 out << endl << sep.c_str() << endl;
 out << "Val:";
 for (gPos i=0;i<array.getSize();i++) {
  if(array.isNA(i)) out << setw(lfield) << "*";
  else out << setw(lfield) << array[i];
 }
 out << endl;
 return out;
}
If you want to retrieve multiple array values at given positions, you can use the multiple values brackets operator, which has been implemented as a template and accepts a gArray reference as parameter. The values contained in the input array are interpreted as (cast to) positions, and the corresponding element values are returned in a new gArray object. This allows to use any gArray object for indexing purposes. For instance, if you have an array of characters, the values of the characters are cast to gPos type and used for indexing. It's worth noting that if any position is marked as NA, the resulting value will also be marked as NA. Moreover, a NA value will be returned for any position exceeding the array size. A find member function has also been defined to find occurrence positions of a given value inside a gArray. It returns a gPos array.
cout << "Elements of the array:\n" << AnArray << endl;
cout << "Positions of the elements to retrieve:\n" << AnArrayOfPositions << endl;
cout << "Array values at the given positions:\n" << AnArray[AnArrayOfPositions] << endl;
cout << "An Array of characters:\n" << AnArrayOfCharacters << endl;
cout << "Array values indexed by an array of characters:\n" << AnArray[AnArrayOfCharacters] << endl;
Elements of the array:
Pos: 0 1 2 3 4 5 6 7 8 9
------------------------------------------------------
Val: 10 20 30 40 50 60 70 80 * 100

Positions of the elements to retreive:
Pos: 0 1 2 3
------------------------
Val: 2 * 6 8

Array values at the given positions:
Pos: 0 1 2 3
------------------------
Val: 30 * 70 *

An Array of characters:
Pos: 0 1 2 3 4
-----------------------------
Val: a b c d e

Array values indexed by an array of characters:
Pos: 0 1 2 3 4
-----------------------------
Val: * * * * *
Note that values of the character are all greater than the array size (10).

How to instantiate gArrays

Arrays can be instantiated in a variety of ways. In this tutorial as an example we will use the gShortUnsigned specialization of the template class gArray. To instantiate a single value array you can for example use the Single Value constructor, by specifying the single value you want to assign. The single value will be considered by default as valid (not NA). Please note that this constructor will be automatically invoked whenever gArray is required and you pass a value of type T or of some type that can be cast to T.
gArray<gShortUnsigned> A(15);
By using the Single Value Constructor, arrays can be instantiated with a given size and initialized to default values and NA status. You can also specify that you don't want any initialization by setting the initialize parameter to false (default is true). In this latter case, values and NA status will be random:
gArray<gShortUnsigned> A(45,10,false);
cout << "Array directly instantiated with a given length (10)" << endl;
cout << "and initialized to a default value (45) and NA status (false):\n"<< A << endl;
gArray<gShortUnsigned> B(45,10,false,false);
cout << "Array directly instantiated with a given length (10)" << endl;
cout << "without initialization:\n"<< B << endl;
Array directly instatiated with a given length (10)
and initilized to a default value (45) and NA status (false):
Pos: 0 1 2 3 4 5 6 7 8 9
------------------------------------------------------
Val: 45 45 45 45 45 45 45 45 45 45

Array directly instatiated with a given length (10)
without initialization:
Pos: 0 1 2 3 4 5 6 7 8 9
------------------------------------------------------
Val:32552368 0 4 0 6 0 8 0 0 0
In the last example since the array is instatiated without initialization values are random and depend only by the status of the memory being allocated.
Here we initialize some data, to be used in the following examples
//raw data initialization
gShortUnsigned raw_data[10];
for (int i=0;i<10;i++) raw_data[i]=rand() % 1000 + 1;
By using the raw data constructor, you can instantiate and initialize a gArray using a pointer to raw data and its size. Obviously the pointer has to be of the same type of the template instantiation type. In this case, the values are considered as not NAs by default. Please note that the data provided are internal so you don't need to preserve the provided raw data.
//Using raw (random in this case) data to initialize an array
gArray<gShortUnsigned> A(raw_data,10);
cout << "An array initialized with 10 random values:\n" << A << endl;
An array initialized with 10 random values:
Pos: 0 1 2 3 4 5 6 7 8 9
------------------------------------------------------
Val: 384 887 778 916 794 336 387 493 650 422
Optionally you can assign to the constructor a third parameter: a positions array, which specifies the positions that have to be marked as NAs. You can construct this array (actually a gArray) using any of the methods provided for array instantiation.
gPos raw_positions[3]={0,3,7};
gArray<gShortUnsigned> A(raw_data,10,gArray<gPos>(raw_positions,3));
cout << "An array initialized with 10 random values and 3 NA positions:\n" << A << endl;
gArray<gShortUnsigned> B(raw_data,10,5);
cout << "An array initialized with 10 random values and 1 NA position:\n" << B << endl;
An array initialized with 10 random values and 3 NA positions:
Pos: 0 1 2 3 4 5 6 7 8 9
------------------------------------------------------
Val: * 887 778 * 794 336 387 * 650 422

An array initialized with 10 random values and 1 NA position:
Pos: 0 1 2 3 4 5 6 7 8 9
------------------------------------------------------
Val: 384 887 778 916 794 * 387 493 650 422
Notice that in the second case the NA position array is constructed implicitly from the integer provided. This is possible thanks to the existence of the first constructor we've previously seen.
The Range Copy constructor allows to instantiate a new array from an existing one, optionally specifying the starting and ending positions. The new array does not make a copy of the data. It references the data already allocated by the first one, specifying the range of boundaries. This reference will remain valid as long as none of the values in the new array changes. In this case the data comprised between the range boundaries are copied in a newly allocated slot. When specifying the range it's important to remember that, throughout the library, while positions are always interpreted as 0 based, end positions are always 1 based.
gArray<gShortUnsigned> A(raw_data,10,5);
cout << "The array to copy from:\n" << A << endl;
//Simple copy
gArray<gShortUnsigned> B(A);
cout << "A copied array:\n" << B << endl;
//Range copy
gArray<gShortUnsigned> C(A,3,8);
cout << "A range copied array:\n" << C << endl;
The array to copy from:
Pos: 0 1 2 3 4 5 6 7 8 9
------------------------------------------------------
Val: 384 887 778 916 794 * 387 493 650 422

A copied array:
Pos: 0 1 2 3 4 5 6 7 8 9
------------------------------------------------------
Val: 384 887 778 916 794 * 387 493 650 422

A range copied array:
Pos: 0 1 2 3 4
-----------------------------
Val: 916 794 * 387 493
Finally you can instantiate an empty array, then assign value(s) as we'll see in the next section:
gArray<gShortUnsigned> A;
cout << "The size of an empty array:\t" << A.getSize() << endl;
The size of an empty array: 0

How to assign values to gArrays

Array values can be assigned in different ways. The member function setValues allows to assign a value at any position regardless of the current size, the array will be indeed resized accordingly. Added positions (except the one you are assigning the value to) are initialized by default to 0 with NA set to true. You can however specify the fillvalue and fillNA parameters:
gArray<gShortUnsigned> A;
A.setValue(0,15);
cout << "After assigning a value at position 0:\n" << A << endl;
A.setValue(4,25,false);
cout << "After assigning a value at position specifying NA status:\n" << A << endl;
A.setValue(9,35,true,3,false);
cout << "After assigning a value at position 9 specifying fillvalue and fillNA parameters:\n"<< A << endl;
The size of an empty array: 0
After assigning a value at position 0:
Pos: 0
---------
Val: 15

After assigning a value at position spcifing NA status:
Pos: 0 1 2 3 4
-----------------------------
Val: 15 * * * 25

After assigning a value at position 9 specifing fillvalue and fillNA parameters:
Pos: 0 1 2 3 4 5 6 7 8 9
------------------------------------------------------
Val: 15 * * * 25 3 3 3 3 *
notice how previously assigned values and status are maintained
Using setValue requires to specify both the value and the NA status. If the value you want to assign is from another gArray, you can use another version of setValue. If using this other version, you are required to specify the position which the value must be assigned to, the gArray and the relative position of the value you want to copy. For what concerns any other aspect, this version behaves exactly like the other one. The example below takes a value from the last example's A array and copies it to the newly instantiated one.
gArray<gShortUnsigned> B;
B.setValue(9,A,5,4,false);
cout << "Source array:\n" << A << endl;
cout << "Destination Array:\n" << B << endl;
Source array:
Pos: 0 1 2 3 4 5 6 7 8 9
------------------------------------------------------
Val: 15 * * * 25 3 3 3 3 *

Destination Array:
Pos: 0 1 2 3 4 5 6 7 8 9
------------------------------------------------------
Val: 4 4 4 4 4 4 4 4 4 3
To set multiple values at the same time you can use the raw data setter setData. It behaves like the Raw Data constructor accepting the same parameters. Another function, ownData, does the same job (and therefore we will not write an example) with the only difference that the data provided are owned by the gArray, so you don't have to (and you should not) delete them. In the example below we will reuse the raw data instantiated before.
gArray<gShortUnsigned> A;
A.setData(10,raw_data,3);
cout << "After setting values using setData:\n" << A << endl;
After setting values using setData:
Pos: 0 1 2 3 4 5 6 7 8 9
------------------------------------------------------
Val: 384 887 778 * 794 336 387 493 650 422
Using the member function setValues you can copy values (and NA status) from another gArray specifying the positions you want to assign the values to. Positions can exceed the array length, causing the array to be resized (using fillvalue and fillNA to initialize new potions). If a position is a marked as NA it will be skipped as well as the corresponding value. Value arrays can be either longer or shorter than the size n of the positions array. In the first case the first n positions will be used while, in the latter case, values will be recycled. The following example illustrates the above mentioned conditions.
// Positions array initialization
gArray<gPos> positions;
positions.setValue(0,0);
positions.setValue(1,2,true);
positions.setValue(2,4);
positions.setValue(3,6);
positions.setValue(4,8);
positions.setValue(5,10);
cout << "The positions:\n" << positions << endl;
// Values array initialization
gArray<gShortUnsigned> srcValues;
srcValues.setValue(0,10);
srcValues.setValue(1,20);
srcValues.setValue(2,30,true);
cout << "The values:\n" << srcValues << endl;
// An array is instantiated with seven values initialized to the value 5
gArray<gShortUnsigned> dstValues(5,7);
cout << "The array before copying:\n" << dstValues << endl;
dstValues.setValues(positions,srcValues,0,false);
cout << "The array after copying:\n" << dstValues << endl;
The positions:
Pos: 0 1 2 3 4 5
----------------------------------
Val: 0 * 4 6 8 10

The values:
Pos: 0 1 2
-------------------
Val: 10 20 *

The array before copying:
Pos: 0 1 2 3 4 5 6
---------------------------------------
Val: 5 5 5 5 5 5 5

The array after copying:
Pos: 0 1 2 3 4 5 6 7 8 9 10
-----------------------------------------------------------
Val: 10 5 5 5 * 5 10 0 20 0 *
Value assignment to already instantiated gArray objects can be done using assignment operators. The single value assignment operator is the counterpart of the single value constructor. The array assignment operator makes the current gArray a copy of the other one in the same way described for the copy range constructor (i.e. without data duplication).
gArray<gShortUnsigned> A,B;
A=57;
B=A;
Compound assignment arithmetic atomic operators are another way to assign values to arrays. Let's say we have two arrays A, of size n, and B, of size m. Assigning the values of B to A by a compound assignment operator means that the elements of A are operated with the first n elements from B and that the results are assigned back to A. If m
//Let's instantiate an empty array
gArray<gShortUnsigned> A;
//and an array of values (it's woth noting the implicit use of a single value constructor to specify the NApositions parameter
gArray<gShortUnsigned> B(raw_data,5,4);
cout << "The values of B:\n" << B << endl;
A+=B;
cout << "A+=B is empty if A is empty:\n" << A << endl;
//now let's set some values of A
A.setValue(9,5,false,5,false);
cout << "The initial values of A:\n" << A << endl;
// and apply the same operator as before
A+=B;
cout << "Values of A after A+=B:\n" << A << endl;
//The Array to be assigned can also be specified via an implicit single value constructor
A/=2;
cout << "Values of A after A/=2:\n" << A << endl;
The values of B:
Pos: 0 1 2 3 4
-----------------------------
Val: 384 887 778 916 *

A+=B is empty if A is empty:
Pos:
----
Val:

The initial values of A:
Pos: 0 1 2 3 4 5 6 7 8 9
------------------------------------------------------
Val: 5 5 5 5 5 5 5 5 5 5

Values of A after A+=B:
Pos: 0 1 2 3 4 5 6 7 8 9
------------------------------------------------------
Val: 389 892 783 921 * 389 892 783 921 *

Values of A after A/=2:
Pos: 0 1 2 3 4 5 6 7 8 9
------------------------------------------------------
Val: 194 446 391 460 * 194 446 391 460 *

Other ways to modify an array

Arrays can be modified using a number of function members like in the following examples. It's worth noting that these member functions modify the instantiation upon which they are invoked. (and that's why in all these examples we will re-assign the values of C)
gArray<gShortUnsigned> A(raw_data,5);
gArray<gShortUnsigned> B(raw_data+5,5,3);
gArray<gShortUnsigned> C=A;

cout << "The array to be modified:\n" << A << endl;
cout << "The array used for modifications:\n" << B << endl;
//Array concatenation
A=C;
A.concatenate(B);
cout << "A concatenated with B:\n" << A << endl;

//Array insertion
A=C;
A.insert(3,B);
cout << "A with B inserted at position 3:\n" << A << endl;

//Array values replacement
A=C;
A.replace(2,B,1,3);
cout << "The values of A replaced starting from position 2 with the values" << endl;
cout << "of B between 1 and 3(1 based):\n" << A << endl;

//Array values removal
A=C;
A.remove(1,2);
cout << "After removing two values starting from position 1:\n" << A << endl;

//Array values reversion
A=C;
A.revert(1,4);
cout << "A after reverting elements in the range [1,4):\n" << A << endl;

//Array sorting
A=C;
gArray<gPos> D=A.sort(true);
cout << "A after sorting:\n" << A << endl;
cout << "the original positions of the sorted elements:\n" << D << endl;
The array to be modified:
Pos: 0 1 2 3 4
-----------------------------
Val: 384 887 778 916 794

The array used for modifications:
Pos: 0 1 2 3 4
-----------------------------
Val: 336 387 493 * 422

A concatenated with B:
Pos: 0 1 2 3 4 5 6 7 8 9
------------------------------------------------------
Val: 384 887 778 916 794 336 387 493 * 422

A with B inserted at position 3:
Pos: 0 1 2 3 4 5 6 7 8 9
------------------------------------------------------
Val: 384 887 778 916 336 387 493 * 422 794

The values of A replaced starting from position 2 with the values
of B between 1 and 3 (1 based):
Pos: 0 1 2 3 4
-----------------------------
Val: 384 887 387 493 794

A after removing two values starting from position 1:
Pos: 0 1 2
-------------------
Val: 384 916 794

A after reverting elements in the range [1,4):
Pos: 0 1 2 3 4
-----------------------------
Val: 384 916 778 887 794

A after sorting:
Pos: 0 1 2 3 4
-----------------------------
Val: 384 778 794 887 916

The original positions of the sorted elements:
Pos: 0 1 2 3 4
-----------------------------
Val: 0 2 4 1 3

Finding and counting values

Values can be searched inside a gArray by using the find() member function. It returns a gPos array containing the positions where the value occurs.
gArray<gShortUnsigned> A(raw_data,10,4);
A.setValue(8,384,false);
cout << "An array A:\n" << A << endl;
cout << "Positions where the value 384 occurs:\n" << A.find(384) << endl;
cout << "Positions where the value 794 occurs:\n" << A.find(794) << endl;
An array A:
Pos: 0 1 2 3 4 5 6 7 8 9
------------------------------------------------------
Val: 384 887 778 916 * 336 387 493 384 422

Positions where the value 384 occurs:
Pos: 0 1
--------------
Val: 0 8

Positions where the value 794 occurs:
Pos:
----
Val:
Note that the element at position 4 is ignored because it is marked as NA.
You can count values using one of the getCounts member functions. The first form of this function simply counts all the occurrences of a given value inside the array and returns a single valued gArray with the count. You can also specify a subset of the array in which the counts (either windowed or not) should be performed. Counts in a subset of the array can be obtained by specifying the subset boundaries. You can also obtain counts along the array by using a sliding window of length winlength (if winlength=0, the default, no windowing is performed). In this case, the returned value is a gArray of the same length of the (subset of the) input array. Counts are saved with an offset (specified by the pos parameter) relative to the starting position of each window. Please note the if the parameter skipnan is passed as true (default) only non NA elements will be counted. If skipnan is set to false then the return value will be NA if in the array (or in the subset, or in a window) there are NA values. There is also another version of the same function, whose first parameter must be an array (of the correct size) in which the results will be saved. This is useful when you have this function called in a cycle and you don't want a new array to be instantiated at each call. This last version will return a reference to the array passed as an argument.
//We are using the same A array instantiated in the previous example
cout << "Counting occurrences of 384 in A:\n" << A.getCounts(384) << endl;

//With the following we counts occurrences of the specified value in 4
//elements windows with offset set to 0.
cout << "Counting occurrences of 384 in windows of length 4:\n" << A.getCounts(384,4,0) << endl;

//Now we do the same but specifying a subset and skipnan set to false
cout << "Counting occurrences of 384 in windows of length 4" << endl;
cout << "in the subset [0,6) and with skipnan==false:\n" << A.getCounts(384,4,0,0,6,false) << endl;

//Let's instantiate a result array
gArray<gSize> counts(0,A.getSize(),true);
A.getCounts(counts,384,4,2);
cout << "Counting occurrences of 384 in windows of length 5 (passing the result array):\n" << counts << endl;
Counting occurrences of 384 in A:
Pos: 0
---------
Val: 2

Counting occurrences of 384 in windows of length 4:
Pos: 0 1 2 3 4 5 6 7 8 9
------------------------------------------------------
Val: 1 0 0 0 0 1 1 * * *

Counting occurrences of 384 in windows of length 4
in the subset [0,6) and with skipnan==false:
Pos: 0 1 2 3 4 5
----------------------------------
Val: 1 * * * * *

Counting occurrences of 384 in windows of length 5 (passing the result array):
Pos: 0 1 2 3 4 5 6 7 8 9
------------------------------------------------------
Val: * * 1 0 0 0 0 1 1 *
Counting occurrences of multiple values at the same time is possible using the multiple value version of getCounts(). This requires you to pass a gArray containing the values you want to count. It is possible, as in the previous one, to specify a subset of the array and skipnan behaviour. It is not allowed to obtain counts in windows along the array. Instead, you can specify that you want separate counts for each value (countall==true, default) or a- unique count of the occurrence of any of the value passed. It will return a gArray containing counts for each value or containing a single overall count. This function too has a version where you can pass the result array as a parameter. Another member function is getUnique(), which returns an array containing a single instance of all the non NA values in the array.
gArray<gShortUnsigned> B(0,2,false);
B.setValue(0,778);
B.setValue(1,384);
cout << "Counting occurrences of 778 and 384:\n" << A.getCounts(B) << endl;
cout << "Counting occurrences of 778 or 384:\n" << A.getCounts(B,true) << endl;
gArray<gShortUnsigned> C=A.getUnique();
cout << "Non NA values in A:\n" << C << endl;
cout << "Counting occurrences of values in A:\n" << A.getCounts(C) << endl;
Counting occurrences of 778 and 384:
Pos: 0 1
--------------
Val: 1 2

Counting occurrences of 778 or 384:
Pos: 0
---------
Val: 3

Non NA values in A:
Pos: 0 1 2 3 4 5 6 7
--------------------------------------------
Val: 384 887 778 916 336 387 493 422

Counting occurrences of values in A:
Pos: 0 1 2 3 4 5 6 7
--------------------------------------------
Val: 2 1 1 1 1 1 1 1

Calculation using gArrays

We have already seen the usage of compound assignment arithmetic atomic operators. There are non compound assignment versions for these same operators. In this case, the output is a new gArray object. The following examples show possible usages for these operators. Remember that in an expression what gives the size of the result is always the first array. For instance, if A has a size=5 the result of the expression (A+B)/C will have size=5 independently from the size of B and C (which will be recycled if needed). You can also use atomic constants instead of single value arrays that will be automatically instantiated by the compiler.
gArray<gShortUnsigned> A(raw_data,5);
cout << "The array A:\n" << A << endl;

//Increment of array elements
gArray<gShortUnsigned> B=A+1;
cout << "B=A+1:\n" << B << endl;

//Division
gArray<gShortUnsigned> C(raw_data+5,5);
cout << "The array C:\n" << C << endl;
gArray<gShortUnsigned> D=B/C;
cout << "D=B/C:\n" << D << endl;

//that is equivalent to (in a single expression)
cout << "(A+1)/C:\n" << (A+1)/C << endl;
The array A:
Pos: 0 1 2 3 4
-----------------------------
Val: 384 887 778 916 794

B=A+1:
Pos: 0 1 2 3 4
-----------------------------
Val: 385 888 779 917 795

The array C:
Pos: 0 1 2 3 4
-----------------------------
Val: 336 387 493 650 422

D=B/C:
Pos: 0 1 2 3 4
-----------------------------
Val: 1 2 1 1 1

(A+1)/C:
Pos: 0 1 2 3 4
-----------------------------
Val: 1 2 1 1 1
Logical atomic operators have also been defined. They act like the arithmetic ones with the main difference being that they return gShortUnsigned arrays containing {0,1} to represent boolean values. The library provides the which() function, able to return a gPos array containing the positions of the "1" elements of a gShortUnsigned array. This can help in finding positions at which a given "condition" array is true. Let's see, as an example, how to find the positions of values that are not comprised in the interval [300,400) into the array "A":
gArray<gShortUnsigned> A(raw_data,5);
cout << "The array A:\n" << A << endl;

gArray<gShortUnsigned> B=A>=300;
cout << "B = A>=300:\n" << B << endl;

gArray<gShortUnsigned> C=A<400;
cout << "C = A<400:\n" << C << endl;

gArray<gShortUnsigned> D= B && C;
cout << "D = B && C:\n" << D << endl;

gArray<gPos> result=which(!D);
cout << "result = which(!D):\n" << result << endl;

//that is equivalent to (in a single expression)
cout << "which( !( (A>=300) && (A<400) ) ):\n" << which( !( (A>=300) && (A<400) ) ) << endl;
The array A:
Pos: 0 1 2 3 4
-----------------------------
Val: 384 887 778 916 794

B = A>=300:
Pos: 0 1 2 3 4
-----------------------------
Val: 1 1 1 1 1

C = A<400:
Pos: 0 1 2 3 4
-----------------------------
Val: 1 0 0 0 0

D = B && C:
Pos: 0 1 2 3 4
-----------------------------
Val: 1 0 0 0 0

result = which(!D):
Pos: 0 1 2 3
------------------------
Val: 1 2 3 4

which( !( (A>=300) && (A<400) ) ):
Pos: 0 1 2 3
------------------------
Val: 1 2 3 4
To apply a function to all elements of an array you can use the template function "apply". It requires as a parameter the pointer to the function you want to be applied. The function can be anything pointed by T (*)(T val). An atomic cast operator on arrays has also been defined. It returns a cast copy of the original array (in this case data are newly allocated). In the following example we initialize a short integer array (using the usual raw data) with a NA value, and then a negative value is assigned to an element. Since the log10 function accepts double values as input we need to cast the array to double. We then call the apply function specifying a pointer to the log10 function. The apply function returns a double gArray that is cast back to integer.
gArray<_short> A((_short *) raw_data,10,4);
A.setValue(5,-100,false);
cout << "The Array A:\n" << A << endl;
gArray<_short> B=apply((gArray<gScore>)A,&log10)*10;
cout << "B containing A elements base 10 logs:\n" << B << endl;
The Array A:
Pos: 0 1 2 3 4 5 6 7 8 9
------------------------------------------------------
Val: 384 0 887 0 * -100 916 0 794 0

B containing A elements base 10 logs:
Pos: 0 1 2 3 4 5 6 7 8 9
------------------------------------------------------
Val: 25 0 29 0 * * 29 0 28 0
Note how the NAs are considered. Calculation on elements marked as NA in the input array is not performed and their NA status is returned coherently. The apply function tries to identify nan return values from the provided function setting their NA status to true. Exceptions are also caught so that if the called function fails the result is marked as NA.
There are function members to make non atomic calculation on arrays. The minimum, maximum, sum, squared sum, mean and standard deviation of the elements of an array can be calculated using the corresponding "get" function
gArray<gShortUnsigned> A(raw_data,10,4);

cout << "The Array A:\n" << A << endl;

cout << "The minimum of the elements in A:\n" << A.getMin() << endl;
cout << "The minimum of the elements in A" << endl;
cout << "specifing a range [0,5):\n" << A.getMin(0,5) << endl;

cout << "The maximum of the elements in A:\n" << A.getMax() << endl;
cout << "The maximum of the elements in A" << endl;
cout << "with skipnan set to false:\n" << A.getMax(0,0,false) << endl;
The Array A:
Pos: 0 1 2 3 4 5 6 7 8 9
------------------------------------------------------
Val: 384 887 778 916 * 336 387 493 650 422

The minimum of the elements in A:
Pos: 0
---------
Val: 336

The minimum of the elements in A
specifing a range [0,5):
Pos: 0
---------
Val: 384

The maximum of the elements in A:
Pos: 0
---------
Val: 916

The maximum of the elements in A
with skipnan set to false:
Pos: 0
---------
Val: *
As described for the getCounts methods, these also have a winlength parameter, which allows to perform calculations on sliding (step 1) windows rather than on the whole array. The function will return a single value (gArray) if winlength=0 (default) otherwise it will return an array of the same length containing results for each window at a precise offset (pos parameter, default 0) relative to the starting position of each window.
//Windowed calculations
cout << "The Array A:\n" << A << endl;

cout << "The sum of the elements in sliding windows" << endl;
cout << "of length 3 with an offset of 1, skipping NAs:\n" << A.getSum(0,0,true,3,1) << endl;

cout << "The mean of the elements in sliding windows" << endl;
cout << "of length 3 and offset of 1, skipping NAs:\n" << A.getMean(0,0,true,3,1) << endl;

cout << "The mean of the elements sliding windows" << endl;
cout << "of length 3 and offset of 1, without skipping NAs:\n" << A.getMean(0,0,false,3,1) << endl;
}
The Array A:
Pos: 0 1 2 3 4 5 6 7 8 9
------------------------------------------------------
Val: 384 887 778 916 * 336 387 493 650 422

The sum of the elements in sliding windows
of length 3 with an offset of 1, skipping NAs:
Pos: 0 1 2 3 4 5 6 7 8 9
------------------------------------------------------
Val: * 2049 2581 1694 1252 723 1216 1530 1565 *

The mean of the elements in sliding windows
of length 3 and offset of 1, skipping NAs:
Pos: 0 1 2 3 4 5 6 7 8 9
--------------------------------------------------------------------------
Val: * 683.0 860.3 847.0 626.0 361.5 405.3 510.0 521.7 *

The mean of the elements sliding windows
of length 3 and offset of 1, without skipping NAs:
Pos: 0 1 2 3 4 5 6 7 8 9
--------------------------------------------------------------------------
Val: * 683.0 860.3 * * * 405.3 510.0 521.7 *

gArray derived objects

In this chapter we will...

gMatrix

A gMatrix template class has been derived from gArray to represent data organized in matricial form. The gMatrix class has methods to set and retrieve rows and columns as well as a method that allows to use couple of indices to access matrix specific elements. ...to be completed with examples...

gString

gString is derived from a gChar specialization of gArray. It is used both to represent strings and as a base for the gSequence class. gString has specific constructors that allow for instantiating from std:string ,char * and other gString objects. It also defines string operators such as string comparison (==) and concatenation (+). Since it is derived from gArray member function and operators to access single characters are still available. ...to be completed with examples...

gPositionInterval

Another template class derived from gArray is gPositionInterval, used to represent positions intervals. In the library, interval starting positions are always zero based while ending positions are 1 based. This class is used in other objects to represent intervals. It has methods to perform intersections and unions between intervals. It has been defined as a template to specialize it in order to represent both relative (signed) and absolute (unsigned) positions. ...to be completed with examples...

Other objects

In this chapter we will...

gBitsArray

Invalid values (NA) in gArray objects are maintained by a gBitsArray object. This object has been implemented to represent bits array. Nevertheless this class can be used independently to represent binary values. ...to be completed with examples...

gException

gException is a simple class derived from std::exception. Error conditions can be trapped through the try... catch mechanism. Exception can be thrown using the gException constructor that accepts a string describing what happened and optionally a code. Currently this code is not used, the class is mainly a stub for future developments. ...to be completed with examples...

Retrievers

Array retrievers are objects used to calculate/retrieve array values. The library provides a base gArrayretriever template class from which the actual retrievers can be derived. This class holds a pointer to a gArrayRetreiverImplementation object. This is the object that actually performs calculations/retrievals and that must be implemented. These two classes actually provide a way to instantiate a retriever providing its implementation. The instantiated retriever can be copied and the internal implementation object is used again without its duplication thanks to a ref-counting mechanism. The gArrayretrieverImplementation class has no particular method implemented, except for two protected functions to access array reference count and internal raw data. These allow any class derived from gArrayretrieverImplementation to operate directly on the array memory. Particular care should be taken since value changes will be reflected by all the objects that share this pointer at a given time (it is completely safe when refCount is 1). ...to be completed... Examples of gArrayRetriever derived objects are reported for sequence retriever and feature retrievers.

Sequence

The Geco++ library is not a sequence analysis tool. Its goal, instead, is to provide a way to develop sequence analysis software (in its wider acception) that can efficiently and easily cope with genomic annotation and sequence variations. We'll see in the next chapter how this is achieved through the use of the gElement class. Sequence information is therefore simply stored in gSequence objects. The gSequence class is derived from gString which, in turn, inherits from a specialization of the gArray template class (with T=gChar, unsigned char in the current implementation). As we have seen for gArrays, this allows for subsetting without duplication and positions being marked as NA. The class itself doesn't add much to the gString capabilities. The only specific member functions which have been added are relative to sequence reverse-complementing and translation. gSequence objects can be instantiated by providing sequence information directly to the constructor.
gSequence seq("ACGATCGACTAGCATCGACA");
cout << "Sequence: "<< seq << endl;

Sequence: ACGATCGACTAGCATCGACA

Sequence operations

gSequence member functions, as well as those derived from its ancestors (gString and gArray) can be used to perform simple operations on sequences.
Using gArray member functions you can consider sequences as arrays of characters; in this view you can, for example, obtain information about specific character occurrences, positions and counts.
gSequence seq("ACGATCGACTAGCATCGACA");
cout << "Sequence: "<< seq << endl;  
cout << "Total number of C and A occurrences: " << endl << seq.getCounts(gString("CA")) << endl;
cout << "Number of C occurrences in a sliding 4bp window: " << endl << seq.getCounts('C',4) << endl;
cout << "Positions of C occurrences:" << endl << which( ((gArray<gChar>)seq)=='C') << endl;
Total number of C and A occurrences:
Pos: 0 1
--------------
Val: 6 7

Number of C occurrences in a sliding 4bp window:
Pos: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
--------------------------------------------------------------------------------------------------------
Val: 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 2 1[ 0][ 0][ 0]

Positions of C occurrences:
Pos: 0 1 2 3 4 5
----------------------------------
Val: 1 5 8 12 15 18
Sequences can be modified using gArray setValues functions (not shown) or the transformation member functions.
gSequence seq("AAACCC");
cout << "The original sequence:                     " << seq << endl;
seq.insert(2,gSequence("TTT"));
cout << "After inserting TTT at position 2:         " << seq << endl;
seq.remove(5,2);
cout << "After removing 2 bases from position 5:    " << seq << endl;
seq.replace(1,gSequence("GGG"));
cout << "After replacing positions from 1 with GGG: " << seq << endl;

The original sequence: AAACCC
After inserting TTT at position 2: AAATTTCCC
After removing 2 bases from position 5: AAATTCC
After replacing positions from 1 with GGG: AGGGTCC
By using gString member functions you can assign std::string objects or const char pointers to gSequences, you can also concatenate different gSequence objects. Lower and upper functions change the letter case of the string. The corresponding getLower and getUpper return a new object with changed case.
gSequence seq1="AAAA";
gSequence seq2=string("CCCC");
gSequence seq3=seq1+seq2+"GGGG"+string("TTTT");
cout << seq3.getLower() << endl; 
cout << seq3 << endl;

aaaaccccggggtttt
AAAACCCCGGGGTTTT
Two specific methods have been added to the gSequence class to perform reverse complement and translation (in a given reading frame) operations.
gSequence seq("TGATGTGTTGAGCATCGACA");
cout << "A sequence: " << seq << endl; 
cout << "Its reverese-complement:     " << seq.getReverseComplement() << endl; 
cout << "Its AA translation, frame=0: " << seq.getTranslated(0) << endl; 
cout << "Its AA translation, frame=1: " << seq.getTranslated(1) << endl; 
cout << "Its AA translation, frame=2: " << seq.getTranslated(2) << endl;

A sequence: TGATGTGTTGAGCATCGACA
Its reverse-complement: TGTCGATGCTCAACACATCA
Its AA translation, frame=0: *CVEHR
Its AA translation, frame=1: DVLSID
Its AA translation, frame=2: MC*AST

Sequence retrievers

Sequence retrievers are objects derived from the gChar specialization of the gArrayRetrievers template class. To use a retriever, an implementation object must be specified in the constructor. It also has a public member, called getSequence, that will be the one to call in order to actually obtain the sequence (and that is used by other library classes when sequence information is needed). Implementation objects must be derived from the gSequenceRetrieverImplementation class. gSequenceRetriverImplementation is an abstract virtual class since it declares a virtual destructor and two pure virtual methods, which always need to be implemented when deriving classes. To derive an actual sequence retriever implementation (to be used by sequence retrievers) you can thus derive from gSequenceRetrieverImplementation overloading its virtual methods.
The first one is the function that should return the required sequence. It takes as input the reference name, and the reference starting and ending positions. We will see in the next chapter how "references" and positions along them are used in this library. For now, consider a reference as a given known sequence (i.e. a chromosome, a scaffold or anything that makes sense to you).
The other method that must be implemented is the "clone" function, which allows objects using the retriever to obtain a pointer to a new copy of this object (without knowing which object type actually is passed).
For what concerns the virtual destructor (which is not a pure virtual method) one has to re-implement it only needed (i.e. if you allocated some memory in your constructor that must be explicitly set free). Let's see how a sequence retriever implementation can be defined. Let's imagine a situation where you have chromosome sequences stored somewhere in your local filesystem. Each chromosome sequence is contained in a a plain text file. Files are located in at given path in your filesystem and are named upon chromosome name with a suffix (.txt for example). Imagine you want the path and the suffix to be parametrized. Let's first declare the derived class:
class gSequenceRetrieverExampleImplementation:public gSequenceRetrieverImplementation{
  private:
   //String containing the path to the files
   gString fileprefix;
   //String containing the files suffix
   gString filepostfix;
   
   virtual gArrayRetrieverImplementation<gChar> * clone() const;
   virtual gSequence getSequence_Internal(const gString & reference,gPos start,gPos end) const;

  public:
   gSequenceRetrieverExampleImplementation(const gString & prefix,const gString &  postfix);
   virtual ~gSequenceRetrieverExampleImplementation();
};
We have to implement the constructor, and the virtual methods. The constructor will simply copy the postfix and suffix parameters into the corresponding data members The virtual destructor in this example has nothing to do, (we have reported it for the sake of completeness). You must also implement the clone member function. This one is always implemented instantiating a new instance of the current object and returning its pointer. Other objects in the library will call this method to obtain a copy of the retriever implementation.
gSequenceRetrieverExampleImplementation::gSequenceRetrieverExampleImplementation(const gString & prefix,const gString &  postfix):gSequenceRetrieverImplementation(){
 fileprefix=prefix;
 filepostfix=postfix;
}

gSequenceRetrieverExampleImplementation::~gSequenceRetrieverExampleImplementation(){
}

gArrayRetrieverImplementation<gChar> * gSequenceRetrieverExampleImplementation::clone() const{
  return new gSequenceRetrieverExampleImplementation( ((string) fileprefix).c_str() ,((string) filepostfix).c_str() );
}
Lastly we must implement the method that actually retrieve sequence information from the correct file returning a gSequence object. In doing so we will construct the filename based on the reference parameter (expected to be the chromosome name) and to prefix ad suffix data members. We will first instantiate a gSequence of the desired length and then, using the getArrayDataPtr member function, we use the actual memory buffer of the gSequence to store the data retrieved from the file, finally returning the sequence object.
As usual, starting positions are 0 based while ending ones are 1 based. This function will be called by the getSequence member function that upon return will check if the sequence is compatible with the requested one; if not it will throw an exception. You can both return an empty gSequence or throw an exception to report some error condition in your retriever.
#include <fstream> //include this for file I/O operations
gSequence gSequenceRetrieverExampleImplementation::getSequence_Internal(const gString & reference,gPos start,gPos end) const{
 string filename=fileprefix+reference+filepostfix; // obtains the complete filename
 ifstream filein(filename.c_str());                // and open it for reading
 if(filein.is_open()){                             // check if it's actually opened
                                                   // if opened
  gSequence tmp(end-start,'N');                    // instantiates a gSequence of the required length
  filein.seekg(start,ios_base::beg);               // offsets the file pointer to the starting position
  filein.get(getArrayDataPtr(tmp),end-start+1);    // gets the data directly into the object memory
  return tmp;                                      // returns the gSequence
 }else{
                                                   // otherwise
  return gSequence();                              // returns an empty gSequence;
                                                   // alternatively we could throw a gException
 }
}
To test our new retriever we will construct a gSequenceRetriever object providing an instance of our new implementation, then we'll get the sequence corresponding to a portion of chromosome X starting at 1789568 (0 based) and ending at 1789618 (1 based)
gSequenceRetriever retriever(gSequenceRetrieverExampleImplementation("/adat/database/hg18/",".txt"));
gSequence seq=retriever.getSequence("chrX",1789568,1789618);
cout << "chrX:1789568-1789618\t" << seq << endl;
chrX:1789568-1789618 ATATATACTAGTATAAACTCAGTATCTGTATATTAGTATATATTATAAA
We now want to show the differences in using retrievers instead of sequence information when instantiating a gElement object. Let's instantiate two gElements objects, corresponding to the same chromosome X interval we referred above. The first one will use the sequence we've retrieved, while the second one will be constructed passing the retriever.
gElement elementA("chrX",1789568,1789618,true,seq,gRef);
gElement elementB("chrX",1789568,1789618,true,retriever,gRef);

cout << "Element A: " << elementA << endl;
cout << "Element B: " << elementB << endl;
Element A:
-------------------------------------
|Reference name | chrX|
|Reference Interval| [ 1789568, 1789618)|
|Strand | Forward|
|Element Interval | [ 0, 50)|
|Element length | 50|
-------------------------------------
sequence: ATATATACTAGTATAAACTCAGTATCTGTATATTAGTATATATTATAAAC
-------------------------------------

Element B:
-------------------------------------
|Reference name | chrX|
|Reference Interval| [ 1789568, 1789618)|
|Strand | Forward|
|Element Interval | [ 0, 50)|
|Element length | 50|
-------------------------------------
sequence: ATATATACTAGTATAAACTCAGTATCTGTATATTAGTATATATTATAAAC
-------------------------------------
As you have seen, the result is the same. Nonetheless, if we obtain two sub elements from the one we have instantiated, the situation changes: since the two sub-elements extend the boundaries 10bp upstream you see that the one instantiated with the retriever is able to retrieve the missing sequence information, while the other one returns NAs where the sequence information is not available. Please note also that the sub-element obtained from element B inherits the retriever object. This is done without making a copy of the object, but instead incrementing its reference counter. The object will be automatically deallocated when its reference counter becomes zero.
gElement subelementA(elementA,gElementInterval(-10,10),true);
gElement subelementB(elementB,gElementInterval(-10,10),true);

cout << "Sub-element A:" << subelementA << endl;
cout << "Sub-element B:" << subelementB << endl;
Sub-element A:
-------------------------------------
|Reference name | chrX|
|Reference Interval| [ 1789558, 1789578)|
|Strand | Forward|
|Element Interval | [ 0, 20)|
|Element length | 20|
-------------------------------------
sequence: NNNNNNNNNNATATATACTA
-------------------------------------

Sub-element B:
-------------------------------------
|Reference name | chrX|
|Reference Interval| [ 1789558, 1789578)|
|Strand | Forward|
|Element Interval | [ 0, 20)|
|Element length | 20|
-------------------------------------
sequence: GTATAAACTCATATATACTA
-------------------------------------

Genomic elements

In the GeCo++ library, a genomic reference is defined as a portion of DNA identified by a name. A genomic element is then defined as an interval with a given strand along a genomic reference. Positions along a genomic reference (referencePositions) are defined as zero based unsigned values. Element positions are defined relatively to the beginning of a given element and therefore are represented by signed zero based values. Intervals for both references and elements are considered as right open ones. A genomic element instance is defined by its reference boundaries (interval), its strand and by one ore more variations (insertions, deletions and substitutions) relative to the reference.

One of the goals of the library is to allow for conversion between element and reference positions and vice-versa. The presence of insertions and deletions makes this not straightforward, therefore some simple rules need to be applied. When converting from reference to element, a reference position falling inside a deletion maps to the last valid element position before the deletion (along the element strand) and it is marked as NA. Reciprocally, when converting from element to reference, an element position falling inside an insertion maps to the last reference position before the insertion. When converting intervals, a slightly different rule applies: starting and ending positions are exchanged if the element strand is reverse and invalid positions are mapped to the position outside the variation that falls inside the interval. Some examples of conversion from reference to element positions and intervals are reported in the above figure. Throughout the library, reference positions are represented by gPos (unsigned long) values while element positions by gRelativePos (signed long) values. Reference and element intervals are represented by the gElementInterval and gReferenceInterval classes respectively. Both these classes are specializations of the gPositionInterval template class.
The class gElement is designed to represent genomic elements. It maintains information about the element interval, its reference, variations and sequence. Some of the information are maintained in a separate gRegion object. gRegions are intended as instances of a given reference to which variations have been applied. For instance, if a reference represents a chromosome the corresponding gRegion object could represent the same chromosome in an individual. Users will never have to deal with this object, but its presence allows for the instantiation of new gElements from existing one sharing common information about reference, variations and sequence. In this tutorial we will refer to gElement objects obtained from existing one as "sub-elements", regardless of the fact that they are contained or not in the original one.
"Sites" can be added to gElement objects. Sites are positions along the element that have a particular biological meaning depending on the application one is developing. As well as sites, "Connections" can also be added, representing meaningful directed links between two sites. Another important feature of gElement is the opportunity to add "Features". Features are numerical properties whose value varies along the element.

Starting with gElements

gElement objects can be instantiated in different ways. With the exception of the copy constructors, one always has to provide a reference name, a reference interval, and the strand. The reference interval can be referred either instantiating a gReferenceInterval or by specifying starting and ending positions (these latter must be 1 based). Later we'll see that there are other constructors that can be used to add sequence information to the object. If you do not pass sequence information (as in the following examples) the sequence will be considered as unknown and the object's sequence member functions will return strings of "N".
gElement element1("chr1",5,45);
gElement element2("chr1",gReferenceInterval(5,45));
gElement element3("chr1",5,45,false);
Once you have instantiated a gElement you can use its member functions to obtain element information. To illustrate the usage of these functions we define an overloaded streaming operator to output element information:
ostream & operator << (ostream & out,const gElement & element) {
 //You can obtain the reference name...    
 gString refName=element.getReferenceName();
 //the reference interval where this element maps...
 gReferenceInterval refIntv=element.getReferenceInterval();
 //the element interval (in element coordinates)...
 gElementInterval elmIntv=element.getElementInterval();
 //the strand relative to the reference...
 bool strand=element.isForward();
 //the length of the element...
 gSize elementLen=element.getLength();
 //and the element's sequence
 gSequence eseq=element.getSequence();
 //details about output formatting are omitted.
 //...formatting output...
Now we can apply the streaming operator to the gElements object we instantiated before:
cout << "element1:" << element1 << endl;
cout << "element2:" << element2 << endl;
cout << "element3:" << element3 << endl;
element1:
-------------------------------------
|Reference name | chr1|
|Reference Interval| [ 5, 45)|
|Strand | Forward|
|Element Interval | [ 0, 40)|
|Element length | 40|
-------------------------------------
sequence: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
-------------------------------------

element2:
-------------------------------------
|Reference name | chr1|
|Reference Interval| [ 5, 45)|
|Strand | Forward|
|Element Interval | [ 0, 40)|
|Element length | 40|
-------------------------------------
sequence: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
-------------------------------------

element3:
-------------------------------------
|Reference name | chr1|
|Reference Interval| [ 5, 45)|
|Strand | Reverse|
|Element Interval | [ 0, 40)|
|Element length | 40|
-------------------------------------
sequence: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
-------------------------------------
gElements can also be instantiated as sub-elements of other gElements using the range copy constructor. You must provide an element, the element interval and the strand relative to the element. As for the previous case, you can specify the interval either using a gElementInterval object or providing starting and ending positions in element coordinates. Element coordinates are relative to the beginning of the element. This means that you can use negative values or values greater than the element length in order to extend the sub-element from the original element boundaries. Another way to obtain gElement objects from an existing one is to use the getElement member function. You must specify the element interval, the strand and, optionally, offsets for starting and ending positions. The function will return a new gElement object corresponding to the strand and boundaries specified.
//some example of elements instantiated as sub-elements.
gElement element4(element1,10,20);
gElement element5(element1,gElementInterval(-4,46));
gElement element6(element3,gElementInterval(15,25),false);
//obtaining a new gElement using the getElement function (with offsets)
gElement element7=element1.getElement(gElementInterval(3,8),false,-1,1);

//You can do the same for the other three instances we have
cout << "element4:" << element4 << endl;
cout << "element5:" << element5 << endl;
cout << "element6:" << element6 << endl;
cout << "element7:" << element7 << endl;
element4:
-------------------------------------
|Reference name | chr1|
|Reference Interval| [ 15, 25)|
|Strand | Forward|
|Element Interval | [ 0, 10)|
|Element length | 10|
-------------------------------------
sequence: NNNNNNNNNN
-------------------------------------

element5:
-------------------------------------
|Reference name | chr1|
|Reference Interval| [ 1, 51)|
|Strand | Forward|
|Element Interval | [ 0, 50)|
|Element length | 50|
-------------------------------------
sequence: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
-------------------------------------

element6:
-------------------------------------
|Reference name | chr1|
|Reference Interval| [ 20, 30)|
|Strand | Forward|
|Element Interval | [ 0, 10)|
|Element length | 10|
-------------------------------------
sequence: NNNNNNNNNN
-------------------------------------

element7:
-------------------------------------
|Reference name | chr1|
|Reference Interval| [ 7, 14)|
|Strand | Reverse|
|Element Interval | [ 0, 7)|
|Element length | 7|
-------------------------------------
sequence: NNNNNNN
-------------------------------------
the gElement class also has a default constructor and an assignment operator. Thus you can use the assignment operator to obtain an exact copy of another element. Furthermore, this makes it insertable in std::containers.
//Make a copy of a gElement
gElement element8=element5;
//Instantiate and populate a std::vector of gElement objects
vector<gElement> elementVector; 
elementVector.push_back(element1);
elementVector.push_back(element2);
elementVector.push_back(element3);
elementVector.push_back(element4);
elementVector.push_back(element5);
elementVector.push_back(element6);
elementVector.push_back(element7);
elementVector.push_back(element8);
 
//output the elements we've instantiated 
for(size_t i=0;i<elementVector.size();i++){
 cout << "Element" << i+1 << elementVector[i] << endl;
}

Element1
-------------------------------------
|Reference name | chr1|
|Reference Interval| [ 5, 45)|
|Strand | Forward|
|Element Interval | [ 0, 40)|
|Element length | 40|
-------------------------------------
sequence: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
-------------------------------------

Element2
-------------------------------------
|Reference name | chr1|
|Reference Interval| [ 5, 45)|
|Strand | Forward|
|Element Interval | [ 0, 40)|
|Element length | 40|
-------------------------------------
sequence: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
-------------------------------------

Element3
-------------------------------------
|Reference name | chr1|
|Reference Interval| [ 5, 45)|
|Strand | Reverse|
|Element Interval | [ 0, 40)|
|Element length | 40|
-------------------------------------
sequence: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
-------------------------------------

Element4
-------------------------------------
|Reference name | chr1|
|Reference Interval| [ 15, 25)|
|Strand | Forward|
|Element Interval | [ 0, 10)|
|Element length | 10|
-------------------------------------
sequence: NNNNNNNNNN
-------------------------------------

Element5
-------------------------------------
|Reference name | chr1|
|Reference Interval| [ 1, 51)|
|Strand | Forward|
|Element Interval | [ 0, 50)|
|Element length | 50|
-------------------------------------
sequence: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
-------------------------------------

Element6
-------------------------------------
|Reference name | chr1|
|Reference Interval| [ 20, 30)|
|Strand | Forward|
|Element Interval | [ 0, 10)|
|Element length | 10|
-------------------------------------
sequence: NNNNNNNNNN
-------------------------------------

Element7
-------------------------------------
|Reference name | chr1|
|Reference Interval| [ 7, 14)|
|Strand | Reverse|
|Element Interval | [ 0, 7)|
|Element length | 7|
-------------------------------------
sequence: NNNNNNN
-------------------------------------

Element8
-------------------------------------
|Reference name | chr1|
|Reference Interval| [ 1, 51)|
|Strand | Forward|
|Element Interval | [ 0, 50)|
|Element length | 50|
-------------------------------------
sequence: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
-------------------------------------

Adding sequence information

gElement objects have been defined with the purpose to make it easier to instantiate sub-elements from already instantiated ones. This also implies that the sequence, as well as other information about the new genomic element, will be obtained from the existing one. There are two ways to provide sequence information to gElement objects: by either a sequence or a sequence retriever. Using the first approach makes it possible to instantiate elements with any sequence you need. If you'll obtain sub-elements from objects whose sequence have been specified this way, the sequence will be reused for the new one, possibly padded with NAs if the new object extends the interval of the original one. Using retrievers this limitation is removed; moreover, sub-element objects will fetch any additional needed sequence from the retriever. Nonetheless, in this case you need to have a suitable retriever object: sequence retrievers will be discussed in detail later; for now, let's consider them as objects that are able to retrieve the sequence of a given reference interval.
//An element instantiated providing sequence information
gElement element1("myRef",10,20,true,"ACGTGACGTA",gElm);
//A sub element 
gElement element2(element1,-10,10,false);
//An element instantiated using a retriever (suppose we've instantiated it before)
gElement element3("chr1",10100,10150,true,retriever,gElm);
//A Sub-Element
gElement element4(element3,-10,10,false);

cout << "Element with sequence:" << element1 << endl;
cout << "Sub element [-10,10), reverse" << element2 << endl;
cout << "Element with retriever:" << element3 << endl;
cout << "Sub-element [-10,10], reverse" <<  element4 << endl;

Element with sequence:
-------------------------------------
|Reference name | myRef|
|Reference Interval| [ 10, 20)|
|Strand | Forward|
|Element Interval | [ 0, 10)|
|Element length | 10|
-------------------------------------
sequence: ACGTGACGTA
-------------------------------------

Sub element [-10,10), reverse
-------------------------------------
|Reference name | myRef|
|Reference Interval| [ 0, 20)|
|Strand | Reverse|
|Element Interval | [ 0, 20)|
|Element length | 20|
-------------------------------------
sequence: TACGTCACGTNNNNNNNNNN
-------------------------------------

Element with retriever:
-------------------------------------
|Reference name | chr1|
|Reference Interval| [ 10100, 10150)|
|Strand | Forward|
|Element Interval | [ 0, 50)|
|Element length | 50|
-------------------------------------
sequence: TCAGCAGGATTTTGACGGCTTCTAACAAAATCTTGTAGACAAGATGGAGC
-------------------------------------

Sub-element [-10,10], reverse
-------------------------------------
|Reference name | chr1|
|Reference Interval| [ 10090, 10110)|
|Strand | Reverse|
|Element Interval | [ 0, 20)|
|Element length | 20|
-------------------------------------
sequence: ATCCTGCTGACTCGAGATGC
-------------------------------------
There is another parameter you may want to specify for the constructor: the sequence mode. Sequence mode defines the way the object will maintain sequence information. You can pass one of the following four values: gElm,gReg,gRef or gRet.
gElm mode is the simplest (and the default one, if you do not specify the mode). Sequence information is maintained at the element level: reference sequence (either obtained from the retriever or specified directly) is transformed according to the strand and possibly to the variations. Sub-Elements will copy the sequence from the original element and will maintain their own copy. This is the fastest way to keep sequence information. It doesn't need any transformation when the sequence is required by one of the sequence methods of the gElement class. With the gReg mode, the sequence is transformed according to the variations and it is kept into the gRegion object associated to the gElement. When a sequence is required from the original object (or from one of its sub-elements that share the gRegion object) the sequence is reverted according to the strand of the object. This method allows for a certain degree of memory saving but requires further computation to revert the sequence when needed. If a sub-element is obtained from another one adding variations a new copy of the sequence will be constructed keeping into account variations. To prevent this from happening, and thus saving a greater amount of memory, one could use the gRef or gRet modes. In these cases the getSequence methods will take care of transforming the reference sequence, according to the strand and variations of any gElement. The first one still keeps in memory a copy of the reference sequence while the second one doesn't at all. In this latter case whenever a sequence is required it is retrieved by the retriever and then transformed. Since retrievers are usually reading sequences from files, this can be very time consuming. The choice of the sequence mode depends on the specific application one is developing.

Introducing variations

We refer to a variation as to a difference between a region from a reference sequence (i.e. a chromosome assembly) and the corresponding region in another sequence (i.e. the same chromosome in an individual). To describe these differences we can use three type of variations, namely: Insertions, Deletions and Substitutions. The class gVariations can hold the information about variations relative to a reference. A gVariation instance can contain any number of variations. There is no requirement that these variations completely describe the difference between the given reference and an actual sequence.
gVariations objects need to be instantiated providing reference information. This can be done by passing the reference name, a gElement object (from which the reference will be obtained) or another gVariations object respectively. A default null constructor and the assignment operator make this object assignable and insertable (in std::containers).
//A gVariation from reference name;
gVariations var1("chr1");
//Another one whose reference is obtained from a gElement
gElement element("chr1",5,45);
gVariations var2(element);
//gVariations objects can be copied using the copy constructor...
gVariations var4(var2);
//or using the assignment operator
gVariations var3=var1;
Once the object is instantiated, variations can be added. Insertions require a position and the inserted sequence, deletions an interval and substitutions an interval and the varied sequence. You can specify both positions and intervals using either reference or element positions. In the latter case a gElement with the same reference must be provided to the referred positions. Variations are maintained in ascending order of starting position, so their order inside the object doesn't reflect the one by which they've been added. In the following example we imagine a situation where we have a reference sequence called "ref1" which aligns, as shown in figure, with a region sequence. We can describe this alignment by considering a 10bp deletion between positions 10 and 20, the insertion of "ACGTTGTG" at position 30 and a substitution C->A at position 40.

In this example we add variations using the function that operates on reference coordinates. Other versions exist, and they can be used when variations positions are relative to elements.
gVariations var("ref1");
var.addInsertion("ref1",30,"ACGTTGTG");
var.addDeletion("ref1",gReferenceInterval(10,20));
var.addSubstitution("ref1",gReferenceInterval(40,41),"A");
Information about variations can be obtained by a gVariations object using the appropriate member functions.
cout << "variation reference:" << var.getReferenceName() << endl;
cout << "Number of variations: " << setw(10) << var.getCount() << endl << endl;
cout << "N#" << setw(10) << "Start" << setw(10) << "End" << setw(10) << "Type" << setw(14) << "Sequence" << endl;
for(gSize i=0;i<var.getCount();i++){
 cout << i;
 cout << setw(10) << var.getVariationStart(i);
 cout << setw(10) << var.getVariationEnd(i);
 cout << setw(10) << var.getVariationType(i);
 cout << setw(15) << var.getVariationSequence(i) << endl; 
}
cout << endl << "type legend: 0=Substitution, 1=Insertion, 2=Deletion" << endl;
variation reference:ref1
Number of variations: 3

N# Start End Type Sequence
0 30 31 1 ACGTTGTG
1 10 20 2
2 40 41 0 A

type legend: 0=Substitution, 1=Insertion, 2=Deletion
Now we have a gVariation object containing three variations. We can use it to instantiate a new gElement that will reflect the effects of the variations. For example we can use the copy range constructor from the previously instantiated element. An important limitation is that you cannot apply variations to an object that already contains variation. You can use the copy constructor passing an already varied object but in this case you cannot pass new variations. Variations can also be passed to the constructor-from-reference. It's worth noting that reference positions are always relative to the reference sequence without any variation applied. In any case, when positions become invalid due to variations, they are marked as NA and their values are assigned following the rules explained at the beginning of this chapter.
gElement Element("ref1",5,45,true,"ATATCTGCAGCTTCTCCTTCATCTTCATCTTCTTCCTCTT",gRef);
gElement vElement1(Element,Element.getElementInterval(),true,var);
gElement vElement2(vElement1,gElementInterval(0,20),true);
gElement vElement3("ref1",gReferenceInterval(15,35),false,var);
  
cout << "Original element" << Element << endl;
cout << "Varied element 1" << vElement1 << endl;
cout << "Varied element 2" << vElement2 << endl;
cout << "Varied element 3" << vElement3 << endl;
Original element
-------------------------------------
|Reference name | ref1|
|Reference Interval| [ 5, 45)|
|Strand | Forward|
|Element Interval | [ 0, 40)|
|Element length | 40|
-------------------------------------
sequence: ATATCTGCAGCTTCTCCTTCATCTTCATCTTCTTCCTCTT
-------------------------------------

Varied element 1
-------------------------------------
|Reference name | ref1|
|Reference Interval| [ 5, 45)|
|Strand | Forward|
|Element Interval | [ 0, 38)|
|Element length | 38|
-------------------------------------
sequence: ATATCCCTTCATCTTCACGTTGTGATCTTCTTCATCTT
-------------------------------------

Varied element 2
-------------------------------------
|Reference name | ref1|
|Reference Interval| [ 5,*31)|
|Strand | Forward|
|Element Interval | [ 0, 20)|
|Element length | 20|
-------------------------------------
sequence: ATATCCCTTCATCTTCACGT
-------------------------------------

Varied element 3
-------------------------------------
|Reference name | ref1|
|Reference Interval| [ 15, 35)|
|Strand | Reverse|
|Element Interval | [ 0,*23)|
|Element length | 23|
-------------------------------------
sequence: NNNNCACAACGTNNNNNNNNNNN
-------------------------------------

Intervals and positions mapping and conversion

Both element intervals and positions can be converted to/from reference ones and to/from other elements. The rules that apply when positions cannot be exactly mapped due to the presence or variations have been explained in the introduction to this chapter. Now we'll see some examples to better understand how these rules are applied and which member functions can be used. Let's start with reference positions mapping onto elements. The following figure depicts some of the elements defined in the previous section as well as their reference.

This picture also reports four reference positions as well as their corresponding element positions for each one of the elements we have defined. Positions that are not valid are marked with an asterisk. Since variations are always defined relative to the reference, reference positions can map to invalid element positions when they fall inside a deletion. In this case, the converted positions values are reported with the NA flag set to true and their value correspond to the last valid position along the element before the deletion (reported in the output between brackets).
//An array of reference positions
gArray<gPos> referencePositions;
referencePositions.setValue(0,2,false);
referencePositions.setValue(1,15,false);
referencePositions.setValue(2,26,false);
referencePositions.setValue(3,47,false);
cout << "An array of reference positions" << endl;
cout << referencePositions << endl;

//Mapping of these positions onto different elements
cout << "The corresponding positions on Element," << endl;
cout << Element.getElementPositionsFromReference(referencePositions) << endl;
cout << "on vElement1," << endl; 
cout << vElement1.getElementPositionsFromReference(referencePositions) << endl;
cout << "and on vElement3" << endl;  
cout << vElement3.getElementPositionsFromReference(referencePositions) << endl;
An array of reference positions
Pos: 0 1 2 3
------------------------
Val: 2 15 26 47

The corresponding positions on Element,
Pos: 0 1 2 3
------------------------
Val: -3 10 21 42

on vElement1,
Pos: 0 1 2 3
------------------------
Val: -3[ 4] 11 40

and on vElement3
Pos: 0 1 2 3
------------------------
Val: 30[ 22] 16 -13
Mapping reference intervals onto elements can be done in a similar way. As said before, the rules to deal with invalid positions are slightly different. In particular, reference interval starting positions that fall inside a deletion are mapped to the first valid element position after the deletion while ending positions are mapped to the last valid element position before the deletion.

As you might appreciate from the figure, two intervals (red and green) are both defined on the reference and mapped to the elements. You can compare them with the following example output. Remember that intervals ending positions are always 1 based, even if they are marked as invalid!.
//definition of two reference intervals.
gReferenceInterval rintv1(2,15);
gReferenceInterval rintv2(15,26);
 
cout << "Reference interval: " << rintv1 << endl;
cout << "-----------------------------------------------------------" << endl;
cout << "maps to  Element  interval: " << Element.getElementIntervalFromReference(rintv1) << endl;
cout << "maps to vElement1 interval: " << vElement1.getElementIntervalFromReference(rintv1) << endl;
cout << "maps to vElement3 interval: " << vElement3.getElementIntervalFromReference(rintv1) << endl << endl;

cout << "Reference interval: " << rintv2 << endl;
cout << "-----------------------------------------------------------" << endl; 
cout << "maps to  Element  interval: " << Element.getElementIntervalFromReference(rintv2) << endl; 
cout << "maps to vElement1 interval: " << vElement1.getElementIntervalFromReference(rintv2) << endl;
cout << "maps to vElement3 interval: " << vElement3.getElementIntervalFromReference(rintv2) << endl;

Reference interval: 2 - 15
-----------------------------------------------------------
maps to Element interval: -3 - 10
maps to vElement1 interval: -3 -[ 5]
maps to vElement3 interval: [ 23] - 31

Reference interval: 15 - 26
-----------------------------------------------------------
maps to Element interval: 10 - 21
maps to vElement1 interval: [ 5] - 11
maps to vElement3 interval: 17 -[ 23]
To illustrate conversions from either element positions and intervals to reference ones or between elements, consider the situation illustrated in the following figure. We have reported two of the elements we already instantiated before and the reference sequence at the bottom.

On the vElement 3 are reported three positions as well as two intervals, one starting and one ending inside a deletion (the green and red one respectively). When converting from element to reference positions (and intervals), the resulting invalid positions will be the ones which fall inside a deletion. The rules applied in this case correspond to the ones for the conversion from reference to element positions.
In the example, the three vElement3 positions are converted to the corresponding vElement1 ones and then to the corresponding reference using the appropriate member functions. The two vElement3 intervals are then defined and mapped onto the vElement1 element. The resulting intervals are finally mapped to the reference (showing that corresponding element intervals map to the same reference ones). Again you can refer the figure to the reported output.
//An array of element positions;
gArray<gRelativePos> elementPositions;
elementPositions.setValue(0,-13);
elementPositions.setValue(1,8);
elementPositions.setValue(2,27);
 
cout << "vElement3 Positions: " << endl;
cout << elementPositions << endl;
cout << "map to the following vElement1 positions: " << endl; 
cout << vElement3.mapElementPositions(vElement1,elementPositions) << endl << endl;
cout << "and to the following reference positions: " << endl;  
cout << vElement3.getReferencePositionsFromElement(elementPositions) << endl; 
 
//two element intervals
gElementInterval eintv1(8,27);
gElementInterval eintv2(0,8);

//obtaining corresponding intervals on another element
gElementInterval eintv3=vElement3.mapElementInterval(vElement1,eintv1);
gElementInterval eintv4=vElement3.mapElementInterval(vElement1,eintv2);

cout << "vElement3 interval: " << eintv1;
cout << "maps to vElement1 interval: " << eintv3 << endl;
cout << "vElement3 interval: " << eintv2;
cout << "maps to vElement1 interval: " << eintv4 << endl << endl; 
 
cout << "vElement3 interval: " << eintv1;
cout << "maps to reference interval: " << vElement3.getReferenceIntervalFromElement(eintv1) << endl;
cout << "vElement3 interval: " << eintv2;
cout << "maps to reference interval: " << vElement3.getReferenceIntervalFromElement(eintv2) << endl << endl;

cout << "vElement1 interval: " << eintv3;
cout << "maps to reference interval: " << vElement1.getReferenceIntervalFromElement(eintv3) << endl;
cout << "vElement1 interval: " << eintv4;
cout << "maps to reference interval: " << vElement1.getReferenceIntervalFromElement(eintv4) << endl;
vElement3 Positions:
Pos: 0 1 2
-------------------
Val: -13 8 27

map to the following vElement1 positions:
Pos: 0 1 2
-------------------
Val: 40 19 0


and to the following reference psoitions:
Pos: 0 1 2
-------------------
Val: 47[ 30] 5

vElement3 interval: 8 - 27 maps to vElement1 interval: 1 - 20
vElement3 interval: 0 - 8 maps to vElement1 interval: 20 - 28

vElement3 interval: 8 - 27 maps to reference interval: 6 -[ 31]
vElement3 interval: 0 - 8 maps to reference interval: [ 31] - 35

vElement1 interval: 1 - 20 maps to reference interval: 6 -[ 31]
vElement1 interval: 20 - 28 maps to reference interval: [ 31] - 35

Sites and connections

We have seen how elements can be constructed starting from already existing ones by adding variations or simply getting sub-elements. One could desire to keep track of some particularly significant positions along one element in a way that, when a sub-element (varied or not) is obtained, it contains information about these positions. This could happen especially when variations are involved. We refer to these positions as "Sites". As usual, no assumption is done about their biological meaning.
Sites can be added to an existing element by using the appropriate member functions. Their positions can be specified in terms of element or reference positions. Obviously there are also member functions to retrieve them.
//Instantiate an element and add two sites
gElement element("chr1",10,20,true,"ACGTTTACGT");
gArray<gArrayIndex> A=element.insertElementSites(3);
gArray<gArrayIndex> B=element.insertReferenceSites(18);
cout << "Element" << element << endl;
  
//Some variations
gVariations variations(element);
variations.addDeletion(element,gElementInterval(1,5));
variations.addInsertion("chr1",16,"AAAA");

//A copy of the element adding variations
gElement vElement(element,0,10,true,variations);
cout << "vElement" << vElement << endl;

//Fetch site A positions
gArray<gRelativePos> pos1=element.getSitesElementPosition(A);
gArray<gRelativePos> pos2=vElement.getSitesElementPosition(A);
 
cout << "Site A:\t";
cout << pos1[0] << " -> ";
if(pos2.isNA(0)){
 cout << "(" << pos2[0] << ")" << endl;
}else{
 cout << pos2[0] << endl;
}
  
//Fetch site B positions
pos1=element.getSitesElementPosition(B);
pos2=vElement.getSitesElementPosition(B);
  
cout << "Site B:\t";
cout << pos1[0] << " -> ";
if(pos2.isNA(0)){
 cout << "(" << pos2[0] << ")" << endl;
}else{
 cout << pos2[0] << endl;
  }
Element
-------------------------------------
|Reference name | chr1|
|Reference Interval| [ 10, 20)|
|Strand | Forward|
|Element Interval | [ 0, 10)|
|Element length | 10|
-------------------------------------
sequence: ACGTTTACGT
-------------------------------------

vElement
-------------------------------------
|Reference name | chr1|
|Reference Interval| [ 10, 20)|
|Strand | Forward|
|Element Interval | [ 0, 10)|
|Element length | 10|
-------------------------------------
sequence: ATAAAAACGT
-------------------------------------

Site A: 3 -> (0)
Site B: 8 -> 8
The site insertion functions accept as input a gArray of positions and return a gArray of array indexes with the position of each site.
You can also add directed connections between two sites. As for sites, connections are also copied into sub-elements. There are member functions to query objects about the connections they contain, the sites these connections link, and so on.
//Using the same element of the previous example we add a third site
gArray<gArrayIndex> C=element.insertElementSites(6);  
//Yhen we insert a connection specifying the sites
gArray<gArrayIndex> connAC=element.connectSites(A,C);
//And also specifying an interval and its orientation (sites are added if not already present)
gArray<gArrayIndex> connBC=element.insertElementConnection(gElementInterval(6,9),false);

//You can know the number of sites and connections
cout << "element contains " << element.getSitesCount() << " sites and ";
cout << element.getConnectionsCount() << " connections" << endl << endl;

//For each connection you can obtain the corresponding element or reference interval 
//as well as the corresponding orientation 
cout << "Element interval for connection AC:";
cout << element.getConnectionElementInterval(connAC[0]);
cout << ((element.getConnectionElementStrand(connAC[0]))?(" Forward"):(" Reverse")) << endl;
cout << "Element interval for connection BC:";
cout << element.getConnectionElementInterval(connBC[0]);
cout << ((element.getConnectionElementStrand(connBC[0]))?(" Forward"):(" Reverse")) << endl;
cout << "Reference interval for connection AC:";
cout << element.getConnectionReferenceInterval(connAC[0]);
cout << ((element.getConnectionReferenceStrand(connAC[0]))?(" Forward"):(" Reverse")) << endl;
cout << "Reference interval for connection BC:";
cout << element.getConnectionReferenceInterval(connBC[0]);
cout << ((element.getConnectionReferenceStrand(connBC[0]))?(" Forward"):(" Reverse")) << endl;
cout << endl;

//You can also know which sites a given site is connected to/from:
cout << "Site C connects from the following sites:" << endl << element.connectsFrom(C[0]) << endl;
cout << "Site C doesn't connects to any site:" << endl << element.connectsTo(C[0]) << endl;
cout << "Site A connects to the following sites:" << endl << element.connectsTo(A[0]) << endl;
cout << "Site B connects to the following sites:" << endl << element.connectsTo(B[0]) << endl;
element contains 3 sites and 2 connections

Element interval for connection AC: 3 - 7 Forward
Element interval for connection BC: 6 - 9 Reverse
Reference interval for connection AC: 13 - 17 Forward
Reference interval for connection BC: 16 - 19 Reverse

Site C connects from the following sites:
Pos: 0 1
--------------
Val: 0 1

Site C doesn't connects to any site:
Pos:
----
Val:

Site A connects to the following sites:
Pos: 0
---------
Val: 2

Site B connects to the following sites:
Pos: 0
---------
Val: 2
As expected, we can now instantiate a new gElement from the previous one. Sites and connections will be copied in the new one. Note that, since we are interested in maintaining a directed relation between two sites, if the new object has opposite strand with respect to the one we are copying connections from, it will maintain its origin and destination sites unvaried. Have a look at the output...
gElement element2(element,0,10,false,variations);
cout << element2 << endl;
// Now we can reproduce the steps done before to analyse the connections for this new element
//.. code omitted, same as previous snippet except that we are using element2 instead of element
-------------------------------------
|Reference name | chr1|
|Reference Interval| [ 10, 20)|
|Strand | Reverse|
|Element Interval | [ 0, 10)|
|Element length | 10|
-------------------------------------
sequence: ACGTTTTTAT
-------------------------------------

element contains 3 sites and 2 connections

Element interval for connection AC: 7 -[ 9] Reverse
Element interval for connection BC: 1 - 8 Forward
Reference interval for connection AC: 13 - 17 Forward
Reference interval for connection BC: 16 - 19 Reverse

Site C connects from the following sites:
Pos: 0 1
--------------
Val: 0 1

Site C doesn't connects to any site:
Pos:
----
Val:

Site A connects to the following sites:
Pos: 0
---------
Val: 2

Site B connects to the following sites:
Pos: 0
---------
Val: 2
Lastly we can obtain sub-elements from a connection. Elements obtained in this way have a strand that depends on the connection orientation. Optionally, you can specify starting and ending offsets.
gElement elementAC=element.getElement(connAC[0]);
cout << "elementAC" << elementAC << endl;
gElement elementBC=element.getElement(connBC[0]);
cout << "elementBC" << elementBC << endl;

gElement elementAC2=element2.getElement(connAC[0]);
cout << "elementAC2" << elementAC2 << endl;
gElement elementBC2=element2.getElement(connBC[0]);
cout << "elementBC2" << elementBC2 << endl;
elementAC
-------------------------------------
|Reference name | chr1|
|Reference Interval| [ 13, 17)|
|Strand | Forward|
|Element Interval | [ 0, 4)|
|Element length | 4|
-------------------------------------
sequence: TTTA
-------------------------------------

elementBC
-------------------------------------
|Reference name | chr1|
|Reference Interval| [ 16, 19)|
|Strand | Reverse|
|Element Interval | [ 0, 3)|
|Element length | 3|
-------------------------------------
sequence: CGT
-------------------------------------

elementAC2
-------------------------------------
|Reference name | chr1|
|Reference Interval| [ 15, 17)|
|Strand | Forward|
|Element Interval | [ 0, 2)|
|Element length | 2|
-------------------------------------
sequence: TA
-------------------------------------

elementBC2
-------------------------------------
|Reference name | chr1|
|Reference Interval| [ 16, 19)|
|Strand | Reverse|
|Element Interval | [ 0, 7)|
|Element length | 7|
-------------------------------------
sequence: CGTTTTT
-------------------------------------

Positional features

We define as Positional Feature (PF) a property whose value changes along the positions of a genomic element. Positional features are not necessarily connected to sequence information. PFs can be represented using a gArray specialization. The GeCo++ library provides a way to implement algorithms for calculating PFs that can be used with gElements in order to make connect them to annotations more easily, and to keep track of the applied variations. Positional features are often calculated in a sliding sequence window. Let's consider the following example.

For each position X along the sequence we want to calculate the number of G/C in a window spanning [ X+shift-startOffset, X+shift-endOffset ]. The situation depicted in the above figure corresponds to startOffset=1, endOffset=2 and shift=-1. One could consider that either using startOffset=2, endOffset=1 or no shift at all could have lead to the same result. This remains true for this case, but this way of window parametrization is more general and allows for a wider range of cases. It also allows for features that do not need to be calculated in a window (startOffset and endOffset both equal to 0).
Calculation of PFs requires the implementation of Feature retrievers. In the following example we see how a feature retriever can be instantiated, assuming we already defined a GCRetriever as a gShortUnsigned specialization of the gFeatureRetrieverImplementation template (we will describe in the following section how this can be done). We then instantiate a gElement corresponding to the situation in the figure and use it to retrieve the feature.
In GeCo++, positional features are always calculated relative to an element interval, requiring you to provide an element as well as an element interval to the getSequence retriever method. As you can see, details about the window definition are passed to the constructor of the retriever implementation, while shift is a parameter of the getFeature function.
gFeatureRetriever<gShortUnsigned> retriever(GCRetriever(1,2));
gElement element("chr3",113743339,113743370,true,"CTTCCATATCTGCAGCTTCTCCTTCATCTTC");
cout << retriever.getFeature(element,element.getElementInterval(),-1) << endl;
Pos: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
---------------------------------------------------------------------------------------------------------------------------------------------------------------
Val: [ 0] [ 0] 2 2 2 2 1 0 1 1 2 3 2 3 3 2 2 2 1 2 3 2 2 2 1 1 2 1 1 2 [ 0]
What is happening behind the scene is that the retriever determines which element interval is needed to calculate the required feature in the chosen interval. In this case, the interval where the feature should be calculated is [0,31). To take into account start and end offsets as well as shift, the needed interval will be [0-shift-startOffset,31-shift+endOffset) that is [-2 32). This interval is then passed to the virtual member "calculateFeature" of the retriever implementation (GCRetriever) as well as the element upon which calculations should be done. The definition of this virtual function is the main task required when programming feature retriever implementation.

Feature retrievers

Like sequence Retrievers, feature retrievers inherit from gArrayRetriever and need an implementation object to be instantiated. Unlike gSequenceRetriever, gFeatureRetriever is a template class and its implementation objects must be derived from the gFeatureRetrieverImplementation abstract base template class, which is also a template. Thus, when instantiating a gFeatureRetriever, one has to define its template type and to use an implementation object that specializes gFeaureRetrieverImplemenatation with the same type. gFeatureRetriever has a member function, getFeaure, which accepts as input parameters an element, an element interval and a shift. This function returns a gArray containing the feature values relative to the element in the specified interval. It calculates which element interval is needed to evaluate the feature values keeping into account the shift and the windows boundaries specified by the implementation object. Window definition is delegated to the feature retriever implementation, which has to specify startOffset and endOffset.
Like the gSequenceRetrieverImplementation we have seen before, the template class gFeatureRetrieverImplementation has the virtual destructor and clone functions that must be overloaded in derived objects. It also defines the virtual member function "calculateFeature", which is where the feature calculation takes place. As an example, we now introduce an implementation corresponding to the above figure.
class GCRetriever:public gFeatureRetrieverImplementation<gShortUnsigned>{
 private:
  //The virtual function that calculates the feature
  virtual bool calculateFeature(gArray<gShortUnsigned> & values,const gElement & element,const gElementInterval & interval) const;
  //The virtual function used to obtain a copy of this object
  virtual gArrayRetrieverImplementation<gShortUnsigned> * clone() const;

 public:
  //The constructor
  GCRetriever(gRelativePos startOffset,gRelativePos endOffset);
  //The virtual destructor
  virtual ~GCRetriever();
};
In this case, the constructor calls the base class specialization constructor by specifying the offsets. It is worth noting that the constructor of the base class requires three parameters, namely the two offsets and an isDoubleStrand parameter. This latter indicates to the retriever using this implementation whether the values of the features are valid on both strands or not. We will see in the next section how this information is used to avoid recalculations when features are used associated to gElements. The clone function returns a copy of this object, while the virtual destructor in this particular implementation does nothing.
GCRetriever::GCRetriever(gRelativePos  startOffset,gRelativePos endOffset):gFeatureRetrieverImplementation<gShortUnsigned>(startOffset,endOffset,true){
}

GCRetriever::~GCRetriever(){
}

gArrayRetrieverImplementation<gShortUnsigned> * GCRetriever::clone() const{
 return new GCRetriever(getStartOffset(),getEndOffset());
}
The calculateFeature function definition is more interesting. It accepts three parameters:
  • an array where results will be saved
  • the element upon which calculations will be performed
  • the element interval
To understand how this function works, remember that it is called by the gFeatureRetriever object. The interval, as well as the length of the values array, takes into account the offsets and the shift. Therefore, what this function is expected to do is to calculate the feature in the provided interval, according to the start and end offsets. In other words, the first startOffset positions as well as the last endOffset ones have been added to the interval by the gFeatureRetriever object in order to obtain valid values in the originally requested interval. This means that the first startOffset positions as well as the last endOffset ones should be marked as NA (actually values have already initialized to 0 NA values). The function receives as input the element which sequence information relative to the passed interval can be obtained from, as well as sites,connections, and possibly other features. Finally the function should return false if some problem occurred. In our example we obtain the sequence relative to the interval using the element getSequence function. Then we construct an array with values set to 1 at positions where either a C or a G is present. Finally we use the gArray getSum function, passing the window length and the start offset as "winlen" and "position" parameters respectively (see the Array chapter for details). getWindowLength() as well as getStartOffset() and getEndOffset() are defined in the gFeatureRetrieverImplementation base template class.
bool GCRetriever::calculateFeature(gArray<gShortUnsigned> & values,const gElement & element,const gElementInterval & interval) const{
 //Obtain the sequence corresponding to the specified interval
 gArray<gChar> seq=element.getSequence(interval);
 //Build an array that has 1 at G or C positions.
 gArray<gShortUnsigned> g_or_c=((seq=='C')||(seq=='G'));
 //Use the getSum function to calculate the number of "1" in a sliding window
 g_or_c.getSum(values,0,seq.getSize(),false,getWindowLenght(),getStartOffset());
 return true;
}
The mechanism of passing to the calculateFeature function an "extended" interval, allowing for complete calculation in the required one, is particularly useful when the sequence information can be obtained for the extended regions. In the previous example we instantiated a gElement providing direct sequence information. Since we requested GC calculation for the whole element interval, we were not able to calculate the values near the boundaries. If we instantiate the gElement using a sequence retriever, this doesn't happen. In the following example we perform the same task using a sequence retriever. As you may appreciate the result is now complete (i.e. no NA values are present in the results).
gElement element("chr3",113743339,113743370,true,gSequenceRetriever(gSequenceRetrieverExampleImplementation("/adat/database/hg18/",".txt"))); 
gFeatureRetriever<gShortUnsigned> retriever(GCRetriever(1,2));
cout << retriever.getFeature(element,element.getElementInterval(),-1) << endl;
Pos: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
---------------------------------------------------------------------------------------------------------------------------------------------------------------
Val: 2 2 2 2 2 2 1 0 1 1 2 3 2 3 3 2 2 2 1 2 3 2 2 2 1 1 2 1 1 2 1

Features and genomic elements

As we have seen, PFs are calculated relative to element intervals. In many circumstances one could be interested in obtaining PF values relative to sub-elements (both varied and not) of another element. A way to obtain this result is to use the gFeatureRetriever getFeature member function we have seen. In the following example we instantiate two sub-element of an original gElement object, one with and one without variations. Then we calculate the GC content feature for both of them.
gElement element("chr3",113743339,113743370,true,gSequenceRetriever(gSequenceRetrieverExampleImplementation("/adat/database/hg18/",".txt")));
gFeatureRetriever<gShortUnsigned> retriever(GCRetriever(1,2));

cout << "The original element:" << element << endl;
cout << retriever.getFeature(element,element.getElementInterval(),-1) << endl;  

gElement sElement(element,0,10,true);
cout << "A sub-element:" << sElement << endl;
cout << retriever.getFeature(sElement,sElement.getElementInterval(),-1) << endl;

gVariations var(element);
var.addDeletion(element,gElementInterval(13,18));

gElement vElement(element,0,31,true,var);

cout << "A Varied sub-element:" << vElement << endl;
cout << retriever.getFeature(vElement,vElement.getElementInterval(),-1) << endl;

The original element:
-------------------------------------
|Reference name | chr3|
|Reference Interval| [ 113743339, 113743370)|
|Strand | Forward|
|Element Interval | [ 0, 31)|
|Element length | 31|
-------------------------------------
sequence: CTTCCATATCTGCAGCTTCTCCTTCATCTTC
-------------------------------------

Pos: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
---------------------------------------------------------------------------------------------------------------------------------------------------------------
Val: 2 2 2 2 2 2 1 0 1 1 2 3 2 3 3 2 2 2 1 2 3 2 2 2 1 1 2 1 1 2 1

A sub-element:
-------------------------------------
|Reference name | chr3|
|Reference Interval| [ 113743339, 113743349)|
|Strand | Forward|
|Element Interval | [ 0, 10)|
|Element length | 10|
-------------------------------------
sequence: CTTCCATATC
-------------------------------------

Pos: 0 1 2 3 4 5 6 7 8 9
------------------------------------------------------
Val: 2 2 2 2 2 2 1 0 1 1

A Varied sub-element:
-------------------------------------
|Reference name | chr3|
|Reference Interval| [ 113743339, 113743370)|
|Strand | Forward|
|Element Interval | [ 0, 26)|
|Element length | 26|
-------------------------------------
sequence: CTTCCATATCTGCCTCCTTCATCTTC
-------------------------------------

Pos: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
--------------------------------------------------------------------------------------------------------------------------------------
Val: 2 2 2 2 2 2 1 0 1 1 2 3 3 3 3 3 2 2 2 1 1 2 1 1 2 1
In the above example, each time we call the getFeature function we recalculate the feature value in each position. This could be avoided, considering that the first sub-element is a portion of the element for which calculation have been already done and that, for the second one, the values corresponding to intervals [0,12) and [20,31) are not affected by the deletion. Re-using these values would save a significant amount of calculation. In order to avoid unnecessary recalculation, gElement objects provide a mechanism that binds together a feature and an element. Instead of using the retriever getFeature function, you can add a feature to the element by using the addFeature member function. In this way, whenever a new element is obtained as a sub-element, the feature is recalculated only where needed and the values that are still valid are copied in the new object.
gElement element("chr3",113743339,113743370,true,gSequenceRetriever(gSequenceRetrieverExampleImplementation("/adat/database/hg18/",".txt")));
gFeatureRetriever<gShortUnsigned> retriever(GCRetriever(1,2));
gArrayIndex featnum=element.addFeature<gShortUnsigned>("GC",retriever,-1);

cout << "The original element:" << element << endl;
cout << element.getFeature<gShortUnsigned>(featnum,element.getElementInterval()) << endl;

gElement sElement(element,0,10,true);
cout << "A sub-element:" << sElement << endl;
cout << sElement.getFeature<gShortUnsigned>(featnum,sElement.getElementInterval()) << endl;

gVariations var(element);
var.addDeletion(element,gElementInterval(13,18));

gElement vElement(element,0,31,true,var);

cout << "A Varied sub-element:" << vElement << endl;
cout << vElement.getFeature<gShortUnsigned>(featnum,vElement.getElementInterval()) << endl;
The original element:
-------------------------------------
|Reference name | chr3|
|Reference Interval| [ 113743339, 113743370)|
|Strand | Forward|
|Element Interval | [ 0, 31)|
|Element length | 31|
-------------------------------------
sequence: CTTCCATATCTGCAGCTTCTCCTTCATCTTC
-------------------------------------

Pos: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
---------------------------------------------------------------------------------------------------------------------------------------------------------------
Val: 2 2 2 2 2 2 1 0 1 1 2 3 2 3 3 2 2 2 1 2 3 2 2 2 1 1 2 1 1 2 1

A sub-element:
-------------------------------------
|Reference name | chr3|
|Reference Interval| [ 113743339, 113743349)|
|Strand | Forward|
|Element Interval | [ 0, 10)|
|Element length | 10|
-------------------------------------
sequence: CTTCCATATC
-------------------------------------

Pos: 0 1 2 3 4 5 6 7 8 9
------------------------------------------------------
Val: 2 2 2 2 2 2 1 0 1 1

A Varied sub-element:
-------------------------------------
|Reference name | chr3|
|Reference Interval| [ 113743339, 113743370)|
|Strand | Forward|
|Element Interval | [ 0, 26)|
|Element length | 26|
-------------------------------------
sequence: CTTCCATATCTGCCTCCTTCATCTTC
-------------------------------------

Pos: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
--------------------------------------------------------------------------------------------------------------------------------------
Val: 2 2 2 2 2 2 1 0 1 1 2 3 3 3 3 3 2 2 2 1 1 2 1 1 2 1

Transcripts

gTranscript objects represent annotated genomic alignments of known transcripts. These objects inherit from the gElement ones and take full advantage of element sites and connections to represent transcription start and polyadenilation sites, cds start and end, donor and acceptor sites as well as the corresponding intervals (5' and 3' utr, exons and introns). Some specialized methods have been added to access this information. Like for any other gElement inherited object, it is possible to:
  • add numerical features, tightly bound to element positions and calculated/retrieved by easy-to-develop objects (gFeatureRetriever)
  • derive new objects from already instantiated ones by applying a set of variations (a gVariations object).Objects obtained in this way will consistently map the annotated positions onto the new (variated) sequence keeping track of the possible positions shift effect of insertions and deletions. Added features will also be recalculated but only where variations make it necessary.
gTranscript is a specialization of the gElement class: therefore it doesn't introduce any new concept deserving thorough explanations. We demand to the reference manual the description of the methods implemented to obtain transcript specific information. In the next section of this tutorial, we are going to present a complete application which also uses gTranscript objects.

A complete example application

This last chapter will describe a complete application written using the GeCo++ library. In the introductory example we have seen how the GeCo++ library features make it possible to study the effect of genomic variations on a transcript. In that example we generated arbitrary variations, now we want to apply the same approach to the study of real world data.
With Next Generation Sequencing (NGS) it is now easy and affordable to obtain complete sequencing of individual genomic regions. Furthermore complete genomes are available (1000 Genomes Project) allowing the identification of genome-wide variations as well as genotypes. Multiple genotypes of the same region make haplotype imputation possible, using specialized tools (e.g. impute2: Howie BN, Donnelly P, Marchini J, 2009 A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies. PLoS Genet 5(6)). This means that, for a given genomic region, it is now possible to obtain, with reasonable certainty, the exact sequence on each individual chromosome. This is definitely different and more interesting than analyzing the effect of one single variant at a time.
Large scale NGS experiments produce huge amounts of data, one possible strategy to store them in a compact form is to represent a resulting sequence reporting its differences from a reference one. The 1000 Genomes Project pilot paper has been published in October, 2010. Associated datasets are mainly constituted by vcf files. These files actually contains genotype information about any detected variant using different algorithms in different groups of individuals. By using the GeCo++ library, and in particular the gElement and gVariations classes, one can take full advantage of such kind of information by constructing objects representing fully annotated genomic elements. Variants can then be "injected" producing new objects, still coherently annotated.
In this application we want to study how different haplotypes affect the number of putative exonic binding sites for four SR splicing enhancer proteins. These proteins participate in the splicing of constitutive exons as well as in the regulation of alternative splicing (Wang J, Smith PJ, Krainer AR, Zhang MQ Nucleic Acids Res. Distribution of SR protein exonic splicing enhancer motifs in human protein-coding genes. 2005 Sep 7;33(16):5053-62. Print 2005.). These binding sites will be referred to as Exonic Splicing Enhancers (ESE). The program will take as input a transcript refSeq ID as well as the name of a vcf file, containing genotype information in the same region, relative to one or more individuals. We named the application "varese" from VARiant and ESE.

Header inclusion

At the beginning we need to include all the necessary headers. In this case we need:
  • File streaming capabilities to read from vcf files, for which we will use the STL file stream object
  • MySQL database connectivity to query a database for transcript annotations, in this case we will use the Mysql++ library
  • Our library
We declare we are using the corresponding namespaces.
#include <fstream>
#include <mysql++/mysql++.h>
#include <geco.h>

using namespace std;
using namespace mysqlpp;
using namespace geco;

Retrieving haplotype information

The Variant Call Format (Vcf 4.0 specification) is a text file format to hold variants information, with or without genotypes. Usually it is the result of re-sequencing experiments. Vcf is the format used by the 1000 Genomes Project. Note that a vcfTools suite exists, providing a C++ class that can handle any aspect of reading and writing vcf files.
In fact vcf files also contain information about variants frequency and sequence quality, which we are not interested to, for the purpose of this application. Since we only need genotype information and this is an example application for which we want to keep to a minimum the need for external software, we decided for the sake of clarity to read vcf files from the scratch. So, in order to extract haplotype information from .vcf files, we wrote a simple class that illustrate the usage of some library features. The class contains the following data members:
  • A vector of gString objects, which will be filled with subjects identifiers;
  • A vector of gVariation objects that will contain, for each subject, the variations between its first haplotype and the reference sequence.
  • A second vector of gVariation objects that will contain, for each subject, the variations between its second haplotype and the reference sequence.
Please note that here "first" and "second" are used solely to distinguish between the haplotypes of the two chromosomes. The class has no method other than a constructor.
Here is the class declaration:
class gVcf {
public:
    vector<gString> sub_names;
    vector<gVariations> v1;
    vector<gVariations> v2;
    gVcf(const gString & filename,const gElement & element);
};
The constructor is where the data is read from the vcf file. It accepts as a parameter the name of the file as well as a gElement object, used to select the relevant the vcf entries. Vcf files can contain metadata lines at the beginning: these lines are identified by a double ## at the beginning; the first line which is relevant to us is the header, starting with a single '#'. Data lines start after the header: each line represents a variant. They have 8 mandatory fields, namely chrom, pos, Id, Ref, Alt, Qual, Filter and Info. These fields contains information about the variant position (Chrom and pos), its alleles (Ref and Alt) and about the quality and methodology of the variant calling process. It also contains some information about its allele frequencies. If genotype data are present a further column labelled Format must be present in the header line, and it describes the genotypes format that follows. In this case the data line continues with a genotype description for each subject, whose identifiers must be present in the header line as column names. For a detailed description of the format please refer to the above link. Our function requires genotype information to be present. Please observe that, being this an example application, we do not perform extensive format checking. (We would instead use vcfTools in a real-world application).
gVcf::gVcf(const gString & filename,const gElement & element) {
    static char buffer[1000];
    gString chrom,Id,Alt,Qual,Filter,Info,Format;
    gPos pos;
    gSequence Ref;
    gReferenceInterval intv=element.getReferenceInterval();
    gString nref(element.getReferenceName(),3);

    ifstream fin(((string) filename).c_str());

    if (fin.is_open()) {
        //Skipping metadata
        do {
            fin.getline(buffer,1000,'\n');
        } while ((!fin.eof())&&(buffer[1]=='#'));
        //if header found
        if (!fin.eof()) {
            //construnct a gString skipping the first '#' character
            gString line=buffer+1;
            //split the header string to obtain a vector with field names
            vector<gString> header;
            line.split('\t',header);
            //fill sub_names with subject names and instatiate two gVariations each
            for (gSize i=9;i<header.size();i++) {
                sub_names.push_back(header[i]);
                v1.push_back(gVariations(element));
                v2.push_back(gVariations(element));
            }
            //reads data lines until an eof is found
            while (!fin.eof()) {
                //read the first 9 fields (we're interested here in chrom and pos)
                fin >> chrom >> pos >> Id >> Ref >> Alt >> Qual >> Filter >> Info >> Format;
                
                //stop the cycle if we reach the eof
                if(fin.eof()) break;
                //positions are 1 based in vcf files so we make them 0 based
                pos-=1;

                //check if the variant is inside our region otherwise skip the line
                if ((chrom==nref)&&(pos>=intv.getStart())&&(pos<intv.getEnd())) {
                    //construct a vector containing alleles for the current variant
                    vector<gString> alleles;
                    //add the referenfce allele (its index will be 0)
                    alleles.push_back(Ref);
                    //and the alternative ones (could be more than one)
                    Alt.split(',',alleles);
                    
                    //read the rest of the line and split it into subject's genotypes
                    fin.getline(buffer,1000,'\n');
                    line=buffer;
                    vector<gString> genotypes;
                    line.split('\t',genotypes);
                    // for each genotype (i.e. subject):
                    for (size_t i=1;i<genotypes.size();i++) {
                        // get the first allele (here we used the ASCII code subtracting the code 
                        // for '0' (48), this works because alleles are less than 10)
                        size_t g1=genotypes[i][0]-48;
                        // and the second one
                        size_t g2=genotypes[i][2]-48;
                        // if the first allele is not reference add the proper variation
                        if (g1>0) {
                            //ref and alt alleles have the same length: Substitution
                            if (alleles[g1].getSize()==alleles[0].getSize()) {
                                v1[i-1].addSubstitution(element.getReferenceName(),
                                                        gReferenceInterval(pos,pos+alleles[g1].getLength()),
                                                        alleles[g1]
                                                       );
                            //alt allele is longer than referernce one: Insertion
                            } else if (alleles[g1].getSize()>alleles[0].getSize()) {
                                v1[i-1].addInsertion(element.getReferenceName(),
                                                     pos,
                                                     gString(alleles[g1],1)
                                                    );
                            //alt allele is shorter than referernce one: Deletion
                            } else {
                                v1[i-1].addDeletion(element.getReferenceName(),
                                                    gReferenceInterval(pos+1,pos+alleles[0].getLength())
                                                   );
                            }
                        }
                        // do the same for the second allele
                        if (g2>0) {
                            if (alleles[g2].getSize()==alleles[0].getSize()) {
                                v2[i-1].addSubstitution(element.getReferenceName(),gReferenceInterval(pos,pos+alleles[g2].getLength()),alleles[g2]);
                            } else if (alleles[g2].getSize()>alleles[0].getSize()) {
                                v2[i-1].addInsertion(element.getReferenceName(),pos,gString(alleles[g2],1));
                            } else {
                                v2[i-1].addDeletion(element.getReferenceName(),gReferenceInterval(pos+1,pos+alleles[0].getLength()));
                            }
                        }
                    }
                } else fin.getline(buffer,1000,'\n');
            }
        }
    }
}

Retrieving sequence information

In order to obtain the sequence information, genomic elements (i.e. transcripts) need to be instantiated by passing a gSequenceRetriever object. In turn, sequence retrievers need an implementation to be instantiated. In this way you can write software that is independent from the way you store the sequence information. You can instantiate different implementations of the retriever without the need to change your code.
In this section we are going to see how to write a simple retriever implementation which fetches sequence information from a set of plain text files, stored somewhere in your filesystem (path) and named after chromosomes. Remember that you have to inherit your implementation object from the gSequenceRetrieverImplementation abstract base class by implementing:
  • At least a constructor to which you can pass the information relevant for your object
  • A virtual destructor: you must implement this one, even you have nothing to destroy (because of the virtual)
  • The clone method which must return an exact copy of your object
  • The getSequence_Internal member function, where you actually do the job
Here is the class declaration:
class gLocalTxtSequenceRetriever:public gSequenceRetrieverImplementation {
private:
    gString i_path;
    gString i_extension;
    virtual gArrayRetrieverImplementation<gChar> * clone() const;
    virtual gSequence getSequence_Internal(const gString & reference,gPos start,gPos end) const;
protected:
public:
    gLocalTxtSequenceRetriever(const gString & path,const gString & extension=".txt");
    virtual ~gLocalTxtSequenceRetriever();
};
The constructor of our retriever accepts as parameters the path to the files and their extension. This information is stored in two gString objects, which will be used to build the complete file name when a sequence is required. Note that, in order to use this retriever, one must use reference names (for example when instantiating a gElement) matching the chromosome names. In this example we use names constructed as "chr", followed by the chromosome number (or X or Y).
gLocalTxtSequenceRetriever::gLocalTxtSequenceRetriever(const gString & path,const gString & extension):gSequenceRetrieverImplementation() {
    i_path=path;
    i_extension=extension;
}
In this case the destructor has nothing to do but it still needs to be defined for C++ technical reasons (which is to allow the compiler to generate the correct entry in the virtual table).
The "clone" member function simply returns a pointer to a new instance of this same object. To do so (and this can be done for every object you implement), simply call the constructor passing the same values for the path and extension parameters.
gLocalTxtSequenceRetriever::~gLocalTxtSequenceRetriever() {
}

gArrayRetrieverImplementation<gChar> * gLocalTxtSequenceRetriever::clone() const {
    return new gLocalTxtSequenceRetriever(i_path,i_extension);
}
The member function getSequence_Internal is where you actually write the code to retrieve the sequence. It accepts three parameters as input: the sequence reference name (i.e. the name of the chromosome), the starting and ending positions of the interval [start,end) for which the sequence is required. It returns a gSequence object, filled with the required sequence.
In our example it simply tries to open the file whose name is matching the concatenation of path, name and extension. When the file is opened, it seeks for the correct position and reads in the correct number of characters. Then it instantiates a gSequence object providing the read data.
If the file is not opened or if doesn't contain the correct number of characters (i.e. you are reading beyond its eof) you might want to generate an exception. Actually, the library checks for the correct length of the returned sequence when it calls this method. If it doesn't match, a gException is automatically thrown.
gSequence gLocalTxtSequenceRetriever::getSequence_Internal(const gString & reference,gPos start,gPos end) const {
    gSequence ret;
    string filename=i_path+reference+i_extension;
    ifstream filein(filename.c_str());
    if (filein.is_open()) {
        gChar *Buffer=new gChar[end-start+1];
        Buffer[end-start]=0;
        filein.seekg(start,ios_base::beg);
        filein.get(Buffer,end-start+1);
        ret=gSequence(gString(Buffer));
        delete [] Buffer;
    }
    return ret;
}

Retreiving transcript annotation

There are various sources where transcript annotations can be extracted from. In this example we chose to use human refSeq transcripts as reported in the refGene table of the UCSC Genome Browser hg18 Annotation database. This table contains information about each refSeq transcript sequence alignements to the entire human genome. UCSC provides public access to their MySQL server, alternatively one can download it and setup a local database. What we need to do here is to extract from the database the information needed to invoke the constructor of the gTranscript object. What we need is the reference name (chromosome), two arrays containing exon start and end positions, the transcript strand relative to the reference and, if the transcript is coding, the positions where the coding sequence starts and ends. Therefore, we want to write a function which retrieves these data from the UCSC database for a given refSeq accession id, instantiates a gTranscript object and returns it to us.

To access MySQL databases we use an open Source Library (Mysql++) which is a C++ interface to the MySQL api. We are not going to explain in detail how the library works. To connect to a database one has to provide the MySQL server host and port, the desired database name and, when required, a username and a password. We define the following structure to pass this information to our object.
struct dbInfo {
    string host;
    int port;    
    string dbname;    
    string user;
    string passwd;

};
We define the function to accept the following parameters:
  • refSeqId: a gString reference containing the refSeq transcript id (NM_XXXXXXXX)
  • dbInfo: a dbInfo struct reference containing connection information
  • retriever: a sequence retriever
  • mode: the sequence handling mode
  • startOffset,endOffset: offsets to be included in the returned object
The last four parameters are required by the gTranscript constructor. The function catches exceptions mainly to have an opportunity to trap errors that can occur, and close and destroy any opened connection.
gTranscript getUCSCRefseq(const gString & refseqID,const dbInfo & info,const gSequenceRetriever & retriever,gElementSequenceMode mode,gPos startOffset,gPos endOffset) {
    //The object we will return
    gTranscript ret;

    //a Database conncetion
    Connection * conn=NULL;
    
    try {
        //Connects to the database
        conn= new Connection(info.dbname.c_str(),info.host.c_str(),info.user.c_str(),info.passwd.c_str(),info.port);
        
        //if succeeded
        if (conn) {
            //setup and run the query
            Query myquery = conn->query();
            myquery << "select ";
            myquery << " name,chrom,strand,txStart,txEnd,cdsStart,cdsEnd,exonCount,exonStarts,exonEnds ";
            myquery << "from ";
            myquery << " refGene ";
            myquery << "where ";
            myquery << " refGene.name='" << refseqID << "'";
            StoreQueryResult res = myquery.store();

            //if we have one and only one result
            if (res.size()==1) {
                //get the first row in the results
                Row row = *res.begin();
                
                //get the cds start,end, number of exons
                gPos cdsStart = atol(row.at(5).data());
                gPos cdsEnd = atol(row.at(6).data());
                gSize ecount = atoi(row.at(7).data());
                
                //declare exon start and end position arrays
                gArray<gPos> estart,eend;
                //exonStarts and exonEnds are given as text fields
                //with comma separated positions
                //copy them into two strings for further processing
                string es = row.at(8).data();
                string ee = row.at(9).data();
                
                //close and destroy the connection (we have extracted all we need)
                conn->disconnect();
                delete conn;
                //split the two position lists filling the estart and eend position arrays 
                for (int i=0;i<ecount;i++) {
                    size_t p1 = es.find_first_of(',');
                    size_t p2 = ee.find_first_of(',');
                    estart.setValue(i,atol(es.substr(0,p1).c_str()),false);
                    eend.setValue(i,atol(ee.substr(0,p2).c_str()),false);
                    es = es.substr(p1+1);
                    ee = ee.substr(p2+1);
                }

                //now we can instantiate a gTranscript
                ret=gTranscript(row.at(1).data(),estart,eend,strcmp(row.at(2).data(),"+")==0,retriever,mode,startOffset,endOffset,cdsStart,cdsEnd);
            } else {
                // if no results or more than one
                // close and destroy the connection
                conn->disconnect();
                delete conn;
                //throw appropriate exceptions
                if (res.size()==0) throw gException("Invalid UCSC refSeq identifier");
                else throw gException("Multiple match for refSeq identifier");
            }            
        } else {
          // the connection was not ok, we throw an exception to signal this condition
          throw gException("Unable to connect to the database");
        }
    } catch (exception & e) {
        //if something unexpected occurred, throwing a known exception object:
        //close and destroy the connection
        if (conn) {
            conn->disconnect();
            delete conn;
        }
        //throw a gException to signal what happened
        throw gException(e.what());
    } catch (...) {
        //if something unexpected occurred, throwing an unknown exception object:
        //close and destroy the connection
        if (conn) {
            conn->disconnect();
            delete conn;
        }
        //throw a gException to signal an unknown error
        throw gException("unknown error");
    }
    //return the object if no exception occurred
    return ret;
}

Calculating ESE scores

Exonic Splicing Enhancers (ESEs) are binding sites for SR proteins, which participate in both constitutive and alternative splicing. To identify putative binding sites along our transcript genomic sequence, we use a Positional Weight Matrix (PWM) approach. It consists in evaluating how well a given sequence of length n matches a frequency (or weight) matrix (of size 4 x n) which gives a score to each nucleotide for each position. The sum of these values (possibly normalized) is then used as the score for the given sequence. If we repeat this procedure for each sub-sequence of length n we obtain a series of values representing how well the sequence matches the consensus for that motif at any given position. A threshold is usually provided to identify putative binding sites.
Here we derive a class from the template specialization of gFeatureRetrieverImplementation to obtain a double valued (gScore is a typedef for double) positional feature which reports the score value if it is above the threshold, or 0 otherwise. Data members of the derived class are the matrix, the threshold and a "i_positions" array used internally by the algorithm. Similarly to what we have seen for the sequence retriever, in order to derive from gFeatureRetrieverImplementation one needs to define:
  • at least one constructor
  • a virtual destructor
  • a clone method
  • the calculateFeature method, where the algorithm will be actually implemented.
This is the class declaration:
class PWMScoreRetriever:public gFeatureRetrieverImplementation<gScore> {
private:
    gMatrix<gScore>       i_mat;
    gScore                i_threshold;
    gArray<gPos>          i_positions;
    virtual bool calculateFeature(gArray<gScore> & values,const gElement & element,const gElementInterval & interval) const;
    virtual gArrayRetrieverImplementation<gScore> * clone() const;

public:
    PWMScoreRetriever(const gMatrix<gScore> & mat,double threshold);
    virtual ~PWMScoreRetriever();
};
In the constructor we simply initialize the matrix, the threshold and the positions. This last is an array used by the algorithm for matrix columns indexing. Since it only depends on the matrix number of columns, it is instantiated only once here to save time,instead of in the calculateFeature method where it would be re-instantiated at each call.
PWMScoreRetriever::PWMScoreRetriever(const gMatrix<gScore> & mat,double threshold):gFeatureRetrieverImplementation<gScore>(0,mat.getColsNum()){
   i_mat=mat; 
   i_threshold=threshold;
   i_positions=getArray<gPos>(0,mat.getColsNum()-1,1);   
}
The virtual destructor has nothing to do, while as usual the clone function must return an exact copy of our object.
PWMScoreRetriever::~PWMScoreRetriever() {
}

gArrayRetrieverImplementation<gScore> * PWMScoreRetriever::clone() const {
    return new PWMScoreRetriever(i_mat,i_threshold);
}
The calculateFeature is a virtual member function that has to be defined in each object derived from gFeatureImplemantation, and it's where the actual algorithm must be implemented. It receives as input parameter a gArray, which is where results will be stored. It is passed already instantiated with the correct length and all its elements set to NA. Other input parameters are a reference to the element and one elementInterval for which calculations should be done. We do not pass the sequence corresponding to the interval since it can be retrieved using the element object which, in turn, could hold data needed by the algorithm (this is not the case here). Firstly we need to retrieve from the element the sequence corresponding to the interval. To do this, we use the getSequence() function. Let's say our PWM has N=7 columns (since the motif we are searching for is 7 bp long) we then need to select the first 7 bps from the sequence. Once we obtained the first sub-sequence (e.g. ACGTGTG), we should extract from the PWM the following values:

A->i_mat(0,0)
C->i_mat(1,1)
G->i_mat(2,2)
T->i_mat(3,3)
G->i_mat(2,4)
T->i_mat(3,5)
G->i_mat(2,6)

The sub-sequence score will be the sum of these values. The process must be repeated for each sub sequence obtained starting from position 0, 1, 2... and so on.
To obtain the correct row index from the sequence data we use the following strategy. Since gSequence objects are ultimately arrays of chars, we use our sequence object for indexing into a 256 elements gPos array (the number of values a char can assume). The values of the array will be 0,1,2 and 3 for the positions corresponding to the ASCII codes of A,C,G and T (or a,c,g and t) respectively. All the other values will be marked as NA. In this way we can easily obtain the row index array by indexing this array with the sequence itself. We will construct this index sequence by indexing the array with the whole sequence and then subsetting the resulting index array to obtain sub-sequence indices. Thus we can obtain the score with a single line of code, taking advantage of the gMatrix multiple values indexing member function.
// here we construct the "ast" array containing row positions 0,1,2,3 corresponding to A,C,G T or a,c,g,t ASCII codes
// this is global and static
static const gPos char_map[256]={0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};
static const gPos char_map_NA[248]={0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,66,68,69,70,72,73,74,75,76,77,78,79,80,81,82,83,85,86,87,88,89,90,91,92,93,94,95,96,98,100,101,102,104,105,106,107,108,109,110,111,112,113,114,115,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255};
static const gArray<gPos> ast(char_map,256,gArray<gPos>(char_map_NA,248));

bool PWMScoreRetriever::calculateFeature(gArray<gScore> & values,const gElement & element,const gElementInterval & interval) const {
    //get the sequence in the required interval
    gSequence seq=element.getSequence(interval);
    
    //build the whole sequence row indices array
    gArray<gPos> iseq=ast[seq];
    
    //the number of positions
    gSize mlength=i_mat.getColsNum();    
    
    //pre-allocate (to save time) the array that will contain the score values for each matrix column position
    gArray<gScore> vals(0,mlength,false,false);
    
    //pre-allocate (to save time) the single value array which will contain the (possibly NA) sum of the columns values.
    gArray<gScore> sum(0,1,false,false);

    //cycle through sequence positions
    for (gPos i=0;i<seq.getLength()-mlength;i++) {
      //get the sum of columns score for the current sub-sequence
      i_mat(vals,gArray<gPos>(iseq,i,i+mlength),i_positions).getSum(sum);
      if(!sum.isNA(0)){ // if the sum is not NA (please note that sum can be NA if the iseq 
                        // sub-interval contais some NA value: for example if the sequence 
                        // contais some character different from a,c,g or t)
        //set the value at position i
        if (sum[0]>i_threshold) values.setValue(i,sum[0],false); //to the score if it's above threshold
        else values.setValue(i,0,false); //to 0 otherwise
      } //we don't need to set the value to NA here, since values come already 
        //instantiated with anything set to NA
    }
    return true;
}
This is one of the possible implementations of the PWM algorithm. It has the advantage to be concise and to show some of the advanced features of the gArray hierarchy of classes. We also made every effort to make it efficient. Nevertheless, concision has a cost. In some conditions, especially when extensive computing and/or memory resources are needed it is certainly better to work at a lower level of abstraction using simple native C/C++ types and pointers.
For this reason, from any gArrayRetrieverImplementation derived object you have access to gArrays (arrays, matrices, strings and sequences) raw data through the protected member template function getArrayDataPtr. This means that, when needed, one could use raw data to implement algorithms in the most efficient way.

Putting all together

At this point we have everything we need to write our example program. We will provide as command line parameters a refSeq accession id and the name of a vcf file. After reading these parameters the program fetches the transcript information instantiating a gTranscript object, it then adds to the transcript a feature for each ESE, reads the haplotype data from the provided file and calculates for each haplotype the number of ESEs found in the exons. Finally, if the ESE counts differ from the "reference" transcript, it outputs the information to standard out. We need to instantiate some global static variables, containing ESEs matrices and threshold information (this could be generalized by reading the required data from an external file)
//  SF2/ASF
static const double valEse1[28]={
    -1.139660, 0.621677, -1.584962, 1.323616, -1.584962, -1.584962,  0.621677,
    1.365525,-1.103778,  0.725053, 0.330068,  0.938765, -1.584962, -1.584962,
    -0.211782, 0.174455,  0.478823,-1.584962,  0.334650,  0.992775, -0.105105,
    -1.584962,-0.500847, -1.584962,-1.127254, -1.584962, -1.127254,  0.268437
};
const double threshold_ese1 = 1.956;    
    
// SRp40
static const double valEse2[28]={
    -0.130243, -1.584962,  1.276539, -0.326247,  0.969599, -0.130243, -1.584962,
     0.562681,  0.681752, -1.119013,  1.242778, -0.767378,  0.131965, -0.045863,
    -1.584962, -0.139795, -1.332258, -0.481546, -1.584962,  0.437586,  0.796774,
     0.919376,  0.369696,  0.229264, -1.141854,  0.723873, -3.169924, -0.130243
};
const double threshold_ese2 = 2.670;
    
// SRp55
static const double valEse3[24]={
    -0.662469,  0.110350, -0.662469,  0.110350, -1.584962,  0.610874,
     0.393557, -1.584962,  1.477196, -1.584962, -1.584962,  0.976113,
    -1.584962,  0.719010, -1.584962,  0.719010,  0.206371, -0.787198,
     1.224306, -1.584962, -0.074500, -1.584962,  1.020243, -1.584962
};
const double threshold_ese3 = 2.676;    
    
// SRp40
static const double valEse4[32]={
    -0.882950,  0.090266, -0.058735, -1.584962,  0.090266, -0.412767, -0.058735,  0.225310,
    -1.159392, -1.584962,  0.951947,  1.113179,  0.562681,  0.864029,  0.319936, -1.584962,
     0.869122,  0.451118, -1.355627, -1.584962, -0.334998, -0.051280, -1.355627,  0.675206,
    -1.180520, -0.196560,  0.383015,  0.882338, -0.196560, -0.864986,  0.964376, -1.584962
};
const double threshold_ese4 = 2.383;
Here is the completely annotated code of the main function of our program; you can also see a part of the output.
int main(int argc, char **argv) {
  
  // Declare variables to hold command line parameters
  gString transcriptId,vcfFile;

  // Command line parmater parsing
  if(argc==3){ //If the correct number of parameters have been specified

    //read the transcript ID
    transcriptId=argv[1];
    
    //and the vcf file name
    vcfFile=argv[2];
    
  }else{ //if the number of parameters were not correct

    //signal the error condition and exit 
    throw gException("Usage: varese transcriptId vcf_file");
  }
  
  // Instantiate a sequence retriever
  // (please consider that we have not parametrized this, although it could be done)
  gSequenceRetriever retriever(gLocalTxtSequenceRetriever("/adat/database/hg18/",".txt"));
  
  // UCSC MySQL server connection information
  // (this could be parametrized, too)
  dbInfo info;
  info.dbname="hg18";
  info.host="genome-mysql.cse.ucsc.edu";
  info.user="genome";
  info.passwd="";
  info.port=0;
  
  // Get a gTranscript object matching the ID 
  gTranscript refTtranscript=getUCSCRefseq(transcriptId,info,retriever,gRef,0,0);
  
  // Now we construct a retriever for ESE1
  PWMScoreRetriever ese1Retriever(gMatrix<gScore>(4,7,valEse1),threshold_ese1);
  
  // and add to the transcript a corresponding feature
  refTtranscript.addFeature<gScore>("ese1",ese1Retriever);
  
  //this could be done with a single line like we do for the rest of the ESEs...
  refTtranscript.addFeature<gScore>("ese2",PWMScoreRetriever(gMatrix<gScore>(4,7,valEse2),threshold_ese2));
  refTtranscript.addFeature<gScore>("ese3",PWMScoreRetriever(gMatrix<gScore>(4,6,valEse3),threshold_ese3));
  refTtranscript.addFeature<gScore>("ese1",PWMScoreRetriever(gMatrix<gScore>(4,8,valEse4),threshold_ese4));

  // To count the number of ESEs in the reference transcript exons
  // we instatinate a matrix with n# exons rows and 4 (ESE) columns
  gMatrix<int> ref_ese_counts(refTtranscript.getExonCount(),4);
  
  // then for each exon
  for (gArrayIndex e=0;e<refTtranscript.getExonCount();e++) {
    
    // and for each ese type
    for (gArrayIndex m=0;m<4;m++) {
      
      // we get the exon interval... 
      gElementInterval ref_eintv=refTtranscript.getExonInterval(e);
      
      //the ese scores in the exon interval...
      gArray<gScore> ref_ese=refTtranscript.getFeature<gScore>(m,ref_eintv);
      
      // the positions relative to the exon interval of the non-zero scores... 
      gArray<gPos> ref_ese_pos= which(ref_ese > 0);
      
      // and finally we save their value in our matrix
      ref_ese_counts.setValue(e,m,ref_ese_pos.getSize(),false); 
    }
  }

  // Now we extract haplotype information from the vcf file
  gVcf haplotypes(vcfFile,refTtranscript);

  // We declare an auxiliary variable to store the number of ESE
  int var_ese_count;
  
  
  // For each subject
  for (size_t i=0; i<haplotypes.sub_names.size();i++) {

    // Output the subject's ID
    cout << haplotypes.sub_names[i] << endl;
    cout << "-------------------" << endl;
    
    // For the first haplotype
    cout << "H1" << endl;
    
    // we contruct the "varied" gTranscript: it will inherit all the refTranscript features 
    // which will be automatically recalculated where needed (i.e. where the variations occur)
    gTranscript v1(refTtranscript,haplotypes.v1[i]);
    
    // Then, using the varied transcript, for each exon and ESE type we do the same we've done 
    // before for the reference transcript (this time, we write the code in an equivalent but more 
    // compact form)
    for (gArrayIndex e=0;e<v1.getExonCount();e++) {
      
      for (gArrayIndex m=0;m<4;m++) {

        // Get the ESE count for the H1 transcript
        // (note that we do not have to bother about indels effect on exon positions: 
        // this is automatically handled by the gElement object)
        var_ese_count=which(v1.getFeature<gScore>(m,v1.getExonInterval(e))>0).getSize();

        // if the count differs from the reference...
        if (var_ese_count!=ref_ese_counts(e,m)) { 
          
          // output the curren exon number
          cout << "\tE" << e+1 << " ";
          
          // the difference in ESE count / the reference ESE count
          cout << var_ese_count-ref_ese_counts(e,m) << "/" << ref_ese_counts(e,m);
          
          // the ESE type
          cout << " ESE" << m+1 << endl;
        }
      }
    }
    
    // Same for the other haplotype
    cout << "H2" << endl;
    gTranscript v2(refTtranscript,haplotypes.v2[i]);
    for (gArrayIndex e=0;e<v2.getExonCount();e++) {
      for (gArrayIndex m=0;m<4;m++) {
        var_ese_count=which(v2.getFeature<gScore>(m,v2.getExonInterval(e))>0).getSize();
        if (var_ese_count!=ref_ese_counts(e,m)) {
          cout << "\tE" << e+1 << " ";
          cout << var_ese_count-ref_ese_counts(e,m) << "/" << ref_ese_counts(e,m);
          cout << " ESE" << m+1 << endl;
        }
      }
    }
    
    cout << "-------------------" << endl << endl;
  }

  //We're done
  return 0;
}
> varese NM_003594 example.vcf
NA18486
-------------------
H1
H2
E1 -1/2 ESE4
E5 2/35 ESE1
E5 1/53 ESE2
E5 1/17 ESE3
-------------------

NA18489
-------------------
H1
H2
E2 -1/6 ESE3
E2 -1/6 ESE4
-------------------

NA18498
-------------------
H1
E2 -1/6 ESE3
E2 -1/6 ESE4
H2
E1 -1/2 ESE4
-------------------

NA18499
-------------------
H1
E1 -1/2 ESE4
E23 -1/30 ESE1
E23 -1/76 ESE2
E23 -1/31 ESE3
H2
E1 -1/2 ESE4
E19 1/1 ESE1
-------------------

NA18501
-------------------
H1
E2 -1/6 ESE3
E2 -1/6 ESE4
H2
E5 2/35 ESE1
E5 1/53 ESE2
E5 1/17 ESE3
-------------------

...output truncated...
In this example we wrote an application which retrieves transcript annotations from one of the most used annotation database (UCSC), and is able to fetch the corresponding sequence, read individual haplotypes information from a widely used data format (vcf) and implement an ESE putative binding sites identification algorithm. The application then integrates these information to show how different haplotypes may affect ESE binding sites.
This is definitely not the most complex application one can imagine in the genomic field, but it is complex enough to demonstrate how the use of the GeCo++ library can make things much faster and easier.

R/bioconductor comparison

The same results obtained by running our example C++ application can be obtained using other packages. In particular we wanted to compare our program with an R script. To implement this script we used two packages available from Bioconductor to retrieve transcript annotations and get genomic sequences. You can find the source of this script here. For a better comparison we analyzed separately five sub-tasks of our program to point out the main differences.
Vcf parsing. We decided to use an R function to read tabular data from text files and to automatically build an in memory table. This is a function that is lacking natively in C++. The table is then analyzed to extract both subjects and haplotypes. This step could be dangerous. If the vcf file is a big one (e.g. 1000 Genomes Project huge files, containing haplotypes for all chromosomes) you could easily run out of memory. Our C++ program reads the file line by line. This is a potentially slower approach, which conversely has the advantage to allow you to retain only relevant information.
Sequence retrieving. In this case we used the Bioconductor package BSgenome.Hsapiens.UCSC.hg18 that contains the whole UCSC human genome hg18 assembly. It can be easily used to get the required sequences. In this case we didn't need to write any particular code other than the call of the appropriate function.
Transcript annotation retrieving. This is the module that differs more from the C++ implementation. We used the Bioconductor package GenomicFeatures. First of all by using this package you need to download the whole refGene table from the UCSC database server. The package provides functions to do this and save the table in a local sqlite database. Since we only needed one transcript information in our script, we used the package to download and save the table in alocal file. In the script we simply load the sqlite table from the file system and query it to obtain our transcript annotation. Conversely, the C++ program directly queries the UCSC Server to retrieve only the required ID information. In the R version we get the transcript coordinates with exon boundaries. The result of the corresponding C++ function is a gTranscript object. Such an object doesn't exist in the R/Bioconductor package.
ESE score calculation. There is no particular observation on the differences between the two implementations of this module.
Main program. In the R environment (bioconductor), the lack of the gTranscript methods to manage variations made it necessary to perform some more complex steps. We had to apply variations to the transcript sequence. Since insertions/deletions introduce changes in exons boundaries positions we decided to limit the application of variations to the exon sequences only. This is not necessary using the gTranscript object which takes care of this, providing an easy way to map annotation positions between the original and the "varied" objects.
In the following table you can find the number of code lines written for the two implementations

Task C++R
Vcf parsing: 6777
Sequence retrieving: 342
Transcript annotation retrieving: 647
ESE score calculation:40 35
Static data (i.e. ESE matrices): 2819
Main program:70 141
Total:303 281

As you may appreciate, there are no significant differences, except for what concerns both sequence and annotation retrieving modules, for which Bioconductor already provided modules. Nevertheless, in these cases too, the number of C++ lines is not particularly high, especially if you consider that we created two new modules, also useful for other applications, which produce a much more feature rich result (a gTranscript instead of the raw list of exons boundary positions). Furthermore, the resulting application in the C++ case is more flexible. For instance, you could easily evaluate ESE count variations in introns. It's also worth nothing that the library takes care of recalculating ESE scores only where variations occur, while the R version simply recalculates everything for each haplotype. If you wanted to have either the exact position of where an ESE has been disrupted, or introduced by a variation, or which variation would be responsible for such a difference, you should write a lot of extra R code. Finally, as an example, we measured the running time of both applications to analyze the transcript NM_003594 using an example vcf file which contains genotypes coming from the 1000 Genomes Project SNPs and Indels data. It took 6 and 166 seconds to run the C++ and R version, respectively.

Performance tests

As reported in the Array section of this tutorial, gArrays have been implemented with two important features. Memory efficient array subsetting allows for the instantiation of array subsets without memory duplication. Moreover, an optimized bits array class (gBitsArray) makes it possible to mark any array element with a boolean flag, usually used to represent NA values. Since this second mechanism could increment memory usage, we wanted to have an estimate of how gArray objects can actually help to save memory and, at the same time, verify the impact of the time overheads introduced by both features. To this purpose, we report the following simple application. It takes four parameters: the length of the array to allocate, the degree of duplication (0=no duplication, 1=complete duplication), the degree of fragmentation by which duplication is achieved (0=no fragmentation, 1=maximum fragmentation). To make an example: if you pass 100 as the array length and 0.5 as duplication a gArray of length 100 will be instantiated with some NA values and 50 positions will be duplicated. The way this is achieved depends on the frag parameter. For frag equal to 0 one array of length 50 will be instatiated. Conversely for frag equal to 1 50 arrays of length 1 will be instantiated. In both cases, the first 50 elements from the original array will be used.
The last parameter determines the implementation used.
Mode=1 - Standard template vector are used to instantiate the arrays with no NA tracking.
Mode=2 - Standard template vector are still used but NA values will also be considered.
Mode=3 - gArrays will be used
The application code is reported in the following box.
#include <cstdlib>
#include <iostream>
#include <geco.h>

using namespace std;
using namespace geco;

int main(int argc, char **argv) {
  bool dummy;
  //Parameters parsing
  if(argc==5){
    srand(1);
    unsigned len=atol(argv[1]);
    float dup=atof(argv[2]);
    float frag=atof(argv[3]);
    int mode=atol(argv[4]);
    
    unsigned wlen,nwin;
    
    //calcuate number and length of sub-arrays (windows)
    if(dup==0){
      nwin=0;
      wlen=0;
    }else{
      nwin=max((unsigned) ((float) len*dup*frag),(unsigned) 1);
      wlen=(float) len * dup / (float) nwin;
    }
    
    //for mode 1
    if(mode==1){
      //Instantiate and populate the array
      vector<unsigned> array(len,0);
      for(unsigned i=0;i<len;i++){
        unsigned val=rand() % 1000 +1;
        array[i]=val;
      }
      
      // declare a vector to hold copies
      vector< vector<unsigned> > sub_arrays;
      // instantiate the sub-arrays
      for(size_t i=0;i<nwin;i++){
        sub_arrays.push_back(vector<unsigned>(array.begin()+i,array.begin()+i+wlen));
      }
      
   
    }else if(mode==2){
      
      vector<unsigned> array(len,0);
      vector<bool> Nas(len,0);
      for(unsigned i=0;i<len;i++){
        unsigned val=rand() % 1000 +1;
        bool isNA= (rand() % 1000)<5;
        array[i]=val;
        Nas[i]=isNA;
      }
      
      vector< vector<unsigned> > sub_arrays;
      vector< vector<bool> > sub_nas;
      for(size_t i=0;i<nwin;i++){
        sub_arrays.push_back(vector<unsigned>(array.begin()+i,array.begin()+i+wlen));
        sub_nas.push_back(vector<bool>(Nas.begin()+i,Nas.begin()+i+wlen));
      }

    }else if(mode==3){
      
     gArray<unsigned> array(0,len,false);
     for(unsigned i=0;i<len;i++){
       unsigned val=rand() % 1000 +1;
       bool isNA = (rand() % 1000)<5;
       array.setValue(i,val,isNA);
     }
    
     vector< gArray<unsigned> > sub_arrays;
     for(gPos i=0;i<nwin;i++){
       sub_arrays.push_back(gArray<unsigned>(array,i,i+wlen));
     }
      
    }
    return 0;    
  }else{
    cout << "Usage: window_test len %win %dup mode" << endl;
    return 1;
  }
}
The application has been executed in the three modes for lengths corresponding to 10, 100, 1000, 10000 and 100000 array elements. For each length duplication was set to 0, 0.25, 0.5, 0.75 and 1.00. For each duplication degree fragmentation was set to 0, 0.001, 0.01, 0.1, 0.25, 0.5 and 1.00. Execution time and peak memory were measured for each execution using time and valgrind|massif, respectively. To measure time, the application was run 10 times for each parameter configuration and the mean of execution time was calculated.
The results are reported in the following figure:
The blue lines represent the space/time ratios at different lengths between implementation 1 (vector, no NA) and 3 (gArray) while the red ones between implementations 2 (vector, NA) and 3 (gArray). Duplication values (pink shaded) increase horizontally while fragmentation values (gree shaded) increase vertically.



For what concerns memory, you can see that if no duplication occurs the ratio is very close to 1 in both comparisons. For increasing degree of both duplication and fragmentation there is a clear increase of the ratios. This indicates a better performance of gArray over the two vector implementations with and without NAs that reach 3 and 1.5 times more memory used than the gArray one, respectively. Execution times are comparable except than for high values of both duplication and fragmentation, for which gArray performs better.