GRyCAP - GNU Backward Search Aligner

An inexact mapping algorithm compatible with any backward search approach

Introduction

New high-throughput sequencers are able to produce data at an unprecedented scale, while sequencing costs are in free fall. Primary data processing, which often includes mapping short reads onto a reference genome, is computationally very expensive. Recently, a variety of programs based on backward search methods have been developed for such task. Despite the efficiency of the FM-Index, the general view is that new approaches will be necessary to cope with the increasing flood of sequencing data.

GBSA proposes three approaches to backward search sequence mapping:

Datasets

The BWT indexes and datasets employed to measure the performance of this tool are availabe here.

API Documentation

Work in progress.

System Requirements

Compile instructions

When compiling the tools do not forget to check in the makefile that the preprocessor variables -DFM_COMP_64 and -DSA_32 are present, in order to enable 32/64 bits compression of the FM-Index and 32/64 bits variables to store the positions of the Suffix Array. To enable the csalib backend compile with the preprocessor variable -DCSALIB_SEARCH. The preprocess program needs to activate OpenMP with -fopenmp also.

Both preprocess and search must have been compiled with the same options in order to use the generated index with the search tool.

Command line options

preprocess

./preprocess reference_file output_directory ratio duplicate_reverse nucleotides

reference_file, input file, in FASTA format, containing the reference to be indexed (up to 6GB references are supported)

output_directory, existing directory where the index files will be stored

ratio, the compression ratio of suffix array and its inverse (S and R).

duplicate_reverse, if 1 concatenates de reverse strand genome and computes the double sized index to search reverse and strand at the same time (for the human genome check that you compile the code with the flag -DSA_64, at the moment this takes a lot of memory). Csalib indexes need this option enabled.

nucleotides, the nucleotides present in the reference, in order.

For example,

./preprocess genome.fa index/ 8 0 ACGT

will index genome.fa and store the output files in directory "index", vectors S and R will be 8x compressed, the duplicate reverse will not be appended and the nucleotides encoding in this reference will be ACGT - 0123. Currently there exist two versions of this tool old for our BWT implementation and new for the csalib implementation

inexact_search

./inexact_search mappings_file index_dir output_file notfound_file duplicate_reverse search_tree_size num_errors min_fragment nucleotides

mappings_file The file with the reads

index_dir The directory with the index files

output_file, file where the mappings found will be stored

notfound_file, file where the mappings not found will be stored

duplicate_reverse, if 1 the index of the reference and the concatenated duplicate reverse was stored in index_dir.

search_tree_size, the search tree size used for the Breadth First Search exploration (an small value increases performance but decreases the sensitivity).

num_errors, the maximum number of errors allowed during the search.

min_fragment, the minimum fragment size used by the exploration bounding techniques (does not affect sensitivity but an small value decreases performance).

nucleotides, the nucleotides present in the reference, in order.

For example,

./inexact_search drosophila.fa drosophila_dbwt found 0 10000 3 17 ACGT

will map reads in drosophila.fa using the preprocessed index in folder drosophila_dbwt, which does not include the reverse. The reads found will be stored in the file "found", with a search_tree_size of 10000 elements, allowing up to 3 errors and a minimum segment size of 17 nucleotides (for the human genome 30 is a good value). The nucleotides encoding in this reference will be ACGT - 0123.

search_gpu

./search_gpu mappings_file index_dir output_file notfound_file num_errors nucleotides

mappings_file The file with the reads

index_dir The directory with the index files

output_file, file where the mappings found will be stored

notfound_file, file where the mappings not found will be stored

num_errors, the maximum number of errors allowed during the search (only 0 or 1).

nucleotides, the nucleotides present in the reference, in order.

For example,

./search_gpu drosophila.fa drosophila_dbwt found notfound 1 ACGT

will map reads in drosophila.fa using the preprocessed index in folder drosophila_dbwt. The reads found will be stored in the file "found", the reads not found in file "notfound", allowing up to 1 errors. The nucleotides encoding in this reference will be ACGT - 0123.

Currently this is only a test program and it needs a configuration with two nvidia graphic cards in order to work. It does not support duplicate reference indexes at the moment.

Notice also that in all the tools the encoding for the four bases should be indicated. For genomes with less bases please check the reverse strand functions in file "search/csafm.c", and please do not forget to call function init_replace_table when embedding the source code.

Contact information

GBSA has been developed at "Instituto de instrumentación de la imagen molecular" (UPV) Valencia, Spain in collaboration with "Centro de Investigación Príncipe Felipe" Valencia Spain and the support of Bull Inc. An early publication on the implementation of this project in GPU has been already published. José Salavert Torres is the main developer, Ignacio Blanquer Espert is the Project supervisor.

The code from csalib library included in this project is provided by "National Institute of Informatics" Tokyo, Japan under the MIT license. Please contact professor Kunihiko Sadakane.

The preprocess command employs the algorithm and code proposed in "Daisuke Okanohara, Kunihiko Sadakane. A Linear-Time Burrows-Wheeler Transform Using Induced Sorting. In Proc. of SPIRE, LNCS 5721, pp. 90-101, 2009"

.

Licenses

GBSA is released under the GNU/LGPLv3 license terms, csalib and dbwt are released under the MIT license terms.