Introduction

New high-throughput sequencers are able to produce data at an unprecedented scale, while sequencing costs are in free fall. Primary data processing, which often includes mapping short reads onto a reference genome, is computationally very expensive. Recently, a variety of programs based on backward search methods have been developed for such task. Despite the efficiency of the FM-Index, the general view is that new approaches will be necessary to cope with the increasing flood of sequencing data.

GBSA proposes three approaches to backward search sequence mapping:

The first one takes advantage of CUDA GPGPU processors. We have implemented BTW on GPU, allowing alignments of up to one error. Also, we parallelised the input/output operations on the CPU and the GPU execution of the mapping process. Our experiments show that this heterogeneous and really cheap solution outperforms by 3x the conventional algorithms based only on CPU.
The second one is a CPU implementation of an inexact mapping algorithm compatible with any backward search method. By using optimised search tree exploration techniques we achieve unprecedented results. Our solution outperforms by a factor of 7x similar algorithms allowing 3 errors on 250bps reads. Also, when dealing with 400bps reads it can be used as a preprocessing that accelerates modern aligners by 20-40%.
The third one uses csalib to provide an out-of-core implementation of the reference index, being able to map reads using only 100MB of RAM. The interfaces used to provide compatibility with csalib can be used to employ other index implementations.

Datasets

The BWT indexes and datasets employed to measure the performance of this tool are availabe here.

API Documentation

Work in progress.

System Requirements

Hardware: 7GB of RAM for human genome index calculation. The GPU test program requires two graphic cards, a more complete version is in development. The RAM required for searching against the human genome depends on the suffix array compression ratio and the number of errors, with compression ratio 32x and support for more than one error it needs about 7GB. This compression does not affect the search algorithm but can affect the speed when mapping great quantities of small seeds. GPUs with at least 3GB of RAM are needed for the human genome. If compiled to run with the csalib backend the total memory requirements are below 200MB, but GPU are still not supported in this operation mode
Software: 64-bit POSIX System

Compile instructions

When compiling the tools do not forget to check in the makefile that the preprocessor variables -DFM_COMP_64 and -DSA_32 are present, in order to enable 32/64 bits compression of the FM-Index and 32/64 bits variables to store the positions of the Suffix Array. To enable the csalib backend compile with the preprocessor variable -DCSALIB_SEARCH. The preprocess program needs to activate OpenMP with -fopenmp also.

Both preprocess and search must have been compiled with the same options in order to use the generated index with the search tool.

Command line options

preprocess

./preprocess reference_file output_directory ratio duplicate_reverse nucleotides

reference_file, input file, in FASTA format, containing the reference to be indexed (up to 6GB references are supported)

output_directory, existing directory where the index files will be stored

ratio, the compression ratio of suffix array and its inverse (S and R).

duplicate_reverse, if 1 concatenates de reverse strand genome and computes the double sized index to search reverse and strand at the same time (for the human genome check that you compile the code with the flag -DSA_64, at the moment this takes a lot of memory). Csalib indexes need this option enabled.

nucleotides, the nucleotides present in the reference, in order.

For example,

./preprocess genome.fa index/ 8 0 ACGT

will index genome.fa and store the output files in directory "index", vectors S and R will be 8x compressed, the duplicate reverse will not be appended and the nucleotides encoding in this reference will be ACGT - 0123. Currently there exist two versions of this tool old for our BWT implementation and new for the csalib implementation

inexact_search

./inexact_search mappings_file index_dir output_file notfound_file duplicate_reverse search_tree_size num_errors min_fragment nucleotides

mappings_file The file with the reads

index_dir The directory with the index files

output_file, file where the mappings found will be stored

notfound_file, file where the mappings not found will be stored

duplicate_reverse, if 1 the index of the reference and the concatenated duplicate reverse was stored in index_dir.

search_tree_size, the search tree size used for the Breadth First Search exploration (an small value increases performance but decreases the sensitivity).

num_errors, the maximum number of errors allowed during the search.

min_fragment, the minimum fragment size used by the exploration bounding techniques (does not affect sensitivity but an small value decreases performance).

nucleotides, the nucleotides present in the reference, in order.

For example,

./inexact_search drosophila.fa drosophila_dbwt found 0 10000 3 17 ACGT

will map reads in drosophila.fa using the preprocessed index in folder drosophila_dbwt, which does not include the reverse. The reads found will be stored in the file "found", with a search_tree_size of 10000 elements, allowing up to 3 errors and a minimum segment size of 17 nucleotides (for the human genome 30 is a good value). The nucleotides encoding in this reference will be ACGT - 0123.

search_gpu

./search_gpu mappings_file index_dir output_file notfound_file num_errors nucleotides

mappings_file The file with the reads

index_dir The directory with the index files

output_file, file where the mappings found will be stored

notfound_file, file where the mappings not found will be stored

num_errors, the maximum number of errors allowed during the search (only 0 or 1).

nucleotides, the nucleotides present in the reference, in order.

For example,

./search_gpu drosophila.fa drosophila_dbwt found notfound 1 ACGT

will map reads in drosophila.fa using the preprocessed index in folder drosophila_dbwt. The reads found will be stored in the file "found", the reads not found in file "notfound", allowing up to 1 errors. The nucleotides encoding in this reference will be ACGT - 0123.

Currently this is only a test program and it needs a configuration with two nvidia graphic cards in order to work. It does not support duplicate reference indexes at the moment.

Notice also that in all the tools the encoding for the four bases should be indicated. For genomes with less bases please check the reverse strand functions in file "search/csafm.c", and please do not forget to call function init_replace_table when embedding the source code.

Contact information

GBSA has been developed at "Instituto de instrumentación de la imagen molecular" (UPV) Valencia, Spain in collaboration with "Centro de Investigación Príncipe Felipe" Valencia Spain and the support of Bull Inc. An early publication on the implementation of this project in GPU has been already published. José Salavert Torres is the main developer, Ignacio Blanquer Espert is the Project supervisor.

The code from csalib library included in this project is provided by "National Institute of Informatics" Tokyo, Japan under the MIT license. Please contact professor Kunihiko Sadakane.

The preprocess command employs the algorithm and code proposed in "Daisuke Okanohara, Kunihiko Sadakane. A Linear-Time Burrows-Wheeler Transform Using Induced Sorting. In Proc. of SPIRE, LNCS 5721, pp. 90-101, 2009"

Licenses

GBSA is released under the GNU/LGPLv3 license terms, csalib and dbwt are released under the MIT license terms.