MarDRe
[NEW] 2019/01/23: MarDRe v1.4 released! Check out the News section
MarDRe [1,2] is a de novo MapReduce-based parallel tool to remove duplicate and near-duplicate DNA reads through the clustering of single-end and paired-end sequences from FASTQ/FASTA datasets. Duplicate reads can be seen as identical or nearly identical sequences with some mismatches. Depending on the application scenario, duplicate or near-duplicate reads do not provide any interesting biological information but can increase memory requirements and computational time of downstream analysis. This tool allows reasearchers and bioinformatics to avoid the analysis of not necessary reads, reducing the time of subsequent procedures with the dataset (e.g., assemblies, mappings, etc.).
MarDRe is the Big Data counterpart of ParDRe [3,4], which employs HPC technologies (i.e., hybrid MPI/multithreading) to reduce runtime on multicore systems. Instead, MarDRe takes advantage of the MapReduce programming model originally developed by Google [5] to significantly improve ParDRe performance on distributed systems, especially on cloud-based infrastructures. Written in pure Java to maximize cross-platform compatibility, MarDRe is built upon the open-source Apache Hadoop project [6], the most popular distributed computing framework for scalable Big Data processing. More recently, MarDRe has been redesigned to use the Hadoop Sequence Parser (HSP) library [7] for improved performance, especially in paired-end scenarios.
This tool is distributed as free software and is publicly available at the Downloads section under the GPLv3 license [8].
Citation
If you have used MarDRe in your research, please cite our work using the following reference:
References
- [2] MarDRe SourceForge webpage
- [3] Jorge González-Domínguez, Bertil Schmidt. ParDRe: faster parallel duplicated reads removal tool for sequencing studies. Bioinformatics 32(10): 1562-1564 (2016)
- [4] ParDRe SourceForge webpage
- [5] Jeffrey Dean, Sanjay Ghemawat. MapReduce: simplified data processing on large clusters. Communications of the ACM 51(1): 107-113 (2008)
- [6] Apache Hadoop project
- [7] Hadoop Sequence Parser (HSP) library for FASTQ/FASTA datasets
- [8] GNU General Public License version 3 (GPLv3)