Home > Manual

SPLITREAD is a novel method for detecting INDELs (small insertions and deletion with size less than 50bp) as well as large deletions that are within the coding regions from the exome sequencing data. It also can be applied ot the whole genome sequencing data.

Basics & Definitions

  • mrsFAST for read mapping stage.
    • SPLITREAD method depends on the mappings on short reads with substitutions only. The current version of SPLITREAD is integrated with the mrsFAST. It is possible to use other short read mappers such as bwa, however the alignment has to specified to include no indels. A version using bwa will be released soon.
  • zlib for the ability to read compressed SAM files. 
  • samtools for converting the results of the mappings to bam to save space.
  • C compiler (SPLITREAD is developed with gcc version 4.1.2)

  • It is possible to use SPLITREAD stand-alone, however it is advised to use this program on a cluster to take advantage of the parallelization of the input files. There is a script included for SGE that can be easily modified. 
  • The reference file for the coding region with the processed pseudogenes of the HG18 and HG19 can be found at the downloads page for exome sequencing. However user can use any reference file for the SPLITREAD method.

Before using SPLITREAD

The whole reference is used for the mapping, so do not mask any repeats and mobile elements in the reference sequence. Soft masking also works. The reference should be indexed for the mapping algorithm that is used for the SPLITREAD in the default case using mrsFAST. The fasta file of the reference should also be indexed using the samtools faidx tool.

The input file for the reads are FASTQ files. SPLITREAD can work with both single-end and paired-end data. The released version is using paired-end data and the input FASTQ file should contain both reads in the proper order. If the mapping is done is clusters, the FASTQ files should be split into smaller input files. Based on 2GB memory cluster note it is suggested to use 200000 reads per input file.

Reference Sequence for the exome dataset:

SPLITREAD is mostly applied to the whole-exome data. In order to increase the performance of the mapping process we define a smaller reference sequence that is consistent of the captured coding regions and non-coding RNAs and the paralogs of the coding regions such as processed pseudogenes. This results in a smaller reference sequence and an index that speeds up the loading time for the reference as well as the whole mapping process. The definition of the captured regions are based on the SeqCap EZ Human Exome Library v2.0 that captures ~30000 coding genes (total size of 36.5 Mbp) and captures about a total of 44.1Mbp. These regions are merged with all the gene definitions from RefSeq and UCSC and CCDS genes that are extracted from the UCSC browser. In order to identify the processed pseudogenes that are already represented in the Human Reference(hg18/hg19) sequence but not annotated, we blast all the exons agains the Human Reference Sequence. The blast results are parsed for sequences with similarity larger than 80% and span of the query within the 50% of the real size of the exon. These regions are inserted in to the coding/captured regions and merged.

The Human Reference Sequence(hg18/hg19) is masked with character 'N' except the coordinates of the regions described above. This eliminates the possibility of reads mapping places that are not in the definiton of the captured regions. The regions that are N are not indexed by the mapping methods so the index for the reference sequence will be smaller and the mapping to the reference will be much faster compared to the whole genome. This new reference is appened by the consensus sequences for the Alus, L1s and SVAs that is used for detecting the mobile element insertions to the coding regions. The references for the whole-exome data can be download from hg18 and hg19.

Read Mapping:

Please use mrsFAST in single-read mapping mode (There is a program included to perform the pairing of the reads). Recommended hamming distance threshold is 5% of the read length (i.e. 3 for 50 bp reads, 4 for 76 bp reads, etc.). Using other aligners might be possible if all map locations are reported with only substitutions, however, we strongly recommend mrsFAST that was specifically developed for SV/CNV analysis using Hamming distance criteria.

  1. Prepare the reference whole-exome reference/whole-genome reference/custom reference.
  2. Build mrsFAST search index using the same reference and build samtools FASTA index using the same reference.
  3. Run SPLITREAD_map to map the whole reads and the split reads.
  4. Merge the output file if it is run in parallel. Transform the balanced/unbalanced files for event calling program.
  5. Run SPLITREAD_call to get the events using the balanced/unbalanced reads.
  6. The final set of events that are reported as a bed file.



SplitReadAll_lite is used to map the reads to the reference sequence. This part of the program maps the whole reads to the reference sequence and identifies the concordant/discordant and OEA(One end anchored) reads. The OEA reads are later split and mapped back to the reference sequence.

Required files and parameters:

  • The FASTQ file for the reads with the proper pairing. It is necessary to have an even number of read length. If it is not the reads should be trimmed to be an even number.
  • The Reference sequence and the index for mrsFAST and samtools fasta index.    
The command for running the mapping program is as follows:

SplitReadAll_lite   <split_reference>                                          // The reference sequence for mapping of the split reads.
    <whole_reference>                                      // The reference sequence for the mapping of the whole reads. It is usually the same as the previous one.
<The target definition>                                // The bed file for the regions to be captured. This is not necessary unless read depth is used.
                                     <read_len>                                                    // The read length for the whole reads.
<whole read edit_distance>                      // Edit distance for the whole reads. 6% of the read length is recommended.
<split read edit_distance>                         // Edit distance for the split reads. It is usually half of the whole read edit_distance.
<lower insert size>                                      // The lower limit for the span size. The read length is recommended.   
<upper insert size>                        // The upper limit for the span size. It is recommended as 500 but can be adjusted more accuratly for better results
<maximum size for discordant mappings>     //The upper bound for detecting deletions. Recommended setting is 10000000.
<reference index of the fasta file for samtools>  // The .fai file for the reference sequence. 
<Input FASTQ>                         //FASTQ file for the whole reads.   
<Whole read mapping output = match>                               // The output filename for the mappings of the whole reads.
<Whole read unmapped reads = unmapped>                            // The unmapped whole reads
<Split read output prefix = output>                                        // The prefix for the output files from the split mappings.


  • match.sorted.bam                                               // The concordant mapping of the whole paired-end reads
  • match.single.sorted.bam                                   // All possible mappings of the OEA reads. (Anchors) 
  • match.inv.evert.sorted.bam                              // All possible mappings of with the inverted and everted directions.
  • match.transchr.sorted.bam                               // All possible mappings of the translocations mappings.
  • match.split.fq                                                       // Reads that are split from OEA unmatched ends.
  • match.split.match.sorted.bam                          // The paired mapping of the splits. (Balanced)
  • match.split.match.single.sorted.bam              // The OEA split reads. (Unbalanced reads)
  • match.pair                                                            // The split reads used for the event detection
  • match.pair.concordand.txt                                // The concordant balanced split read mappings - empty not needed anymore  
  • match.pair.discordand.txt                                 // The discordant mappings of the balanced splits (inversions)
  • match.pair.maxevert.txt                                     // The discordant mappings of the balanced splits (everted or larger than the maximum size, 10Mbp)  
  • match.pair.trans.txt                                            // The discordant mappings of the balanced splits (translocations)
  • match.all.oea.match                                          // The mappings of the unbalanced splits. This is used for event detection.
  • match.all.oea.match.maxevert.txt                   // The mappings of the unbalanced splits (inversion/everted/maximum size)
  • match.all.oea.match.trans.txt                           // The mappings of the unbalanced splits (translocations)
  • match.all.oea.match.discordand.txt               // The mappings of the unbalanced splits (inversion) - empty not needed anymore. 
There are two output files that are going to be used for calling events in the next step: match.pair and match.all.oea.match. These files has to be converted another format. This is conversions is done using the getSC.sh. The command is as follows:

           ./getSC.sh match.pair match.pair.format match.all.oea.match match.all.oea.match.format 

It is vey important to make sure that each line is in the correct form and the chromosomes are mathcing between the balanced and unbalanced mappings files, respectively match.pair.format and match.all.oea.match.format.


events5 is used for clustering the split reads and reporting the final set of events. There are two input files for the program the match.pair.format and match.all.oea.match,format.

The command for running the program is as follows:

./events5                    <match.pair.format>
                                   <Output_file for events = outfile>   //The output file for the events
                                   <Whole read length>        // The read length of the original complete reads.

  • outfile.events  // The events predicted in a preliminary format.
  • outfile.cluster // The reads that are supporting the event.
  • outfile.fasta    // The read sequence supporting the events
The outfile.events is processed using getFinalevents.sh using the command line

    ./getFilnalevents outfile.events outfile.event.final

The sample output is as follows:

C1        C2                   C3                 C4        C5           C6     C7        C8     C9  C10-11-12-13
chr11    51428765    51435883    7118    D7118    262    1462    4        4      51436108    51436275    51435851    51436217
chr11    51426389    51435891    9502    D9502    249    2           5        6      51436109    51436266    51426576    51436217

The first three columns are the <chr> <start> <end> for the events.
Column 4 is the size of the event. + for deletions and - for insertion.
Column 5 is the symbol of the event.
Column 6 is the balanced split read support.
Column 7 is the unbalanced split read support.
Column 8 is the edit distance for the anchor read.
Column 9 is the edit distance for the split read.
Column 10-13 is used for the anchor reads and can be removed for the final file.

Basic Pipeline
Assume that the reads are in Sample_1.fastq(First reads)  and Sampe_2.fastq(Second reads) where read length is 100bp. For 100 bp we will use 6 hamming distance for the whole read and 3 for the split read. The minimum insert size is 105 and the maximum is 500 bp for this example. The maximum size of event we can find is 10Mb. Here is the pipeline for this example
1) Merge the first and second reads together where first reads is followed by its pair(second read). 

     Sample_1.fastq + 
Sample_2.fastq -> Sample.fastq
2) Seperate into files that contains 200000 reads each; Sample1.fastq, Sample2.fastq, ... , Samplen.fastq
3) ./SplitReadAll_lite HG19_exome.fa HG19_exome.fa HG19_exome.fa 100 6 3 105 500 10000000 HG19_exome.fa.fai Sample1.fastq Sample1.out Sample1.unmapped Sample1.match
4) For the event calling there are only two important files, the rest can be removed.
Cat Sample1.match.pair, Sample2.match.pair, Sample3.match.pair, ... -> Sample.pair
Cat Sample1.all.oea.match.pair., Sample2.,all.oea.match.pair, Sample3.all.oea.match.pair, ... -> Sample.single
5)  ./getSC.sh Sample.pair Sample.pair.format Sample.single Sample.single.format
6) ./events5 Sample.pair.format Sample.single.format 100 Sample.events
10 ./getFilnalevents Sample.events Sample.events.final


Latest Releases




Other publications that use SPLITREAD

Related Projects

mrFAST: Our default Illumina read mapper that finds both indels and mismatches and performs iterative search to increase mapping sensitivity. Specifically designed for copy number variation and structural variation analysis.

mrsFAST: Our alternative Illumina read mapper that finds only mismatches to increase mapping speed. Also supports bisulfite mapping.

drFAST: Read mapper for  di-base color-space reads generated with the SOLiD platform.

VariationHunter: Structural variation calling algorithm using read pair mapping information including suboptimal alignments.

NovelSeq: Novel sequence insertion discovery framework.

mrCaNaVaR: Detection of structural variants and indels from genome and exome sequencing data


Bug Report
Our Public Forums

Eichler Lab, University of Washington
Valid XHTML 1.0 Transitional