SPLITREAD is a novel method for detecting INDELs (small insertions and deletion with size less than 50bp) as well as large deletions that are within the coding regions from the exome sequencing data. It also can be applied ot the whole genome sequencing data.
Basics & Definitions
- mrsFAST for read mapping stage.
- SPLITREAD method depends on the mappings on short reads with substitutions only. The current version of SPLITREAD is integrated with the mrsFAST. It is possible to use other short read mappers such as bwa, however the alignment has to specified to include no indels. A version using bwa will be released soon.
- zlib for the ability to read compressed SAM files.
- samtools for converting the results of the mappings to bam to save space.
- C compiler (SPLITREAD is developed with gcc version 4.1.2)
- It is possible to use SPLITREAD stand-alone, however it is advised to use this program on a cluster to take advantage of the parallelization of the input files. There is a script included for SGE that can be easily modified.
- The reference file for the coding region with the processed pseudogenes of the HG18 and HG19 can be found at the downloads page for exome sequencing. However user can use any reference file for the SPLITREAD method.
Before using SPLITREAD
The whole reference is used for the mapping, so do not mask any repeats and mobile elements in the reference sequence. Soft masking also works. The reference should be indexed for the mapping algorithm that is used for the SPLITREAD in the default case using mrsFAST. The fasta file of the reference should also be indexed using the samtools faidx tool.
The input file for the reads are FASTQ
files. SPLITREAD can work with both single-end and paired-end data.
The released version is using paired-end data and the input FASTQ
file should contain both reads in the proper order. If the mapping is
done is clusters, the FASTQ files should be split into smaller input
files. Based on 2GB memory cluster note it is suggested to use 200000
reads per input file.
SPLITREAD is mostly applied to the
whole-exome data. In order to increase the performance of the mapping
process we define a smaller reference sequence that is consistent of
the captured coding regions and non-coding RNAs and the paralogs of the
coding regions such as processed pseudogenes. This results in a smaller
reference sequence and an index that speeds up the loading time for the
reference as well as the whole mapping process. The definition of the
captured regions are based on the SeqCap
v2.0 that captures ~30000 coding genes
(total size of 36.5 Mbp) and captures about a total of 44.1Mbp. These
regions are merged with all the gene definitions from RefSeq and UCSC
and CCDS genes that are extracted from the UCSC browser. In order to
identify the processed pseudogenes that are already represented in the
Human Reference(hg18/hg19) sequence but not annotated, we blast all the
exons agains the Human Reference Sequence. The blast results are parsed
for sequences with similarity larger than 80% and span of the query
within the 50% of the real size of the exon. These regions are inserted
in to the coding/captured regions and merged.
The Human Reference Sequence(hg18/hg19)
is masked with character 'N' except the coordinates of the regions
described above. This eliminates the possibility of reads mapping
places that are not in the definiton of the captured regions. The
regions that are N are not indexed by the mapping methods so the index
for the reference sequence will be smaller and the mapping to the
reference will be much faster compared to the whole genome. This new
reference is appened by the consensus sequences for the Alus, L1s and
SVAs that is used for detecting the mobile element insertions to the
coding regions. The references for the whole-exome data can be download
Please use mrsFAST in single-read mapping mode (There is a program included to perform the pairing of the reads). Recommended hamming distance threshold is 5% of the read length (i.e. 3 for 50 bp reads, 4 for 76 bp reads, etc.). Using other aligners might be possible if all map locations are reported with only substitutions, however, we strongly recommend mrsFAST that was specifically developed for SV/CNV analysis using Hamming distance criteria.
- Prepare the reference whole-exome reference/whole-genome
- Build mrsFAST search index using the same reference and build samtools FASTA index using the same reference.
- Run SPLITREAD_map to map the whole reads and the split reads.
- Merge the output file if it is run in parallel. Transform the
balanced/unbalanced files for event calling program.
- Run SPLITREAD_call to get the events using the
- The final set of events that are reported as a bed file.
SplitReadAll_lite is used to map the reads to the reference sequence. This part of the program maps the whole reads to the reference sequence and identifies the concordant/discordant and OEA(One end anchored) reads. The OEA reads are later split and mapped back to the reference sequence.
Required files and parameters:
- The FASTQ file for the reads with the proper pairing. It is
necessary to have an even number of read length. If it is not the reads
should be trimmed to be an even number.
- The Reference sequence and the index for mrsFAST and samtools fasta index.
./SplitReadAll_lite <split_reference> // The reference sequence for mapping of the split reads.
<whole_reference> // The reference sequence for the mapping of the whole reads. It is usually the same as the previous one.
<The target definition> // The bed file for the regions to be captured. This is not necessary unless read depth is used.
<read_len> // The read length for the whole reads.
<whole read edit_distance> // Edit distance for the whole reads. 6% of the read length is recommended.
<split read edit_distance> // Edit distance for the split reads. It is usually half of the whole read edit_distance.
<lower insert size> // The lower limit for the span size. The read length is recommended.
<upper insert size> // The upper limit for the span size. It is recommended as 500 but can be adjusted more accuratly for better results
<maximum size for discordant mappings> //The upper bound for detecting deletions. Recommended setting is 10000000.
<reference index of the fasta file for samtools> // The .fai file for the reference sequence.
<Input FASTQ> //FASTQ file for the whole reads.
<Whole read mapping output = match> // The output filename for the mappings of the whole reads.
<Whole read unmapped reads = unmapped> // The unmapped whole reads
<Split read output prefix = output> // The prefix for the output files from the split mappings.
the whole paired-end reads
the OEA reads. (Anchors)
with the inverted and everted directions.
the translocations mappings.
from OEA unmatched ends.
the splits. (Balanced)
for the event detection
read mappings - empty not needed
the balanced splits (inversions)
the balanced splits (everted or larger
than the maximum size, 10Mbp)
the balanced splits (translocations)
unbalanced splits. This is used for event
unbalanced splits (inversion/everted/maximum
unbalanced splits (translocations)
unbalanced splits (inversion) - empty not needed
./getSC.sh match.pair match.pair.format match.all.oea.match match.all.oea.match.format
It is vey important to make sure that each line is in the correct form and the chromosomes are mathcing between the balanced and unbalanced mappings files, respectively match.pair.format and match.all.oea.match.format.
events5 is used for clustering the split reads and reporting the final set of events. There are two input files for the program the match.pair.format and match.all.oea.match,format.
The command for running the program is as follows:
<Output_file for events = outfile> //The output file for the events
<Whole read length> // The read length of the original complete reads.
- outfile.events // The events predicted in a preliminary format.
- outfile.cluster // The reads that are supporting the event.
- outfile.fasta // The read sequence supporting
./getFilnalevents outfile.events outfile.event.final
The sample output is as follows:
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10-11-12-13
chr11 51428765 51435883 7118 D7118 262 1462 4 4 51436108 51436275 51435851 51436217
chr11 51426389 51435891 9502 D9502 249 2 5 6 51436109 51436266 51426576 51436217
The first three columns are the <chr> <start> <end> for the events.
Column 4 is the size of the event. + for deletions and - for insertion.
Column 5 is the symbol of the event.
Column 6 is the balanced split read support.
Column 7 is the unbalanced split read support.
Column 8 is the edit distance for the anchor read.
Column 9 is the edit distance for the split read.
Column 10-13 is used for the anchor reads and can be removed for the final file.
Assume that the reads are in Sample_1.fastq(First reads) and Sampe_2.fastq(Second reads) where read length is 100bp. For 100 bp we will use 6 hamming distance for the whole read and 3 for the split read. The minimum insert size is 105 and the maximum is 500 bp for this example. The maximum size of event we can find is 10Mb. Here is the pipeline for this example
1) Merge the first and second reads together where first reads is followed by its pair(second read).
Sample_1.fastq + Sample_2.fastq -> Sample.fastq
2) Seperate into files that contains 200000 reads each; Sample1.fastq, Sample2.fastq, ... , Samplen.fastq
3) ./SplitReadAll_lite HG19_exome.fa HG19_exome.fa HG19_exome.fa 100 6 3 105 500 10000000 HG19_exome.fa.fai Sample1.fastq Sample1.out Sample1.unmapped Sample1.match
4) For the event calling there are only two important files, the rest can be removed.
Cat Sample1.match.pair, Sample2.match.pair, Sample3.match.pair, ... -> Sample.pair
Cat Sample1.all.oea.match.pair., Sample2.,all.oea.match.pair, Sample3.all.oea.match.pair, ... -> Sample.single
5) ./getSC.sh Sample.pair Sample.pair.format Sample.single Sample.single.format
6) ./events5 Sample.pair.format Sample.single.format 100 Sample.events
10 ./getFilnalevents Sample.events Sample.events.final