BASIL-ANISE¶
简介¶
- BASIL 是一种从 BAM 格式的对齐配对 HTS 读数中检测结构变体(包括插入断点)断点的方法。
BASIL is a method to detect breakpoints for structural variants (including insertion breakpoints) from aligned paired HTS reads in BAM format. ANISE is a method for the assembly of large insertions from paired reads in BAM format and a list candidate insert breakpoints as generated by BASIL.
安装¶
通过conda安装¶
module load miniconda3
conda create -n mypy # 创建环境
source activate mypy # 进入环境
conda install -c bioconda anise_basil #安装
通过git安装
# download
git clone https://github.com/seqan/anise_basil.git
cd anise_basil
git checkout master
git submodule init
git submodule update --recursive
# compile the program
cd build
cmake ..
make -j 4 anise basil # 通过-j参数设置运行编译时用的核数
# 通过-h参数查看帮助信息
./bin/basil -h
./bin/anise -h
使用方法与范例¶
下载测试数据:https://github.com/seqan/anise_basil/tree/master/example
Step1:将reads比对到参考基因组¶
bwa index ref.fa
bwa aln -f left.fq.gz.sai ref.fa left.fq.gz
bwa aln -f right.fq.gz.sai ref.fa right.fq.gz
bwa sampe ref.fa left.fq.gz.sai right.fq.gz.sai left.fq.gz right.fq.gz | samtools view -Sb - | samtools sort - simulated
samtools index simulated.bam
Step2:使用BASIL分析BAM文件的tentative insertion sites¶
basil -ir ref.fa -im simulated.bam -ov basil.vcf
查看输出vcf文件的内容
cat basil.vcf
##fileformat=VCFv4.1
##source=BASIL
##reference=ref.fa
##INFO=<ID=IMPRECISE,Number=0,Type=Flag,Description="Imprecise structural va...
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural varia...
##INFO=<ID=OEA_ONLY,Number=0,Type=Flag,Description="Breakpoint support by OE...
##ALT=<ID=INS,Description="Insertion of novel sequence">
##FORMAT=<ID=GSCORE,Number=1,Type=String,Description="Sum of Geometric score...
##FORMAT=<ID=CLEFT,Number=1,Type=String,Description="Clipped alignments supp...
##FORMAT=<ID=CRIGHT,Number=1,Type=String,Description="Clipped alignments sup...
##FORMAT=<ID=OEALEFT,Number=1,Type=String,Description="One-end anchored alig...
##FORMAT=<ID=OEARIGHT,Number=1,Type=String,Description="One-end anchored ali...
##contig=<ID=1,length=10000>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT indi...
1 5001 site_0 T <INS> . PASS IMPRECISE;SVTYPE=INS...
Step3:通过ANISE进行断点组装¶
对vcf文件进行过滤(以30X数据为例)以生成输入文件 .. code:: bash
filter_basil.py -i basil.vcf -o basil.filtered.vcf --min-oea-each-side 10
运行ANISE
anise -ir ref.fa -im simulated.bam -iv basil.filtered.vcf -of anise.fa
观察输出的fasta文件,其中包含了1个组装好的插入(insert),且在注释行(>开头的行)对内容有一定的说明
cat anise.fa
>site_0_contig_0 REF=1 POS=5001 STEPS=6 ANCHORED_LEFT=yes ANCHORED_RIGHT=yes SPANNING=yes STOPPED=no_more_reads
GGGCTTCGCCTAGGGTCTCGGGAGAAATCTAGGGACCCCAATCTATTAGACGAACACGTCCAGGGCATGG
TCAGGTATACACCTTCCGACTAGACGTGTTCGAAGATTCGGGAAAATTACCTGAAGAGCCCCCGTAAGCC
GTAGTAGAAGAGGACACTTCATTTAAACAATACCGAAAAAGTGTCTTGGCAGACCGTATCTTCACAGGGC
CGAAGCACTTTTGGCAGGCTTATAAACGCCCAGAATGAAGCACTCGCCATAGGTGGAAACCTTTAAGCGA
CGCGGGGTGTGTCGGCCCTATCCCTTGCGCTTACAGACTTTATTTCTTCGTGAGGGAGTTGACCCATGCA
获取ANISE标记为"spanning"的Contig
extract_spanning anise.fa > anise.filtered.fa
Step4:二次比对¶
使用BLAT将contig比对回参考基因组
blat ref.fa anise.filtered.fa matches.psl
使用pslPretty观察结果matches.psl
>site_0_contig_0:0+2592 of 2592 1:4716+5308 of 10000
GGGCTTCGCCTAGGGTCTCGGGAGAAATCTAGGGACCCCAATCTATTAGACGAACACGTC
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GGGCTTCGCCTAGGGTCTCGGGAGAAATCTAGGGACCCCAATCTATTAGACGAACACGTC
CAGGGCATGGTCAGGTATACACCTTCCGACTAGACGTGTTCGAAGATTCGGGAAAATTAC
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
CAGGGCATGGTCAGGTATACACCTTCCGACTAGACGTGTTCGAAGATTCGGGAAAATTAC
CTGAAGAGCCCCCGTAAGCCGTAGTAGAAGAGGACACTTCATTTAAACAATACCGAAAAA
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
CTGAAGAGCCCCCGTAAGCCGTAGTAGAAGAGGACACTTCATTTAAACAATACCGAAAAA
GTGTCTTGGCAGACCGTATCTTCACAGGGCCGAAGCACTTTTGGCAGGCTTATAAACGCC
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GTGTCTTGGCAGACCGTATCTTCACAGGGCCGAAGCACTTTTGGCAGGCTTATAAACGCC
CAGAATGAAGCACTCGCCATAGGTGGAAACCTTTAAGCGACGCGGGGTGT...TCAGGTT
|||||||||||||||||||||||||||||||||||||||||||| |
CAGAATGAAGCACTCGCCATAGGTGGAAACCTTTAAGCGACGCG-----2000------T
TGGGTCCGCGCAGCGCCAACGATTTCAACCGGGAGACGTTCGTTCATGATGAGAAGACGG
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
TGGGTCCGCGCAGCGCCAACGATTTCAACCGGGAGACGTTCGTTCATGATGAGAAGACGG
......
Step5:¶
通过BLAT分数筛选最佳匹配结果
best_blat.py -b matches.psl | column -t
#identity query_coverage target_coverage blat_score matches mismatches...
95.9 22.8 5.9 591 592 0 ...
参考资料¶
GitHub homepage: https://github.com/seqan/anise_basil
Anaconda package: https://anaconda.org/bioconda/anise_basil
Holtgrewe, M., Kuchenbecker, L., & Reinert, K. (2015). Methods for the Detection and Assembly of Novel Sequence in High-Throughput Sequencing Data. Bioinformatics, btv051.