BASIL-ANISE

简介

BASIL 是一种从 BAM 格式的对齐配对 HTS 读数中检测结构变体(包括插入断点)断点的方法。

BASIL is a method to detect breakpoints for structural variants (including insertion breakpoints) from aligned paired HTS reads in BAM format. ANISE is a method for the assembly of large insertions from paired reads in BAM format and a list candidate insert breakpoints as generated by BASIL.

安装

通过conda安装

module load miniconda3
conda create -n mypy # 创建环境
source activate mypy # 进入环境
conda install -c bioconda anise_basil #安装

通过git安装

# download
git clone https://github.com/seqan/anise_basil.git
cd anise_basil
git checkout master
git submodule init
git submodule update --recursive

# compile the program
cd build
cmake ..
make -j 4 anise basil # 通过-j参数设置运行编译时用的核数

# 通过-h参数查看帮助信息
./bin/basil -h
./bin/anise -h

使用方法与范例

下载测试数据:https://github.com/seqan/anise_basil/tree/master/example

Step1:将reads比对到参考基因组

bwa index ref.fa
bwa aln -f left.fq.gz.sai ref.fa left.fq.gz
bwa aln -f right.fq.gz.sai ref.fa right.fq.gz
bwa sampe ref.fa left.fq.gz.sai right.fq.gz.sai left.fq.gz right.fq.gz | samtools view -Sb - | samtools sort - simulated
samtools index simulated.bam

Step2:使用BASIL分析BAM文件的tentative insertion sites

basil -ir ref.fa -im simulated.bam -ov basil.vcf

查看输出vcf文件的内容

cat basil.vcf
##fileformat=VCFv4.1
##source=BASIL
##reference=ref.fa
##INFO=<ID=IMPRECISE,Number=0,Type=Flag,Description="Imprecise structural va...
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural varia...
##INFO=<ID=OEA_ONLY,Number=0,Type=Flag,Description="Breakpoint support by OE...
##ALT=<ID=INS,Description="Insertion of novel sequence">
##FORMAT=<ID=GSCORE,Number=1,Type=String,Description="Sum of Geometric score...
##FORMAT=<ID=CLEFT,Number=1,Type=String,Description="Clipped alignments supp...
##FORMAT=<ID=CRIGHT,Number=1,Type=String,Description="Clipped alignments sup...
##FORMAT=<ID=OEALEFT,Number=1,Type=String,Description="One-end anchored alig...
##FORMAT=<ID=OEARIGHT,Number=1,Type=String,Description="One-end anchored ali...
##contig=<ID=1,length=10000>
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  indi...
1       5001    site_0  T       <INS>   .       PASS    IMPRECISE;SVTYPE=INS...

Step3:通过ANISE进行断点组装

对vcf文件进行过滤(以30X数据为例)以生成输入文件 .. code:: bash

filter_basil.py -i basil.vcf -o basil.filtered.vcf --min-oea-each-side 10

运行ANISE

anise -ir ref.fa -im simulated.bam -iv basil.filtered.vcf -of anise.fa

观察输出的fasta文件,其中包含了1个组装好的插入(insert),且在注释行(>开头的行)对内容有一定的说明

cat anise.fa
>site_0_contig_0 REF=1 POS=5001 STEPS=6  ANCHORED_LEFT=yes ANCHORED_RIGHT=yes SPANNING=yes STOPPED=no_more_reads
GGGCTTCGCCTAGGGTCTCGGGAGAAATCTAGGGACCCCAATCTATTAGACGAACACGTCCAGGGCATGG
TCAGGTATACACCTTCCGACTAGACGTGTTCGAAGATTCGGGAAAATTACCTGAAGAGCCCCCGTAAGCC
GTAGTAGAAGAGGACACTTCATTTAAACAATACCGAAAAAGTGTCTTGGCAGACCGTATCTTCACAGGGC
CGAAGCACTTTTGGCAGGCTTATAAACGCCCAGAATGAAGCACTCGCCATAGGTGGAAACCTTTAAGCGA
CGCGGGGTGTGTCGGCCCTATCCCTTGCGCTTACAGACTTTATTTCTTCGTGAGGGAGTTGACCCATGCA

获取ANISE标记为"spanning"的Contig

extract_spanning anise.fa > anise.filtered.fa

Step4:二次比对

使用BLAT将contig比对回参考基因组

blat ref.fa anise.filtered.fa matches.psl

使用pslPretty观察结果matches.psl

>site_0_contig_0:0+2592 of 2592 1:4716+5308 of 10000
GGGCTTCGCCTAGGGTCTCGGGAGAAATCTAGGGACCCCAATCTATTAGACGAACACGTC
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GGGCTTCGCCTAGGGTCTCGGGAGAAATCTAGGGACCCCAATCTATTAGACGAACACGTC

CAGGGCATGGTCAGGTATACACCTTCCGACTAGACGTGTTCGAAGATTCGGGAAAATTAC
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
CAGGGCATGGTCAGGTATACACCTTCCGACTAGACGTGTTCGAAGATTCGGGAAAATTAC

CTGAAGAGCCCCCGTAAGCCGTAGTAGAAGAGGACACTTCATTTAAACAATACCGAAAAA
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
CTGAAGAGCCCCCGTAAGCCGTAGTAGAAGAGGACACTTCATTTAAACAATACCGAAAAA

GTGTCTTGGCAGACCGTATCTTCACAGGGCCGAAGCACTTTTGGCAGGCTTATAAACGCC
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GTGTCTTGGCAGACCGTATCTTCACAGGGCCGAAGCACTTTTGGCAGGCTTATAAACGCC

CAGAATGAAGCACTCGCCATAGGTGGAAACCTTTAAGCGACGCGGGGTGT...TCAGGTT
||||||||||||||||||||||||||||||||||||||||||||               |
CAGAATGAAGCACTCGCCATAGGTGGAAACCTTTAAGCGACGCG-----2000------T

TGGGTCCGCGCAGCGCCAACGATTTCAACCGGGAGACGTTCGTTCATGATGAGAAGACGG
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
TGGGTCCGCGCAGCGCCAACGATTTCAACCGGGAGACGTTCGTTCATGATGAGAAGACGG
......

Step5:

通过BLAT分数筛选最佳匹配结果

best_blat.py -b matches.psl | column -t
#identity  query_coverage  target_coverage  blat_score  matches  mismatches...
95.9       22.8            5.9              591         592      0         ...

参考资料


最后更新: 2024 年 11 月 22 日