1. 建树前的准备工作
1.1 相似序列的获得——BLAST
BLAST是目前常用的数据库搜索程序,它是Basic Local Alignment Search Tool的缩写,意为“基本局部相似性比对搜索工具”(Altschul et al.,1990[62];1997[63])。国际著名生物信息中心都提供基于Web的BLAST服务器。BLAST算法的基本思路是首先找出检测序列和目标序列之间相似性程度最高的片段,并作为内核向两端延伸,以找出尽可能长的相似序列片段。
首先登录到提供BLAST服务的常用网站,比如国内的CBI、美国的 NCBI、欧洲的EBI和日本的DDBJ。这些网站提供的BLAST服务在界面上差不多,但所用的程序有所差异。它们都有一个大的文本框,用于粘贴需要搜索的序列。把序列以FASTA格式(即第一行为说明行,以“>”符号开始,后面是序列的名称、说明等,其中“>”是必需的,名称及说明等可以是任意形式,换行之后是序列)粘贴到那个大的文本框,选择合适的BLAST程序和数据库,就可以开始搜索了。如果是DNA序列,一般选择BLASTN搜索DNA数据库。
这里以NCBI为例。登录NCBI主页-点击BLAST-点击Nucleotide-nucleotide BLAST (blastn)-在Search文本框中粘贴检测序列-点击BLAST!-点击Format-得到result of BLAST。
>gi|28171832|gb|AY155203.1| Nocardia sp. ATCC 49872 16S ribosomal RNA gene, complete sequence
Score = 2020 bits (1019), Expect = 0.0
Identities = 1382/1497 (92%), Gaps = 8/1497 (0%)
Strand = Plus / Plus
Query: 1 gacgaacgctggcggcgtgcttaacacatgcaagtcgagcggaaaggccctttcgggggt 60
|||||||||||||||||||||||||||||||||||||||||| ||||||||| |||||
Sbjct: 1 gacgaacgctggcggcgtgcttaacacatgcaagtcgagcggtaaggcccttc--ggggt 58
Query: 61 actcgagcggcgaacgggtgagtaacacgtgggtaacctgccttcagctctgggataagc 120
|| ||||||||||||||||||||||||||||||| | |||||| |||||||||||||
Sbjct: 59 acacgagcggcgaacgggtgagtaacacgtgggtgatctgcctcgtactctgggataagc 118
Score :指的是提交的序列和搜索出的序列之间的分值,越高说明越相似;
Strand:链的方向,Plus / Minus意味着提交的序列和参比序列是反向互补的,如果是Plus / Plus则二者皆为正向。
1.2 序列格式:FASTA格式
1 aaattgaaga gtttgatcat ggctcagatt gaacgctggc ggcaggccta acacatgcaa
61 gtcgaacggt aacaggaaga agcttgcttc tttgctgacg agtggcggac ……
>AY631071 Jiangella gansuensis YIM 002
1 gacgaacgct ggcggcgtgc ttaacacatg caagtcgagc ggaaaggccc tttcgggggt
61 actcgagcgg cgaacgggtg agtaacacgt gggtaacctg ccttcagctc tgggataagc
其中的‘>’为Clustal X默认的序列输入格式,必不可少。其后可以是种属名称,也可以是序列在Genbank中的登录号(Accession No.),自编号也可以,不过需要注意名字不能太长,一般由英文字母和数字组成,开首几个字母最好不要相同,因为有时Clustal X程序只默认前几位为该序列名称。回车换行后是序列。将检测序列和搜索到的同源序列以FASTA格式编辑成为一个文本文件(例:C:\temp \jc.txt),即可导入Clustal X等程序进行比对建树。
2. 构建系统树的相关软件和操作步骤
构建进化树的主要步骤是比对,建立取代模型,建立进化树以及进化树评估。鉴于以上对于构建系统树的评价,结合本实验室实际情况,以下主要介绍N-J Tree构建的相关软件和操作步骤。
2.1 用Clustal X构建N-J系统树的过程
(1) 打开Clustal X程序,载入源文件.
File-Load sequences- C:\temp\jc.txt.
(2) 序列比对
Alignment - Output format options - √ Clustal format; CLUSTALW sequence numbers: ON
Alignment - Do complete alignment
(Output Guide Tree file, C:\temp\jc.dnd;Output Alignment file, C:\temp\jc.aln;)
Align → waiting……
(3) 掐头去尾
File-Save Sequence as…
Format: ⊙ CLUSTAL
GDE output case: Lower
CLUSTALW sequence numbers: ON
Save from residue: 39 to 1504 (以前后最短序列为准)
Save sequence as: C:\temp\jc-a.aln
(4) File-Load sequences-Replace existing sequences?-Yes- C:\temp\jc-a.aln
(5) Trees-Output Format Options
Output Files : √ CLUSTAL format tree √ Phylip format tree √ Phylip distance matrix
Bootstrap labels on: NODE
Trees-Exclude positions with gaps
Trees-Bootstrap N-J Tree :
Random number generator seed(1-1000) : 111
Number of bootstrap trails(1-1000): 1000
SAVE CLUSTAL TREE AS: C:\temp\jc-a.njb
SAVE PHYLIP TREE AS: C:\temp\jc-a.njbphb
OK → waiting……
(6) Trees-Draw N-J Trees
SAVE CLUSTAL TREE AS: C:\temp\jc-a.nj
SAVE PHYLIP TREE AS: C:\temp\jc-a.njph
SAVE DISTANCE MATRIX AS: C:\temp\jc-a.njphdst
(7) TreeView
Tree- phylogram(unrooted, slanted cladogram,Rectangular cladogram多种树型)
Tree- Show internal edge labels (Bootstrap value)(显示数值)
Tree- Define outgroup… → ingroup >> outgroup → OK(定义外群)
Tree- Root with outgroup
2.2 Mega建树
虽然Clustal X可以构建系统树,但是结果比较粗放,现在一般很少用它构树,Mega因为操作简单,结果美观,很多研究者选择用它来建树。
(1) 首先用Clustal X进行序列比对,剪切后生成C:\temp\jc-a.aln文件;(同上)
(2) 打开BioEdit程序,将目标文件格式转化为FASTA格式,
File-Open- C:\temp\jc-a.aln,
File-Save As- C:\temp\ jc-b.fas;
(3) 打开Mega程序,转化为mega格式并激活目标文件,
File-Convert To MEGA Format- C:\temp\ jc-b.fas → C:\temp\ jc-b.meg,
关闭Text Editor窗口-(Do you want to save your changes before closing?-Yes);
Click me to activate a data file- C:\temp\jc-b.meg-OK-
(Protein-coding nucleotide sequence data?-No);
Distance Options-Models-Nucleotide: Kimura 2-parameter;
√d: Transitions+Transversions;
Include Sites-⊙Pairwise Deletion
Test of Phylogeny-⊙Bootstrap; Replications 1000; Random Seed 64238
(4) Image-Copy to Clipboard-粘贴至Word文档进行编辑。
-Tree/Branch Style:多种树型转换;
打开Clustal X,File-Load sequences-jc-a.aln,File-Save Sequence as…(Format-PHYLIP;Save from residue-1 to 末尾;Save sequence as : C:\temp\jc.phy);
(1) Distance estimation
点击Distance estimation-Start distance estimation,打开上面保存的jc.phy文件,Sequence Type-Nuleic Acid Sequence,Sequence format-PHYLIP interleaved,Select ALL,OK;
Distance Estimation-JukesCantor(or Kimura),Alignment positions-All,Bootstrap analysis-Yes,InsertionsDeletions-Not taken into account,OK;
Bootstrap samples-1000,OK;运算,等待……
(2) Infer tree topology
点击Infer tree topology-Start inferring tree topology,Method-Neighbor-joining, Bootstrap analysis-Yes,OK.;运算,等待……
(3) Root unrooted trees
点击Root unrooted trees-Start rooting unrooted trees,Outgroup opition-single sequence(forced),Bootstrap analysis-Yes,OK;
Select Root-X89947,OK;运算,等待……
(4) Draw phylogenetic tree
点击Draw phylogenetic tree,File-Open-(new) tree,Show-Bootstrap values/ Distance scale。
2.4 PHYLIP
PHYLIP是多个软件的压缩包,下载后双击则自动解压。当你解压后就会发现PHYLIP的功能极其强大,主要包括五个方面的功能软件:i,DNA和蛋白质序列数据的分析软件。ii,序列数据转变成距离数据后,对距离数据分析的软件。 iii,对基因频率和连续的元素分析的软件。iv,把序列的每个碱基/氨基酸独立看待(碱基/氨基酸只有0和1的状态)时,对序列进行分析的软件。v,按照DOLLO简约性算法对序列进行分析的软件。vi,绘制和修改进化树的软件。在此,主要对DNA序列分析和构建系统树的功能软件进行说明。
(1) 生成PHY格式文件
首先用Clustal X等软件打开剪切后的序列文件C:\temp\jc-a.aln另存为C:\temp\jc.phy(使用File-Save Sequences As命令,Format项选“PHY”)。用BioEdit或记事本打开(2) 打开Phylip软件包里的SEQBOOT
seqboot.exe: can't find input file "infile"
Please enter a new file name> C:\temp\jc.phy
按路径输入刚才生成的 *.PHY文件,显示如下:
Bootstrapping algorithm, version 3.6a3
Settings for this run:
D Sequence, Morph, Rest., Gene Freqs? Molecular sequences
J Bootstrap, Jackknife, Permute, Rewrite? Bootstrap
B Block size for block-bootstrapping? 1
R How many replicates? 100
W Read weights of characters? No
C Read categories of sites? No
F Write out data sets or just weights? Data sets
I Input sequences interleaved? Yes
0 Terminal type none
1 Print out the data at start of run No
2 Print indications of progress of run Yes
Y to accept these of type the letter for one to change
Number of replicates?
Settings for this run:
D Sequence, Morph, Rest., Gene Freqs? Molecular sequences
J Bootstrap, Jackknife, Permute, Rewrite? Bootstrap
B Block size for block-bootstrapping? 1
R How many replicates? 1000
W Read weights of characters? No
C Read categories of sites? No
F Write out data sets or just weights? Data sets
I Input sequences interleaved? Yes
0 Terminal type IBM PC
1 Print out the data at start of run No
2 Print indications of progress of run Yes
Y to accept these of type the letter for one to change
Random number seed (must be odd)?
5(any odd number)
completed replicate number 100
completed replicate number 200
completed replicate number 300
completed replicate number 400
completed replicate number 500
completed replicate number 600
completed replicate number 700
completed replicate number 800
completed replicate number 900
completed replicate number 1000
上面的D、J、R、I、O、1、2代表可选择的选项,键入这些字母后敲回车键,程序的条件就会发生改变。D选项无须改变。J选项有三种条件可以选择,分别是Bootstrap、Jackknife和 Permute。R选项让使用者输入republicate的数目。所谓republicate就是用Bootstrap法生成的一个多序列组。根据多序列中所含的序列的数目的不同可以选取不同的republicate。当我们设置好条件后,键入Y按回车。得到一个文件 outfile:C:\Program Files\Phylip\exe\ outfile.
(3) 打开dnadist.exe
Nucleic acid sequence Distance Matrix program, version 3.6a3
Settings for this run:
D Distance ? F84
G Gamma distributed rates across sites? No
T Transition/transversion ratio? 2.0
C One category of substitution rates? Yes
W Use weights for sites? No
F Use emperical base frequencies? Yes
L Form of distance matrix? Square
M Analyze multiple data sets? No
I Input sequences interleaved? Yes
0 Terminal type ?
1 Print out the data at start of run No
2 Print indications of progress of run Yes
Y to accept these of type the letter for one to change
D Distance ? Kimura 2-parameter
Multiple data sets or multiple weighs? (type D or W)
How many data sets?
Settings for this run:
D Distance ? Kimura 2-parameter
G Gamma distributed rates across sites? No
T Transition/transversion ratio? 2.0
C One category of substitution rates? Yes
W Use weights for sites? No
F Use emperical base frequencies? Yes
L Form of distance matrix? Square
M Analyze multiple data sets? Yes, 1000 data sets
I Input sequences interleaved? Yes
0 Terminal type ? IBM PC
1 Print out the data at start of run No
2 Print indications of progress of run Yes
Y to accept these of type the letter for one to change
选项D有四种距离模式可以选择,分别是 Kimura 2-parameter、Jin/Nei、Maximum-likelihood和Jukes-Cantor。选项T一般键入一个1.5-3.0之间的数字。选项M键入1000。运行后生成文件C:\Program Files\Phylip\exe\ outfile。
(4) 打开 neighbor.exe
Neighbor-Joining/UPGMA method version 3.6a3
Settings for this run:
N Neighbor-Joining or UPGMA tree? Neighbor-Joining
O Outgroup root? No, Use as outgroup species 1
L Lower-triangular data metrix? No
R Upper-triangular data metrix? No
S Subreplication? No
J Randomize input order of species? No, Use input order
M Analyze multiple data sets? No
0 Terminal type ?
1 Print out the data at start of run No
2 Print indications of progress of run Yes
3 Print out tree Yes
4 Write out trees onto tree file? Yes
Y to accept these of type the letter for one to change
How many data sets?
Random number seed (must be odd)?
Settings for this run:
N Neighbor-Joining or UPGMA tree? Neighbor-Joining
O Outgroup root? No, Use as outgroup species 1
L Lower-triangular data metrix? No
R Upper-triangular data metrix? No
S Subreplication? No
J Randomize input order of species? Yes
M Analyze multiple data sets? Yes, 1000 sets
0 Terminal type ? IBM PC
1 Print out the data at start of run No
2 Print indications of progress of run Yes
3 Print out tree Yes
4 Write out trees onto tree file? Yes
Y to accept these of type the letter for one to change
生成文件C:\Program Files\Phylip\exe\ outtree&outfile。
Consensus tree program, version 3.6a3
Settings for this run:
C Consensus type ? Majority rule (extended)
O Outgroop root? No, use as outgroup species 1
R Trees to be treated as Rooted? No
T Terminal type ?
1 Print out the sets of the species Yes
2 Print indications of progress of run Yes
3 Print out tree Yes
4 Write out trees onto tree file? Yes
Are these settings correct?
Settings for this run:
C Consensus type ? Majority rule (extended)
R Trees to be treated as Rooted? Yes
T Terminal type ? IBM PC
1 Print out the sets of the species Yes
2 Print indications of progress of run Yes
3 Print out tree Yes
4 Write out trees onto tree file? Yes
生成文件C:\Program Files\Phylip\exe\ outtree。
重命名outtree→ jc.tre。
打开C:\Program Files\Phylip\exe\ jc.tre。以下操作参照前述详细说明即可。