博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
构建NCBI本地BLAST数据库 (NR NT等) | blastx/diamond使用方法 | blast构建索引 | makeblastdb...
阅读量:6113 次
发布时间:2019-06-21

本文共 24765 字,大约阅读时间需要 82 分钟。

 
如何下载 NCBI NR NT数据库?
下载blast:
先了解BLAST Databases:
 

如何下载NCBI blast数据库?

NCBI提供了一个非常智能化的脚本update_blastdb.pl来自动下载所有blast数据库。
脚本使用方法:
perl update_blastdb.pl nr

有哪些可供下载的blast数据库?

perl update_blastdb.pl --showall
该命令会显示所有可供下载的blast数据库,请自行选择:
16SMicrobialcdd_deltaenv_nrenv_ntestest_humanest_mouseest_othersgssgss_annothtgshuman_genomiclandmarknrntother_genomicpataapatntpdbaapdbntref_prok_rep_genomesref_viroids_rep_genomesref_viruses_rep_genomesrefseq_genomicrefseq_proteinrefseq_rnarefseqgenestsswissprottaxdbtsa_nrtsa_ntvector
这里我选择的是nr数据库。
nohup perl update_blastdb.pl --decompress nr >out.log 2>&1 &
自动在后台下载,然后自动解压。(下载到一半断网了,在运行会接着下载,而不会覆盖已经下载好的文件)

blast如何使用?

这里只演示blastx的使用方法。
 
刚才下载的nr库就是蛋白库,blastx就是用来将核酸序列比对到蛋白库上的。(nt就是核酸库)
因为我们下载的是已经建好索引的数据库,所以省去了makeblastdb的过程。
常见的命令有下面几个:
-query 
要查询的核酸序列
-db 
数据库名字
-out 
输出文件
-evalue 
evalue阈值
-outfmt 
输出的格式

blast构建索引 | makeblastdb

makeblastdb -in mature.fa -input_type fasta -dbtype nucl -title miRBase -parse_seqids -out miRBase -logfile File_Name

-in 后接输入文件,你要格式化的fasta序列

-dbtype 后接序列类型,nucl为核酸,prot为蛋白
-title 给数据库起个名,好看~~(不能用在后面搜索时-db的参数)
-parse_seqids 推荐加上,现在有啥原因还没搞清楚
-out 后接数据库名,自己起一个有意义的名字,以后blast+搜索时要用到的-db的参数
-logfile 日志文件,如果没有默认输出到屏幕

资源消耗 

blastx -query test.merged.transcript.fasta -db nr -out test.blastx.out

其中fasta文件只有19938行。

可是运行起来耗费了很多资源:

平均内存消耗:51.45G;峰值:115.37G

cpu:1个

运行时间:06:00:24(你敢信?这才是一个小小的test)

所以我强烈推荐用diamond替代blast来做数据库搜索。

blast结果解读

每一个合格的序列比对都会给出一个这样的结果(一个query sequence比对到多个就有多个结果):

>AAB70410.1 Similar to Schizosaccharomyces CCAAT-binding factor (gb|U88525).EST gb|T04310 comes from this gene [Arabidopsis thaliana]Length=208 Score = 238 bits (607),  Expect = 7e-76, Method: Compositional matrix adjust. Identities = 116/145 (80%), Positives = 127/145 (88%), Gaps = 2/145 (1%) Frame = +1Query  253  FWASQYQEIEQTSDFKNHSLPLARIKKIMKADEDVRMISAEAPVVFARACEMFILELTLR  432            FW +Q++EIE+T+DFKNHSLPLARIKKIMKADEDVRMISAEAPVVFARACEMFILELTLRSbjct  39   FWENQFKEIEKTTDFKNHSLPLARIKKIMKADEDVRMISAEAPVVFARACEMFILELTLR  98Query  433  SWNHTEENKRRTLQKNDIAAAITRNEIFDFLVDIVPREDLKDEVLASIPRGTLPMGAPTE  612            SWNHTEENKRRTLQKNDIAAA+TR +IFDFLVDIVPREDL+DEVL SIPRGT+P  ASbjct  99   SWNHTEENKRRTLQKNDIAAAVTRTDIFDFLVDIVPREDLRDEVLGSIPRGTVPEAA-AA  157Query  613  GLPYYYMQPQHAPQVGAPGMFMGKP  687            G PY Y+    AP +G PGM MG PSbjct  158  GYPYGYLPAGTAP-IGNPGMVMGNP  181  

结果解读网上很多,这里不啰嗦了。

以下是我在同样条件下测试的diamond:

平均内存消耗:11.01G;峰值:12.44G

cpu:1个(571.17%)也就是会自动占用5-6个cpu

运行时间:00:26:15

而且diamond注明了,它的优势是处理>1M 的query,量越大速度越快。

diamond的简单用法:

diamond makedb --in nr.fa -d nrdiamond blastx -d nr -q test.merged.transcript.fasta -o test.matches.m8

 但是diamond使用有限制,只能用于比对蛋白数据库。

以下是OrfPredictor推荐的参数设置:

To minimize the file size of BLASTX output for loading, the following parameters are recommended if the BLASTX in the 'NCBI-blastall' package is used: "-v 1 -b 1 -e 1e-5" (Note: we used version 2.2.19 - earlier or later versions may not work properly).
 

下面是详细的blastx帮助文档,以供查阅:

$ blastx -helpUSAGE  blastx [-h] [-help] [-import_search_strategy filename]    [-export_search_strategy filename] [-task task_name] [-db database_name]    [-dbsize num_letters] [-gilist filename] [-seqidlist filename]    [-negative_gilist filename] [-negative_seqidlist filename]    [-entrez_query entrez_query] [-db_soft_mask filtering_algorithm]    [-db_hard_mask filtering_algorithm] [-subject subject_input_file]    [-subject_loc range] [-query input_file] [-out output_file]    [-evalue evalue] [-word_size int_value] [-gapopen open_penalty]    [-gapextend extend_penalty] [-qcov_hsp_perc float_value]    [-max_hsps int_value] [-xdrop_ungap float_value] [-xdrop_gap float_value]    [-xdrop_gap_final float_value] [-searchsp int_value]    [-sum_stats bool_value] [-max_intron_length length] [-seg SEG_options]    [-soft_masking soft_masking] [-matrix matrix_name]    [-threshold float_value] [-culling_limit int_value]    [-best_hit_overhang float_value] [-best_hit_score_edge float_value]    [-window_size int_value] [-ungapped] [-lcase_masking] [-query_loc range]    [-strand strand] [-parse_deflines] [-query_gencode int_value]    [-outfmt format] [-show_gis] [-num_descriptions int_value]    [-num_alignments int_value] [-line_length line_length] [-html]    [-max_target_seqs num_sequences] [-num_threads int_value] [-remote]    [-comp_based_stats compo] [-use_sw_tback] [-version]DESCRIPTION   Translated Query-Protein Subject BLAST 2.7.1+OPTIONAL ARGUMENTS -h   Print USAGE and DESCRIPTION;  ignore all other parameters -help   Print USAGE, DESCRIPTION and ARGUMENTS; ignore all other parameters -version   Print version number;  ignore other arguments *** Input query options -query 
Input file name Default = `-' -query_loc
Location on the query sequence in 1-based offsets (Format: start-stop) -strand
Query strand(s) to search against database/subject Default = `both' -query_gencode
Genetic code to use to translate query (see user manual for details) Default = `1' *** General search options -task
Task to execute Default = `blastx' -db
BLAST database name * Incompatible with: subject, subject_loc -out
Output file name Default = `-' -evalue
Expectation value (E) threshold for saving hits Default = `10' -word_size
=2> Word size for wordfinder algorithm -gapopen
Cost to open a gap -gapextend
Cost to extend a gap -max_intron_length
=0> Length of the largest intron allowed in a translated nucleotide sequence when linking multiple distinct alignments Default = `0' -matrix
Scoring matrix name (normally BLOSUM62) -threshold
=0> Minimum word score such that the word is added to the BLAST lookup table -comp_based_stats
Use composition-based statistics: D or d: default (equivalent to 2 ) 0 or F or f: No composition-based statistics 1: Composition-based statistics as in NAR 29:2994-3005, 2001 2 or T or t : Composition-based score adjustment as in Bioinformatics 21:902-911, 2005, conditioned on sequence properties 3: Composition-based score adjustment as in Bioinformatics 21:902-911, 2005, unconditionally Default = `2' *** BLAST-2-Sequences options -subject
Subject sequence(s) to search * Incompatible with: db, gilist, seqidlist, negative_gilist, negative_seqidlist, db_soft_mask, db_hard_mask -subject_loc
Location on the subject sequence in 1-based offsets (Format: start-stop) * Incompatible with: db, gilist, seqidlist, negative_gilist, negative_seqidlist, db_soft_mask, db_hard_mask, remote *** Formatting options -outfmt
alignment view options: 0 = Pairwise, 1 = Query-anchored showing identities, 2 = Query-anchored no identities, 3 = Flat query-anchored showing identities, 4 = Flat query-anchored no identities, 5 = BLAST XML, 6 = Tabular, 7 = Tabular with comment lines, 8 = Seqalign (Text ASN.1), 9 = Seqalign (Binary ASN.1), 10 = Comma-separated values, 11 = BLAST archive (ASN.1), 12 = Seqalign (JSON), 13 = Multiple-file BLAST JSON, 14 = Multiple-file BLAST XML2, 15 = Single-file BLAST JSON, 16 = Single-file BLAST XML2, 18 = Organism Report Options 6, 7 and 10 can be additionally configured to produce a custom format specified by space delimited format specifiers. The supported format specifiers are: qseqid means Query Seq-id qgi means Query GI qacc means Query accesion qaccver means Query accesion.version qlen means Query sequence length sseqid means Subject Seq-id sallseqid means All subject Seq-id(s), separated by a ';' sgi means Subject GI sallgi means All subject GIs sacc means Subject accession saccver means Subject accession.version sallacc means All subject accessions slen means Subject sequence length qstart means Start of alignment in query qend means End of alignment in query sstart means Start of alignment in subject send means End of alignment in subject qseq means Aligned part of query sequence sseq means Aligned part of subject sequence evalue means Expect value bitscore means Bit score score means Raw score length means Alignment length pident means Percentage of identical matches nident means Number of identical matches mismatch means Number of mismatches positive means Number of positive-scoring matches gapopen means Number of gap openings gaps means Total number of gaps ppos means Percentage of positive-scoring matches frames means Query and subject frames separated by a '/' qframe means Query frame sframe means Subject frame btop means Blast traceback operations (BTOP) staxid means Subject Taxonomy ID ssciname means Subject Scientific Name scomname means Subject Common Name sblastname means Subject Blast Name sskingdom means Subject Super Kingdom staxids means unique Subject Taxonomy ID(s), separated by a ';' (in numerical order) sscinames means unique Subject Scientific Name(s), separated by a ';' scomnames means unique Subject Common Name(s), separated by a ';' sblastnames means unique Subject Blast Name(s), separated by a ';' (in alphabetical order) sskingdoms means unique Subject Super Kingdom(s), separated by a ';' (in alphabetical order) stitle means Subject Title salltitles means All Subject Title(s), separated by a '<>' sstrand means Subject Strand qcovs means Query Coverage Per Subject qcovhsp means Query Coverage Per HSP qcovus means Query Coverage Per Unique Subject (blastn only) When not provided, the default value is: 'qaccver saccver pident length mismatch gapopen qstart qend sstart send evalue bitscore', which is equivalent to the keyword 'std' Default = `0' -show_gis Show NCBI GIs in deflines? -num_descriptions
=0> Number of database sequences to show one-line descriptions for Not applicable for outfmt > 4 Default = `500' * Incompatible with: max_target_seqs -num_alignments
=0> Number of database sequences to show alignments for Default = `250' * Incompatible with: max_target_seqs -line_length
=1> Line length for formatting alignments Not applicable for outfmt > 4 Default = `60' -html Produce HTML output? *** Query filtering options -seg
Filter query sequence with SEG (Format: 'yes', 'window locut hicut', or 'no' to disable) Default = `12 2.2 2.5' -soft_masking
Apply filtering locations as soft masks Default = `false' -lcase_masking Use lower case filtering in query and subject sequence(s)? *** Restrict search or results -gilist
Restrict search of database to list of GI's * Incompatible with: negative_gilist, seqidlist, negative_seqidlist, remote, subject, subject_loc -seqidlist
Restrict search of database to list of SeqId's * Incompatible with: gilist, negative_gilist, negative_seqidlist, remote, subject, subject_loc -negative_gilist
Restrict search of database to everything except the listed GIs * Incompatible with: gilist, seqidlist, remote, subject, subject_loc -negative_seqidlist
Restrict search of database to everything except the listed SeqIDs * Incompatible with: gilist, seqidlist, remote, subject, subject_loc -entrez_query
Restrict search with the given Entrez query * Requires: remote -db_soft_mask
Filtering algorithm ID to apply to the BLAST database as soft masking * Incompatible with: db_hard_mask, subject, subject_loc -db_hard_mask
Filtering algorithm ID to apply to the BLAST database as hard masking * Incompatible with: db_soft_mask, subject, subject_loc -qcov_hsp_perc
Percent query coverage per hsp -max_hsps
=1> Set maximum number of HSPs per subject sequence to save for each query -culling_limit
=0> If the query range of a hit is enveloped by that of at least this many higher-scoring hits, delete the hit * Incompatible with: best_hit_overhang, best_hit_score_edge -best_hit_overhang
0 and <0.5)> Best Hit algorithm overhang value (recommended value: 0.1) * Incompatible with: culling_limit -best_hit_score_edge
0 and <0.5)> Best Hit algorithm score edge value (recommended value: 0.1) * Incompatible with: culling_limit -max_target_seqs
=1> Maximum number of aligned sequences to keep Not applicable for outfmt <= 4 Default = `500' * Incompatible with: num_descriptions, num_alignments *** Statistical options -dbsize
Effective length of the database -searchsp
=0> Effective length of the search space -sum_stats
Use sum statistics *** Search strategy options -import_search_strategy
Search strategy to use * Incompatible with: export_search_strategy -export_search_strategy
File name to record the search strategy used * Incompatible with: import_search_strategy *** Extension options -xdrop_ungap
X-dropoff value (in bits) for ungapped extensions -xdrop_gap
X-dropoff value (in bits) for preliminary gapped extensions -xdrop_gap_final
X-dropoff value (in bits) for final gapped alignment -window_size
=0> Multiple hits window size, use 0 to specify 1-hit algorithm -ungapped Perform ungapped alignment only? *** Miscellaneous options -parse_deflines Should the query and subject defline(s) be parsed? -num_threads
=1 and =<24)> Number of threads (CPUs) to use in the BLAST search Default = `1' * Incompatible with: remote -remote Execute search remotely? * Incompatible with: gilist, seqidlist, negative_gilist, negative_seqidlist, subject_loc, num_threads -use_sw_tback Compute locally optimal Smith-Waterman alignments?
 

以下是copy的详细英文教学:

1. Quick Start

  • Get all numbered files for a database with the same base name: Each of these files represents a subset (volume) of that database, and all of them are needed to reconstitute the database.
  • After extraction, there is no need to concatenate the resulting files:Call the database with the base name, for nr database files, use "-db nr". 这些数据库是已经预先进行过makeblastdb命令的,下载后可以直接使用
  • For easy download, use the update_blastdb.pl script from the blast+ package.
  • Incremental update is not available.

2. General Introduction

BLAST search pages under the Basic BLAST section of the NCBI BLAST home page( use a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches.  These databases are made

available as compressed archives of pre-formatted form) and can be donwloaed from the /db directory of the BLAST ftp site (. The FASTA files reside under the /FASTA subdirectory.

The pre-formatted databases offer the following advantages:

  • Pre-formatting removes the need to run makeblastdb; 无需再运行建库命令行
  • Species-level taxonomy ids are included for each database entry;
  • Databases are broken into smaller-sized volumes and are therefore easier to download;
  • Sequences in FASTA format can be generated from the pre-formatted databases by using the blastdbcmd utility;可以从这些数据库文件中导出FASTA文件
  • A convenient script (update_blastdb.pl) is available in the blast+ package to download the pre-formatted databases. 可用该脚本升级数据库

Pre-formatted databases must be downloaded using the update_blastdb.pl script or via FTP in binary mode. Documentation for this script can be obtained by running the script without any arguments; Perl installation is required.

The compressed files downloaded must be inflated with gzip or other decompress tools. The BLAST database files can then be extracted out of the resulting tar file using the tar utility on Unix/Linux, or WinZip and StuffIt Expander on

Windows and Macintosh platforms, respectively.  下载的数据库为压缩包,要解压缩

Large databases are formatted in multiple one-gigabyte volumes, which are named using the basename.##.tar.gz convention. All volumes with the same base name are required. An alias file is provided to tie individual volumes together so that the database can be called using the base name (without the .nal or .pal extension). For example, to call the est database, simply use "-db est" option in the command line (without the quotes). 大的数据库通常分为多个压缩包,例如nr库有11个压缩包。所有的相关压缩包都要下载,解压。解压缩会生成对应的库文件,同时生成一个nr.pal文件。检索nr库时输入-d nr 即可。

Additional BLAST databases that are not provided in pre-formatted formats may be available in the FASTA subdirectory. For other genomic BLAST databases, please check the genomes ftp directory at:   

3. Contents of the /blast/db/ directory

The pre-formatted BLAST databases are archived in this directory. The names of these databases and their contents are listed below.

+-----------------------------+------------------------------------------------+ File Name        #  Content Description   +-----------------------------+------------------------------------------------+16SMicrobial.tar.gz          #  Bacterial and Archaeal 16S rRNA sequences from BioProjects 33175 and 33117FASTA/        #  Subdirectory for FASTA formatted sequencesREADME        #  README for this subdirectory (this file)Representative_Genomes.*tar.gz        #  Representative bacterial/archaeal genomes databasecdd_delta.tar.gz          #  Conserved Domain Database sequences for use with stand alone deltablastcloud/          #  Subdirectory of databases for BLAST AMI; see http://1.usa.gov/TJAnEtenv_nr.*tar.gz        #  Protein sequences for metagenomesenv_nt.*tar.gz        #  Nucleotide sequences for metagenomesest.tar.gz        #  This file requires est_human.*.tar.gz, est_mouse.*.tar.gz, and est_others.*.tar.gz files to function. It contains the est.nal alias so that searches against est (-db est) will include est_human, est_mouse and est_others. est_human.*.tar.gz        #  Human subset of the est database from the est division of GenBank, EMBL and DDBJ.est_mouse.*.tar.gz        #  Mouse subset of the est databasaeest_others.*.tar.gz           #  Non-human and non-mouse subset of the est databasegss.*tar.gz           #  Sequences from the GSS division of GenBank, EMBL, and DDBJhtgs.*tar.gz          #  Sequences from the HTG division of GenBank, EMBL,and DDBJhuman_genomic.*tar.gz         #  Human RefSeq (NC_) chromosome records with gap adjusted concatenated NT_ contigsnr.*tar.gz        #  Non-redundant protein sequences from GenPept, Swissprot, PIR, PDF, PDB, and NCBI RefSeqnt.*tar.gz        #  Partially non-redundant nucleotide sequences from all traditional divisions of GenBank, EMBL, and DDBJ excluding GSS,STS, PAT, EST, HTG, and WGS.other_genomic.*tar.gz         #  RefSeq chromosome records (NC_) for non-human organismspataa.*tar.gz         #  Patent protein sequencespatnt.*tar.gz         #  Patent nucleotide sequences. Both patent databases are directly from the USPTO, or from the EPO/JPO via EMBL/DDBJpdbaa.*tar.gz         #  Sequences for the protein structure from the Protein Data Bankpdbnt.*tar.gz         #  Sequences for the nucleotide structure from the Protein Data Bank. They are NOT the protein coding sequences for the corresponding pdbaa entries.refseq_genomic.*tar.gz        #  NCBI genomic reference sequencesrefseq_protein.*tar.gz        #  NCBI protein reference sequencesrefseq_rna.*tar.gz        #  NCBI Transcript reference sequencessts.*tar.gz           #  Sequences from the STS division of GenBank, EMBL,and DDBJswissprot.tar.gz          #  Swiss-Prot sequence database (last major update)taxdb.tar.gz          #  Additional taxonomy information for the databases listed here providing common and scientific namestsa_nt.*tar.gz        #  Sequences from the TSA division of GenBank, EMBL,and DDBJvector.tar.gz         #  Vector sequences from 2010, see Note 2 in section 4.wgs.*tar.gz           #  Sequences from Whole Genome Shotgun assemblies+-----------------------------+------------------------------------------------+

4. Contents of the /blast/db/FASTA directory

This directory contains FASTA formatted sequence files. The file names and database contents are listed below. These files must be unpacked and processed through blastdbcmd before they can be used by the BLAST programs.

+-----------------------+-----------------------------------------------------+File Name          #  Content Description         # +-----------------------+-----------------------------------------------------+alu.a.gz        #  translation of alu.n repeatsalu.n.gz        #  alu repeat elements (from 2003)drosoph.aa.gz           #  CDS translations from drosophila.nt  drosoph.nt.gz           #  genomic sequences for drosophila (from 2003)env_nr.gz*          #  Protein sequences for metagenomes, taxid 408169env_nt.gz*          #  Nucleotide sequences for metagenomes, taxid 408169est_human.gz*           #  human subset of the est database (see Note 1)est_mouse.gz*           #  mouse subset of the est databaseest_others.gz*          #  non-human and non-mouse subset of the est databasegss.gz*         #  sequences from the GSS division of GenBank, EMBL,   and DDBJhtgs.gz*        #  sequences from the HTG division of GenBank, EMBL,   and DDBJ human_genomic.gz*           #  human RefSeq (NC_) chromosome records  with gap adjusted concatenated NT_ contigs igSeqNt.gz          #  human and mouse immunoglobulin variable region   nucleotide sequencesigSeqProt.gz        #  human and mouse immunoglobulin variable region   protein sequencesmito.aa.gz          #  CDS translations of complete mitochondrial genomesmito.nt.gz          #  complete mitochondrial genomesnr.gz*          #  non-redundant protein sequence database with entries from GenPept, Swissprot, PIR, PDF, PDB, and RefSeqnt.gz*          #  nucleotide sequence database, with entries from all traditional divisions of GenBank, EMBL, and DDBJ; excluding bulk divisions (gss, sts, pat, est, htg) and wgs entries. Partially non-redundant.other_genomic.gz*           #  RefSeq chromosome records (NC_) for organisms other than humanpataa.gz*           #  patent protein sequencespatnt.gz*           #  patent nucleotide sequences. Both patent sequence   files are from the USPTO, or EPO/JPO via EMBL/DDBJpdbaa.gz*           #  protein sequences from pdb protein structurespdbnt.gz*           #  nucleotide sequences from pdb nucleic acid structures. They are NOT the protein coding sequences for the corresponding pdbaa entries.sts.gz*         #  database for sequence tag site entries swissprot.gz*           #  swiss-prot database (last major release)vector.gz           #  vector sequences from 2010. (See Note 2)wgs.gz*         #  whole genome shotgun genome assembliesyeast.aa.gz         #  protein translations from yeast.ntyeast.nt.gz         #  yeast genomes (from 2003)+-----------------------+---------------------------------------------------+

NOTE:
(1) NCBI does not provide the complete est database in FASTA format. One  needs to get all three subsets (est_human, est_mouse, and est_others and concatenate them into the complete est fasta database).
(2) For screening for vector contamination, use the UniVec database:
*  marked files have pre-formatted counterparts.

5. Database updates

The BLAST databases are updated regularly. There is no established incremental pdate scheme. We recommend downloading the complete databases regularly to keep their content current.

6. Non-redundant defline syntax

The non-redundant databases are nr, nt (partially) and pataa. In them, identical sequences are merged into one entry. To be merged two sequences must have identical lengths and every residue at every position must be the

same.  The FASTA deflines for the different entries that belong to one record are separated by control-A characters invisible to most programs. In the example below both entries gi|1469284 and gi|1477453 have the same sequence, in every respect:

>gi|3023276|sp|Q57293|AFUC_ACTPL   Ferric transport ATP-binding protein afuC ^Agi|1469284|gb|AAB05030.1|   afuC gene product ^Agi|1477453|gb|AAB17216.1|   afuC [Actinobacillus pleuropneumoniae]MNNDFLVLKNITKSFGKATVIDNLDLVIKRGTMVTLLGPSGCGKTTVLRLVAGLENPTSGQIFIDGEDVTKSSIQNRDICIVFQSYALFPHMSIGDNVGYGLRMQGVSNEERKQRVKEALELVDLAGFADRFVDQISGGQQQRVALARALVLKPKVLILDEPLSNLDANLRRSMREKIRELQQRLGITSLYVTHDQTEAFAVSDEVIVMNKGTIMQKARQKIFIYDRILYSLRNFMGESTICDGNLNQGTVSIGDYRFPLHNAADFSVADGACLVGVRPEAIRLTATGETSQRCQIKSAVYMGNHWEIVANWNGKDVLINANPDQFDPDATKAFIHFTEQGIFLLNKE

The syntax of sequence header lines used by the NCBI BLAST server depends on the database from which each sequence was obtained.  The table at lists the supported FASTA identifiers. 有些BLAST数据库没有提供预先建库的文件,这些数据库可以从FASTA文件夹里下载

For databases whose entries are not from official NCBI sequence databases, such as Trace database, the gnl| convention is used. For custom databases, this convention should be followed and the id for each sequence must be

unique, if one would like to take the advantage of indexed database, which enables specific sequence retrieval using blastdbcmd program included in the blast executable package.  One should refer to documents distributed in the standalone BLAST package for more details.

7. Formatting a FASTA file into a BLASTable database

FASTA files need to be formatted with makeblastdb before they can be used in local blast search. For those from NCBI, the following makeblastdb commands are recommended:

For nucleotide fasta file:  

makeblastdb -in input_db -dbtype nucl -parse_seqids

For protein fasta file:     

makeblastdb -in input_db -dbtype prot -parse_seqids

 

                        

转载地址:http://nmcka.baihongyu.com/

你可能感兴趣的文章
多路归并排序之败者树
查看>>
java连接MySql数据库
查看>>
转:Vue keep-alive实践总结
查看>>
android studio修改新项目package名称
查看>>
深入python的set和dict
查看>>
C++ 11 lambda
查看>>
Hadoop2.5.0 搭建实录
查看>>
实验吧 recursive write up
查看>>
High-speed Charting Control--MFC绘制图表(折线图、饼图、柱形图)控件
查看>>
go test命令參数问题
查看>>
linux 搜索文本
查看>>
超实用Mac软件分享(二)
查看>>
Android JSON数据解析
查看>>
DEV实现日期时间效果
查看>>
java注解【转】
查看>>
Oracle表分区
查看>>
centos 下安装g++
查看>>
嵌入式,代码调试----GDB扫盲
查看>>
类斐波那契数列的奇妙性质
查看>>
配置设置[Django]引入模版之后报错Requested setting TEMPLATE_DEBUG, but settings are not configured....
查看>>