WU-BLAST Reference - BLAST (Basic Local Alignment Search Tool)

BLAST (Basic Local Alignment Search Tool)

Chapter 14. WU-BLAST Reference

WU-BLAST was developed and is maintained entirely by Warren Gish. He was one of the original authors of BLAST while at the NCBI but is now at Washington University in St. Louis (where the WU comes from). Development began in 1994 at Version 1.4, before BLAST had gapped alignments. Quite a lot has changed since then. Paradoxically, WU-BLAST is more similar to the original BLAST than the current NCBI version.

WU-BLAST is useful because it has more command-line parameters that allow advanced users to control the program with more precision. It is also faster. Table 14-1 displays features unique to WU-BLAST or significantly different from NCBI-BLAST.

Table 14-1. WU- and NCBI-BLAST feature differences
Feature	WU-BLAST	NCBI-BLAST
Word size	Any word size for any program mode. Neighborhood words are turned off for word sizes of 5 or greater, but may be activated by setting an explicit value for T.	blastn has a minimum word size of 7. blastp, blastx, tblastn, and tblastx have word sizes of 2 or 3. Neighborhood words are never used for blastn.
Nucleotide scoring	Choice of match/mismatch or scoring matrix.	Only match/mismatch scoring.
Nucleotide statistics	Karlin-Altschul parameters are available for several match/mismatch values and gap costs.	Karlin-Altschul parameters are always computed without respect to gap costs. Reported E-values may greatly overestimate significance.
altscore	Allows score modification for any matrix (e.g., to set stop scores lower).	Nothing similar.
H, K, L, gapH, gapK,gapL	Especially useful when using unsupported scoring schemes; allow the provision of values for Karlin-Altschul parameters.	Nothing similar. Unsupported scoring schemes are fatal errors.
Alias databases	No, but virtual databases offer similar functionality.	Yes, both alias and virtual databases are supported.
Gapped alignment	All programs.	All programs except tblastx.
/etc/sysblast	Allows systems administrators to set system-wide resource restrictions.	Nothing similar.
Database subset selection	Yes, via dbrecmin and dbrecmax.	No, but alias databases can be used for static splitting.
Restricted region of query	The nwstart and nwlen parameters restrict seeding but not alignment.	-L restricts both seeding and alignment.
links	Displays the order of alignments in a group.	Nothing similar.
topcomboN	Allows restriction of number alignment groups. Groups are clearly labeled.	Nothing similar.
kap	Computes significance without sum statistics.	Nothing similar.
olf, golf, olmax, golmax	Allows setting of overlap rules for HSP consistency.	Fixed internally.
notes, warnings,errors	Descriptive messages at various levels ofcaution.	Most error messages are terse and not user friendly.
Output formats	Only the standard format.	Multiple output report formats including HTML, ASN.1, XML, tabular, and anchored multiple alignments. See Appendix A.

To use the most recent version of WU-BLAST, you must have a site license from Washington University in St. Louis. The product is free for academic use, but commercial users must pay a fee. Unlike NCBI-BLAST, the source code isn't freely available. For the latest information on WU-BLAST, visit the official site at http://blast.wustl.edu. If you want to try WU-BLAST, an early version is available without license.

14.1 Usage Statements

All WU-BLAST programs provide usage statements if they are executed without any arguments. They are sometimes lengthy, so it's best to pipe them through a pager such as less or more.

blastn | more

xdformat | less

xdget | less

14.2 Command-Line Syntax

WU-BLAST command-line syntax isn't uniform between all programs. The BLAST programs blastn, blastp, blastx, tblastn, and tblastx use a slightly different syntax than do xdformat, and xdget.

The BLAST program options come after the mandatory arguments of database and query sequence. The command-line structure is as follows:

[program name] [blast database] [query sequence] [parameters]

The parameter names in the BLAST programs and their arguments have some flexibility. The following command lines are all identical:

blastn db query E=10

blastn db query -E 10

blastn db query E 10

blastn db query -E=10

This book uses the first form to avoid confusion with NCBI-BLAST.

xdformat and xdget use the traditional Unix syntax where the parameters precede the mandatory arguments:

[program name] [parameters] [mandatory arguments]

The xdformat and xdget options are all single letters preceded by a single dash. For parameters that require a value, a space between the parameter and its value is optional. As is typical for Unix programs, a double dash indicates the end of command-line options and a single dash signifies stdin.

xdformat -p protein_db

xdformat -n -I nucleotide_db

zcat fasta.*.gz | xdformat -n -o my_db -- -

14.3 WU-BLAST Parameters

WU-BLAST has many control parameters, some of which are esoteric and rarely useful. The most important parameters are listed here.

altscore=[string]

Default: Off

Defines an alternate scoring system for any pair of letters. For example, altscore="M M -3" changes the score of M-M pairs to -3, and altscore="A C 4" gives a score of 4 if the query is A and the subject is C. Letters may be designated as any to change an entire row or column. The score can be given as min or max for the minimum and maximum scores in the matrix or na to make the score infinitely low. To set the score of all rows and columns containing stop codons to negative infinity, set altscore="* any na" and altscore="any * na". If you change the scoring parameters, you may also want to adjust gapL, gapH, and gapK.

See also

nogap, gapL, gapH, gapK

B=[integer]

Default: 250

Sets the number of database hits to report. A warning is issued if this number is exceeded. It is typical to set this parameter to a very high value, such as B=100000, to ensure that no alignments are missed.

bottom

Default: Off

Programs: blastn, tblastx, blastx

Search only the bottom strand of the query.

See also

top

cpus=[integer]

Default: 4 for blastn; all for blastp, blastx, tblastn, and tblastx

Sets the number of processors to use. If not set, all processors on the system may be used except blastn, which will limit itself to 4. See Chapter 10 for information on the/etc/sysblast file used for setting systemwide resource limitations.

dbrecmax=[integer]

Default: Last database record

Last database record number to search.

See also

dbrecmin, qrecmin, qrecmax

dbrecmin=[integer]

Default: 1

First database record number to search. For example, by setting dbrecmin=1 dbrecmax=10, only the first 10 database sequences are searched.

See also

dbrecmax, qrecmin, qrecmax

E=[number]

Default: 10

This is the E from the Karlin-Altschul equation. Database hits whose E-value is greater than this threshold will not be reported. If both E and S are set, the more restrictive parameter is used.

See also

E2=[number]

Default: Variable; calculated from scoring parameters

Sets the alignment threshold for ungapped alignments. When E2 and S2 are set, the more restrictive parameter is used.

See also

S2, gapE2, gapS2

echofilter

Default: Off

Prints out the query sequence after all filtering is performed. This is useful for troubleshooting when there are no database hits, and you suspect the filtering is too aggressive.

See also

filter, wordmask, maskextra

errors

Default: Off

Suppress nonfatal error messages. It is generally a good idea to pay attention to the error messages, but at times it is useful to block them.

See also

nonnegok, novalidctxok

filter=[string]

Default: Off

Processes the query sequence with the specified filtering method. Letters are replaced with X and N for proteins and nucleotides, respectively.

seg

Identifies low-complexity regions in both nucleotide and amino acid sequences.

dust

The standard low-complexity filter for nucleotide sequences. Generally less sensitive than seg.

xnu

Finds short repeats in protein sequences.

seg+xnu

Combines both seg and xnu.

ccp

Coiled-coil filter for proteins.

Multiple filtering methods may be specified on the same command line; for example:

blastp nr query filter=seg filter=ccp filter=xnu

See also

echofilter, maskextra, wordmask

gapE2=[number]

Default: Variable; calculated from scoring parameters

Expectation threshold for saving individual gapped alignments. When gapE2 and gapS2 are set, the more restrictive parameter is used.

See also

gapS2, E2, S2

gapH=[number]

Default: Variable; depends on scoring parameters

Sets the value of H (information per aligned letter) for gapped alignments. If a particular combination of scoring matrix (or match/mismatch scores) and gap values doesn't already have precomputed values for gapH, gapK, and gapL, WU-BLAST uses ungapped statistics. In this case, the resulting E-values may be much too low. A warning is issued when this is the case. Computing proper values for gapped Karlin-Altschul parameters requires simulations with random sequences that determine what ungapped scoring scheme is most similar to the gapped scoring scheme.

See also

H, K, gapK, L, gapL, warnings

gapK=[number]

Default: Variable; depends on scoring parameters

Sets the value of the Karlin-Altschul K parameter for gapped alignments. See the description for gapH.

See also

H, gapH, K, L, gapL

gapL=[number]

Default: Variable; depends on scoring parameters

Sets the value of the Karlin-Altschul parameter lambda (information per unit score) used for gapped alignments. See the description for gapH.

See also

H, gapH, K, gapK, L

gapS2=[integer]

Default: Variable; calculated from scoring parameters

Score threshold for saving individual gapped alignments. Alignments below the threshold aren't reported.

See also

gapE2

gapsepqmax=[int]

Default: Unlimited

Maximum separation allowed between gapped alignments along the query.

See also

gapsepsmax, hspsepqmax, hspsepsmax

gapsepsmax=[int]

Default: Unlimited

Maximum separation allowed between gapped alignments along the subject.

See also

gapsepqmax, hspsepqmax, hspsepsmax

gapX

Default: Variable; depends on scoring parameters

Sets the alignment extension cutoff for gapped alignment.

See also

Default: Off

Displays the GenInfo identifiers of database hits, if present.

golf=[number]

Default: 0.1

Maximum fractional length overlap for gapped alignment consistency. See the description for olf.

golmax=[integer]

Default: Unlimited

Maximum absolute length of overlap for gapped alignment consistency. See the description for olf.

gspmax=[integer]

Default: 1,000

Sets the maximum number of gapped alignments per subject sequence. gspmax is bounded by hspmax. A value of 0 implies no limit.

See also

hspmax

H=[number]

Default: Variable; depends on scoring parameters

Sets the value of the Karlin-Altschul parameter H.

See also

gapH, K, gapK, L, gapL

hspmax=[integer]

Default: 1000

Sets the maximum number of ungapped alignments per subject sequence. A warning is issued if this limit is exceeded. A value of 0 implies no limit.

See also

gspmax

hitdist=[integer]

Default: 0, off

Maximum distance between word hits for the two-hit seeding algorithm. WU-BLAST uses one-hit seeding by default.

hspsepqmax=[int]

Default: Unlimited

Maximum separation allowed between alignments along the query.

hspsepsmax=[int]

Default: Unlimited

Maximum separation allowed between alignments along the subject.

K=[number]

Default: Variable; depends on scoring parameters

Sets the value for K from the Karlin-Altschul equation.

See also

gapK, H, gapH, L, gapL

kap

Default: Off

Assesses individual alignment scores with Karlin-Altschul statistics rather than using sum statistics on groups of alignments.

L=[number]

Default: Variable; depends on scoring parameters

Sets lambda (nats per unit score) from the Karlin-Altschul equation.

See also

gapL, H, gapH, K, gapK

lcfilter

Default: Off

Filters lowercase letters in the query sequence. The lowercase letters are treated as if they had been filtered out by one of the filtering programs.

See also

echofilter, filter, wordmask, lcmask

lcmask

Default: Off

Masks lowercase letters in the query sequence for seeding only. Lowercase letters in the query sequence aren't used in the initial word search but are available for alignment during the extension stage; known as soft masking.

See also

echofilter, filter, wordmask, lcfilter

links

Default: Off

Display group information. Parentheses indicate the placement of the alignment in the group. The following example shows three alignments in the group. The score of the second reported alignment is 159, the last alignment in the chain.

Score = 159 (61.0 bits), Sum P(3) = 2.7e-38

Identities = 26/39 (66%), Positives = 32/39 (82%)

Links = 1-3-(2)

See also

topcomboN

M=[integer]

Default: +5 blastn

Sets the match score. This parameter is usually used for blastn only but may be used for other programs.

See also

maskextra=[integer]

Default: Off

Extends masking an extra distance of [integer] letters.

See also

echofilter, filter, wordmask, lcfilter, lcmask

matrix=[file]

Default: BLOSUM62

Programs: blastp, blastx, tblastn, tblastx

Specifies a scoring matrix file. The default is BLOSUM62. A large number of scoring matrices are distributed with WU-BLAST in the matrix/aa directory. Nucleotide matrices for use with blastn are in matrix/nt.

N=[integer]

Default: -4 blastn

Sets the mismatch score. This parameter is usually used for blastn only but may be used for other programs.

See also

nogap

Default: Off

Turns off gapped alignment. This parameter is useful in conjunction with altscore to prevent stop codons.

See also

altscore

nonnegok

Default: Off

Under Karlin-Altschul statistics, the expected score, must be negative. WU-BLAST normally exits with a fatal error if this isn't the case. Sometimes scoring schemes with positive expected scores are useful, and setting nonnegok silences the error condition.

See also

novalidctxok, errors

nosegs

Default: Off

WU-BLAST doesn't allow alignments to cross hyphen characters that act as query segment boundaries (e.g., for draft sequence). nosegs effectively converts hyphens to Ns.

notes

Default: Off

Suppresses informational messages. For example, if you are intentionally searching for a low-complexity sequence, you may wish to disable the message that suggests that a low-complexity filter would help remove meaningless alignments.

See also

errors, warnings

novalidctxok

Default: Off

If a sequence can't generate any significant HSPs, WU-BLAST normally exits with an error that says there are no valid contexts. You may see encounter such an error when searching a collection of sequencing reads, some of which are mostly (or completely) Ns. Setting novalidctxok allows you to continue without error.

See also

nonnegok, errors

nwlen=[integer]

Default: End of sequence

Sets the length of region for seeding.

See also

nwstart

nwstart=[integer]

Default: 1

Sets the starting position for seeding alignments. nwstart and nwlen indicate that a specific region of the query should be seeded. Alignments may extend outside of this region. For example, nwstart=500 nwlen=200 seeds positions 500 to 700 of the query sequence.

See also

nwlen

o=[file]

Default: stdout

Write results to this file instead of to stdout (the screen).

olf=[number]

Default: 0.125

Maximum fractional length of overlap for alignment consistency.

Consistent alignments must be ordered and have minimal overlap (see Chapter 5). The amount of permitted overlap is expressed as both a relative fraction and an absolute number. The default setting, 0.1, prevents alignments whose overlap length is more than 10 percent of the length of either alignment from being in the same group. The golf parameter plays the same role for gapped alignments. The olmax and golmax parameters control the absolute length of the overlap.

olmax=[integer]

Default: Unlimited

Maximum absolute length of overlap for alignment consistency. See the description for olf.

postsw

Default: Off

Programs: blastp

Performs Smith-Waterman alignment after initial BLAST alignment to return the single maximum-scoring pair rather than several high-scoring pairs.

Q=[integer]

Default: 10 blastn, 9 blastp, blastx, tblastn, tblastx

Sets the cost for the first gap character.

See also

qoffset=[integer]

Default: 0

Adjusts the query numbering by this amount—for example, if you search with a sequence that was known to have a vector sequence in the first 25 bases. By setting this parameter to 25, your numbering will be based on the insert sequence.

qrecmax=[integer]

Default: 1

Last query sequence to search. See the description for qrecmin.

Qrecmin=[integer]

Default: 1

By default, WU-BLAST produces one BLAST report for each query sequence in a FASTA files with multiple sequences. Setting qrecmin and qrecmax allows you to select a subset of query sequences in much the same way as dbrecmin and dbrecmax.

See also

qrecmax, dbrecmin, dbrecmax

R=[integer]

Default: 10 blastn, 2 blastp, blastx, tblastn, tblastx

Sets the cost for the second and remaining gap characters.

See also

restest

Default: Off

blastp and blastx statistical tests are based on the number of residues (letters) in the database. If Z is set in conjunction with restest, blastn, tblastn, and tblastx will also be based on the number of letters.

See also

seqtest, Z

S=[integer]

Default: Variable; calculated from E

Sets the final score threshold. Since S and E are interconvertible through the Karlin-Altschul equation, setting S effectively sets E, and vice versa. When both are set, the more restrictive one is used.

See also

mS2=[integer]

Default: Variable; depends on scoring parameters

Score threshold for individual ungapped alignments. If both S2 and E2 are set, the more restrictive one is used.

See also

E2, gapS2, gapE2

seqtest

blastn, tblastn, and tblastx statistical tests are based on the number of sequences in the database. If Z is set in conjunction with seqtest, blastp and blastx will also be based on the number of sequences.

See also

restest, Z

span, span1, span2

Default: span2

WU-BLAST normally discards HSPs that are contained completely within a larger, higher-scoring HSP. This behavior is called span2. If span1 is set, alignments are thrown out if they are subsets of the query or subject (unlike span2, both conditions aren't required). This is useful if the sequences contain many repeats. To prevent discarded alignments, set span. The output may become very large.

T=[integer]

Default: 11 blastp, 12 blastx, 13 tblastn, 13 tblastx

Sets the neighborhood word threshold score. Setting this value extremely high removes neighborhood words and makes seeding require matching words. T, W, and hitdist are the most effective parameters for controlling the sensitivity and speed of BLAST searches.

See also

W, hitdist

top

Default: Off

Programs: blastn, tblastx, blastx

Searches only the top strand of the query.

See also

bottom

topcomboN=[integer]

Default: Off

Reports the number of consistent, or collinear, HSP combinations.

V=[integer]

Default: 500

Controls the number of one-line summaries.

See also

warnings

Default: Off

WU-BLAST reports various warning conditions. This parameter turns them off.

See also

notes, errors

wink=[integer]

Default: 1

Words are created by sliding a window of width W by wink letters at a time. If W equals wink, words don't overlap.

See also

W, T, hitdist

wordmask=[method]

Default: Off

Filters the query sequence for seeding only. Low-complexity region in the query sequence isn't used in the initial word search but is available for alignment during the extension stage; called soft masking.

See also

filter, lcfilter, lcmask, echofilter, maskextra

W=[integer]

Default: 11

Sets the word size for seeding alignments.

See also

T, hitdist, wink

X=[integer]

Default: Variable; depends on scoring parameters

Controls the alignment extension cutoff for ungapped alignments.

See also

gapX

Y=[number]

Default: Variable; depends on scoring parameters

Sets the size of the query sequence.

See also

Z=[number]

Default: Variable; depends on scoring parameters

Sets the size of the database in letters (restest is assumed), but Z may also be used to mean the number of sequences if seqtest is set.

See also

Y, seqtest, restest

14.4 xdformat Parameters

xdformat formats BLAST databases from FASTA files. It also reports descriptive information about the database and dumps the entire content to FASTA format.

Here are some examples:

xdformat -n files

xdformat -p files

zcat fasta.*.gz | xdformat -o my_db -n -- -

xdformat -n -i database

xdformat -n -r datatbase > fasta_file

-A [0..2]

Default: 2

When indexing accession.version identifiers, you have three indexing options:

Accession only; version isn't stored

Stored as accession.version

Stored as both accession only and accession.version

-a [database]

Appends sequences to the named database. If the database is indexed, the appended sequences will also be indexed.

-c [character]

Default: Off

If an invalid letter is encountered, xdformat terminates and reports an error message. If this occurs, check the sequence file for errors. After checking, you may either skip illegal characters with -k or change them to a legal character with -c. The typical operation for nucleotides is to set -c N, and for proteins -c X.

See also

-k

-D [integer]

Default: Unlimited

Sets the maximum length for definition lines.

-d [string]

Default: None

Sets a user-defined release date for the database. The date may have 63 characters at most.

See also

-v

-e [file]

Default: stderr

Appends information and errors to the named file.

-G

Default: Off

Prefaces each sequence with the database record number in the format of gnl|xdf|#.

-i

Default: Off

Reports descriptive information about a BLAST database. This is useful for determining when a database was created, how many sequences it contains, and if it is indexed.

-K [integer]

Default: Unlimited

Sets the maximum number of identifiers with Control-A separators. This is useful for trimming highly redundant sequences created with nrdb or another redundancy purifier that uses Control-A separators.

-k

Default: Off

If an invalid letter is encountered, xdformat terminates. If this occurs, you can either skip illegal characters with -k or change them to a legal letter with -c. Check the errors to ensure the input file is formatted properly.

See also

-c

-L [number]

Default: 100000000 (100 million letters)

Sets the maximum sequence length. For optimal performance, break up large sequences into smaller fragments no larger than 1 million letters.

-l [number]

Default: 0

Sets the minimum sequence length.

-M [number]

Default: 96m

Sets the cache size for indexing. For faster indexing, the size may be increased (for example, -M 512m).

-O [4..8]

Default: 4

Sets the number of bytes of precision. The default value allows databases of up to 4 billion amino acids or 16 billion nucleotides. If you expect a database to contain more than this limit, increasing precision by one level multiplies the limit by 256. Setting -O is necessary only if you append to the database because the precision automatically increases appropriately when databases are created.

-P [integer]

Default: 60

This option applies only when dumping the entire content of a database with -r. -P controls the length of the sequence lines; -P 0 puts the whole sequence on one line.

See also

-r

-q [0..3]

Default: 0

Certain files may contain numerous nonfatal errors in their identifier format. -q quiets these errors.

No silencing

Silences field1 errors

Silences field 2 errors

Silences all fields

-r

Default: Off

Reports (dumps) the entire database content to stdout in FASTA format.

-T [string]

Default: Off

This option lets you restrict indexing of identifiers to a particular database name or tag. The [string] has two parts: part 1 is the name of the database (e.g., gb for GenBank or emb for EMBL—see Chapter 10), and part 2 is either blank or a number.

blank

Index all identifiers.

Don't index.

Index only field 1.

Index only field 2.

Here are some examples:

-T emb0 doesn't index EMBL records.

-T gb1 indexes GenBank accession but not locus.

-T gb2 indexes GenBank locus but not accession.

-T gb index both accession and locus of GenBank records.

-v

Default: Off

Sets a user-defined version string for the database (a maximum of 63 characters).

See also

-d

-X

Default: Off

Databases that are formatted but not indexed may be indexed or re-indexed (e.g., with a different indexing scheme) with -X. In the following examples, the two commands on Line 1 are equivalent to the one on Line 2.

xdformat -n nt_db ; xdformat -n -X nt_db

xdformat -n -I nt_db

14.5 xdget Parameters

xdget retrieves files in FASTA format from databases formatted with xdformat (not formatdb, pressdb, or setdb). The database must have been indexed prior to using xdget (see -Iand -X in the previous section Section 14.4").

Here are a few example command lines. If identifiers contain vertical bars, as in the second example, you have to enclose the string in quotes to prevent the shell form interpreting them as pipes. This isn't required for identifier files.

xdget -n db 12345

xdget -p nr 'gi|11611819|gb|AAG39070.1|'

xdget -n -f db files_of_ids

-A [n, 0]

Default: n

Given an accession number without a version, xdget retrieves the latest version number. This parameter is set explicitly with -A n. If -A 0 is set, the earliest version number is retrieved.

See also

-d, -N

-a [integer]

Default: 1

The -a and -b parameters retrieve a subsequence. For example, if you want to retrieve just nucleotides 1 to 100, include -a 1 -b 100. For nucleotide sequences, if -b is greater than -a, the sequence is returned as its reverse-complement.

See also

-b, -r, -t

-b [integer]

Default: 0, end of sequence

See -a above.

-d

Default: Off

Ordinarily, when duplicate identifiers are present, only one is retrieved. With -d, all duplicates are reported. Having duplicate identifiers is generally not a good idea.

See also

-A, -N

-D [integer]

Default: Unlimited

Sets the maximum definition line length. Using definition lines to store arbitrary sequence data is common. This option is useful when you don't need the whole definition line.

-e [file]

Default: stderr

Appends messages and errors to log file.

-F

Default: Off

Flushes the output stream after each request. This is useful for preventing I/O deadlocks between communicating processes.

-f

Default: Off

Indicates that files of identifiers are given on the command line. The file format is one identifier per line.

-G

Default: Off

Prefaces each definition line with its record number using the gnl namespace. The format is gnl|xdf|#.

-o [file]

Default: stdout

Reports FASTA files to the named file rather than stdout.

-N [0, n]

Default: 0

For sequences with duplicate identifiers, the first one is retrieved by default. It is set explicitly with -N 0. Setting -N n retrieves the last one. Accession numbers with version numbers have different rules.

See also

-A, -d

-P [integer]

Default: 60

Sets the maximum line length for sequence data. Setting -P 0 puts the entire sequence on one line.

-r

Default: Off

Returns the reverse complement for nucleotide sequences.

-T [string]

Default: Off

This option lets you restrict the lookup of identifiers to a particular database name or tag. For example, to look only in GenBank sequences, use -T gb. For only local, use -T lcl. For tags with multiple identifiers, a numeric suffix identifies which one to select. For example, -T gb1 selects accessions and -T gb2 selects loci. To prevent lookups in a database name, use zero. For example, -T gb0 omits GenBank records.

-t

Default: Off

Translates nt seq.

Page

Contents

If you find an error or have any questions, please email us at admin@doctorlib.org. Thank you!