BLASTClustis a program within the standalone BLAST package used to cluster eitherprotein or nucleotide sequences. The program begins with pairwisematches and places a sequence in a cluster if the sequence matches atleast one sequence already in the cluster. In the case of proteins, theblastp algorithm is used to compute the pairwise matches; in the caseof nucleotide sequences, the Megablast algorithm is used.
Inthe simplest case, BLASTClust takes as input a file containingcatenated FASTA-format sequences, each with a unique identifier at thestart of the definition line. BLASTClust formats the input sequence toproduce a temporary BLAST database, performs the clustering, andremoves the database at completion. Hence, there is no need to runformatdb in advance to use BLASTClust. The output of BLASTClustconsists of a file, one cluster to a line, of sequence identifiersseparated by spaces. The clusters are sorted from the largest clusterto the smallest.
BLASTClust accepts a number ofparameters that can be used to control the stringency of clusteringincluding thresholds for score density, percent identity, and alignmentlength. The BLASTClust program has a number of applications, thesimplest of which is to create a non-redundant set of sequences from asource database. As an example, one might have a library of a fewthousand short nucleotide sequence reads and wish to replace these witha non-redundant set. To produce the non-redundant set, one might use:
blastclust -i infile -o outfile -p F -L .9 -b T -S 95
Thesequences in "infile" will be clustered and the results will be writtento "outfile". The input sequences are identified as nucleotide (-p F);"-p T", or protein, is the default. To register a pairwise match twosequences will need to be 95% identical (-S 95) over an area covering90% of the length (-L .9) of each sequence (-b T) . Using "-b F"instead of "-b T" would enforce the alignment length threshold on onlyone member of a sequence pair. The parameter "S", used here to specifythe percent identity, can also be used to specify, instead, a "scoredensity." The latter is equivalent to the BLAST score divided by thealignment length. If "S" is given as a number between 0 and 3, it isinterpreted as a score density threshold; otherwise it is interpretedas a percent identity threshold.
To create a stringent non-redundant protein sequence set, use the following command line:
blastclust -i infile -o outfile -p T -L 1 -b T -S 100
Inthis case, only sequences which are identical will be clusteredtogether. The “blastclust.txt” file in the standalone BLAST packagedetails the full range of BLASTClust parameters.