My multipurpose sequence aligner tool of choice for many years has been blat. This is a short post on the basics of blat. To use blat, download the 64bit Linux version of blat (or a version that matches your operating system) here. When aligning sequences to the genome, make sure you use the 64 bit version or else you will receive an out of memory error (needHugeMen: Out of huge memory - request size).
To get started below is a slide I made couple of years ago:
First blat splits the reference sequence up into "tiles". The manner in which it is split depends on two parameters, -tileSize and -stepSize, where -tileSize is the size of the tile and -stepSize specifies when to start the next tile. The default setting of both for DNA sequences is 11, which also means the tiles do not overlap.
For blat to report an alignment, your query sequence must match at least two tiles (set via -minMatch) with no mismatches (you can allow up to one mismatch in the tile by using -oneOff). So if you're trying to align a 21 bp sequence to a reference using the default setting, blat will never report an alignment.
To illustrate, imagine this reference sequence (44bp):
>database
AAAAAAAAAAACCCCCCCCCCCGGGGGGGGGGGTTTTTTTTTTT
AAAAAAAAAAACCCCCCCCCCCGGGGGGGGGGGTTTTTTTTTTT
and this query sequence (12bp)
>test
GGGGGGGGGGGT:
GGGGGGGGGGGT:
The only way an alignment will be reported is if the tileSize is set to 1 and the minScore set to less than 12.
1
2
3
4
| #returns no hitblat -minScore=0 -stepSize=2 database.fa test.fa output.psl#returns 2 hitsblat -minScore=0 -stepSize=1 database.fa test.fa output.psl |
RUNNING BLAT IN PARALLEL
Before running blat in parallel, check how many processors are on your machine (32 are available in this example):
1
2
| cat /proc/cpuinfo | grep "^processor" | wc 32 96 470 |
To use blat in parallel, download GNU parallel and compile it. Download the hg19 chromosomes. Let's align a bunch of long non coding RNAs (lncRNA) to hg19. Download the human lncRNA dataset from NONCODE and unzip it. There should be 33,829 sequences in the human.fa file.
1
2
3
4
| unzip lncRNA_human.zipcd humancat human.fa | grep "^>" | wc 33829 692926 4407633 |
Now untar and gunzip the hg19 file. I'm only interested in assembled chromosomes, so I will delete the rest.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
| tar -xzf chromFa.tar.gz#ls to check whether I'm deleting the right filesls *random.fachr11_gl000202_random.fa chr17_gl000206_random.fa chr1_gl000191_random.fa chr4_gl000194_random.fa chr9_gl000198_random.fachr17_gl000203_random.fa chr18_gl000207_random.fa chr1_gl000192_random.fa chr7_gl000195_random.fa chr9_gl000199_random.fachr17_gl000204_random.fa chr19_gl000208_random.fa chr21_gl000210_random.fa chr8_gl000196_random.fa chr9_gl000200_random.fachr17_gl000205_random.fa chr19_gl000209_random.fa chr4_gl000193_random.fa chr8_gl000197_random.fa chr9_gl000201_random.farm *random.fals chrUn*.fa#ls to check againchrUn_gl000211.fa chrUn_gl000216.fa chrUn_gl000221.fa chrUn_gl000226.fa chrUn_gl000231.fa chrUn_gl000236.fa chrUn_gl000241.fa chrUn_gl000246.fachrUn_gl000212.fa chrUn_gl000217.fa chrUn_gl000222.fa chrUn_gl000227.fa chrUn_gl000232.fa chrUn_gl000237.fa chrUn_gl000242.fa chrUn_gl000247.fachrUn_gl000213.fa chrUn_gl000218.fa chrUn_gl000223.fa chrUn_gl000228.fa chrUn_gl000233.fa chrUn_gl000238.fa chrUn_gl000243.fa chrUn_gl000248.fachrUn_gl000214.fa chrUn_gl000219.fa chrUn_gl000224.fa chrUn_gl000229.fa chrUn_gl000234.fa chrUn_gl000239.fa chrUn_gl000244.fa chrUn_gl000249.fachrUn_gl000215.fa chrUn_gl000220.fa chrUn_gl000225.fa chrUn_gl000230.fa chrUn_gl000235.fa chrUn_gl000240.fa chrUn_gl000245.farm chrUn*.fals *hap*.fachr17_ctg5_hap1.fa chr6_apd_hap1.fa chr6_dbb_hap3.fa chr6_mcf_hap5.fa chr6_ssto_hap7.fachr4_ctg9_hap1.fa chr6_cox_hap2.fa chr6_mann_hap4.fa chr6_qbl_hap6.farm *hap*.fa#22 somatic chromosomes + chrY, chrM and chrXls *.fa | wc 25 25 213 |
Create a test fasta file containing the sequence for n37. I checked the human.bed file for n37 (since NONCODE also provides you with the mapping location as a bed file) to see that it maps to chr11.
1
2
3
4
5
6
7
8
9
10
11
12
13
| cat human/human.fa | grep -v "^#" | head -2 > human/test.facat human/human.bed | awk '$4 == "n37" {print}'chr11 2161746 2169894 n37 992 + 2161746 2169894 0 7 431,181,100,293,38,682,331, 0,5728,5911,6801,7095,7134,7817,#run a test blat job using default parametersblat chr11.fa human/test.fa test.pslcat test.psl#results are the samepsLayout version 3match mis- rep. N's Q gap Q gap T gap T gap strand Q Q Q Q T T T T block blockSizes qStarts tStarts match match count bases count bases name size start end name size start end count---------------------------------------------------------------------------------------------------------------------------------------------------------------2046 10 0 0 0 0 6 6092 + n37 2056 0 2056 chr11 135006516 2161746 2169894 7 431,181,100,293,38,682,331, 0,431,612,712,1005,1043,1725, 2161746,2167474,2167657,2168547,2168841,2168880,2169563, |
Now to do this in parallel and convert results from psl to bed. I wrote a script calledpsl_to_bed_best_score.pl, a while ago and we can use it to parse the psl files.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
| parallel blat chr{}.fa human/test.fa test_{}.psl ::: {1..22} X Y Mls *.psltest_1.psl test_11.psl test_13.psl test_15.psl test_17.psl test_19.psl test_20.psl test_22.psl test_4.psl test_6.psl test_8.psl test_M.psl test_Y.psltest_10.psl test_12.psl test_14.psl test_16.psl test_18.psl test_2.psl test_21.psl test_3.psl test_5.psl test_7.psl test_9.psl test_X.pslcat *.psl > all.pslpsl_to_bed_best_score.pl all.psl all.bedcat all.bedchr11 2161746 2169894 n37 2030 + 2161746 2169894 255,0,0 7 431,181,100,293,38,682,331, 0,5728,5911,6801,7095,7134,7817#almost identical to the bed file provided from NONCODE apart from the score columncat human/human.bed | awk '$4 == "n37" {print}'chr11 2161746 2169894 n37 992 + 2161746 2169894 0 7 431,181,100,293,38,682,331, 0,5728,5911,6801,7095,7134,7817,#try it on all lncRNA but first remove the comments from the human.fa filecat human/human.fa | grep -v "^#" > tmpmv tmp human/human.fatime parallel blat chr{}.fa human/human.fa test_{}.psl ::: {1..22} X Y M#well that took too long!real 16996m51.851suser 301319m54.423ssys 50m14.070s |
BLAT PARAMETERS
For convenience.
blat - Standalone BLAT v. 34 fast sequence search command line tool
usage:
blat database query [-ooc=11.ooc] output.psl
where:
database and query are each either a .fa , .nib or .2bit file,
or a list these files one file name per line.
-ooc=11.ooc tells the program to load over-occurring 11-mers from
and external file. This will increase the speed
by a factor of 40 in many cases, but is not required
output.psl is where to put the output.
Subranges of nib and .2bit files may specified using the syntax:
/path/file.nib:seqid:start-end
or
/path/file.2bit:seqid:start-end
or
/path/file.nib:start-end
With the second form, a sequence id of file:start-end will be used.
options:
-t=type Database type. Type is one of:
dna - DNA sequence
prot - protein sequence
dnax - DNA sequence translated in six frames to protein
The default is dna
-q=type Query type. Type is one of:
dna - DNA sequence
rna - RNA sequence
prot - protein sequence
dnax - DNA sequence translated in six frames to protein
rnax - DNA sequence translated in three frames to protein
The default is dna
-prot Synonymous with -t=prot -q=prot
-ooc=N.ooc Use overused tile file N.ooc. N should correspond to
the tileSize
-tileSize=N sets the size of match that triggers an alignment.
Usually between 8 and 12
Default is 11 for DNA and 5 for protein.
-stepSize=N spacing between tiles. Default is tileSize.
-oneOff=N If set to 1 this allows one mismatch in tile and still
triggers an alignments. Default is 0.
-minMatch=N sets the number of tile matches. Usually set from 2 to 4
Default is 2 for nucleotide, 1 for protein.
-minScore=N sets minimum score. This is the matches minus the
mismatches minus some sort of gap penalty. Default is 30
-minIdentity=N Sets minimum sequence identity (in percent). Default is
90 for nucleotide searches, 25 for protein or translated
protein searches.
-maxGap=N sets the size of maximum gap between tiles in a clump. Usually
set from 0 to 3. Default is 2. Only relevent for minMatch > 1.
-noHead suppress .psl header (so it's just a tab-separated file)
-makeOoc=N.ooc Make overused tile file. Target needs to be complete genome.
-repMatch=N sets the number of repetitions of a tile allowed before
it is marked as overused. Typically this is 256 for tileSize
12, 1024 for tile size 11, 4096 for tile size 10.
Default is 1024. Typically only comes into play with makeOoc.
Also affected by stepSize. When stepSize is halved repMatch is
doubled to compensate.
-mask=type Mask out repeats. Alignments won't be started in masked region
but may extend through it in nucleotide searches. Masked areas
are ignored entirely in protein or translated searches. Types are
lower - mask out lower cased sequence
upper - mask out upper cased sequence
out - mask according to database.out RepeatMasker .out file
file.out - mask database according to RepeatMasker file.out
-qMask=type Mask out repeats in query sequence. Similar to -mask above but
for query rather than target sequence.
-repeats=type Type is same as mask types above. Repeat bases will not be
masked in any way, but matches in repeat areas will be reported
separately from matches in other areas in the psl output.
-minRepDivergence=NN - minimum percent divergence of repeats to allow
them to be unmasked. Default is 15. Only relevant for
masking using RepeatMasker .out files.
-dots=N Output dot every N sequences to show program's progress
-trimT Trim leading poly-T
-noTrimA Don't trim trailing poly-A
-trimHardA Remove poly-A tail from qSize as well as alignments in
psl output
-fastMap Run for fast DNA/DNA remapping - not allowing introns,
requiring high %ID
-out=type Controls output file format. Type is one of:
psl - Default. Tab separated format, no sequence
pslx - Tab separated format with sequence
axt - blastz-associated axt format
maf - multiz-associated maf format
sim4 - similar to sim4 format
wublast - similar to wublast format
blast - similar to NCBI blast format
blast8- NCBI blast tabular format
blast9 - NCBI blast tabular format with comments
-fine For high quality mRNAs look harder for small initial and
terminal exons. Not recommended for ESTs
-maxIntron=N Sets maximum intron size. Default is 750000
-extendThroughN - Allows extension of alignment through large blocks of N's
Really very happy to say, your post is very interesting to read. I never stop myself to say something about it. You’re doing a great job. Keep it up
ReplyDeleteCollect here some real minecraft redeem code free 2019
Wang Zheng Yuan: Using Blat >>>>> Download Now
ReplyDelete>>>>> Download Full
Wang Zheng Yuan: Using Blat >>>>> Download LINK
>>>>> Download Now
Wang Zheng Yuan: Using Blat >>>>> Download Full
>>>>> Download LINK Eh