Monday, August 24, 2015

Retrieve gene location info from UCSC hg18/19 using MySQL script

I have a list of gene symbols.

APOB, TTC39B, ATF3, RGS1, LIPG,
I am trying to get the genomic coordinates (as in bp) with +/-5KB position of these genes via UCSC TableBrowser / MySQL server using NCBI 36/hg18 build. I have tried to get this information via TableBrowser, but I can't find the chromStart and chromEnd field in the given table. Am I missing something ?
Also, please share your favorite tutorial / docs that explain the schema/tables in UCSC MySQL server.
Thanks in advance.


 Answer:


from the table browser http://genome.ucsc.edu/cgi-bin/hgTables?command=start , select group= Genes , track= UCSC gene , table=knownGene and then 'describe table schema'
You'll see that knownGene is linked to kgXref :

hg18.kgXref.kgID (via knownGene.name)
and kgXref contains a column named 'geneSymbol'.
all in one, you can get the positions of the transcripts for those genes:

mysql  -h  genome-mysql.cse.ucsc.edu -A -u genome -D hg18 \
-e 'select distinct X.geneSymbol,G.chrom,G.txStart-5000,G.txEnd+5000 \
from knownGene as G, kgXref as X \
where X.geneSymbol in ("APOB", "TTC39B", "ATF3", "RGS1", "LIPG") \
and X.kgId=G.name'
 
+------------+-------+----------------+--------------+
| geneSymbol | chrom | G.txStart-5000 | G.txEnd+5000 |
+------------+-------+----------------+--------------+
| APOB       | chr2  |       21072805 |     21125450 |
| ATF3       | chr1  |      210843616 |    210865739 |
| ATF3       | chr1  |      210800319 |    210865739 |
| ATF3       | chr1  |      210843616 |    210865704 |
| ATF3       | chr1  |      210849982 |    210864212 |
| LIPG       | chr18 |       45337424 |     45378276 |
| LIPG       | chr18 |       45337424 |     45367217 |
| RGS1       | chr1  |      190806479 |    190820782 |
| TTC39B     | chr9  |       15156560 |     15302244 |
| TTC39B     | chr9  |       15156560 |     15227442 |
| TTC39B     | chr9  |       15171584 |     15302244 |
| TTC39B     | chr9  |       15172968 |     15268702 |
| TTC39B     | chr9  |       15172968 |     15227442 |
+------------+-------+----------------+--------------+
I agree that the information is hard to find: I only knew where to find it because I use to play regularly with those tables.
The UCSC mailing list is a good place to find this kind of information. I also did a lot of reverse engineering by just 'greping' the flat files available from the UCSC.
 


 The same script works for hg19 as well.

No comments:

Post a Comment