Skip to the content.

Bio::ToolBox - ucsc_table2gff3

Home Install Libraries Applications Examples FAQ

ucsc_table2gff3.pl

A program to convert UCSC gene tables to GFF3 or GTF annotation.

SYNOPSIS

 ucsc_table2gff3.pl --ftp <text> --db <text>
 
 ucsc_table2gff3.pl [--options] --table <filename>

UCSC database options:
-f --ftp [refgene|known|all]          specify what tables to retrieve from UCSC
          
-d --db <text>                        UCSC database name: hg19,hg38,danRer7, etc
-h --host <text>                      specify UCSC hostname

Input file options:
-t --table <filename>                 name of table, repeat or comma list
-k --kgxref <filename>                kgXref file
-c --chromo <filename>                chromosome file

Conversion options:
--source <text>                       source text, default UCSC
--chr   | --nochr         (true)      include chromosomes in output
--gene  | --nogene        (true)      assemble into genes
--cds   | --nocds         (true)      include CDS subfeatures
--utr   | --noutr         (false)     include UTR subfeatures
--codon | --nocodon       (false)     include start and stop codons
--share | --noshare       (true)      share subfeatures
--name  | --noname        (false)     include name
-g --gtf                              convert to GTF instead of GFF3

General options:
-z --gz                               compress output
-v --version                          print version and exit
-h --help                             show extended documentation

OPTIONS

The command line flags and descriptions:

UCSC database options

Input file options

Conversion options

General options

DESCRIPTION

This program will convert a UCSC gene or gene prediction table file into a GFF3 (or optionally GTF) format file. It will build canonical gene->transcript->[exon, CDS, UTR] heirarchical structures. It will attempt to identify non-coding genesas to type using the gene name as inference. Various additional informational attributes may also be included with the gene and transcriptfeatures, which are derived from supporting table files.

Two table files are currently supported. Gene prediction tables, including refGene and UCSC knownGene are supported. Supporting tables include kgXref.

Tables obtained from UCSC are typically in the extended GenePrediction format, although simple genePrediction and refFlat formats are also supported. See http://genome.ucsc.edu/FAQ/FAQformat.html#format9 regarding UCSC gene prediction table formats.

The latest table files may be automatically downloaded using FTP from UCSC or other host. Since these files are periodically updated, this may be the best option. Alternatively, individual files may be specified through command line options. Files may be obtained manually through FTP, HTTP, or the UCSC Table Browser.

If provided, chromosome and/or scaffold features will be written as GFF3-style sequence-region pragmas (even for GTF files, just in case).

If you need to set up a database using UCSC annotation, you should first take a look at the BioToolBox script db_setup.pl, which provides a convenient automated database setup based on UCSC annotation.

AUTHOR

Timothy J. Parnell, PhD
Huntsman Cancer Institute
University of Utah
Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.