Skip to the content.

Bio::ToolBox

Home Install Libraries Applications Examples FAQ

get_gene_regions.pl

A program to collect specific, often un-annotated, regions from genes.

SYNOPSIS

get_gene_regions.pl [–options…] –in <filename> –out <filename>

get_gene_regions.pl [–options…] –db <text> –out <filename>

Source data:
-i --in <filename>            input annotation: GFF3, GTF, genePred, etc
-d --db <name | filename>     database: name, file.db, or file.sqlite

Feature selection:
-f --feature <type>           optionally specify gene type or type:source
-t --transcript               specify the transcript type
     [all|mRNA|ncRNA|snRNA|
     snoRNA|tRNA|rRNA|miRNA|
     lincRNA|misc_RNA]
-r --region                   specify the gene region to collect
     [tss|tts|cdsStart|cdsStop|
     splice|UTR|exon|
     collapsedExon|altExon|
     uncommonExon|commonExon|
     firstExon|lastExon|intron|
     collapsedIntron|altIntron|
     uncommonIntron|commonIntron|
     firstIntron|lastIntron]
--gencode                     include only GENCODE tagged genes
--biotype <regex>             include only specific biotype
--tsl                         select transcript support level
     [best|best1|best2|best3|
     best4|best5|1|2|3|4|5|NA]
-u --unique                   select only unique regions
-l --slop <integer>           duplicate region if within X bp
-K --chrskip <regex>          skip features from certain chromosomes

Adjustments:
-b --begin --start integer     specify adjustment to start coordinate
-e --end --stop integer        specify adjustment to stop coordinate

General options:
--bed                         output as a bed6 format
-o --out <filename>              specify output name
-z --gz                          compress output
-v --version                     print version and exit
-h --help

OPTIONS

The command line flags and descriptions:

Source data

Feature selection

Adjustments

General options

DESCRIPTION

This program will collect specific regions from annotated genes and/or transcripts. Often these regions are not explicitly defined in the source GFF3 annotation, necessitating a script to pull them out. These regions include the start and stop sites of transcription, introns, the splice sites (both 5’ and 3’), exons, the first (5’) or last (3’) exons, or all alternate or common exons of genes with multiple transcripts. Importantly, unique regions may only be reported, especially important when a single gene may have multiple alternative transcripts. A slop factor is included for imprecise annotation.

The program will report the chromosome, start and stop coordinates, strand, name, and parent and transcript names for each region identified. The reported start and stop sites may be adjusted with modifiers. A standard biotoolbox data formatted text file is generated. This may be converted into a standard BED or GFF file using the appropriate biotoolbox scripts. The file may also be used directly in data collection.

AUTHOR

Timothy J. Parnell, PhD
Howard Hughes Medical Institute
Dept of Oncological Sciences
Huntsman Cancer Institute
University of Utah
Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.