Bio::ToolBox - get_intersecting_features
Home | Install | Libraries | Applications | Examples | FAQ |
get_intersecting_features.pl
A program to pull out overlapping features from the database.
SYNOPSIS
get_intersecting_features.pl [–options] <filename>
File options:
-i --in <filename> input file
-o --out <filename> optionally output file
Database options:
-d --db <database> database to search: name or sqlite
-f --feature <text> db feature to search
Modify search range:
-b --begin --start <integer> adjust relative search start coordinate
-e --end --stop <integer> adjust relative search stop coordinate
-p --pos [5 | m | 3] relative position of search coordinate
-x --extend <integer> extend search in both directions
-r --ref [start | mid] measure distance from which coordinate
General options:
-z --gz compress output
-v --version print version and exit
-h --help show extended documentation
OPTIONS
The command line flags and descriptions:
File options:
-
–in <filename>
Specify an input file containing either a list of database features or genomic coordinates for which to collect data. The file should be a tab-delimited text file, one row per feature, with columns representing feature identifiers, attributes, coordinates, and/or data values. The first row should be column headers. Bed files are acceptable, as are text files generated by other BioToolBox scripts. Files may be gzipped compressed.
-
–out <filename>
Optionally specify a new filename. A standard tim data text file is written. The default is to rewrite the input file.
Database options
-
–db <database>
Specify the name of a Bio::DB::SeqFeature::Store annotation database from which gene or feature annotation may be derived. A database is required for generating new data files with features. This option may skipped when using coordinate information from an input file (e.g. BED file), or when using an existing input file with the database indicated in the metadata.
-
–feature <text>
Specify the name of the target features to search for in the database that intersect with the list of reference features. The type may be a either a GFF “type” or a “type:method” string. If not specifed, then the database will be queried for potential GFF types and a list presented to the user to select one.
Modify search range
- –start <integer>
- –stop <integer>
- –begin <integer>
-
–end <integer>
Optionally specify the relative start and stop positions from the 5’ end (default) or the end specified by the “–pos” option with which to restrict the search region for target features. For example, specify “–start=-200 --stop=0” to restrict to the promoter region of genes. Both positions must be specified. Default is to take the entire region of the reference feature.
-
–pos [ 5 m 3 ] Indicate the relative position from which to make the adjustments to the search window. Both start and stop adjustments may be made from the respective 5 prime, 3 prime, or middle position as dictated by the feature’s strand value.
-
–extend <integer>
Optionally specify the number of bp to extend the reference feature’s region on each side. Useful when you have small reference regions and you want to include a larger search region.
-
–ref [start mid] Indicate the reference point from which to calculate the distance between the reference and target features. The same reference point is used for both features. Valid options include “start” (or 5’ end for stranded features) and “mid” (for midpoint). Default is “start”.
General options
-
–gz
Specify whether the output file should (not) be compressed with gzip.
-
–version
Print the version number.
-
–help
Display the POD documentation
DESCRIPTION
This program will take a list of reference features and identify target features which intersect them. The reference features may be either named features (name and type) or genomic regions (chromosome, start, stop). By default, the search region for each reference feature is the entire feature, but may be restricted or expanded in size with appropriate modifiers (–start, –stop, –extend). The target features are specifed as specific types.
Several attributes of the found features are appended to the original input file data. First, the number of target features are reported. If more than one are found, the feature with the most overlap with the reference feature is preferentially listed. The name, type, and strand of the selected target feature is reported. Finally, the distance from the reference feature to the target feature is reported. The reference points for measuring the distance is by default the start or 5’ end of the features, or optionally the midpoints. Note that the distance measurement is relative to the coordinates after adjustment with the –start, --stop, and –extend options.
AUTHOR
Timothy J. Parnell, PhD
Howard Hughes Medical Institute
Dept of Oncological Sciences
Huntsman Cancer Institute
University of Utah
Salt Lake City, UT, 84112
This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.