Skip to the content.

Bio::ToolBox

Home Install Libraries Applications Examples FAQ

get_datasets.pl

A program to collect data for a list of features

SYNOPSIS

get_datasets.pl [–options…] <filename>

get_datasets.pl [–options…] –in <filename> <data1> <data2…>

Options for data files:
-i --in <filename>                  input file: txt bed gff gtf refFlat ucsc
-o --out <filename>                 optional output file, default overwrite 

Options for new files:
-d --db <name>                      annotation database: mysql sqlite
-f --feature <type>                 one or more feature types from db or gff

Options for feature "genome":
--win <integer>                     size of windows across genome (500 bp)
--step <integer>                    step size of windows across genome
--chrskip <regex>                   regular expression to skip chromosomes
--blacklist <filename>              file of intervals to skip (bed, gff, txt)
--prefix <text>                     prefix text for naming windows

Options for data collection:
-D --ddb <name|file>                data or BigWigSet database
-a --data <dataset|filename>        data from which to collect: bw bam etc
-m --method [mean|median|stddev|    statistical method for collecting data
          min|max|range|sum|count|   default mean
          pcount|ncount]
-t --strand [all|sense|antisense]   strand of data relative to feature (all)
-u --subfeature [exon|cds|          collect over gene subfeatures 
      5p_utr|3p_utr|intron] 
--force_strand                      use the specified strand in input file
--fpkm [region|genome]              calculate FPKM using which total count
--tpm                               calculate TPM values
-r --format <integer>               number of decimal places for numbers
--discard <number>                  discard features whose sum below threshold

Adjustments to features:
-x --extend <integer>               extend the feature in both directions
-b --begin --start <integer>        adjust relative start coordinate
-e --end --stop <integer>           adjust relative stop coordinate
-p --pos [5|m|3|53|p]               relative position to adjust (default 5')
--fstart=<decimal>                  adjust fractional start
--fstop=<decimal>                   adjust fractional stop
--limit <integer>                   minimum size to take fractional window

General options:
-z --gz                             compress output file
-c --cpu <integer>                  number of threads, default 4
--noparse                           do not parse input file into SeqFeatures
-v --version                        print version and exit
-h --help                           show extended documentation

OPTIONS

The command line flags and descriptions:

Options for data files

Options for new files

Options for feature “genome”

Options for data collection

Adjustments to features

General options

DESCRIPTION

This program will collect dataset values from a variety of sources, including features in a BioPerl Bio::DB::SeqFeature::Store database, binary wig files .wib loaded in a database using Bio::Graphics::Wiggle, bigWig files, bigBed files, Bam alignment files, or a Bio::DB::BigWigSet database.

The values are collected for a list of known database features (genes, transcripts, etc.) or genomic regions (defined by chromosome, start, and stop). The list may be provided as an input file or generated as a new list from a database. Output data files may be reloaded for additional data collection.

At each feature or interval, multiple data points within the genomic segment are combined statistically and reported as a single value for the feature. The method for combining datapoints may be specified; the default method is the mean of all datapoints.

The coordinates of the features may be adjusted in numerous ways, including specifying a specific relative start and stop, a fractional start and stop, an extension to both start and stop, and specifying the relative position (5’ or 3’ or midpoint).

Stranded data may be collected, if the dataset supports stranded information. Also, two or more datasets may be combined and treated as one. Note that collecting stranded data may significantly slow down data collection.

EXAMPLES

These are some examples of some common scenarios for collecting data.

AUTHOR

Timothy J. Parnell, PhD
Howard Hughes Medical Institute
Dept of Oncological Sciences
Huntsman Cancer Institute
University of Utah
Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.