Bio::ToolBox - data2wig

data2wig.pl

A program to convert a generic data file into a wig file.

SYNOPSIS

data2wig.pl [–options…] <filename>

File options:
-i --in <filename>                    input file: txt, gff, bed, vcf, etc
-o --out <filename>                   output file name
-H --noheader                         input file has no header row
-0 --zero                             file is in 0-based coordinate system

Column indices:
-a --ask                              interactive selection of columns
-s --score <index>                    score column, may be comma list
-c --chr <index>                      chromosome column
-b --begin --start <index>            start coordinate column
-e --end --stop <index>               stop coordinate column
--attrib <name>                       GFF or VCF attribute name of score

Wig options:
-p --step [fixed|variable|bed]        type of wig file 
--bed --bdg                           alternative shortcut for bedGraph
--size <integer>                      step size for fixedStep
--span <integer>                      span size for fixed and variable

Conversion options:
-f --fast                             fast mode, no error checking
--name <text>                         optional track name
--(no)track                           generate a track line
--mid                                 use the midpoint of feature intervals
--format <integer>                    format decimal points of scores
-m --method [mean | median | sum | max] combine multiple score columns

BigWig options:
-B  --bw --bigwig                     generate a bigWig file
-d --db <database>                    database to collect chromosome lengths
--chromof <filename>                  specify a chromosome file
--bwapp </path/to/wigToBigWig>        specify path to wigToBigWig

General options:
-z --gz                               compress output text files
-v --version                          print version and exit
-h --help                             show extended documentation

OPTIONS

The command line flags and descriptions:

File options

–in <filename>

Specify an input file containing either a list of database features or genomic coordinates for which to collect data. The file should be a tab-delimited text file, one row per feature, with columns representing feature identifiers, attributes, coordinates, and/or data values. Genome coordinates are required. The first row should be column headers. Text files generated by other BioToolBox scripts are acceptable. Files may be gzipped compressed.
–out <filename>

Optionally specify the name of of the output file. The track name is used as default. The ‘.wig’ extension is automatically added if required.
–noheader

The input file does not have column headers, often found with UCSC derived annotation data tables.
–zero

Source data is in interbase coordinate (0-base) system. Shift the start position to base coordinate (1-base) system. Wig files are by definition 1-based. This is automatically handled for most input
files. Default is false.

Column indices

–ask

Indicate that the program should interactively ask for column indices or text strings for the GFF attributes, including coordinates, source, type, etc. It will present a list of the column names to choose from. Enter nothing for non-relevant columns or to accept default values.
–score <column_index or list of column indices>

Indicate the column index of the dataset in the data table to be used for the score. If a GFF file is used as input, the score column is automatically selected. If not defined as an option, then the program will interactively ask the user for the column index from a list of available columns. More than one column may be specified, in which case the scores are combined using the method specified.
–chr <column_index>

Optionally specify the column index of the chromosome or sequence identifier. This is required to generate the wig file. It may be identified automatically from the column header names.
–start <column_index>
–begin <column_index>

Optionally specify the column index of the start or chromosome position. This is required to generate the wig file. It may be identified automatically from the column header names.
–stop <column_index>
–end <column_index>

Optionally specify the column index of the stop or end position. It may be identified automatically from the column header names.
–attrib <attribute_name>

Optionally provide the name of the attribute key which represents the score value to put into the wig file. Both GFF and VCF attributes are supported. GFF attributes are automatically taken from the attribute column (index 9). For VCF columns, provide the index number of the sample column from which to take the value (usually 10 or higher) using the –index option. INFO field attributes can also be taken, if desired (use –index 8).

Wig options

–step [fixed variable bed]

The type of step progression for the wig file. Three wig formats are available: - fixedStep: where data points are positioned at equal distances along the chromosome - variableStep: where data points are variably positioned along the chromosome. - bed (bedGraph): where scores are associated with intervals defined by start and stop coordinates. The fixedStep wig file has one column of data (score), the variableStep wig file has two columns (position and score), and the bedGraph has four columns of data (chromosome, start, stop, score). If the option is not defined, then the format is automatically determined from the metadata of the file.
–bed
–bdg

Convenience option to specify a bedGraph file should be written. Same as specifying –step=bed.
–size <integer>

Optionally define the step size in bp for ‘fixedStep’ wig file. This value is automatically determined from the table’s metadata, if available. If the --step option is explicitly defined as ‘fixed’, then the step size may also be explicitly defined. If this value is not explicitly defined or automatically determined, the variableStep format is used by default.
–span <integer>

Optionally indicate the size of the region in bp to which the data value should be assigned. The same size is assigned to all data values in the wig file. This is useful, for example, with microarray data where all of the oligo probes are the same length and you wish to assign the value across the oligo rather than the midpoint. The default is inherently 1 bp.

Conversion options

–fast

Disable checks for overlapping or duplicated intervals, unsorted data, valid score values, and calculated midpoint positions. Requires setting the chromosome, start, end (for bedGraph files only), and score column indices. WARNING: Use only if you trust your input file format and content.
–name <text>

The name of the track defined in the wig file. The default is to use the name of the chosen score column, or, if the input file is a GFF file, the base name of the input file.
–(no)track

Do (not) include the track line at the beginning of the wig file. Wig files normally require a track line, but if you will be converting to the binary bigwig format, the converter requires no track line. Why it can’t simply ignore the line is beyond me. This option is automatically set to false when the --bigwig option is enabled.
–mid

A boolean value to indicate whether the midpoint between the actual ‘start’ and ‘stop’ values should be used. The default is to use only the ‘start’ position.
–format <integer>

Indicate the number of decimal places the score value should be formatted. The default is to not format the score value.
–method [mean median sum max]

Define the method used to combine multiple data values at a single position. Wig files do not tolerate multiple identical positions. Default is mean.

BigWig options

–bigwig
–bw

Indicate that a binary BigWig file should be generated instead of a text wiggle file.
–db <database>

Specify the name of a Bio::DB::SeqFeature::Store annotation database or other indexed data file, e.g. Bam or bigWig file, from which chromosome length information may be obtained. It may be supplied from the input file metadata.
–chromof <filename>

When converting to a BigWig file, provide a two-column tab-delimited text file containing the chromosome names and their lengths in bp. Alternatively, provide a name of a database, below.
–bwapp </path/to/wigToBigWig>

Specify the path to the UCSC wigToBigWig conversion utility. The default is to first check the BioToolBox configuration file biotoolbox.cfg for the application path. Failing that, it will search the default environment path for the utility. If found, it will automatically execute the utility to convert the wig file.

General options

–gz

A boolean value to indicate whether the output wiggle file should be compressed with gzip.
–version

Print the version number.
–help

Display the POD documentation

DESCRIPTION

This program will convert any tab-delimited data text file into a wiggle formatted text file. This requires that the file contains not only the scores bu also chromosomal coordinates, i.e. chromosome, start, and (optionally) stop. The program should automatically detect these columns (if appropriately labeled) or they can be specified. An option exists to use the midpoint of a region, e.g. microarray probe.

The wig file format is specified by documentation supporting the UCSC Genome Browser and detailed here: http://genome.ucsc.edu/goldenPath/help/wiggle.html. Three formats are supported, ‘fixedStep’, ‘variableStep’, and ‘bedGraph’. The format may be requested or determined empirically from the input file metadata. Genomic bin files generated with BioToolBox scripts record the window and step values in the metadata, which are used to determine the span and step wig values, respectively. The variableStep format is otherwise generated by default. The span is, by default, 1 bp.

Wiggle files cannot tolerate multiple datapoints at the same identical position, e.g. multiple microarray probes matching a repetitive sequence. An option exists to mathematically combine these positions into one value.

Strand is not inherently supported in wig files. If you have stranded data, they should be split into separate files. The BioToolBox script split_data_file.pl can be used for this purpose.

A binary BigWig file may also be further generated from the
text wiggle file. The binary format is preferential to the text version for a variety of reasons, including fast, random access and no loss in data value precision. More information can be found at this location: http://genome.ucsc.edu/goldenPath/help/bigWig.html. Conversion requires BigWig file support, supplied by the external wigToBigWig or bedGraphToBigWig utility available from UCSC.

AUTHOR

Timothy J. Parnell, PhD
Howard Hughes Medical Institute
Dept of Oncological Sciences
Huntsman Cancer Institute
University of Utah
Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.