Skip to the content.

Bio::ToolBox

Home Install Libraries Applications Examples FAQ

manipulate_datasets.pl

A progam to manipulate tab-delimited data files.

SYNOPSIS

manipulate_datasets.pl [–options …] <filename>

File options:
-i --in <filename>                input data file
-o --out <filename>               output file, default overwrite
-H --noheader                     input file has no header row

Non-interactive functions:
-f --func [ reorder | delete | rename | new | number | concatenate | 
            split | coordinate | sort | gsort | null | duplicate | 
            above | below | specific | keep | addname | cnull | 
            absolute | minimum | maximum | log | delog | format | pr | 
            add | subtract | multiply | divide | combine | scale | 
            zscore | ratio | diff | normdiff | center | rewrite | 
            export | treeview | summary | stat ]
-x --index <integers>             column index to work on

Operation options:
-n --exp --num <integer>          numerator column index for ratio
-d --con --den <integer>          denominator column index for ratio
-t --target <text> or <number>    target value for certain functions
--place [r | n]                   replace column contents or new column
--(no)zero                        include zero in certain functions
--dir [i | d]                     sort order: increase or decrease
--name <text>                     name of new column
--log                             values are in log scale

General Options:
-z --gz                           compress output file
-Z --bgz                          bgzip compress output file
-v --version                      print version and exit
-h --help                         show extended documentation

OPTIONS

The command line flags and descriptions:

File options

Non-interactive functions

Operation options

General options

DESCRIPTION

This program allows some common mathematical and other manipulations on one or more columns in a datafile. The program is designed as a simple replacement for common manipulations performed in a full featured spreadsheet program, e.g. Excel, particularly with datasets too large to be loaded, all in a conveniant command line program. The program is designed to be operated primarily as an interactive program, allowing for multiple manipulations to be performed. Alternatively, single manipulations may be performed as specified using command line options. As such, the program can be called in shell scripts.

Note that the datafile is loaded entirely in memory. For extremely large datafiles, e.g. binned genomic data, it may be best to first split the file into chunks (use split_data_file.pl), perform the manipulations, and recombine the file (use join_data_file.pl). This could be done through a simple shell script.

The program keeps track of the number of manipulations performed, and if any are performed, will write out to file the changed data. Unless an output file name is provided, it will overwrite the input file (NO backup is made!).

FUNCTIONS

This is a list of the functions available for manipulating columns. These may be selected interactively from the main menu (note the case sensitivity!), or specified on the command line using the –func option.

AUTHOR

Timothy J. Parnell, PhD
Howard Hughes Medical Institute
Dept of Oncological Sciences
Huntsman Cancer Institute
University of Utah
Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.