Skip to the content.

Bio::ToolBox - split_data_file

Home Install Libraries Applications Examples FAQ

split_data_file.pl

A program to split a data file by rows based on common data values.

SYNOPSIS

split_data_file.pl [–options] <filename>

File options:
-i --in <filename>                (txt bed gff gtf vcf refFlat ucsc etc)
-p --prefix <text>                output file prefix (input basename)
-H --noheader                     input file has no headers

Splitting options:
-x --index <column_index>         column with values to split upon
-t --tag <text>                   use VCF/GFF attribute
-m --max <integer>                maximum number of items per output file

General options:
-z --gz                           compress output file
-v --version                      print version and exit
-h --help                         show extended documentation

OPTIONS

The command line flags and descriptions:

File options

Splitting options

General options

DESCRIPTION

This program will split a data file into multiple files based on common values in the data table. All rows with the same value will be written into the same file. A good example is chromosome, where all data points for a given chromosome will be written to a separate file, resulting in multiple files representing each chromosome found in the original file. The column containing the values to split and group should be indicated; if the column is not sepcified, it may be selected interactively from a list of column headers.

This program can also split files based on an attribute tag in GFF or VCF files. Attributes are often specially formatted delimited key value pairs associated with each feature in the file. Provide the name of the attribute tag to split the file. Since attributes may vary based on the feature type, an interactive list is not supplied from which to choose the attribute.

If the max argument is set, then each group will be written to one or more files, with each file having no more than the indicated maximum number of data lines. This is useful to keep the file size reasonable, especially when processing the files further and free memory is constrained. A reasonable limit may be 100K or 1M lines.

The resulting files will be named using the basename of the input file, appended with the unique group value (for example, the chromosome name) demarcated with a #. If a maximum line limit is set, then the file part number is appended to the basename, padded with zeros to three digits (to assist in sorting). Each file will have duplicated and preserved metadata. The original file is preserved.

This program is intended as the complement to ‘join_data_files.pl’.

AUTHOR

Timothy J. Parnell, PhD
Howard Hughes Medical Institute
Dept of Oncological Sciences
Huntsman Cancer Institute
University of Utah
Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.