Skip to the content.

Bio::ToolBox

Home Install Libraries Applications Examples FAQ

data2fasta.pl

A program to retrieve sequences from a list of features

SYNOPSIS

data2fasta.pl [–options…] <filename>

File Options:
-i --in <filename>                input file: txt, gff, bed, ucsc, vcf, etc
-o --out <filename>               output file name

Database:
-d --db <name|fasta>              annotation database with sequence or fasta

Feature selection:
-f --feature <text>               feature when parsing gff3, gtf, or ucsc input
-u --subfeature [exon|cds|        collect over subfeatures 
      5p_utr|3p_utr] 

Column indices:
-n --name --id <index>            name or ID column
-s --seq <index>                  column with sequence
-c --chr <index>                  chromosome column
-b --begin --start <index>        start coordinate column
-e --end --stop <index>           stop coordinate column
-t --strand <index>               strand column
-x --extend <integer>             extend coordinates in both directions
--desc <index>                    description column

Fasta output options:
--cat                             concatenate all sequences into one
--pad <integer>                   pad concatenated sequences with Ns

General options:
-z --gz                           compress output fasta file
-v --version                      print version and exit
-h --help                         show extended documentation

OPTIONS

The command line flags and descriptions:

File options

Database

Feature selection

Column indices

Fasta output options

General options

DESCRIPTION

This program will take a tab-delimited text file (BED file, for example) and generate either a multi-sequence fasta file containing the sequences of each feature defined in the input file, or optionally a single concatenated fasta file. If concatenating, the individual sequences may be padded with the given number of ‘N’ bases.

This program has two modes. If the name and sequence is already present in the file, it will generate the fasta file directly from the file content.

Alternatively, if only genomic position information (chromosome, start, stop, and optionally strand) is present in the file, then the sequence will be retrieved from a database. Multiple database adapters are supported for indexing genomic fastas, including the Bio::DB::HTS package, the Bio::DB::Sam package, or the BioPerl Bio::DB::Fasta adapter. Annotation databases such as Bio::DB::SeqFeature::Store are also supported.
If strand information is provided, then the sequence reverse complement is returned for reverse strand coordinates.

AUTHOR

Timothy J. Parnell, PhD
Dept of Oncological Sciences
Huntsman Cancer Institute
University of Utah
Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.