Skip to the content.

Bio::ToolBox

Home Install Libraries Applications Examples FAQ

correlate_position_data.pl

A script to calculate correlations between two datasets along the length of a feature.

SYNOPSIS

correlate_position_data.pl [–options] <filename>

Options for data files:
-i --in <filename>               input file: txt bed etc
-o --out <filename>              optional output file, default overwrite 
-d --db <name>                   alternate annotation database

Options for data sources
-D --ddb <name|file>             data or BigWigSet database
-r --ref <dataset|filename>      reference data: bw, name, etc
-t --test <dataset|filename>     test data: bw, name, etc

Options for correlating data
--pval                           calculate P-value by ANOVA
--shift                          determine optimal shift to match datasets
--radius <integer>               radius in bp around reference point to calculate
-p --pos [5|m|3]                 reference point to measure correlation (m)
--norm [rank|sum]                normalization method between datasets
--force_strand                   force an alternate strand

General options:
-c --cpu <interger>              number of threads (4)
-z --gz                          compress output with gz
-v --version                     print version and exit
-h --help                        show extended documentation

OPTIONS

The command line flags and descriptions:

Options for data files

Options for data sources

Options for correlating data

General options

DESCRIPTION

This program will calculate statistics between the positioned scores of two different datasets over a window from an annotated feature or chromosomal segment. These statistics will help determine whether the positions or distribution of scores across the window vary or underwent a positional shift between a test and a reference dataset. For example, if the enrichment of nucleosome signal from a ChIP experiment shifts in genomic position, indicating a change in nucleosome position.

Two statistics may be calculated. First, it will calculate a a Pearson linear correlation coefficient (r value) between the datasets (default). Additionally, an ANOVA analysis may be performed between the datasets and generate a P-value.

By default, the correlation is determined between the data points collected over the entire length of the feature. Alternatively, a radius and reference point (default is midpoint) may be provided that sets the window for collecting scores and calculating a correlation.

In general, to ensure a more reliable Pearson value, fragment ChIP or nucleosome coverage should be used rather than point (start or midpoint) data, as it will give more reliable results. Fragment coverage is more akin to smoothened data and gives better results than interpolated point data.

Normalized read-depth data should be used when possible. If necessary, Values can be normalized using one of two methods. The values may be converted to rank positions (compare to Kendall’s tau), or scaled such that the absolute sum values are equal (for example, when working with sequence tag read counts).

In addition to calculating a correlation coefficient, an optimal shift may also be calculated. This essentially shifts the data, 1 bp at a time, in order to identify a shift that would produce a higher correlation. In other words, what amount of movement to the left or right would make the test data look like the reference data? The window is shifted from -2 radius to +2 radius relative to the reference point, and the highest correlation is reported along with the shift value that generated it.

AUTHOR

Timothy J. Parnell, PhD
Dept of Oncological Sciences
Huntsman Cancer Institute
University of Utah
Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.