Analyze statistical significance of TF co-binding at enhanceosome regions

Performs pairwise statistical testing of transcription factor co-occurrence at enhanceosome regions using Fisher's exact test or permutation testing, with odds ratios, Pointwise Mutual Information (PMI), and hierarchical clustering.

Usage

analyze_tf_cobinding(
  enhanceosome,
  database,
  fdr_threshold = 0.05,
  min_regions = 5L,
  method = c("fisher", "permutation"),
  n_permutations = 1000L
)

Arguments

enhanceosome: epiRomics class database containing enhanceosome calls
database: epiRomics class database containing all data initially loaded
fdr_threshold: numeric, FDR threshold for significance (default: 0.05)
min_regions: integer, minimum number of co-bound regions to report a pair (default: 5)
method: character, statistical method: "fisher" (default) for Fisher's exact test or "permutation" for permutation-based testing that accounts for spatial autocorrelation.
n_permutations: integer, number of permutations when method = "permutation" (default: 1000). Ignored for Fisher's test.

Value

list with components:

pairwise: data.frame with columns: tf1, tf2, n_both, n_tf1_only, n_tf2_only, n_neither, odds_ratio, pvalue, fdr, pmi, significant
presence_matrix: logical matrix (regions x TFs) of binding presence
clustering: hclust object from hierarchical clustering of TF co-occurrence (Jaccard distance, Ward.D2 linkage)
tf_names: character vector of TF names analyzed
n_regions: integer, total number of enhanceosome regions
method: character, statistical method used

Details

This replaces the previous decision-tree approach (epiRomics_predictors) with statistically rigorous co-binding analysis. For each pair of TFs, a 2x2 contingency table is constructed from the enhanceosome presence matrix. P-values are corrected using Benjamini-Hochberg FDR.

Statistical methods

Fisher's exact test (method = "fisher"): Tests whether two TFs co-occur at enhanceosome regions more (or less) often than expected by chance. Assumes independence between regions. This is the default and is appropriate when regions are largely non-overlapping. Reference: Fisher, R.A. (1922) J Royal Stat Soc.
Permutation test (method = "permutation"): Shuffles TF_B binding labels across regions to generate a null distribution, accounting for spatial autocorrelation between nearby genomic regions. More conservative but robust to violations of independence. Reference: Gel et al. (2016) Bioinformatics 32(2):289-291. "regioneR: an R/Bioconductor package for the association analysis of genomic regions."
Odds ratio: Measures strength of association. OR > 1 indicates co-occurrence; OR < 1 indicates mutual exclusion.
PMI: Pointwise Mutual Information quantifies the degree of association between two TFs: PMI(A,B) = log2(P(A,B) / (P(A)*P(B))). PMI > 0 indicates co-occurrence; PMI < 0 indicates avoidance. Reference: Church & Hanks (1990) Computational Linguistics.
BH-FDR: Benjamini-Hochberg correction controls the false discovery rate across all pairwise tests. Reference: Benjamini & Hochberg (1995) J Royal Stat Soc B.

Note on spatial autocorrelation

Fisher's exact test assumes independence between observations (regions). Nearby genomic regions may be spatially correlated (e.g., broad TF binding domains), which can inflate significance. If your enhanceosome regions contain many closely spaced or overlapping intervals, consider using method = "permutation" for more conservative p-values.

Examples

db <- make_example_database()
eso <- make_example_enhanceosome(db)
cobinding <- analyze_tf_cobinding(eso, db)
cobinding$pairwise[, c("tf1", "tf2", "odds_ratio", "fdr")]
#>   tf1 tf2 odds_ratio fdr
#> 1 TF1 TF2          0   1