Application of EpiClass on public epigenomic data

Author

Joanny Raby

Results section 3: Figures and their code

The formatting of the figures may differ slightly from those in the paper, but they display the same data points.

All code cells are folded by default. To view any cell, click “Code” to expand it, or use the code options near the main title above to unfold all at once.

Some figure-generation code is not included here because the interactive outputs would be too large (e.g. PCA figures). If a figure appears without accompanying code or explanation, it was created manually.

Setup

Imports and paths setups.

Code
from __future__ import annotations

from pathlib import Path
from typing import List

import matplotlib.pyplot as plt
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import skops.io as skio
import upsetplot
from IPython.display import display
import numpy as np

from epiclass.utils.notebooks.paper.paper_utilities import (
    ASSAY,
    ASSAY_ORDER,
    IHECColorMap,
    MetadataHandler,
    SplitResultsHandler,
)
Code
CANCER = "harmonized_sample_cancer_high"
CORE_ASSAYS = ASSAY_ORDER[0:7]
Code
base_dir = Path.home() / "Projects/epiclass/output/paper"
paper_dir = base_dir
if not paper_dir.exists():
    raise FileNotFoundError(f"Directory {paper_dir} does not exist.")

base_data_dir = base_dir / "data"
base_fig_dir = base_dir / "figures"
tables_dir = base_dir / "tables"

base_metadata_dir = base_data_dir / "metadata"
Code
IHECColorMap = IHECColorMap(base_fig_dir)
assay_colors = IHECColorMap.assay_color_map
cell_type_colors = IHECColorMap.cell_type_color_map
sex_colors = IHECColorMap.sex_color_map
Code
split_results_handler = SplitResultsHandler()

metadata_handler = MetadataHandler(paper_dir)
metadata_v2 = metadata_handler.load_metadata("v2")
metadata_v2_df = metadata_handler.load_metadata_df("v2")
Code
base_pred_dir = base_data_dir / "training_results" / "dfreeze_v2" / "predictions"
if not base_pred_dir.exists():
    raise FileNotFoundError(f"Directory {base_pred_dir} does not exist.")

Results context

Predictions on external databases (ENCODE, ChIP-Atlas and Recount3) were all performed using the same classifiers trained on the labeled EpiAtlas samples.

Metadata category Nb classes Experiment Key (comet.com) Training Files
assay_epiclass 7 69488630801b4a05a53b5d9e572f0aaa 16788
assay_epiclass 11 0f8e5eb996114868a17057bebe64f87c 20922
harmonized_donor_sex 3 4b908b83e0ec45c3ab991e65aa27af0c 18299
harmonized_donor_life_stage 5 91214ed0b1664395b1826dc69a495ed4 15372
harmonized_sample_cancer_high 2 15da476b92f140eab818ece369248f4c 20922
harmonized_biomaterial_type 4 f42b6f4e147c4f1bbe378f3eed415099 20922

Classes:

  • assay 7c: 6 h3k* histone marks + input
  • assay 11c: assay7c + rna_seq + mrna_seq + wgbs_standard + wgbs_pbat
  • harmonized_donor_sex: male, female, mixed
  • harmonized_donor_life_stage: adult, child, newborn, fetal, embryonic
  • harmonized_sample_cancer_high (modification of harmonized_sample_disease_high): cancer, non-cancer (healthy/None+disease)
  • harmonized_biomaterial_type (biomat): cell line, primary cell, primary cell culture, primary tissue

Training metadata: Approximately IHEC_sample_metadata_harmonization.v1.1.extended.csv.

See data/metadata/epiatlas/README.md for metadata details, and training_metadata_vs_official_v1.1.json for exact difference of our training data and v1.1.

Additonally, for ENCODE, the harmonized_sample_ontology_intermediate model was used on a subset of files with known EpiATLAS biospecimens.

Metadata category Nb classes Experiment Key (comet.com) Training Files
harmonized_sample_ontology_intermediate 16 bb66b72ae83645d587e50b34aebb39c3 16379

The metadata for ENCODE predictions was created using the FILE + EXPERIMENT + BIOSAMPLE accessions, starting from filenames. For the initial extraction process, see: src/python/epiclass/utils/notebooks/paper/encode_metadata_creation.ipynb (permalink). Final metadata file: encode_full_metadata_2025-02_no_revoked.freeze1.csv(.xz)

Figure 3 and Supp Fig 8A,D - Public databases metrics

ChIP-Atlas, ENCODE and recount3 metrics at various prediction score thresholds were generated using the MetricsPerAssay class, from src/python/epiclass/utils/notebooks/paper/metrics_per_assay.py (permalink).

(Folder permalink)

  • ChIP-Atlas: src/python/epiclass/utils/notebooks/paper/c-a_pred_analysis.ipynb (section Summary metrics by assay)
  • ENCODE: src/python/epiclass/utils/notebooks/paper/encode_pred_analysis.ipynb (section All 5 tasks metrics summary, run Setup section beforehand)
  • recount3: src/python/epiclass/utils/notebooks/paper/recount3_pred_analysis.ipynb

Results were manually merged into Supplementary Table 9, and graphs were manually created using values from that table.

Figure 3A - ENCODE predictions

Load public DBs prediction metrics summary.

Code
gen_path = tables_dir / "dfreeze_v2" / "predictions"
table_path = gen_path / "ST9-Public_DBs_predictions_summary.csv"
df_encode = pd.read_csv(table_path, low_memory=False, skiprows=1)

Filter for high-confidence (prediction score >= 0.6) ENCODE metrics

Code
mask1 = df_encode["Dataset source"] == "ENCODE"
mask2 = df_encode["min_predScore"].astype(str) == "0.6" # 'high-confidence' threshold
mask3 = df_encode["Fig3A"] == "Y" # others might be redundant, but this is explicit

df_encode = df_encode[mask1 & mask2 & mask3]

df_encode["assay"].replace("avg-non-core", "non-core", inplace=True)

Define graphing function.

Code
def reproduce_encode_plot(df: pd.DataFrame, assay_colors: dict):
    """
    Reproduces the ENCODE Core and non-core data boxplot with default right-side legend.

    Expected columns in df: 'task_name', 'assay', 'acc', 'nb_samples'
    Note: It assumes 'acc' is on a 0-100 scale. If it's 0-1, it scales it automatically.
    """
    all_assays_ordered = ASSAY_ORDER + ["non-core"]
    assay_colors["non-core"] = "rgb(255,0,255)"

    # Scale accuracy to 0-100% if it seems to be in 0-1 range
    if df["acc"].max() <= 1.0:
        df = df.copy()
        df["acc"] = df["acc"] * 100

    # Order of categories on the x-axis
    category_order = [
        "assay_epiclass_9c",
        "biospecimen_intermediate",
        "sex",
        "cancer_status",
        "biomaterial_type",
        "lifestage_merged",
    ]
    cat_to_int = {cat: i for i, cat in enumerate(category_order)}

    # Predefine a fixed horizontal offset for each assay
    # Spread them evenly between -0.35 and 0.35 relative to the integer box center
    offsets = np.linspace(-0.35, 0.35, num=len(all_assays_ordered))
    assay_to_offset = {assay: offsets[i] for i, assay in enumerate(all_assays_ordered)}

    fig = go.Figure()

    # --- LAYER 1: The summary boxes (White fill, black borders, no points) ---
    # We use mapped integer X values so it aligns with our custom float scatter coordinates later
    fig.add_trace(
        go.Box(
            x=df["task_name"].map(cat_to_int),
            y=df["acc"],
            boxpoints="outliers",  # Force 1.5x IQR whiskers
            marker=dict(opacity=0, size=0.00001),  # invisible dots
            line=dict(color="black", width=1.5),
            fillcolor="white",  # White box over grey background
            boxmean=True,  # Adds dashed line for the mean
            showlegend=False,  # Hide this underlying trace from the legend
            name="Summary",
            hoverinfo="skip",  # We will handle hover info with the scatter points instead
        )
    )

    # --- LAYER 2: The colored scatter points ---
    # Iterate using the ordered list to preserve legend order
    for assay in all_assays_ordered:
        df_assay = df[df["assay"] == assay].copy()
        if df_assay.empty:
            print(f"Warning: No data for assay '{assay}'. Skipping.")
            continue

        # Calculate precise X positions: Integer category center + Assay-specific offset
        base_x = df_assay["task_name"].map(cat_to_int)
        exact_x = base_x + assay_to_offset[assay]

        # Calculate dynamic point sizes based on log10 of sample size
        point_sizes = 2 + 4 * np.log10(df_assay["nb_samples"])

        fig.add_trace(
            go.Scatter(
                x=exact_x,
                y=df_assay["acc"],
                mode="markers",
                name=assay,
                marker=dict(
                    color=assay_colors[assay],
                    size=point_sizes,
                    opacity=1,  # Non unique size removes default 1 opacity, gotta set it back to 1
                    line=dict(color="black", width=1),
                ),
                showlegend=True,
                customdata=df_assay[["nb_samples"]],
                hovertemplate=(
                    "<b>Assay:</b> " + assay + "<br>"
                    "<b>Accuracy:</b> %{y:.1f}%<br>"
                    "<b>Samples:</b> %{customdata[0]}<br>"
                    "<extra></extra>"
                ),
            )
        )

    # --- LAYER 3: Aesthetics and Layout ---
    def calculate_marker_size(n):
        return 2 + 4 * np.log10(np.clip(n, a_min=1, a_max=None))

    reference_sizes = [50, 100, 500, 1000, 5000]

    for n in reference_sizes:
        fig.add_trace(
            go.Scatter(
                x=[None],
                y=[None],  # Invisible coordinates, only renders in legend
                mode="markers",
                name=f"{n:,}",  # e.g., "1,000"
                legend="legend2",  # Assign to the SECOND legend
                marker=dict(
                    color="black",  # Fully black reference circles
                    size=calculate_marker_size(n),
                    opacity=1,
                    line=dict(color="black", width=1),
                ),
                showlegend=True,
                hoverinfo="skip",  # No tooltip needed for fake traces
            )
        )

    fig.update_layout(
        boxmode="overlay",  # Crucial: ensures the invisible point boxes sit directly on top of the boxes
        title_font_size=20,
        yaxis_title="Accuracy (%)",
        xaxis=dict(
            title="Classifier",
            title_font_size=18,
            tickfont_size=14,
            # Map the integers back to the original task_name values!
            tickmode="array",
            tickvals=list(range(len(category_order))),
            ticktext=category_order,
            range=[-0.6, len(category_order) - 0.4],
        ),
        legend=dict(
            title_text="Assay",
            itemsizing="constant",
        ),
        legend2=dict(
            title_text="Samples",
            itemsizing="trace",
            yanchor="top",
            y=0.2,  # Position Middle-Right (below main legend)
            xanchor="left",
            x=1.02,
        ),
    )

    fig.show()
Code
reproduce_encode_plot(df_encode, assay_colors)

Figure 3A: Accuracy of each metadata classifier breakdown per assay (dots) of the high-confidence predictions on the ENCODE data for which metadata was available, excluding those in EpiATLAS (Supplementary Table 9). Dashed lines represent means, solid lines the medians, boxes the quartiles, and whiskers the farthest points within 1.5× the interquartile range.

Figure 3B - ChIP-Atlas assay (7classes) prediction

The following code shows how the various fractions presented in the graph are obtained.

Read prediction file.

Code
ca_preds_dir = tables_dir / "dfreeze_v2" / "predictions"
ca_preds_path = (
    ca_preds_dir / "ChIP-Atlas_predictions_20240606_merge_metadata_freeze1.csv.xz"
)
ca_preds_df = pd.read_csv(ca_preds_path, sep=",", low_memory=False, compression="xz")

print(f"ChIP-Atlas: {ca_preds_df.shape[0]} total files")
ChIP-Atlas: 48669 total files

Ignoring potentially non-core files.

Code
DB_COLS = ["GEO_target_mod", "C-A_target", "Cistrome_target", "NGS_target_mod"]
acceptable_targets = ['h3k4me3', 'h3k36me3', 'h3k27ac', 'h3k27me3', 'h3k9me3', 'h3k4me1', 'input', "ctrl", "----", "unclassified"]

# Uniformize just in case
ca_preds_df[DB_COLS] = ca_preds_df[DB_COLS].fillna("----").apply(lambda x: x.str.lower())

ca_core7_df = ca_preds_df[
    ca_preds_df[DB_COLS]
        .isin(acceptable_targets)
        .all(axis=1)
]

diff = ca_preds_df.shape[0] - ca_core7_df.shape[0]
print(f"{diff} potentially non-core removed.")
print(f"{ca_core7_df.shape[0]} files remaining.")

display(ca_core7_df["target_majority_consensus"].value_counts(dropna=False))
1604 potentially non-core removed.
47065 files remaining.
target_majority_consensus
input           17828
h3k27ac         11234
h3k4me3          6386
h3k27me3         4295
h3k4me1          3143
h3k9me3          2212
h3k36me3         1564
no_consensus      403
Name: count, dtype: int64

Removing ChIP-Atlas experiment overlap with EpiATLAS

Code
print(f"ChIP-Atlas: Initial {ca_core7_df.shape[0]} files")
no_epiatlas_df = ca_core7_df[ca_core7_df["is_EpiAtlas_EpiRR"] == "0"]

diff = ca_core7_df.shape[0] - no_epiatlas_df.shape[0]
print(f"ChIP-Atlas: {diff} files with EpiAtlas EpiRR removed")
print(f"ChIP-Atlas: {no_epiatlas_df.shape[0]} files without EpiAtlas EpiRR")
ChIP-Atlas: Initial 47065 files
ChIP-Atlas: 1047 files with EpiAtlas EpiRR removed
ChIP-Atlas: 46018 files without EpiAtlas EpiRR

Figure 3B: Inference of the Assay classifier on the ChIP-Atlas datasets represented as a multi-layer donut chart where the internal layer is depicting the availability status of the metadata as being provided/extracted (yellow) vs missing (grey), the central layer depicting the classifier prediction matching (green) or not (red) the expected metadata label, and the external layer containing the confidence level where prediction scores above 0.6 correspond to high-confidence predictions (light to dark blue >0.6, >0.8 and >0.9) vs low confidence predictions (brown). The high-confidence mismatching predictions were further divided between datasets predicted as Input (light orange) vs potential mislabels (dark red).

High-confidence predictions

Code
total_N = no_epiatlas_df.shape[0]
high_confidence_pred_df = no_epiatlas_df[no_epiatlas_df["Max_pred_assay7"] >= 0.6]

high_confidence_N = high_confidence_pred_df.shape[0]
low_confidence_N = total_N - high_confidence_N
N_percent = high_confidence_N / total_N
print(
    f"{high_confidence_N}/{total_N} files ({high_confidence_N/total_N:.2%}) with high confidence assay prediction\n"
    f"{low_confidence_N}/{total_N} files ({low_confidence_N/total_N:.2%}) with low confidence assay prediction"
)
41203/46018 files (89.54%) with high confidence assay prediction
4815/46018 files (10.46%) with low confidence assay prediction

Match between manual target consensus and MLP prediction

Code
total_N = high_confidence_pred_df.shape[0]

match_rule = (
    high_confidence_pred_df[DB_COLS].eq(
    high_confidence_pred_df["Predicted_class_assay7"], axis=0
).any(axis=1)
)
match_df = high_confidence_pred_df[match_rule]
mismatch_df = high_confidence_pred_df[~match_rule]

agreement_N = match_df.shape[0]

print(
    f"ChIP-Atlas: {agreement_N}/{total_N} files ({agreement_N / total_N:.2%}) with agreement between predicted assay and any of the 4 sources"
)
ChIP-Atlas: 39537/41203 files (95.96%) with agreement between predicted assay and any of the 4 sources

Mismatch breakdown

Code
total_mismatch = mismatch_df.shape[0]
input_rule = mismatch_df["Predicted_class_assay7"] == "input"
input_pred_N = input_rule.sum()

print(
    f"ChIP-Atlas: {input_pred_N}/{total_mismatch} files ({input_pred_N / total_mismatch:.2%}) with mismatch predicted as input\n"
    f"ChIP-Atlas: {total_mismatch-input_pred_N}/{total_mismatch} files ({(total_mismatch-input_pred_N) / total_mismatch:.2%}) potential mislabels\n"
)

print("ChIP-Atlas: Distribution of consensus assay labels among the mismatches:")
true_mismatch_df = mismatch_df[~input_rule]
display(true_mismatch_df["core7_DBs_consensus"].value_counts(dropna=False))
ChIP-Atlas: 1146/1666 files (68.79%) with mismatch predicted as input
ChIP-Atlas: 520/1666 files (31.21%) potential mislabels

ChIP-Atlas: Distribution of consensus assay labels among the mismatches:
core7_DBs_consensus
Identical    468
1 source      49
Different      3
Name: count, dtype: int64

Figure 3C-J - ChIP-Atlas and recount3 other predictions

Figure 3C-J: Inference of the donor Sex (C,G), sample Cancer status (D,H), Biomaterial type (E,I) and donor Life stage (F,J) classifiers on the ChIP-Atlas (C-F) and Recount3 (G-J) sources represented as multi-layer donut charts as in panel B.

Supplementary Figure 6 - Prediction score thresholds on predictions from various DBs

See Additional Supplementary Figures page.

Supplementary figures 5 and 7 are relevant to section 2 of results and not included here.

PCAs - Supp Fig8-9

The following applies to supplementary figures 8B, 9B and 9E.

PCAs were computed via src/python/epiclass/utils/compute_pca.py (permalink), using IncrementalPCA from scikit-learn.

Graphing was done using code similar to src/python/epiclass/utils/notebooks/paper/pca_plot.ipynb (permalink), which uses the output of compute_pca.py.

Supplementary Figure 8 - ENCODE predictions

EpiClass results on ENCODE datasets.

A,D - Core and non-core accuracy

See public databases metrics section for details.

Supplementary Figure 8A,D: Inference of the Assay, Biospecimen source, donor Sex, sample Cancer status, Biomaterial type and donor Life stage classifiers on the ENCODE datasets from core assays (histones ChIP-Seq, RNA-Seq, WGBS) represented as multi-layer donut charts as in Fig. 3B. The Biospecimen source classifier was only applied on datasets from the 13 major biospecimen sources out of the 16 found in EpiATLAS training data (N = 1215 datasets). Panel D is as panel A but for non-core ENCODE datasets.

B - All assays PCA

See PCA section for details on the methodology.

Supplementary Figure 8B: Distribution of the EpiATLAS and ENCODE data over their two first principal components.

C - Non-core predictions with assay category mapping

Load ENCODE predictions.

Code
encode_preds_dir = base_pred_dir / "encode"
encode_preds_path = (
    encode_preds_dir / "complete_encode_predictions_augmented_2025-02_metadata.csv.gz"
)
encode_preds_df = pd.read_csv(
    encode_preds_path, sep=",", low_memory=False, compression="gzip"
)
print(f"ENCODE: {encode_preds_df.shape[0]} total files")
ENCODE: 11643 total files

Filter out overlap with EpiATLAS EpiRRs.

Code
encode_preds_df = encode_preds_df[encode_preds_df["in_epiatlas"].astype(str) == "False"]
print(f"ENCODE: {encode_preds_df.shape[0]} total files with no EpiAtlas overlap")
ENCODE: 8777 total files with no EpiAtlas overlap

Load ENCODE manually created assay categories, and merge them with ENCODE predictions data.

Code
encode_meta_dir = base_data_dir / "metadata" / "encode"

non_core_categories_path = (
    encode_meta_dir / "non-core_encode_assay_category_2024-08-29.csv"
)

non_core_categories_df = pd.read_csv(non_core_categories_path, sep=",", low_memory=False)
print(f"ENCODE: {non_core_categories_df.shape[0]} non-core categories")

non_core_map = non_core_categories_df.set_index("target").to_dict()["Assay category"]

N_mapped = len([val for val in non_core_map.values() if val != "not_looked"])
print(f"ENCODE: {N_mapped} non-core categories mapped to functional categories.")


# Select non-core samples
encode_non_core_df = encode_preds_df[
    encode_preds_df[ASSAY].isin(["non-core", "ctcf"])
].copy()

# Map assays to categories
encode_non_core_df["assay_category"] = (
    encode_non_core_df["assay"].str.lower().replace(non_core_map)
)
ENCODE: 1170 non-core categories
ENCODE: 238 non-core categories mapped to functional categories.

Graph.

Code
assay_categories_order = [
    "trx_reg",
    "heterochrom",
    "polycomb",
    "splicing",
    "insulator",
    "other/mixed",
    "not_looked",
]

assay_epiclass_order = [
    "h3k27ac",
    "h3k4me3",
    "h3k4me1",
    "h3k9me3",
    "h3k27me3",
    "h3k36me3",
    "input",
]
assay_epiclass_order = {assay: i for i, assay in enumerate(assay_epiclass_order)}
pred_col = f"Predicted class ({ASSAY}_7c)"
max_pred_col = f"Max pred ({ASSAY}_7c)"

min_pred = 0.6
sub_df = encode_non_core_df[encode_non_core_df[max_pred_col] >= min_pred]
groupby = (
    sub_df.groupby(["assay_category", pred_col])
    .size()
    .reset_index(name="Count")
    .sort_values(["assay_category", "Count"], ascending=[True, False])
)
groupby["Percentage"] = groupby.groupby("assay_category")["Count"].transform(
    lambda x: (x / x.sum()) * 100
)

# Add order for plotting
groupby["assay_order"] = groupby[pred_col].map(assay_epiclass_order)
groupby = groupby.sort_values(
    ["assay_category", "assay_order"], ascending=[False, True]
)

# Main plot
fig = px.bar(
    groupby,
    x="assay_category",
    y="Percentage",
    color=pred_col,
    barmode="stack",
    category_orders={"assay_category": assay_categories_order},
    color_discrete_map=assay_colors,
    title=f"core7 predictions for non-core assays, predScore >= {min_pred:.2f}",
    labels={"Percentage": "Percentage (%)", "assay_category": "Assay Category"},
)

# Modify x-axis labels
total_counts = groupby.groupby("assay_category")["Count"].sum()
ticktext = [
    f"{assay_category} (N={total_counts[assay_category]})"
    for assay_category in assay_categories_order
]
fig.update_xaxes(tickvals=assay_categories_order, ticktext=ticktext)
fig.show()

Supplementary Figure 8C: Proportion of non-core ChIP-Seq datasets from ENCODE classified with high-confidence (>0.6) in the different functional categories predicted as one of the 7 core ChIP assays. Since none of the six histone modifications used for training are specifically associated with insulator factors, they were instead mostly predicted as Input.

E-F - Input-class probability and QC metrics

For full code, since the processing is more complex, see src/python/epiclass/utils/notebooks/paper/encode_chip_QC.ipynb (permalink).

For non-core targets, the target name and number of samples are visible on hover.

Supplementary Figure 8E-F: Spearman rank correlations per-target between EpiClass Input-class probability (from the ChIP-only classifier, 7 classes) and four ENCODE ChIP-seq quality control (QC) metrics: fraction of reads in peaks (FRiP), number of reproducible peaks, Jensen-Shannon distance from Input control (JSD), and normalized strand cross-correlation coefficient (NSC) for core histone marks (E) and non-core ChIP targets (F) including transcription factors, chromatin regulators, and RNA polymerase subunits. Each point represents one ChIP target, the boxplots summarize the distribution. Negative ρ indicates that higher Input-class probability is associated with lower QC metric values. The dashed line at ρ = 0 indicates no correlation. Boxplot dashed lines represent means, solid lines the medians, boxes the interquartile range, and whiskers the farthest points within 1.5× the interquartile range. H3K4me3 was excluded in panel E due to insufficient sample size after filtering (see Supplementary Methods).

Supplementary Figure 9 - Results on public datasets

EpiClass results on public datasets.

A - UpsetPlot of assay labels provided in 4 DBs

The following function gives a consensus label to each ChIP-Atlas sample, using the target values given by ChIP-Atlas, cistromeDB, NGS-QC and GEO.

The consensus description is based on the following rules:

  • “Identical” if all labels are the same
  • “Different” if at least one label is different
  • “1 source” if only one DB has a label
  • “Ignored - Potential non-core” if any label is not in the core assays
Code
def create_4DB_consensus_description(
    ca_core_df: pd.DataFrame, db_cols: List[str]
) -> pd.Series:
    """Create a description of the 4DB assay consensus labels.

    Treat "Unclassified" from Chip-Atlas as absent samples for the target consensus evaluation.

    The consensus description is based on the following rules:
    - "Identical" if all labels are the same
    - "Different" if at least one label is different
    - "1 source" if only one DB has a label
    - "Ignored - Potential non-core" if any label is not in the core assays

    Args:
        ca_core_df: ChIP-Atlas core7 DataFrame

    Returns:
        Series with the target consensus description
    """
    id_db_target = []
    tmp_df = ca_core_df.loc[:, db_cols].copy()
    tmp_df["C-A"].replace("unclassified", "----", inplace=True)

    for labels in tmp_df.values:
        missing_N = sum(label == "----" for label in labels)
        db_labels = set(labels)

        try:
            db_labels.remove("----")
        except KeyError:
            pass
        if any(label not in CORE_ASSAYS + ["ctrl"] for label in db_labels):
            id_db_target.append("Ignored - Potential non-core")
        elif missing_N == 3:
            id_db_target.append("1 source")
        elif len(db_labels) == 1:
            id_db_target.append("Identical")
        else:
            id_db_target.append("Different")

    return pd.Series(id_db_target, index=ca_core_df.index)

Define graphing function make_db_upsetplot.

Code
def make_db_upsetplot(
    df: pd.DataFrame, consensus_col: str, db_cols: List[str], title: str
) -> upsetplot.UpSet:
    """Make an upsetplot of the sample presence in the different databases."""
    df = df.copy()

    # Create a new DataFrame with boolean columns for each database
    upset_df = pd.DataFrame()
    for col in db_cols:
        upset_df[col] = df[col] != "----"
    upset_df[consensus_col] = df[consensus_col]

    # Set the index for the UpSet plot
    upset_df = upset_df.set_index(db_cols)

    # Create the UpSet plot
    upset = upsetplot.UpSet(
        upset_df,
        intersection_plot_elements=0,  # disable the default bar chart
        sort_by="cardinality",
        show_counts=True,  # type: ignore
        orientation="horizontal",
    )

    # Add stacked bars
    upset.add_stacked_bars(by=consensus_col, elements=15)

    # Plot and set title
    axes = upset.plot()
    plt.suptitle(title)
    axes["totals"].set_title("Total")
    plt.legend(loc="center right")
    plt.show()
    return upset

Graph.

Code
DB_COLS = ["GEO_target_mod", "C-A_target", "Cistrome_target", "NGS_target_mod"]
consensus_col = "core7_DBs_consensus"
title = "ChIP-Atlas core 7 samples presence in used DBs\nTarget Consensus - No EpiAtlas overlap"
make_db_upsetplot(
    df=no_epiatlas_df, consensus_col=consensus_col, db_cols=DB_COLS, title=title
)

print("Total for each category:")
display(no_epiatlas_df[consensus_col].value_counts(dropna=False))

Total for each category:
core7_DBs_consensus
Identical    41992
1 source      3320
Different      706
Name: count, dtype: int64

Supplementary Figure 9A: Comparison of the four public sources used to extract assay metadata after excluding ChIP-Atlas datasets present in EpiATLAS. The 1 source category corresponds to 3,320 datasets where the assay label was only available from GEO. A total of 706 datasets were labeled differently.

B - Core ChIP-Seq PCA

See PCA section for details on the methodology.

Supplementary Figure 9B: Distribution of the ChIP-Seq data from EpiATLAS and ENCODE over their two first principal components.

C - ChIP-Atlas mislabeled datasets

Supplementary Figure 9C: Genome browser representation of a few high-confidence potential mislabeled candidates from ChIP-AtlasGSE106660. This study includes WT and Knock-out (KO) conditions that were conducted in triplicates for a total of 78 ChIP-Seq experiments, including 6 datasets for H3K9me3 (WT1 and KO1) and 12 for H3K27ac (WT1-2 and KO1-2). The authors also generated and submitted to GEO 26 signal files combining the 3 biological replicates per condition. To discard potential errors during ChIP-Atlas reprocessing, we downloaded the raw data from GEO and reprocessed them on hg38 with the ChIP-Seq GenPipes version 6.1.049[PMC6559338]. The image is showing the combined signal generated by the authors (dark colors), the rep1 of each condition from ChIP-Atlas data, as well as reprocessed rep1 data (hatched colors) from GSE106660 (rep2 and 3 are not shown but are almost identical to rep1), as well as two representatives H3K9me3 from a different study at the bottom. This is demonstrating that the swap between H3K27ac and H3K9me3 occurred during submission of the raw data on GEO/SRA and only involved WT1 and KO1 as WT2 and KO2 signals are as expected.

D - Effect of EpiATLAS imputation on assay training and inference

Read metadata for both ChIP-Atlas and imputed samples.

Code
imputed_metadata_path = (
    base_metadata_dir
    / "epiatlas"
    / "imputed"
    / "hg38_epiatlas_imputed_pval_chip_2024-02.json"
)
metadata_imputed: pd.DataFrame = metadata_handler.load_any_metadata(imputed_metadata_path, as_dataframe=True)  # type: ignore

ca_metadata_path = base_metadata_dir / "chip_atlas" / "CA.full_info_metadata.freeze1.tsv"
metadata_ca = pd.read_csv(ca_metadata_path, sep="\t", low_memory=False)

Gather predictions from classifier training on observed core6 data (all pval, no input).

Code
data_dir = base_data_dir / "training_results" / "dfreeze_v2"
pred_dfs = {}  # Gather all 4 cases

observed_dir = (
    data_dir
    / "hg38_100kb_all_none"
    / "assay_epiclass_1l_3000n"
    / "chip-seq-only"
    / "complete_no_valid_oversample"
)
observed_inf_imputed_path = next((observed_dir / "predict_imputed").glob("*.csv"))
observed_inf_CA_path = next((observed_dir / "predict_C-A").glob("*.csv"))

basename = "observed_core6_pval_inf"

# Imputed preds
df = pd.read_csv(observed_inf_imputed_path, header=0, index_col=0, low_memory=False)
df = pd.merge(df, metadata_imputed, left_index=True, right_on="md5sum")
df["True class"] = df[ASSAY]

print(f"Imputed: {df.shape[0]} total files")
pred_dfs[f"{basename}_imputed"] = df

# C-A preds
df = pd.read_csv(observed_inf_CA_path, header=0, index_col=0, low_memory=False)
df = pd.merge(df, metadata_ca, left_index=True, right_on="ID")
df["True class"] = df["expected_assay"]
print(f"CA: {df.shape[0]} total files")

display(df["expected_assay"].value_counts(dropna=False))
pred_dfs[f"{basename}_C-A"] = df
Imputed: 9570 total files
CA: 29105 total files
expected_assay
h3k27ac     11316
h3k4me3      6501
h3k27me3     4319
h3k4me1      3177
h3k9me3      2228
h3k36me3     1564
Name: count, dtype: int64

Gather predictions from classifier training on imputed data (all pval, no input).

Code
imputed_dir = (
    data_dir
    / "hg38_100kb_all_none_imputed"
    / "assay_epiclass_1l_3000n"
    / "chip-seq-only"
    / "complete_no_valid_oversample"
)
imputed_inf_observed_path = next(
    (imputed_dir / "predict_epiatlas_pval_chip-seq").glob("*.csv")
)
imputed_inf_CA_path = next((imputed_dir / "predict_C-A").glob("*.csv"))

basename = "imputed_core6_pval_inf"

# Observed (non-imputed) data preds
df = pd.read_csv(imputed_inf_observed_path, header=0, index_col=0, low_memory=False)
df = pd.merge(df, metadata_v2_df, left_index=True, right_on="md5sum")
df["True class"] = df[ASSAY]

print(f"EpiATLAS pval ChIP: {df.shape[0]} total files")
pred_dfs[f"{basename}_obs_core6_pval"] = df

# C-A preds
df = pd.read_csv(imputed_inf_CA_path, header=0, index_col=0, low_memory=False)
df = pd.merge(df, metadata_ca, left_index=True, right_on="ID")
df["True class"] = df["expected_assay"]

print(f"CA: {df.shape[0]} total files")
pred_dfs[f"{basename}_C-A"] = df
EpiATLAS pval ChIP: 5337 total files
CA: 29105 total files

Compute accuracy per assay.

Code
core6_assays = ASSAY_ORDER[0:6]
rows = []

for name, df in pred_dfs.items():
    if "Max pred" not in df.columns:
        df["Max pred"] = df[core6_assays].max(axis=1)

    task_name = f"train_{name}"

    for label in core6_assays:
        assay_df = df[df["True class"] == label]

        for min_pred in ["0.0", "0.6", "0.9"]:
            sub_df = assay_df[assay_df["Max pred"] > float(min_pred)]
            acc = (sub_df["True class"] == sub_df["Predicted class"]).mean()
            rows.append([task_name, label, min_pred, acc, len(sub_df)])

df_acc_per_assay = pd.DataFrame(
    rows, columns=["task_name", "assay", "min_predScore", "acc", "nb_samples"]
)

Graphing results

Code
# Prepare data from graphing
df_acc_per_assay["scatter_name"] = (
    df_acc_per_assay["task_name"]
    .replace("train_", "", regex=True)
    .replace("imputed", "imp", regex=True)
    .replace("observed", "obs", regex=True)
)
df_acc_per_assay["inf_target"] = df_acc_per_assay["scatter_name"].str.split("_").str[-1]

df_acc_per_assay = df_acc_per_assay.sort_values(
    ["assay", "min_predScore", "scatter_name"]
)

graph_df = df_acc_per_assay.copy()
graph_df = graph_df.sort_values(["inf_target", "scatter_name"])

# Prepare boxplot data
tick_group = [
    "obs_core6_pval_inf_imp",
    "imp_core6_pval_inf_obs_core6_pval",
    "obs_core6_pval_inf_C-A",
    "imp_core6_pval_inf_C-A",
]
scatter_name_to_position = {name: i for i, name in enumerate(tick_group)}

min_pred_values = ["0.0", "0.6", "0.9"]
offset = [-0.25, 0, 0.25]  # Offset for each min_pred within a tick group

# Define jitter magnitude (like 0.05 for left/right spacing)
jitter = 0.05
jitter_offsets = [-jitter, 0, jitter]

min_predScore_color_map = {"0.0": "blue", "0.6": "orange", "0.9": "red"}

minY = 0.7
maxY = 1.005

# Plot each trace
fig = go.Figure()
for name in tick_group:
    group = graph_df[graph_df["scatter_name"] == name]

    for i, min_pred in enumerate(min_pred_values):
        df_subset = group[group["min_predScore"] == min_pred]

        x_position = scatter_name_to_position[name] + offset[i]
        x_positions = [x_position] * len(df_subset)
        y_values = df_subset["acc"]
        hover_texts = [
            f"{row['assay']}<br>Samples: {row['nb_samples']}"
            for _, row in df_subset.iterrows()
        ]
        colors = [assay_colors[assay] for assay in df_subset["assay"]]

        # Add box plot without points
        fig.add_trace(
            go.Box(
                x=x_positions,
                y=y_values,
                name=f"{name} - Min Pred Score: {min_pred}",
                line=dict(
                    color=min_predScore_color_map[min_pred],
                ),
                boxpoints="all",
                marker=dict(
                    opacity=0, size=1e-5
                ),  # hide points, so whiskers don't go to min/max
                boxmean=True,
                showlegend=False,
                hoverinfo="none",
            )
        )
        # Add scatter plot for individual points
        x_jittered = [
            x + jitter_offsets[i % len(jitter_offsets)] for i, x in enumerate(x_positions)
        ]

        # sort x/y together
        x_jittered, y_values, hover_texts = zip(
            *sorted(zip(x_jittered, y_values, hover_texts))
        )
        fig.add_trace(
            go.Scatter(
                x=x_jittered,
                y=y_values,
                mode="markers",
                marker=dict(color=colors, size=8, line=dict(color="Black", width=1)),
                name=f"{name} - Min Pred Score: {min_pred}",
                showlegend=False,
                text=hover_texts,
                hoverinfo="text+y",
            )
        )

# Update x-axis tick labels
ticktext = []
for tick in tick_group:
    train, inf = tick.split("_inf_")
    ticktext.append(f"<b>{train}</b> \u2192 <b>{inf}</b>")

fig.update_xaxes(tickmode="array", ticktext=ticktext, tickvals=list(range(len(ticktext))))

# Update layout
fig.update_layout(
    title="Accuracy per Task (6 core assays)",
    xaxis_title="Task (training data \u2192 inference data)",
    yaxis_title="Accuracy",
    showlegend=True,
    height=600,
    yaxis=dict(tickformat=".2%", range=[minY, maxY]),
)

# Add a legend for minPred colors
for val, color in min_predScore_color_map.items():
    fig.add_trace(
        go.Scatter(
            x=[None],
            y=[None],
            mode="markers",
            marker=dict(size=10, color=color, symbol="square"),
            name=f"Min Pred Score: {val}",
            showlegend=True,
        )
    )

# Add a legend for assay colors
for assay in sorted(core6_assays):
    color = assay_colors[assay]
    fig.add_trace(
        go.Scatter(
            x=[None],
            y=[None],
            mode="markers",
            marker=dict(size=10, color=color),
            name=assay,
            legendgroup="assays",
            showlegend=True,
        )
    )

# Add legend for obs and imp
fig.add_annotation(
    x=1.2,
    y=0.2,
    yref="paper",
    xref="paper",
    text="obs = observed<br>imp = imputed",
    showarrow=False,
    font=dict(size=14),
)

fig.show()

The number of samples associated with each point is visible on hover.

Supplementary Figure 9D: Accuracy comparison per assay (dots) between Assay classifiers trained on observed vs imputed data from EpiATLAS and applied to either EpiATLAS or ChIP-Atlas without prediction score threshold (brown), or with a threshold of 0.6 (light blue) or 0.9 (dark blue). These classifiers were both trained only using the p-value datasets as this is the only track type available for imputed data (therefore no Input assay) (Supplementary Table 14). Dashed lines represent means, solid lines the medians, boxes the quartiles, and whiskers the farthest points within 1.5× the interquartile range.

E - RNA-Seq PCA

See PCA section for details on the methodology.

Supplementary Figure 9E: Distribution of a random selection of ~15% datasets from Recount3 along with all RNA-seq datasets from EpiATLAS and ENCODE data over their two first principal components.