---
title: "Application of EpiClass on public epigenomic data"
author: "Joanny Raby"
format:
html:
code-fold: true
code-tools: true
toc: true
toc-location: right
toc-expand: 2
embed-resources: true
engine: jupyter
execute:
echo: true
warning: false
eval: true
error: false
---
# Results section 3: Figures and their code
The formatting of the figures may differ slightly from those in the paper, but they display the same data points.
All code cells are folded by default. To view any cell, click **"Code"** to expand it, or use the code options near the main title above to unfold all at once.
Some figure-generation code is not included here because the interactive outputs would be too large (e.g. PCA figures). If a figure appears without accompanying code or explanation, it was created manually.
## Setup
Imports and paths setups.
```{python}
#| label: setup-imports
from __future__ import annotations
from pathlib import Path
from typing import List
import matplotlib.pyplot as plt
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import skops.io as skio
import upsetplot
from IPython.display import display
import numpy as np
from epiclass.utils.notebooks.paper.paper_utilities import (
ASSAY,
ASSAY_ORDER,
IHECColorMap,
MetadataHandler,
SplitResultsHandler,
)
```
```{python}
#| label: setup-constants
CANCER = "harmonized_sample_cancer_high"
CORE_ASSAYS = ASSAY_ORDER[0:7]
```
```{python}
#| label: setup-paths
base_dir = Path.home() / "Projects/epiclass/output/paper"
paper_dir = base_dir
if not paper_dir.exists():
raise FileNotFoundError(f"Directory {paper_dir} does not exist.")
base_data_dir = base_dir / "data"
base_fig_dir = base_dir / "figures"
tables_dir = base_dir / "tables"
base_metadata_dir = base_data_dir / "metadata"
```
```{python}
#| label: setup-colors
IHECColorMap = IHECColorMap(base_fig_dir)
assay_colors = IHECColorMap.assay_color_map
cell_type_colors = IHECColorMap.cell_type_color_map
sex_colors = IHECColorMap.sex_color_map
```
```{python}
#| label: setup-handlers
split_results_handler = SplitResultsHandler()
metadata_handler = MetadataHandler(paper_dir)
metadata_v2 = metadata_handler.load_metadata("v2")
metadata_v2_df = metadata_handler.load_metadata_df("v2")
```
```{python}
#| label: setup-pred-dir
base_pred_dir = base_data_dir / "training_results" / "dfreeze_v2" / "predictions"
if not base_pred_dir.exists():
raise FileNotFoundError(f"Directory {base_pred_dir} does not exist.")
```
## Results context
Predictions on external databases (ENCODE, ChIP-Atlas and Recount3) were all performed using the same classifiers trained on the labeled EpiAtlas samples.
| Metadata category| Nb classes | Experiment Key (comet.com) | Training Files |
|------------------|--------------|--------------------------------------|----------|
| assay_epiclass | 7 | 69488630801b4a05a53b5d9e572f0aaa | 16788 |
| assay_epiclass | 11 | 0f8e5eb996114868a17057bebe64f87c | 20922 |
| harmonized_donor_sex | 3 | 4b908b83e0ec45c3ab991e65aa27af0c | 18299 |
| harmonized_donor_life_stage | 5 | 91214ed0b1664395b1826dc69a495ed4 | 15372 |
| harmonized_sample_cancer_high | 2 | 15da476b92f140eab818ece369248f4c | 20922 |
| harmonized_biomaterial_type | 4 | f42b6f4e147c4f1bbe378f3eed415099 | 20922 |
Classes:
- assay 7c: 6 h3k* histone marks + input
- assay 11c: assay7c + rna_seq + mrna_seq + wgbs_standard + wgbs_pbat
- harmonized_donor_sex: male, female, mixed
- harmonized_donor_life_stage: adult, child, newborn, fetal, embryonic
- harmonized_sample_cancer_high (modification of harmonized_sample_disease_high): cancer, non-cancer (healthy/None+disease)
- harmonized_biomaterial_type (biomat): cell line, primary cell, primary cell culture, primary tissue
Training metadata: Approximately `IHEC_sample_metadata_harmonization.v1.1.extended.csv`.
See `data/metadata/epiatlas/README.md` for metadata details, and `training_metadata_vs_official_v1.1.json` for exact difference of our training data and v1.1.
Additonally, for ENCODE, the `harmonized_sample_ontology_intermediate` model was used on a subset of files with known EpiATLAS biospecimens.
| Metadata category| Nb classes | Experiment Key (comet.com) | Training Files |
|-------------------------------------------|-------------|---------------------------------------|----------|
| harmonized_sample_ontology_intermediate | 16 | bb66b72ae83645d587e50b34aebb39c3 | 16379 |
The metadata for ENCODE predictions was created using the FILE + EXPERIMENT + BIOSAMPLE accessions, starting from filenames.
For the initial extraction process, see: `src/python/epiclass/utils/notebooks/paper/encode_metadata_creation.ipynb` ([permalink](https://github.com/labjacquespe/EpiClass/blob/e87306bccecd3e35d9ffdbb5ddb29fcaa0800a35/src/python/epiclass/utils/notebooks/paper/encode_metadata_creation.ipynb)).
Final metadata file: `encode_full_metadata_2025-02_no_revoked.freeze1.csv(.xz)`
## Figure 3 and Supp Fig 8A,D - Public databases metrics {#sec-all-DB-metrics}
ChIP-Atlas, ENCODE and recount3 metrics at various prediction score thresholds were generated using the `MetricsPerAssay` class, from `src/python/epiclass/utils/notebooks/paper/metrics_per_assay.py` ([permalink](https://github.com/labjacquespe/EpiClass/blob/e87306bccecd3e35d9ffdbb5ddb29fcaa0800a35/src/python/epiclass/utils/notebooks/paper/metrics_per_assay.py)).
([Folder permalink](https://github.com/labjacquespe/EpiClass/tree/e87306bccecd3e35d9ffdbb5ddb29fcaa0800a35/src/python/epiclass/utils/notebooks/paper))
- ChIP-Atlas: `src/python/epiclass/utils/notebooks/paper/c-a_pred_analysis.ipynb` (section `Summary metrics by assay`)
- ENCODE: `src/python/epiclass/utils/notebooks/paper/encode_pred_analysis.ipynb` (section `All 5 tasks metrics summary`, run `Setup` section beforehand)
- recount3: `src/python/epiclass/utils/notebooks/paper/recount3_pred_analysis.ipynb`
Results were manually merged into Supplementary Table 9, and graphs were manually created using values from that table.
### Figure 3A - ENCODE predictions
Load public DBs prediction metrics summary.
```{python}
#| label: fig3a-read-summary
gen_path = tables_dir / "dfreeze_v2" / "predictions"
table_path = gen_path / "ST9-Public_DBs_predictions_summary.csv"
df_encode = pd.read_csv(table_path, low_memory=False, skiprows=1)
```
Filter for high-confidence (prediction score >= 0.6) ENCODE metrics
```{python}
#| label: fig3a-filter-summary
mask1 = df_encode["Dataset source"] == "ENCODE"
mask2 = df_encode["min_predScore"].astype(str) == "0.6" # 'high-confidence' threshold
mask3 = df_encode["Fig3A"] == "Y" # others might be redundant, but this is explicit
df_encode = df_encode[mask1 & mask2 & mask3]
df_encode["assay"].replace("avg-non-core", "non-core", inplace=True)
```
Define graphing function.
```{python}
#| label: fig3a-define-graph-function
def reproduce_encode_plot(df: pd.DataFrame, assay_colors: dict):
"""
Reproduces the ENCODE Core and non-core data boxplot with default right-side legend.
Expected columns in df: 'task_name', 'assay', 'acc', 'nb_samples'
Note: It assumes 'acc' is on a 0-100 scale. If it's 0-1, it scales it automatically.
"""
all_assays_ordered = ASSAY_ORDER + ["non-core"]
assay_colors["non-core"] = "rgb(255,0,255)"
# Scale accuracy to 0-100% if it seems to be in 0-1 range
if df["acc"].max() <= 1.0:
df = df.copy()
df["acc"] = df["acc"] * 100
# Order of categories on the x-axis
category_order = [
"assay_epiclass_9c",
"biospecimen_intermediate",
"sex",
"cancer_status",
"biomaterial_type",
"lifestage_merged",
]
cat_to_int = {cat: i for i, cat in enumerate(category_order)}
# Predefine a fixed horizontal offset for each assay
# Spread them evenly between -0.35 and 0.35 relative to the integer box center
offsets = np.linspace(-0.35, 0.35, num=len(all_assays_ordered))
assay_to_offset = {assay: offsets[i] for i, assay in enumerate(all_assays_ordered)}
fig = go.Figure()
# --- LAYER 1: The summary boxes (White fill, black borders, no points) ---
# We use mapped integer X values so it aligns with our custom float scatter coordinates later
fig.add_trace(
go.Box(
x=df["task_name"].map(cat_to_int),
y=df["acc"],
boxpoints="outliers", # Force 1.5x IQR whiskers
marker=dict(opacity=0, size=0.00001), # invisible dots
line=dict(color="black", width=1.5),
fillcolor="white", # White box over grey background
boxmean=True, # Adds dashed line for the mean
showlegend=False, # Hide this underlying trace from the legend
name="Summary",
hoverinfo="skip", # We will handle hover info with the scatter points instead
)
)
# --- LAYER 2: The colored scatter points ---
# Iterate using the ordered list to preserve legend order
for assay in all_assays_ordered:
df_assay = df[df["assay"] == assay].copy()
if df_assay.empty:
print(f"Warning: No data for assay '{assay}'. Skipping.")
continue
# Calculate precise X positions: Integer category center + Assay-specific offset
base_x = df_assay["task_name"].map(cat_to_int)
exact_x = base_x + assay_to_offset[assay]
# Calculate dynamic point sizes based on log10 of sample size
point_sizes = 2 + 4 * np.log10(df_assay["nb_samples"])
fig.add_trace(
go.Scatter(
x=exact_x,
y=df_assay["acc"],
mode="markers",
name=assay,
marker=dict(
color=assay_colors[assay],
size=point_sizes,
opacity=1, # Non unique size removes default 1 opacity, gotta set it back to 1
line=dict(color="black", width=1),
),
showlegend=True,
customdata=df_assay[["nb_samples"]],
hovertemplate=(
"<b>Assay:</b> " + assay + "<br>"
"<b>Accuracy:</b> %{y:.1f}%<br>"
"<b>Samples:</b> %{customdata[0]}<br>"
"<extra></extra>"
),
)
)
# --- LAYER 3: Aesthetics and Layout ---
def calculate_marker_size(n):
return 2 + 4 * np.log10(np.clip(n, a_min=1, a_max=None))
reference_sizes = [50, 100, 500, 1000, 5000]
for n in reference_sizes:
fig.add_trace(
go.Scatter(
x=[None],
y=[None], # Invisible coordinates, only renders in legend
mode="markers",
name=f"{n:,}", # e.g., "1,000"
legend="legend2", # Assign to the SECOND legend
marker=dict(
color="black", # Fully black reference circles
size=calculate_marker_size(n),
opacity=1,
line=dict(color="black", width=1),
),
showlegend=True,
hoverinfo="skip", # No tooltip needed for fake traces
)
)
fig.update_layout(
boxmode="overlay", # Crucial: ensures the invisible point boxes sit directly on top of the boxes
title_font_size=20,
yaxis_title="Accuracy (%)",
xaxis=dict(
title="Classifier",
title_font_size=18,
tickfont_size=14,
# Map the integers back to the original task_name values!
tickmode="array",
tickvals=list(range(len(category_order))),
ticktext=category_order,
range=[-0.6, len(category_order) - 0.4],
),
legend=dict(
title_text="Assay",
itemsizing="constant",
),
legend2=dict(
title_text="Samples",
itemsizing="trace",
yanchor="top",
y=0.2, # Position Middle-Right (below main legend)
xanchor="left",
x=1.02,
),
)
fig.show()
```
```{python}
#| label: fig3a-generate-plot
reproduce_encode_plot(df_encode, assay_colors)
```
Figure 3A: Accuracy of each metadata classifier breakdown per assay (dots) of the high-confidence predictions on the ENCODE data for which metadata was available, excluding those in EpiATLAS (Supplementary Table 9). Dashed lines represent means, solid lines the medians, boxes the quartiles, and whiskers the farthest points within 1.5× the interquartile range.
### Figure 3B - ChIP-Atlas assay (7classes) prediction
The following code shows how the various fractions presented in the graph are obtained.
\
\
Read prediction file.
```{python}
#| label: fig3b-read-chipatlas-preds
ca_preds_dir = tables_dir / "dfreeze_v2" / "predictions"
ca_preds_path = (
ca_preds_dir / "ChIP-Atlas_predictions_20240606_merge_metadata_freeze1.csv.xz"
)
ca_preds_df = pd.read_csv(ca_preds_path, sep=",", low_memory=False, compression="xz")
print(f"ChIP-Atlas: {ca_preds_df.shape[0]} total files")
```
Ignoring potentially non-core files.
```{python}
#| label: fig3b-filter-non-core
DB_COLS = ["GEO_target_mod", "C-A_target", "Cistrome_target", "NGS_target_mod"]
acceptable_targets = ['h3k4me3', 'h3k36me3', 'h3k27ac', 'h3k27me3', 'h3k9me3', 'h3k4me1', 'input', "ctrl", "----", "unclassified"]
# Uniformize just in case
ca_preds_df[DB_COLS] = ca_preds_df[DB_COLS].fillna("----").apply(lambda x: x.str.lower())
ca_core7_df = ca_preds_df[
ca_preds_df[DB_COLS]
.isin(acceptable_targets)
.all(axis=1)
]
diff = ca_preds_df.shape[0] - ca_core7_df.shape[0]
print(f"{diff} potentially non-core removed.")
print(f"{ca_core7_df.shape[0]} files remaining.")
display(ca_core7_df["target_majority_consensus"].value_counts(dropna=False))
```
Removing ChIP-Atlas experiment overlap with EpiATLAS
```{python}
#| label: fig3b-remove-epiatlas-overlap
print(f"ChIP-Atlas: Initial {ca_core7_df.shape[0]} files")
no_epiatlas_df = ca_core7_df[ca_core7_df["is_EpiAtlas_EpiRR"] == "0"]
diff = ca_core7_df.shape[0] - no_epiatlas_df.shape[0]
print(f"ChIP-Atlas: {diff} files with EpiAtlas EpiRR removed")
print(f"ChIP-Atlas: {no_epiatlas_df.shape[0]} files without EpiAtlas EpiRR")
```
{fig-align="center"}
Figure 3B: Inference of the Assay classifier on the ChIP-Atlas datasets represented as a multi-layer donut chart where the internal layer is depicting the availability status of the metadata as being provided/extracted (yellow) vs missing (grey), the central layer depicting the classifier prediction matching (green) or not (red) the expected metadata label, and the external layer containing the confidence level where prediction scores above 0.6 correspond to high-confidence predictions (light to dark blue >0.6, >0.8 and >0.9) vs low confidence predictions (brown). The high-confidence mismatching predictions were further divided between datasets predicted as Input (light orange) vs potential mislabels (dark red).
#### High-confidence predictions
```{python}
#| label: fig3b-high-confidence-preds
total_N = no_epiatlas_df.shape[0]
high_confidence_pred_df = no_epiatlas_df[no_epiatlas_df["Max_pred_assay7"] >= 0.6]
high_confidence_N = high_confidence_pred_df.shape[0]
low_confidence_N = total_N - high_confidence_N
N_percent = high_confidence_N / total_N
print(
f"{high_confidence_N}/{total_N} files ({high_confidence_N/total_N:.2%}) with high confidence assay prediction\n"
f"{low_confidence_N}/{total_N} files ({low_confidence_N/total_N:.2%}) with low confidence assay prediction"
)
```
#### Match between manual target consensus and MLP prediction
```{python}
#| label: fig3b-calculate-match
total_N = high_confidence_pred_df.shape[0]
match_rule = (
high_confidence_pred_df[DB_COLS].eq(
high_confidence_pred_df["Predicted_class_assay7"], axis=0
).any(axis=1)
)
match_df = high_confidence_pred_df[match_rule]
mismatch_df = high_confidence_pred_df[~match_rule]
agreement_N = match_df.shape[0]
print(
f"ChIP-Atlas: {agreement_N}/{total_N} files ({agreement_N / total_N:.2%}) with agreement between predicted assay and any of the 4 sources"
)
```
#### Mismatch breakdown
```{python}
#| label: fig3b-mismatch-breakdown
total_mismatch = mismatch_df.shape[0]
input_rule = mismatch_df["Predicted_class_assay7"] == "input"
input_pred_N = input_rule.sum()
print(
f"ChIP-Atlas: {input_pred_N}/{total_mismatch} files ({input_pred_N / total_mismatch:.2%}) with mismatch predicted as input\n"
f"ChIP-Atlas: {total_mismatch-input_pred_N}/{total_mismatch} files ({(total_mismatch-input_pred_N) / total_mismatch:.2%}) potential mislabels\n"
)
print("ChIP-Atlas: Distribution of consensus assay labels among the mismatches:")
true_mismatch_df = mismatch_df[~input_rule]
display(true_mismatch_df["core7_DBs_consensus"].value_counts(dropna=False))
```
### Figure 3C-J - ChIP-Atlas and recount3 other predictions
{width=100%}
Figure 3C-J: Inference of the donor Sex (C,G), sample Cancer status (D,H), Biomaterial type (E,I) and donor Life stage (F,J) classifiers on the ChIP-Atlas (C-F) and Recount3 (G-J) sources represented as multi-layer donut charts as in panel B.
## Supplementary Figure 6 - Prediction score thresholds on predictions from various DBs
See [Additional Supplementary Figures page](./sec2-supp.html#sec-supp6).
Supplementary figures 5 and 7 are relevant to section 2 of results and not included here.
## PCAs - Supp Fig8-9 {#sec-pca}
The following applies to supplementary figures 8B, 9B and 9E.
PCAs were computed via `src/python/epiclass/utils/compute_pca.py` ([permalink](https://github.com/labjacquespe/EpiClass/blob/e87306bccecd3e35d9ffdbb5ddb29fcaa0800a35/src/python/epiclass/utils/compute_pca.py)), using `IncrementalPCA` from `scikit-learn`.
Graphing was done using code similar to `src/python/epiclass/utils/notebooks/paper/pca_plot.ipynb` ([permalink](https://github.com/labjacquespe/EpiClass/blob/e87306bccecd3e35d9ffdbb5ddb29fcaa0800a35/src/python/epiclass/utils/notebooks/paper/pca_plot.ipynb)), which uses the output of `compute_pca.py`.
## Supplementary Figure 8 - ENCODE predictions
EpiClass results on ENCODE datasets.
### A,D - Core and non-core accuracy
See [public databases metrics section](./sec3.html#sec-all-DB-metrics) for details.
{width=100%}
Supplementary Figure 8A,D: Inference of the Assay, Biospecimen source, donor Sex, sample Cancer status, Biomaterial type and donor Life stage classifiers on the ENCODE datasets from core assays (histones ChIP-Seq, RNA-Seq, WGBS) represented as multi-layer donut charts as in Fig. 3B. The Biospecimen source classifier was only applied on datasets from the 13 major biospecimen sources out of the 16 found in EpiATLAS training data (N = 1215 datasets). Panel D is as panel A but for non-core ENCODE datasets.
### B - All assays PCA
See [PCA section](./sec3.html#sec-pca) for details on the methodology.
{fig-align="center"}
Supplementary Figure 8B: Distribution of the EpiATLAS and ENCODE data over their two first principal components.
### C - Non-core predictions with assay category mapping
Load ENCODE predictions.
```{python}
#| label: supp-fig8c-read-encode-preds
encode_preds_dir = base_pred_dir / "encode"
encode_preds_path = (
encode_preds_dir / "complete_encode_predictions_augmented_2025-02_metadata.csv.gz"
)
encode_preds_df = pd.read_csv(
encode_preds_path, sep=",", low_memory=False, compression="gzip"
)
print(f"ENCODE: {encode_preds_df.shape[0]} total files")
```
Filter out overlap with EpiATLAS EpiRRs.
```{python}
#| label: supp-fig8c-filter-no-epiatlas
encode_preds_df = encode_preds_df[encode_preds_df["in_epiatlas"].astype(str) == "False"]
print(f"ENCODE: {encode_preds_df.shape[0]} total files with no EpiAtlas overlap")
```
Load ENCODE manually created assay categories, and merge them with ENCODE predictions data.
```{python}
#| label: supp-fig8c-load-categories
encode_meta_dir = base_data_dir / "metadata" / "encode"
non_core_categories_path = (
encode_meta_dir / "non-core_encode_assay_category_2024-08-29.csv"
)
non_core_categories_df = pd.read_csv(non_core_categories_path, sep=",", low_memory=False)
print(f"ENCODE: {non_core_categories_df.shape[0]} non-core categories")
non_core_map = non_core_categories_df.set_index("target").to_dict()["Assay category"]
N_mapped = len([val for val in non_core_map.values() if val != "not_looked"])
print(f"ENCODE: {N_mapped} non-core categories mapped to functional categories.")
# Select non-core samples
encode_non_core_df = encode_preds_df[
encode_preds_df[ASSAY].isin(["non-core", "ctcf"])
].copy()
# Map assays to categories
encode_non_core_df["assay_category"] = (
encode_non_core_df["assay"].str.lower().replace(non_core_map)
)
```
Graph.
```{python}
#| label: supp-fig8c-generate-plot
#| column: body
assay_categories_order = [
"trx_reg",
"heterochrom",
"polycomb",
"splicing",
"insulator",
"other/mixed",
"not_looked",
]
assay_epiclass_order = [
"h3k27ac",
"h3k4me3",
"h3k4me1",
"h3k9me3",
"h3k27me3",
"h3k36me3",
"input",
]
assay_epiclass_order = {assay: i for i, assay in enumerate(assay_epiclass_order)}
pred_col = f"Predicted class ({ASSAY}_7c)"
max_pred_col = f"Max pred ({ASSAY}_7c)"
min_pred = 0.6
sub_df = encode_non_core_df[encode_non_core_df[max_pred_col] >= min_pred]
groupby = (
sub_df.groupby(["assay_category", pred_col])
.size()
.reset_index(name="Count")
.sort_values(["assay_category", "Count"], ascending=[True, False])
)
groupby["Percentage"] = groupby.groupby("assay_category")["Count"].transform(
lambda x: (x / x.sum()) * 100
)
# Add order for plotting
groupby["assay_order"] = groupby[pred_col].map(assay_epiclass_order)
groupby = groupby.sort_values(
["assay_category", "assay_order"], ascending=[False, True]
)
# Main plot
fig = px.bar(
groupby,
x="assay_category",
y="Percentage",
color=pred_col,
barmode="stack",
category_orders={"assay_category": assay_categories_order},
color_discrete_map=assay_colors,
title=f"core7 predictions for non-core assays, predScore >= {min_pred:.2f}",
labels={"Percentage": "Percentage (%)", "assay_category": "Assay Category"},
)
# Modify x-axis labels
total_counts = groupby.groupby("assay_category")["Count"].sum()
ticktext = [
f"{assay_category} (N={total_counts[assay_category]})"
for assay_category in assay_categories_order
]
fig.update_xaxes(tickvals=assay_categories_order, ticktext=ticktext)
fig.show()
```
Supplementary Figure 8C: Proportion of non-core ChIP-Seq datasets from ENCODE classified with high-confidence (>0.6) in the different functional categories predicted as one of the 7 core ChIP assays. Since none of the six histone modifications used for training are specifically associated with insulator factors, they were instead mostly predicted as Input.
### E-F - Input-class probability and QC metrics
For full code, since the processing is more complex, see `src/python/epiclass/utils/notebooks/paper/encode_chip_QC.ipynb` ([permalink](https://github.com/labjacquespe/EpiClass/blob/974e78cfdacd24bd6045fa595ace8981a0431385/src/python/epiclass/utils/notebooks/paper/encode_chip_QC.ipynb)).
<iframe src="../resources/encode_qc_correlation_boxplots_cdn.html"
scrolling="no"
style="display: block; width: 100%; aspect-ratio: 2 / 1; border: 0; overflow: hidden;">
</iframe>
For non-core targets, the target name and number of samples are visible on hover.
Supplementary Figure 8E-F: Spearman rank correlations per-target between EpiClass Input-class probability (from the ChIP-only classifier, 7 classes) and four ENCODE ChIP-seq quality control (QC) metrics: fraction of reads in peaks (FRiP), number of reproducible peaks, Jensen-Shannon distance from Input control (JSD), and normalized strand cross-correlation coefficient (NSC) for core histone marks (E) and non-core ChIP targets (F) including transcription factors, chromatin regulators, and RNA polymerase subunits. Each point represents one ChIP target, the boxplots summarize the distribution. Negative ρ indicates that higher Input-class probability is associated with lower QC metric values. The dashed line at ρ = 0 indicates no correlation. Boxplot dashed lines represent means, solid lines the medians, boxes the interquartile range, and whiskers the farthest points within 1.5× the interquartile range. H3K4me3 was excluded in panel E due to insufficient sample size after filtering (see Supplementary Methods).
## Supplementary Figure 9 - Results on public datasets
EpiClass results on public datasets.
### A - UpsetPlot of assay labels provided in 4 DBs
The following function gives a consensus label to each ChIP-Atlas sample, using the target values given by ChIP-Atlas, cistromeDB, NGS-QC and GEO.
The consensus description is based on the following rules:
- "Identical" if all labels are the same
- "Different" if at least one label is different
- "1 source" if only one DB has a label
- "Ignored - Potential non-core" if any label is not in the core assays
```{python}
#| label: supp-fig9a-define-consensus-function
def create_4DB_consensus_description(
ca_core_df: pd.DataFrame, db_cols: List[str]
) -> pd.Series:
"""Create a description of the 4DB assay consensus labels.
Treat "Unclassified" from Chip-Atlas as absent samples for the target consensus evaluation.
The consensus description is based on the following rules:
- "Identical" if all labels are the same
- "Different" if at least one label is different
- "1 source" if only one DB has a label
- "Ignored - Potential non-core" if any label is not in the core assays
Args:
ca_core_df: ChIP-Atlas core7 DataFrame
Returns:
Series with the target consensus description
"""
id_db_target = []
tmp_df = ca_core_df.loc[:, db_cols].copy()
tmp_df["C-A"].replace("unclassified", "----", inplace=True)
for labels in tmp_df.values:
missing_N = sum(label == "----" for label in labels)
db_labels = set(labels)
try:
db_labels.remove("----")
except KeyError:
pass
if any(label not in CORE_ASSAYS + ["ctrl"] for label in db_labels):
id_db_target.append("Ignored - Potential non-core")
elif missing_N == 3:
id_db_target.append("1 source")
elif len(db_labels) == 1:
id_db_target.append("Identical")
else:
id_db_target.append("Different")
return pd.Series(id_db_target, index=ca_core_df.index)
```
Define graphing function `make_db_upsetplot`.
```{python}
#| label: supp-fig9a-define-upsetplot-function
def make_db_upsetplot(
df: pd.DataFrame, consensus_col: str, db_cols: List[str], title: str
) -> upsetplot.UpSet:
"""Make an upsetplot of the sample presence in the different databases."""
df = df.copy()
# Create a new DataFrame with boolean columns for each database
upset_df = pd.DataFrame()
for col in db_cols:
upset_df[col] = df[col] != "----"
upset_df[consensus_col] = df[consensus_col]
# Set the index for the UpSet plot
upset_df = upset_df.set_index(db_cols)
# Create the UpSet plot
upset = upsetplot.UpSet(
upset_df,
intersection_plot_elements=0, # disable the default bar chart
sort_by="cardinality",
show_counts=True, # type: ignore
orientation="horizontal",
)
# Add stacked bars
upset.add_stacked_bars(by=consensus_col, elements=15)
# Plot and set title
axes = upset.plot()
plt.suptitle(title)
axes["totals"].set_title("Total")
plt.legend(loc="center right")
plt.show()
return upset
```
Graph.
```{python}
#| label: supp-fig9a-generate-upsetplot
#| fig-align: center
DB_COLS = ["GEO_target_mod", "C-A_target", "Cistrome_target", "NGS_target_mod"]
consensus_col = "core7_DBs_consensus"
title = "ChIP-Atlas core 7 samples presence in used DBs\nTarget Consensus - No EpiAtlas overlap"
make_db_upsetplot(
df=no_epiatlas_df, consensus_col=consensus_col, db_cols=DB_COLS, title=title
)
print("Total for each category:")
display(no_epiatlas_df[consensus_col].value_counts(dropna=False))
```
Supplementary Figure 9A: Comparison of the four public sources used to extract assay metadata after excluding ChIP-Atlas datasets present in EpiATLAS. The 1 source category corresponds to 3,320 datasets where the assay label was only available from GEO. A total of 706 datasets were labeled differently.
### B - Core ChIP-Seq PCA
See [PCA section](./sec3.html#sec-pca) for details on the methodology.
{fig-align="center"}
Supplementary Figure 9B: Distribution of the ChIP-Seq data from EpiATLAS and ENCODE over their two first principal components.
### C - ChIP-Atlas mislabeled datasets
{height="1000px" fig-align="center"}
Supplementary Figure 9C: Genome browser representation of a few high-confidence potential mislabeled candidates from ChIP-AtlasGSE106660. This study includes WT and Knock-out (KO) conditions that were conducted in triplicates for a total of 78 ChIP-Seq experiments, including 6 datasets for H3K9me3 (WT1 and KO1) and 12 for H3K27ac (WT1-2 and KO1-2). The authors also generated and submitted to GEO 26 signal files combining the 3 biological replicates per condition. To discard potential errors during ChIP-Atlas reprocessing, we downloaded the raw data from GEO and reprocessed them on hg38 with the ChIP-Seq GenPipes version 6.1.049[PMC6559338]. The image is showing the combined signal generated by the authors (dark colors), the rep1 of each condition from ChIP-Atlas data, as well as reprocessed rep1 data (hatched colors) from GSE106660 (rep2 and 3 are not shown but are almost identical to rep1), as well as two representatives H3K9me3 from a different study at the bottom. This is demonstrating that the swap between H3K27ac and H3K9me3 occurred during submission of the raw data on GEO/SRA and only involved WT1 and KO1 as WT2 and KO2 signals are as expected.
### D - Effect of EpiATLAS imputation on assay training and inference
Read metadata for both ChIP-Atlas and imputed samples.
```{python}
#| label: supp-fig9d-setup-imputed-paths
imputed_metadata_path = (
base_metadata_dir
/ "epiatlas"
/ "imputed"
/ "hg38_epiatlas_imputed_pval_chip_2024-02.json"
)
metadata_imputed: pd.DataFrame = metadata_handler.load_any_metadata(imputed_metadata_path, as_dataframe=True) # type: ignore
ca_metadata_path = base_metadata_dir / "chip_atlas" / "CA.full_info_metadata.freeze1.tsv"
metadata_ca = pd.read_csv(ca_metadata_path, sep="\t", low_memory=False)
```
Gather predictions from classifier training on observed core6 data (all pval, no input).
```{python}
#| label: supp-fig9d-gather-observed-preds
data_dir = base_data_dir / "training_results" / "dfreeze_v2"
pred_dfs = {} # Gather all 4 cases
observed_dir = (
data_dir
/ "hg38_100kb_all_none"
/ "assay_epiclass_1l_3000n"
/ "chip-seq-only"
/ "complete_no_valid_oversample"
)
observed_inf_imputed_path = next((observed_dir / "predict_imputed").glob("*.csv"))
observed_inf_CA_path = next((observed_dir / "predict_C-A").glob("*.csv"))
basename = "observed_core6_pval_inf"
# Imputed preds
df = pd.read_csv(observed_inf_imputed_path, header=0, index_col=0, low_memory=False)
df = pd.merge(df, metadata_imputed, left_index=True, right_on="md5sum")
df["True class"] = df[ASSAY]
print(f"Imputed: {df.shape[0]} total files")
pred_dfs[f"{basename}_imputed"] = df
# C-A preds
df = pd.read_csv(observed_inf_CA_path, header=0, index_col=0, low_memory=False)
df = pd.merge(df, metadata_ca, left_index=True, right_on="ID")
df["True class"] = df["expected_assay"]
print(f"CA: {df.shape[0]} total files")
display(df["expected_assay"].value_counts(dropna=False))
pred_dfs[f"{basename}_C-A"] = df
```
Gather predictions from classifier training on imputed data (all pval, no input).
```{python}
#| label: supp-fig9d-gather-imputed-training-preds
imputed_dir = (
data_dir
/ "hg38_100kb_all_none_imputed"
/ "assay_epiclass_1l_3000n"
/ "chip-seq-only"
/ "complete_no_valid_oversample"
)
imputed_inf_observed_path = next(
(imputed_dir / "predict_epiatlas_pval_chip-seq").glob("*.csv")
)
imputed_inf_CA_path = next((imputed_dir / "predict_C-A").glob("*.csv"))
basename = "imputed_core6_pval_inf"
# Observed (non-imputed) data preds
df = pd.read_csv(imputed_inf_observed_path, header=0, index_col=0, low_memory=False)
df = pd.merge(df, metadata_v2_df, left_index=True, right_on="md5sum")
df["True class"] = df[ASSAY]
print(f"EpiATLAS pval ChIP: {df.shape[0]} total files")
pred_dfs[f"{basename}_obs_core6_pval"] = df
# C-A preds
df = pd.read_csv(imputed_inf_CA_path, header=0, index_col=0, low_memory=False)
df = pd.merge(df, metadata_ca, left_index=True, right_on="ID")
df["True class"] = df["expected_assay"]
print(f"CA: {df.shape[0]} total files")
pred_dfs[f"{basename}_C-A"] = df
```
Compute accuracy per assay.
```{python}
#| label: supp-fig9d-compute-accuracy
core6_assays = ASSAY_ORDER[0:6]
rows = []
for name, df in pred_dfs.items():
if "Max pred" not in df.columns:
df["Max pred"] = df[core6_assays].max(axis=1)
task_name = f"train_{name}"
for label in core6_assays:
assay_df = df[df["True class"] == label]
for min_pred in ["0.0", "0.6", "0.9"]:
sub_df = assay_df[assay_df["Max pred"] > float(min_pred)]
acc = (sub_df["True class"] == sub_df["Predicted class"]).mean()
rows.append([task_name, label, min_pred, acc, len(sub_df)])
df_acc_per_assay = pd.DataFrame(
rows, columns=["task_name", "assay", "min_predScore", "acc", "nb_samples"]
)
```
Graphing results
```{python}
#| label: supp-fig9d-generate-imputation-plot
#| column: body
# Prepare data from graphing
df_acc_per_assay["scatter_name"] = (
df_acc_per_assay["task_name"]
.replace("train_", "", regex=True)
.replace("imputed", "imp", regex=True)
.replace("observed", "obs", regex=True)
)
df_acc_per_assay["inf_target"] = df_acc_per_assay["scatter_name"].str.split("_").str[-1]
df_acc_per_assay = df_acc_per_assay.sort_values(
["assay", "min_predScore", "scatter_name"]
)
graph_df = df_acc_per_assay.copy()
graph_df = graph_df.sort_values(["inf_target", "scatter_name"])
# Prepare boxplot data
tick_group = [
"obs_core6_pval_inf_imp",
"imp_core6_pval_inf_obs_core6_pval",
"obs_core6_pval_inf_C-A",
"imp_core6_pval_inf_C-A",
]
scatter_name_to_position = {name: i for i, name in enumerate(tick_group)}
min_pred_values = ["0.0", "0.6", "0.9"]
offset = [-0.25, 0, 0.25] # Offset for each min_pred within a tick group
# Define jitter magnitude (like 0.05 for left/right spacing)
jitter = 0.05
jitter_offsets = [-jitter, 0, jitter]
min_predScore_color_map = {"0.0": "blue", "0.6": "orange", "0.9": "red"}
minY = 0.7
maxY = 1.005
# Plot each trace
fig = go.Figure()
for name in tick_group:
group = graph_df[graph_df["scatter_name"] == name]
for i, min_pred in enumerate(min_pred_values):
df_subset = group[group["min_predScore"] == min_pred]
x_position = scatter_name_to_position[name] + offset[i]
x_positions = [x_position] * len(df_subset)
y_values = df_subset["acc"]
hover_texts = [
f"{row['assay']}<br>Samples: {row['nb_samples']}"
for _, row in df_subset.iterrows()
]
colors = [assay_colors[assay] for assay in df_subset["assay"]]
# Add box plot without points
fig.add_trace(
go.Box(
x=x_positions,
y=y_values,
name=f"{name} - Min Pred Score: {min_pred}",
line=dict(
color=min_predScore_color_map[min_pred],
),
boxpoints="all",
marker=dict(
opacity=0, size=1e-5
), # hide points, so whiskers don't go to min/max
boxmean=True,
showlegend=False,
hoverinfo="none",
)
)
# Add scatter plot for individual points
x_jittered = [
x + jitter_offsets[i % len(jitter_offsets)] for i, x in enumerate(x_positions)
]
# sort x/y together
x_jittered, y_values, hover_texts = zip(
*sorted(zip(x_jittered, y_values, hover_texts))
)
fig.add_trace(
go.Scatter(
x=x_jittered,
y=y_values,
mode="markers",
marker=dict(color=colors, size=8, line=dict(color="Black", width=1)),
name=f"{name} - Min Pred Score: {min_pred}",
showlegend=False,
text=hover_texts,
hoverinfo="text+y",
)
)
# Update x-axis tick labels
ticktext = []
for tick in tick_group:
train, inf = tick.split("_inf_")
ticktext.append(f"<b>{train}</b> \u2192 <b>{inf}</b>")
fig.update_xaxes(tickmode="array", ticktext=ticktext, tickvals=list(range(len(ticktext))))
# Update layout
fig.update_layout(
title="Accuracy per Task (6 core assays)",
xaxis_title="Task (training data \u2192 inference data)",
yaxis_title="Accuracy",
showlegend=True,
height=600,
yaxis=dict(tickformat=".2%", range=[minY, maxY]),
)
# Add a legend for minPred colors
for val, color in min_predScore_color_map.items():
fig.add_trace(
go.Scatter(
x=[None],
y=[None],
mode="markers",
marker=dict(size=10, color=color, symbol="square"),
name=f"Min Pred Score: {val}",
showlegend=True,
)
)
# Add a legend for assay colors
for assay in sorted(core6_assays):
color = assay_colors[assay]
fig.add_trace(
go.Scatter(
x=[None],
y=[None],
mode="markers",
marker=dict(size=10, color=color),
name=assay,
legendgroup="assays",
showlegend=True,
)
)
# Add legend for obs and imp
fig.add_annotation(
x=1.2,
y=0.2,
yref="paper",
xref="paper",
text="obs = observed<br>imp = imputed",
showarrow=False,
font=dict(size=14),
)
fig.show()
```
The number of samples associated with each point is visible on hover.
Supplementary Figure 9D: Accuracy comparison per assay (dots) between Assay classifiers trained on observed vs imputed data from EpiATLAS and applied to either EpiATLAS or ChIP-Atlas without prediction score threshold (brown), or with a threshold of 0.6 (light blue) or 0.9 (dark blue). These classifiers were both trained only using the p-value datasets as this is the only track type available for imputed data (therefore no Input assay) (Supplementary Table 14). Dashed lines represent means, solid lines the medians, boxes the quartiles, and whiskers the farthest points within 1.5× the interquartile range.
### E - RNA-Seq PCA
See [PCA section](./sec3.html#sec-pca) for details on the methodology.
{fig-align="center"}
Supplementary Figure 9E: Distribution of a random selection of ~15% datasets from Recount3 along with all RNA-seq datasets from EpiATLAS and ENCODE data over their two first principal components.