Skip to content

Add ModelSEED-retrained dGPredictor as an additive reaction-energy source#264

Open
freiburgermsu wants to merge 1 commit into
ModelSEED:devfrom
freiburgermsu:dgpredictor-modelseed-retrained-energies
Open

Add ModelSEED-retrained dGPredictor as an additive reaction-energy source#264
freiburgermsu wants to merge 1 commit into
ModelSEED:devfrom
freiburgermsu:dgpredictor-modelseed-retrained-energies

Conversation

@freiburgermsu

Copy link
Copy Markdown
Member

Summary

Adds the ModelSEED-retrained dGPredictor as its own additive per-method reaction-energy source, dGPredictor-ModelSEED, alongside the existing Group contribution / eQuilibrator / dGPredictor records. Purely additive — the original KEGG-based dGPredictor record is left untouched, there are no canonical deltag/deltagerr/reversibility changes, and no .tsv or compound-file changes (thermodynamics is a JSON-only field). This continues the additive per-source philosophy of #263.

What this is

dGPredictor (Wang et al. 2021) as shipped under Biochemistry/Thermodynamics/dGPredictor/ was trained on KEGG compound structures. dGPredictor-ModelSEED is the same model retrained on the ModelSEED compound structures: every ModelSEED compound carrying a complete structure is re-decomposed into atom-centered fragments (radius 1 & 2), expanding the group vocabulary, and the BayesianRidge model is refit on the same 4,001 experimental measurements remapped into ModelSEED ID space. It predicts dG for 31,924 reactions (pH 7, I 0.25 M, 298.15 K), including ~11,400 reactions the original KEGG-based model could not reach (compounds with no KEGG cross-reference).

Each reaction now carries both dGPredictor estimates side-by-side, e.g. rxn00001:

"thermodynamics": {
    "Group contribution":    [4.15, 1.22, "="],
    "eQuilibrator":          [-3.46, 0.05, ">"],
    "dGPredictor":           [-3.82, 0.02, ">"],
    "dGPredictor-ModelSEED": [-3.77, 0.87, ">"]
}

New-coverage reactions (e.g. rxn00013) carry a dGPredictor-ModelSEED record where there is no original dGPredictor one — and their canonical deltag is still left untouched.

How

  • New staged predictions: Biochemistry/Thermodynamics/dGPredictor/modelseed_retrained_dG.json{rxn: {dG_mean, dG_uncer}} in kJ/mol, 31,924 reactions.
  • New writer: Scripts/Thermodynamics/Update_Reaction_dGPredictor_ModelSEED_Energies.py — stores dGPredictor-ModelSEED [energy, error, operator] (kJ→kcal /4.184); the operator is this estimate's own thermodynamic direction via the shared reversibility_from_energy(). Added to Rerun_Thermodynamics.sh.
  • All predictions are recorded as-is; the heavy-tailed extrapolations (macromolecule biosynthesis) carry correspondingly large error bars and never affect the canonical served value.

Data changed

  • 31,924 reactions gained a dGPredictor-ModelSEED record; 24,088 reactions unchanged.
  • Verified: every modified reaction differs from dev only by the added dGPredictor-ModelSEED key (deep per-reaction JSON equality across all 56,012 reactions; 0 other-field changes), every added value equals dG_mean/4.184, and re-running the writer is idempotent (byte-identical output). Zero changes to canonical deltag/deltagerr/reversibility/notes, the original dGPredictor record, any other field, or any .tsv.

Docs updated in Scripts/Thermodynamics/README.md and Biochemistry/Structures/sources.yaml.

🤖 Generated with Claude Code

…urce

Records the dGPredictor group-contribution model retrained on the ModelSEED
compound structures as its own per-method entry, "dGPredictor-ModelSEED", in
each reaction's `thermodynamics` dict. Purely additive: it sits next to the
Group contribution / eQuilibrator / (original KEGG-based) dGPredictor records,
and the original "dGPredictor" entry is left untouched. The canonical
deltag / deltagerr / reversibility are not changed, and no .tsv or compound
files change.

- New staged predictions: Biochemistry/Thermodynamics/dGPredictor/
  modelseed_retrained_dG.json (31,924 reactions, kJ/mol).
- New writer: Scripts/Thermodynamics/Update_Reaction_dGPredictor_ModelSEED_
  Energies.py (kJ->kcal /4.184; operator via reversibility_from_energy).
- 31,924 reactions gain a dGPredictor-ModelSEED record (incl. ~11,400 the
  original KEGG-based dGPredictor could not reach); 24,088 reactions unchanged.
- Verified: every modified reaction differs from dev ONLY by the added
  dGPredictor-ModelSEED key; added values equal dG_mean/4.184; the writer is
  idempotent.
- Docs: sources.yaml, Scripts/Thermodynamics/README.md, Rerun_Thermodynamics.sh.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant