Data Concept#

SynRXN treats reaction data as versioned benchmark assets. Curated tables are stored as compressed CSV files and loaded through synrxn.data.DataLoader. Each task folder represents a benchmark family; each dataset file is a named record within that family.

Core model#

Task family

A benchmark problem type such as atom mapping, classification, property prediction, rebalancing, or synthesis.

Dataset record

A named compressed CSV table with stable columns, source metadata, and task-specific supervision.

Source snapshot

A Zenodo release, GitHub tag, exact commit, or development branch used to retrieve the data.

Split policy

Published split columns are preserved; new repeated splits can be generated with deterministic seeds.

The package separates concerns clearly:

  • Data/ contains curated benchmark tables grouped by task.

  • synrxn.data resolves, downloads, caches, verifies, lists, and loads those tables.

  • synrxn.split provides deterministic split helpers when a dataset does not already include the split needed for an experiment.

  • Builder scripts under script/ rebuild curated artifacts from upstream raw sources for developer workflows.

Storage layout#

A typical source checkout or data archive follows this structure:

Data/
├── aam/
│   ├── golden.csv.gz
│   └── uspto_3k.csv.gz
├── classification/
│   ├── schneider_b.csv.gz
│   └── uspto_50k_u.csv.gz
├── property/
│   ├── b97xd3.csv.gz
│   └── rgd1.csv.gz
├── rbl/
│   └── mnc.csv.gz
└── synthesis/
    └── uspto_50k.csv.gz

Data/classification/schneider_b.csv.gz is loaded as dataset "schneider_b" with task="classification".

Task families#

Task

Goal

Typical targets

Example datasets

aam

Evaluate atom-to-atom mapping quality.

reference ground_truth mappings and mapper outputs

golden, enzyme_map, uspto_3k

classification

Predict reaction class, template, or enzyme hierarchy.

label, label_0, label_1, label_2, ec1ec3

schneider_b, tpl_b, ecreact

property

Predict continuous or probabilistic reaction properties.

ea, dh, G_act, G_r, lograte, Conversion

b97xd3, rgd1, cycloadd

rbl

Restore missing species and recover balanced reactions.

balanced ground_truth reaction strings

mnc, mos, mbs, complex

synthesis

Support forward synthesis, retrosynthesis, reagent inference, or reaction-center prediction.

aam, rxn, reagent, rc, source

da, uspto_50k, uspto_mit, uspto_500

Naming conventions#

Dataset names are short, lowercase identifiers. Some names include suffixes that clarify the reaction representation:

  • *_b denotes balanced or fuller reaction records where auxiliary species can be included.

  • *_u denotes unbalanced or product-focused records, often useful for template and reaction-class tasks.

  • task-specific names such as e2, sn2, rgd1, or cycloadd keep the upstream benchmark identity visible.

Common columns#

Column availability depends on the task family and upstream source. The most common conventions are:

  • r_id: stable row identifier used by most curated datasets.

  • rxn: reaction SMILES without atom mapping unless otherwise documented.

  • aam: atom-mapped reaction SMILES.

  • ground_truth: reference answer, such as the correct atom mapping or the balanced reaction string.

  • label, label_0, label_1, label_2: flat or hierarchical class labels.

  • split: published split assignment when available.

  • numeric target columns such as ea, dh, G_act, G_r, lograte, or Conversion.

  • source-tracking columns such as original_id, orig_index, source, or code.

Source resolution#

synrxn.data.DataLoader lets the same user code target different data sources.

Source

What it resolves

Recommended use

Stability

zenodo

Archived data release assets.

Published experiments and benchmark reports.

Citable and stable.

github

GitHub tags, releases, or branches.

Synchronizing data with repository tags.

Stable for tags, mutable for branches.

commit

Exact repository commit SHA.

Reproducible development snapshots.

Stable if the commit remains available.

latest

Current default branch tip.

Exploration and development.

Mutable; record the resolved commit.

Cache behavior#

Use a persistent cache directory for performance and reproducibility.

from pathlib import Path
from synrxn.data import DataLoader

loader = DataLoader(
    task="classification",
    source="zenodo",
    version="1.0.0",
    cache_dir=Path("~/.cache/synrxn").expanduser(),
)

df = loader.load("schneider_b")

The cache avoids repeated downloads, keeps resolved assets local, and makes it easier to archive the exact data used in an experiment.

Reproducibility checklist#

Record these fields with every benchmark result:

Minimum record: package version, task family, dataset name, source mode, data version or commit SHA, split column or split seed, preprocessing steps, and evaluation script commit.

For example:

package: synrxn==1.0.0
task: classification
dataset: schneider_b
source: zenodo
version: 1.0.0
split: published split column
cache: ~/.cache/synrxn

See also#