Data Concept#
SynRXN treats reaction data as versioned benchmark assets. Curated tables are
stored as compressed CSV files and loaded through synrxn.data.DataLoader.
Each task folder represents a benchmark family; each dataset file is a named
record within that family.
Core model#
The package separates concerns clearly:
Data/contains curated benchmark tables grouped by task.synrxn.dataresolves, downloads, caches, verifies, lists, and loads those tables.synrxn.splitprovides deterministic split helpers when a dataset does not already include the split needed for an experiment.Builder scripts under
script/rebuild curated artifacts from upstream raw sources for developer workflows.
Storage layout#
A typical source checkout or data archive follows this structure:
Data/
├── aam/
│ ├── golden.csv.gz
│ └── uspto_3k.csv.gz
├── classification/
│ ├── schneider_b.csv.gz
│ └── uspto_50k_u.csv.gz
├── property/
│ ├── b97xd3.csv.gz
│ └── rgd1.csv.gz
├── rbl/
│ └── mnc.csv.gz
└── synthesis/
└── uspto_50k.csv.gz
Data/classification/schneider_b.csv.gz is loaded as dataset
"schneider_b" with task="classification".
Task families#
Task |
Goal |
Typical targets |
Example datasets |
|---|---|---|---|
|
Evaluate atom-to-atom mapping quality. |
reference |
|
|
Predict reaction class, template, or enzyme hierarchy. |
|
|
|
Predict continuous or probabilistic reaction properties. |
|
|
|
Restore missing species and recover balanced reactions. |
balanced |
|
|
Support forward synthesis, retrosynthesis, reagent inference, or reaction-center prediction. |
|
|
Naming conventions#
Dataset names are short, lowercase identifiers. Some names include suffixes that clarify the reaction representation:
*_bdenotes balanced or fuller reaction records where auxiliary species can be included.*_udenotes unbalanced or product-focused records, often useful for template and reaction-class tasks.task-specific names such as
e2,sn2,rgd1, orcycloaddkeep the upstream benchmark identity visible.
Common columns#
Column availability depends on the task family and upstream source. The most common conventions are:
r_id: stable row identifier used by most curated datasets.rxn: reaction SMILES without atom mapping unless otherwise documented.aam: atom-mapped reaction SMILES.ground_truth: reference answer, such as the correct atom mapping or the balanced reaction string.label,label_0,label_1,label_2: flat or hierarchical class labels.split: published split assignment when available.numeric target columns such as
ea,dh,G_act,G_r,lograte, orConversion.source-tracking columns such as
original_id,orig_index,source, orcode.
Source resolution#
synrxn.data.DataLoader lets the same user code target different data
sources.
Source |
What it resolves |
Recommended use |
Stability |
|---|---|---|---|
|
Archived data release assets. |
Published experiments and benchmark reports. |
Citable and stable. |
|
GitHub tags, releases, or branches. |
Synchronizing data with repository tags. |
Stable for tags, mutable for branches. |
|
Exact repository commit SHA. |
Reproducible development snapshots. |
Stable if the commit remains available. |
|
Current default branch tip. |
Exploration and development. |
Mutable; record the resolved commit. |
Cache behavior#
Use a persistent cache directory for performance and reproducibility.
from pathlib import Path
from synrxn.data import DataLoader
loader = DataLoader(
task="classification",
source="zenodo",
version="1.0.0",
cache_dir=Path("~/.cache/synrxn").expanduser(),
)
df = loader.load("schneider_b")
The cache avoids repeated downloads, keeps resolved assets local, and makes it easier to archive the exact data used in an experiment.
Reproducibility checklist#
Record these fields with every benchmark result:
Minimum record: package version, task family, dataset name, source mode, data version or commit SHA, split column or split seed, preprocessing steps, and evaluation script commit.
For example:
package: synrxn==1.0.0
task: classification
dataset: schneider_b
source: zenodo
version: 1.0.0
split: published split column
cache: ~/.cache/synrxn
See also#
Data Records for the complete dataset inventory.
Tutorials and Examples for source-mode and split workflows.
API Reference for generated API documentation.