Tutorials and Examples#
This page collects practical workflows for loading SynRXN datasets, switching between data sources, caching assets, creating reproducible splits, and rebuilding curated records during development.
Canonical imports#
from pathlib import Path
from synrxn.data import DataLoader
from synrxn.split.repeated_kfold import RepeatedKFoldsSplitter
A reusable cache path keeps downloaded archives local:
CACHE = Path("~/.cache/synrxn").expanduser()
Task aliases#
DataLoader normalizes common task aliases.
Alias |
Canonical task |
Typical use |
|---|---|---|
|
|
reaction class, template, and enzyme labels |
|
|
reaction property prediction |
|
|
synthesis and retrosynthesis records |
Load a released dataset from Zenodo#
Use Zenodo for a citable release snapshot. This is the recommended mode for published experiments.
from pathlib import Path
from synrxn.data import DataLoader
loader = DataLoader(
task="classification",
source="zenodo",
version="1.0.0",
cache_dir=Path("~/.cache/synrxn").expanduser(),
)
print("Available datasets:", loader.available_names())
df = loader.load("schneider_b")
print("Rows:", len(df))
print("Columns:", df.columns.tolist())
print(df.head(3))
Expected workflow:
Resolve the versioned release.
Download the archive if it is not already cached.
Load the selected compressed CSV as a pandas DataFrame.
Use the documented columns for training or evaluation.
Pin a GitHub release or commit#
GitHub release tag#
Use a release tag when you want data and code aligned with a repository tag.
loader = DataLoader(
task="classification",
source="github",
version="v1.0.0",
cache_dir=Path("~/.cache/synrxn").expanduser(),
gh_enable=True,
)
print(loader.available_names())
df = loader.load("schneider_b")
print(df.shape)
Exact commit SHA#
Use a full commit SHA for the strongest development-snapshot reproducibility.
loader = DataLoader(
task="property",
source="commit",
version="3e1612e2199e8b0e369fce3ed9aff3dda68e4c32",
cache_dir=Path("~/.cache/synrxn").expanduser(),
gh_enable=True,
)
df = loader.load("b97xd3")
print(df[["r_id", "ea", "dh"]].head())
Development latest#
latest is convenient during exploration but should not be the only version
record in a formal benchmark.
loader = DataLoader(
task="classification",
source="github",
version="latest",
cache_dir=Path("~/.cache/synrxn").expanduser(),
gh_enable=True,
)
df = loader.load("schneider_b")
print(df.shape)
Best practice: If exploratory results from latest become important, rerun the experiment with the resolved commit SHA or an archived release before reporting the result.
Explore dataset schemas#
List available datasets, load one table, and summarize column types.
loader = DataLoader(
task="property",
source="zenodo",
version="1.0.0",
cache_dir=CACHE,
)
for name in loader.available_names():
print(name)
df = loader.load("rgd1")
print(df.dtypes)
print(df.describe(include="all").T.head(10))
A compact helper for quick inspection:
def inspect_dataset(task: str, name: str, version: str = "1.0.0"):
loader = DataLoader(task=task, source="zenodo", version=version, cache_dir=CACHE)
df = loader.load(name)
return {
"task": task,
"name": name,
"shape": df.shape,
"columns": df.columns.tolist(),
"has_split": "split" in df.columns,
}
print(inspect_dataset("classification", "schneider_b"))
print(inspect_dataset("property", "b97xd3"))
Make reproducible splits#
Use published splits when available#
Some benchmark records contain a split column. Keep it for direct comparison
with upstream or previously published results.
df = loader.load("schneider_b")
if "split" in df.columns:
print(df["split"].value_counts(dropna=False))
Generate repeated k-fold splits#
When a dataset does not include the split you need, use a deterministic splitter.
loader = DataLoader(
task="property",
source="zenodo",
version="1.0.0",
cache_dir=CACHE,
)
df = loader.load("b97xd3")
splitter = RepeatedKFoldsSplitter(
n_splits=5,
n_repeats=3,
random_state=2026,
val_ratio=0.1,
)
split_indices = splitter.split(df)
print(split_indices)
Export split indices#
Persist generated split indices next to model outputs.
import json
from pathlib import Path
out = Path("runs/b97xd3_splits.json")
out.parent.mkdir(parents=True, exist_ok=True)
with out.open("w") as fh:
json.dump(split_indices.to_dict(), fh, indent=2)
print("Wrote", out)
Train/evaluate loop sketch#
The exact training code depends on your model, but the pattern is stable:
for split_name, indices in split_indices.items():
train_df = df.iloc[indices["train"]]
valid_df = df.iloc[indices["valid"]]
test_df = df.iloc[indices["test"]]
# model = fit_model(train_df, valid_df)
# metrics = evaluate_model(model, test_df)
# save_metrics(split_name, metrics)
print(split_name, len(train_df), len(valid_df), len(test_df))
Rebuild datasets from source#
Dataset rebuilding is intended for maintainers and advanced users who need to verify a release, add a dataset, or regenerate manifests after modifying build scripts.
When to rebuild#
You want to verify that published archives are reproducible.
You are developing new curated records.
You modified preprocessing, schema normalization, or split-generation logic.
You need to regenerate manifests and checksums.
CLI smoke test#
Inspect the available command-line interface:
python -m synrxn --help
python -m synrxn build --help
Typical rebuild command#
python -m synrxn build \
--task classification \
--dataset schneider_b \
--output Data/classification
A rebuild workflow should record:
the raw upstream source and retrieval date,
the SynRXN Git commit,
the builder command and configuration,
generated row counts and checksums,
schema changes compared with the previous release.
Caching and reproducibility tips#
Use a persistent cache such as
~/.cache/synrxn.Prefer
source="zenodo"for published experiments.Prefer exact commit SHAs over
latestfor development snapshots.Keep generated split indices under version control or alongside run outputs.
Record the package version and the data version in each experiment log.
Minimal experiment log#
package: synrxn==1.0.0
task: property
dataset: b97xd3
source: zenodo
version: 1.0.0
cache_dir: ~/.cache/synrxn
split: repeated k-fold, n_splits=5, n_repeats=3, random_state=2026, val_ratio=0.1
model_commit: <your-method-commit>
See also#
Getting Started for the first installation and load.
Data Concept for task folders and schema conventions.
Data Records for dataset counts and citations.
API Reference for generated class and module documentation.