Tutorials and Examples#

This page collects practical workflows for loading SynRXN datasets, switching between data sources, caching assets, creating reproducible splits, and rebuilding curated records during development.

Released data Use Zenodo when you need citable, stable benchmark snapshots. Git snapshots Use tags or exact commit SHAs to match source code and data state. Splits Create repeated k-fold splits or train/validation/test partitions. Rebuilds Regenerate curated artifacts and manifests from upstream raw sources.

Canonical imports#

from pathlib import Path
from synrxn.data import DataLoader
from synrxn.split.repeated_kfold import RepeatedKFoldsSplitter

A reusable cache path keeps downloaded archives local:

CACHE = Path("~/.cache/synrxn").expanduser()

Task aliases#

DataLoader normalizes common task aliases.

Alias	Canonical task	Typical use
`class`	`classification`	reaction class, template, and enzyme labels
`prop`	`property`	reaction property prediction
`syn`	`synthesis`	synthesis and retrosynthesis records

Load a released dataset from Zenodo#

Use Zenodo for a citable release snapshot. This is the recommended mode for published experiments.

from pathlib import Path
from synrxn.data import DataLoader

loader = DataLoader(
    task="classification",
    source="zenodo",
    version="1.0.0",
    cache_dir=Path("~/.cache/synrxn").expanduser(),
)

print("Available datasets:", loader.available_names())

df = loader.load("schneider_b")
print("Rows:", len(df))
print("Columns:", df.columns.tolist())
print(df.head(3))

Expected workflow:

Resolve the versioned release.
Download the archive if it is not already cached.
Load the selected compressed CSV as a pandas DataFrame.
Use the documented columns for training or evaluation.

Pin a GitHub release or commit#

GitHub release tag#

Use a release tag when you want data and code aligned with a repository tag.

loader = DataLoader(
    task="classification",
    source="github",
    version="v1.0.0",
    cache_dir=Path("~/.cache/synrxn").expanduser(),
    gh_enable=True,
)

print(loader.available_names())
df = loader.load("schneider_b")
print(df.shape)

Exact commit SHA#

Use a full commit SHA for the strongest development-snapshot reproducibility.

loader = DataLoader(
    task="property",
    source="commit",
    version="3e1612e2199e8b0e369fce3ed9aff3dda68e4c32",
    cache_dir=Path("~/.cache/synrxn").expanduser(),
    gh_enable=True,
)

df = loader.load("b97xd3")
print(df[["r_id", "ea", "dh"]].head())

Development latest#

latest is convenient during exploration but should not be the only version record in a formal benchmark.

loader = DataLoader(
    task="classification",
    source="github",
    version="latest",
    cache_dir=Path("~/.cache/synrxn").expanduser(),
    gh_enable=True,
)

df = loader.load("schneider_b")
print(df.shape)

Best practice: If exploratory results from latest become important, rerun the experiment with the resolved commit SHA or an archived release before reporting the result.

Explore dataset schemas#

List available datasets, load one table, and summarize column types.

loader = DataLoader(
    task="property",
    source="zenodo",
    version="1.0.0",
    cache_dir=CACHE,
)

for name in loader.available_names():
    print(name)

df = loader.load("rgd1")
print(df.dtypes)
print(df.describe(include="all").T.head(10))

A compact helper for quick inspection:

def inspect_dataset(task: str, name: str, version: str = "1.0.0"):
    loader = DataLoader(task=task, source="zenodo", version=version, cache_dir=CACHE)
    df = loader.load(name)
    return {
        "task": task,
        "name": name,
        "shape": df.shape,
        "columns": df.columns.tolist(),
        "has_split": "split" in df.columns,
    }

print(inspect_dataset("classification", "schneider_b"))
print(inspect_dataset("property", "b97xd3"))

Make reproducible splits#

Use published splits when available#

Some benchmark records contain a split column. Keep it for direct comparison with upstream or previously published results.

df = loader.load("schneider_b")

if "split" in df.columns:
    print(df["split"].value_counts(dropna=False))

Generate repeated k-fold splits#

When a dataset does not include the split you need, use a deterministic splitter.

loader = DataLoader(
    task="property",
    source="zenodo",
    version="1.0.0",
    cache_dir=CACHE,
)
df = loader.load("b97xd3")

splitter = RepeatedKFoldsSplitter(
    n_splits=5,
    n_repeats=3,
    random_state=2026,
    val_ratio=0.1,
)

split_indices = splitter.split(df)
print(split_indices)

Export split indices#

Persist generated split indices next to model outputs.

import json
from pathlib import Path

out = Path("runs/b97xd3_splits.json")
out.parent.mkdir(parents=True, exist_ok=True)

with out.open("w") as fh:
    json.dump(split_indices.to_dict(), fh, indent=2)

print("Wrote", out)

Train/evaluate loop sketch#

The exact training code depends on your model, but the pattern is stable:

for split_name, indices in split_indices.items():
    train_df = df.iloc[indices["train"]]
    valid_df = df.iloc[indices["valid"]]
    test_df = df.iloc[indices["test"]]

    # model = fit_model(train_df, valid_df)
    # metrics = evaluate_model(model, test_df)
    # save_metrics(split_name, metrics)

    print(split_name, len(train_df), len(valid_df), len(test_df))

Rebuild datasets from source#

Dataset rebuilding is intended for maintainers and advanced users who need to verify a release, add a dataset, or regenerate manifests after modifying build scripts.

When to rebuild#

You want to verify that published archives are reproducible.
You are developing new curated records.
You modified preprocessing, schema normalization, or split-generation logic.
You need to regenerate manifests and checksums.

CLI smoke test#

Inspect the available command-line interface:

python -m synrxn --help
python -m synrxn build --help

Typical rebuild command#

python -m synrxn build \
  --task classification \
  --dataset schneider_b \
  --output Data/classification

A rebuild workflow should record:

the raw upstream source and retrieval date,
the SynRXN Git commit,
the builder command and configuration,
generated row counts and checksums,
schema changes compared with the previous release.

Caching and reproducibility tips#

Use a persistent cache such as ~/.cache/synrxn.
Prefer source="zenodo" for published experiments.
Prefer exact commit SHAs over latest for development snapshots.
Keep generated split indices under version control or alongside run outputs.
Record the package version and the data version in each experiment log.

Minimal experiment log#

package: synrxn==1.0.0
task: property
dataset: b97xd3
source: zenodo
version: 1.0.0
cache_dir: ~/.cache/synrxn
split: repeated k-fold, n_splits=5, n_repeats=3, random_state=2026, val_ratio=0.1
model_commit: <your-method-commit>