Getting Started#

This page takes you from installation to a working SynRXN data-loading workflow. The examples use released data for reproducibility and exact Git commits for development snapshots.

InstallCreate a Python environment and install SynRXN from PyPI or source.

LoadUse DataLoader to resolve, cache, and read a curated benchmark table.

SplitReuse published split columns or generate deterministic repeated k-fold splits.

Requirements#

Python: 3.11 or newer is recommended.
Operating system: Linux, macOS, or Windows with WSL.
Network access: required the first time you download data from Zenodo or GitHub.
Persistent cache: recommended for repeated experiments, for example ~/.cache/synrxn.

Installation#

From PyPI#

Use this route for released package builds and published data snapshots.

python -m pip install --upgrade pip
python -m pip install synrxn

Install the broader optional dependency stack when you need all extras:

python -m pip install "synrxn[all]"

From source#

Use this route when you are developing SynRXN, editing documentation, or rebuilding curated datasets.

git clone https://github.com/TieuLongPhan/SynRXN.git
cd SynRXN

python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -e ".[dev]"

Build the documentation locally#

The enhanced documentation uses the PyData Sphinx theme.

cd doc
python -m pip install -r requirements.txt
sphinx-build -b html . _build/html

Open doc/_build/html/index.html in a browser.

Verify the installation#

Run a minimal import and version check:

python - <<'PY'
import importlib.metadata as metadata
import synrxn

print("synrxn", metadata.version("synrxn"))
print("module", synrxn.__file__)
PY

Load your first dataset#

The example below loads the balanced Schneider classification benchmark from a versioned Zenodo release and inspects the resulting pandas DataFrame.

from pathlib import Path
from synrxn.data import DataLoader

cache_dir = Path("~/.cache/synrxn").expanduser()

loader = DataLoader(
    task="classification",
    source="zenodo",
    version="1.0.0",
    cache_dir=cache_dir,
)

print("Available datasets:", loader.available_names())

df = loader.load("schneider_b")
print("Shape:", df.shape)
print("Columns:", df.columns.tolist())
print(df.head(3))

Use a pinned commit during development#

For development workflows, pinning a commit gives an exact source snapshot. This is stronger than latest because it prevents silent changes in future runs.

from pathlib import Path
from synrxn.data import DataLoader

loader = DataLoader(
    task="property",
    source="commit",
    version="3e1612e2199e8b0e369fce3ed9aff3dda68e4c32",
    cache_dir=Path("~/.cache/synrxn").expanduser(),
    gh_enable=True,
)

df = loader.load("b97xd3")
print(df[["r_id", "ea", "dh"]].head())

Work with train/validation/test splits#

Some datasets already include a split column from the original benchmark. When a fresh deterministic split is needed, use synrxn.split.repeated_kfold.RepeatedKFoldsSplitter.

from pathlib import Path
from synrxn.data import DataLoader
from synrxn.split.repeated_kfold import RepeatedKFoldsSplitter

loader = DataLoader(
    task="property",
    source="zenodo",
    version="1.0.0",
    cache_dir=Path("~/.cache/synrxn").expanduser(),
)
df = loader.load("b97xd3")

splitter = RepeatedKFoldsSplitter(
    n_splits=5,
    n_repeats=2,
    random_state=42,
    val_ratio=0.1,
)

split_indices = splitter.split(df)
print(split_indices)

Reproducibility tip: For a paper or benchmark report, record the dataset name, task family, source mode, version or commit SHA, cache manifest if available, split seed, and package version.

Source modes at a glance#

Source mode	Best use case	Example version	Reproducibility level
`zenodo`	Published results and long-term archiving.	`"1.0.0"`	High, citable release snapshot.
`github`	Aligning data with a GitHub release tag or branch.	`"v1.0.0"`	High for tags, lower for branches.
`commit`	Exact development snapshot.	full commit SHA	Highest for source-state reproducibility.
`latest`	Exploration during active development.	`"latest"`	Low unless the resolved commit is recorded.

Troubleshooting#

Installation cannot find optional packages#

Upgrade packaging tools first:

python -m pip install --upgrade pip setuptools wheel

Zenodo downloads fail or are rate-limited#

Prefer GitHub-hosted release assets for routine experiments. For Zenodo-based loading, check network access and use a persistent cache so assets are not repeatedly requested.

Autodoc cannot import `synrxn` during documentation builds#

Install the package in editable mode from the project root before building docs:

python -m pip install -e ".[dev]"
cd doc
sphinx-build -b html . _build/html

Next steps#

Data Concept explains task folders, schema conventions, and source modes.
Data Records lists curated benchmark records and row counts.
Tutorials and Examples gives complete workflows for released data, pinned commits, splitting, and rebuilds.
API Reference documents DataLoader and split utilities.