API Reference#

This section documents the public Python API exposed by SynRXN. For practical usage examples, start with Getting Started and Tutorials and Examples.

Data access Load curated benchmark tables and discover available records. Splitting Create reproducible repeated k-fold and train/validation/test splits. CLI Inspect developer commands for rebuilding datasets and manifests.

Data access: `synrxn.data`#

The synrxn.data module is the main entry point for accessing curated datasets and their manifests.

Typical responsibilities of synrxn.data.DataLoader include:

listing datasets for a task family,
resolving data sources such as Zenodo, GitHub tags, exact commits, or latest,
managing local cache directories,
verifying or tracking downloaded assets when metadata is available,
returning records as pandas DataFrames.

synrxn.data Data access utilities for SynRXN datasets.

class synrxn.data.DataLoader(task, version=None, cache_dir=PosixPath('/home/docs/.cache/synrxn'), timeout=20, user_agent='SynRXN-DataLoader/2.0', max_workers=6, gh_ref=None, gh_enable=False, source='zenodo', resolve_on_init=False, verify_checksum=True, cache_record_index=True, force_record_id=None)[source]#

Bases: object

DataLoader for SynRXN data stored under Data/<task>/<name>.csv(.gz).

The loader supports three sources:

'zenodo': pulls files from the Zenodo record for the project’s concept DOI
'github': pulls files from a GitHub release tag or branch
'commit': pulls files from a specific commit SHA (version='latest' resolves the tip SHA)

Parameters:

task (str) – Subfolder name under the repository’s Data/ directory (e.g. "class", "aam").
version (Optional[str]) –
Source-dependent version identifier:
- If source=='zenodo': Zenodo version string (e.g. "1.0.0") or None for the latest record.
- If source=='github': GitHub release tag (e.g. "v1.0.0" or "1.0.0"). 'latest' can be used and will resolve to the latest release tag (when available) and otherwise fall back to a branch ref.
- If source=='commit': commit SHA (40-char) or the string 'latest' to resolve the tip SHA of a branch.
cache_dir (Optional[pathlib.Path]) – Directory used to cache downloaded gz payloads, Zenodo record indices, and GitHub latest lookups. Defaults to ~/.cache/synrxn.
timeout (int) – HTTP timeout in seconds for requests. Default is 20.
user_agent (str) – User-Agent string for outbound HTTP requests.
max_workers (int) – Maximum worker threads for load_many(). Default 6.
gh_ref (Optional[str]) – Optional explicit GitHub ref (branch name) used when resolving latest (commit or release fallback). If omitted, the repo default branch is used.
gh_enable (bool) – If True enables GitHub-based retrieval. Required when source is 'github' or 'commit'.
source (str) – One of {'zenodo', 'github', 'commit'}.
resolve_on_init (bool) – If True and source=='zenodo', resolves the Zenodo record and file index during initialization.
verify_checksum (bool) – If True, verifies Zenodo-provided checksums for downloads and cached payloads.
cache_record_index (bool) – If True, the Zenodo record file index will be cached in cache_dir.
force_record_id (Optional[int]) – If provided, this numeric Zenodo record id will be used directly (bypassing version lookup).

Raises:

ValueError – If source is not one of the supported values, or if GitHub retrieval is requested but gh_enable is False.
RuntimeError – If resolving version='latest' for source='commit' fails to return a commit SHA.

Examples

Zenodo (specific version):

from synrxn.data import DataLoader
from pathlib import Path

dl = DataLoader(
    task="classification",
    source="zenodo",
    version="1.0.0",
    cache_dir=Path("~/.cache/synrxn").expanduser(),
)
print(dl.available_names())
df = dl.load('schneider_b')
print(df.head())

GitHub (release tag):

from synrxn.data import DataLoader
from pathlib import Path

dl = DataLoader(
    task="classification",
    source="github",
    version="v1.0.0",
    gh_enable=True,
    cache_dir=Path("~/.cache/synrxn").expanduser(),
)
print(dl.available_names())
df = dl.load('schneider_b')
print(df.head())

Commit (explicit commit SHA):

from synrxn.data import DataLoader
from pathlib import Path

dl = DataLoader(
    task="classification",
    source="commit",
    version="3e1612e2199e8b0e369fce3ed9aff3dda68e4c32",
    gh_enable=True,
    cache_dir=Path("~/.cache/synrxn").expanduser(),
)
print(dl.available_names())
df = dl.load('schneider_b')
print(df.head())

Commit (resolve latest tip of a branch):

from synrxn.data import DataLoader
from pathlib import Path

dl = DataLoader(
    task="classification",
    source="commit",
    version="latest",
    gh_enable=True,
    gh_ref="main",
    cache_dir=Path("~/.cache/synrxn").expanduser(),
)
# dl.version will be replaced with the resolved 40-char SHA
print("resolved sha:", dl.version)
print(dl.available_names())
df = dl.load('schneider_b')
print(df.head())

Notes

version='latest' (commit or release) is non-deterministic; record the resolved value (dl.version) if you need reproducible results.
For heavy GitHub API usage, provide an authenticated session (add an Authorization token to dl._session.headers).

find_zenodo_keys(term)[source]#

Public helper to search keys in the currently-loaded Zenodo file index.

Resolves the record index if not already loaded. Returns an empty list if no client/index is available or if an error occurs.

Parameters:: term (str)
Return type:: List[str]

available_names(refresh=False)[source]#

Parameters:: refresh (bool)
Return type:: List[str]

load(name, use_cache=True, dtype=None, **pd_kw)[source]#

Parameters:

name (str)
use_cache (bool)
dtype (Dict[str, object] | None)

Return type:

DataFrame

load_many(names, use_cache=True, dtype=None, parallel=True, **pd_kw)[source]#

Parameters:

names (Iterable[str])
use_cache (bool)
dtype (Dict[str, object] | None)
parallel (bool)

Return type:

Dict[str, DataFrame]

Minimal usage#

from pathlib import Path
from synrxn.data import DataLoader

loader = DataLoader(
    task="classification",
    source="zenodo",
    version="1.0.0",
    cache_dir=Path("~/.cache/synrxn").expanduser(),
)

print(loader.available_names())
df = loader.load("schneider_b")

Splitting utilities: `synrxn.split.repeated_kfold`#

The synrxn.split.repeated_kfold module provides tools for reproducible repeated k-fold splitting of datasets.

Typical use cases include:

generating repeated k-fold splits for property and classification tasks,
deriving train/validation/test partitions via a user-specified validation ratio,
preserving label distributions through stratified splitting when supported,
exporting/importing split indices for exact reproducibility.

class synrxn.split.repeated_kfold.SplitIndices(repeat, fold, train_idx, val_idx, test_idx)[source]#

Bases: object

Container for indices of a single (repeat, fold) split.

Parameters:

repeat (int) – Repeat index (0-based).
fold (int) – Fold index within the repeat (0-based).
train_idx (ndarray) – Numpy array of training row indices.
val_idx (ndarray) – Numpy array of validation row indices.
test_idx (ndarray) – Numpy array of test row indices.

repeat: int#

fold: int#

train_idx: ndarray#

val_idx: ndarray#

test_idx: ndarray#

class synrxn.split.repeated_kfold.RepeatedKFoldsSplitter(n_splits=5, n_repeats=1, ratio=(8, 1, 1), shuffle=True, random_state=None)[source]#

Bases: object

Repeated K-Fold splitter producing (train, val, test) for each outer fold.

sklearn-compatible split(X, y=None, groups=None, stratify=None) -> yields (train_idx, holdout_idx) where holdout_idx == val + test (useful for sklearn cross_validate).
Use split_with_val(X, y=None, groups=None, stratify=None) to receive (train, val, test) triples as SplitIndices objects (repeat, fold, train_idx, val_idx, test_idx).

The ratio argument is a (train, val, test) tuple that controls the proportion used when splitting the outer holdout into validation and test sets. For example, ratio=(8,1,1) means the holdout is split val:test = 1:1.

Notes on stratification:

Pass y (array-like) to stratify by labels (sklearn-conventional).
Alternatively pass stratify to split(…) or split_with_val(…). stratify may be:
- a column name (str) when X is a pandas.DataFrame, or
- an array-like of the same length as X.
If both y and stratify are provided, stratify takes precedence.
If stratification is requested but not possible (e.g., a class has fewer than n_splits members), the splitter falls back to non-stratified KFold and emits a warning.

Example (Sphinx-style):

>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_samples=500, n_features=20, weights=[0.9, 0.1], random_state=0)
>>> splitter = RepeatedKFoldsSplitter(n_splits=5, n_repeats=2, ratio=(8,1,1), shuffle=True, random_state=1)
>>> # sklearn-style outer cross-validation (y is used for stratification)
>>> for train_idx, hold_idx in splitter.split(X, y):
...     print(len(train_idx), len(hold_idx))
>>> # explicit train/val/test
>>> for s in splitter.split_with_val(X, stratify=y):
...     X_train, X_val, X_test = X[s.train_idx], X[s.val_idx], X[s.test_idx]

Parameters:

n_splits (int) – Number of outer folds (k).
n_repeats (int) – Number of repeats (how many times to reshuffle-and-split).
ratio (Tuple[int, int, int]) – Tuple of three ints (train, val, test) like (8,1,1).
shuffle (bool) – Whether to shuffle before splitting each repeat.
random_state (Optional[int]) – Base random state for reproducible repeats.

get_n_splits(X=None, y=None, groups=None)[source]#

Return how many (train, holdout) splits will be produced.

Parameters:

X – Feature matrix or dataframe (ignored for counting).
y – Labels (ignored for counting).
groups – Groups (ignored for counting).

Returns:

Total number of outer splits (n_splits * n_repeats).

Return type:

int

split(X, y=None, groups=None, stratify=None)[source]#

sklearn-compatible generator yielding (train_idx, holdout_idx) where holdout_idx == val + test.

The stratify argument may be:

a column name (str) if X is a pandas.DataFrame, or
an array-like of length n_samples.

If stratify is provided, it is used in preference to y.

Parameters:

X (Any) – Feature matrix or pandas.DataFrame.
y (Any | None) – Labels (array-like). Used for stratification if stratify is None.
groups (Any | None) – Group labels for GroupKFold (optional).
stratify (str | Any | None) – Column name or array-like used to stratify folds (optional).

Yields:

Tuples (train_idx, holdout_idx) for each repeat/fold.

Return type:

Iterable[Tuple[ndarray, ndarray]]

split_with_val(X, y=None, groups=None, stratify=None)[source]#

Yield SplitIndices objects containing (train, val, test) indices.

Parameters:

X (Any) – Feature matrix or pandas.DataFrame.
y (Any | None) – Labels (array-like). Used for stratification if stratify is None.
groups (Any | None) – Group labels for GroupKFold (optional).
stratify (str | Any | None) – Column name or array-like used to stratify folds (optional).

Yields:

SplitIndices objects for each repeat and fold.

Return type:

Iterable[SplitIndices]

prepare_splits(X, y=None, groups=None, stratify=None)[source]#

Compute and store all splits immediately (equivalent to iterating split(…) fully). After calling this, self._splits is populated and get_split(…) may be used.

Parameters:

X (Any)
y (Any | None)
groups (Any | None)
stratify (str | Any | None)

Return type:

None

get_split(repeat=0, fold=0, as_frame=False)[source]#

Retrieve either index arrays (train_idx, val_idx, test_idx) or slices of the originally provided X (if it was a DataFrame or array-like) when as_frame=True.

Parameters:

repeat (int) – Repeat index (0-based).
fold (int) – Fold index within the repeat (0-based).
as_frame (bool) – If True, return slices of the original X (DataFrame or ndarray) rather than indices.

Returns:

Tuple of (train, val, test) either as index arrays or as slices of X.

Raises:

RuntimeError – If no splits have been computed yet.
IndexError – If the requested (repeat, fold) does not exist.

iter_splits()[source]#

Iterate over computed splits in order (repeat major, fold minor).

Returns:: Iterator of SplitIndices objects.
Return type:: Iterator[SplitIndices]

property splits: List[SplitIndices]#: Return a copy of computed splits.

property n_generated_splits: int#: Number of generated (repeat, fold) splits.

Minimal usage#

from synrxn.split.repeated_kfold import RepeatedKFoldsSplitter

splitter = RepeatedKFoldsSplitter(
    n_splits=5,
    n_repeats=3,
    random_state=2026,
    val_ratio=0.1,
)

split_indices = splitter.split(df)

Command-line interface: `synrxn.main`#

The synrxn.__main__ module exposes the command-line interface used when invoking SynRXN as a module.

python -m synrxn --help
python -m synrxn build --help

The build subcommand is intended for maintainers and advanced users who need to rebuild datasets or manifests from original sources.

SynRXN CLI (synrxn/main.py)

Typical usage inside the repository (no install):

# Show help python -m synrxn.main –help

# Build RBL dataset python -m synrxn.main build –rbl

# Build property dataset python -m synrxn.main build –property

# Run multiple builders in sequence python -m synrxn.main build –aam –classification –property

# Forward extra args to builders (after –) python -m synrxn.main build –rbl – –out-dir Data/rbl

# CLI-level dry-run: only print commands, do not execute python -m synrxn.main build –rbl –dry-run

# Ask builders themselves not to save (they must support –dry-run) python -m synrxn.main build –rbl –no-save

synrxn.__main__.configure_logging(verbose)[source]#

Configure a simple stream logger.

Parameters:: verbose (bool)
Return type:: None

synrxn.__main__.find_repo_root(start)[source]#

Try to detect the repository root starting from the given path.

Strategy: - Walk upwards until we find a directory that contains a “script” subdir. - Fallback to “start” if nothing is found.

Parameters:: start (Path)
Return type:: Path

synrxn.__main__.find_script_path(repo_root, override, target)[source]#

Resolve the script path for a given target (possibly overridden).

Parameters:

repo_root (Path)
override (str | None)
target (str)

Return type:

Path

synrxn.__main__.build_command(python_exe, script_path, forwarded_args)[source]#

Build the command list for subprocess.

Parameters:

python_exe (str)
script_path (Path)
forwarded_args (Sequence[str])

Return type:

List[str]

synrxn.__main__.run_subprocess(cmd, cwd, dry_run=False)[source]#

Run subprocess and return exit code.

If dry_run is True, only print the command that would be run.

Parameters:

cmd (List[str])
cwd (Path)
dry_run (bool)

Return type:

int

synrxn.__main__.parse_args(argv=None)[source]#

Parameters:: argv (List[str] | None)
Return type:: Namespace

synrxn.__main__.main(argv=None)[source]#

Parameters:: argv (List[str] | None)
Return type:: int

API usage guidance#

Use synrxn.data.DataLoader for all dataset access instead of hardcoding local paths.
Prefer source="zenodo" for published experiments.
Prefer exact commit SHAs for development snapshots.
Export split indices whenever generated splits are part of a benchmark.
Keep package version, data version, and split settings in experiment logs.

API Reference#

Data access: synrxn.data#

Minimal usage#

Splitting utilities: synrxn.split.repeated_kfold#

Minimal usage#

Command-line interface: synrxn.__main__#

API usage guidance#

Data access: `synrxn.data`#

Splitting utilities: `synrxn.split.repeated_kfold`#

Command-line interface: `synrxn.main`#