API Reference#

This section documents the public Python API exposed by SynRXN. For practical usage examples, start with Getting Started and Tutorials and Examples.

Data access: synrxn.data#

The synrxn.data module is the main entry point for accessing curated datasets and their manifests.

Typical responsibilities of synrxn.data.DataLoader include:

  • listing datasets for a task family,

  • resolving data sources such as Zenodo, GitHub tags, exact commits, or latest,

  • managing local cache directories,

  • verifying or tracking downloaded assets when metadata is available,

  • returning records as pandas DataFrames.

synrxn.data Data access utilities for SynRXN datasets.

class synrxn.data.DataLoader(task, version=None, cache_dir=PosixPath('/home/docs/.cache/synrxn'), timeout=20, user_agent='SynRXN-DataLoader/2.0', max_workers=6, gh_ref=None, gh_enable=False, source='zenodo', resolve_on_init=False, verify_checksum=True, cache_record_index=True, force_record_id=None)[source]#

Bases: object

DataLoader for SynRXN data stored under Data/<task>/<name>.csv(.gz).

The loader supports three sources:
  • 'zenodo': pulls files from the Zenodo record for the project’s concept DOI

  • 'github': pulls files from a GitHub release tag or branch

  • 'commit': pulls files from a specific commit SHA (version='latest' resolves the tip SHA)

Parameters:
  • task (str) – Subfolder name under the repository’s Data/ directory (e.g. "class", "aam").

  • version (Optional[str]) –

    Source-dependent version identifier:

    • If source=='zenodo': Zenodo version string (e.g. "1.0.0") or None for the latest record.

    • If source=='github': GitHub release tag (e.g. "v1.0.0" or "1.0.0"). 'latest' can be used and will resolve to the latest release tag (when available) and otherwise fall back to a branch ref.

    • If source=='commit': commit SHA (40-char) or the string 'latest' to resolve the tip SHA of a branch.

  • cache_dir (Optional[pathlib.Path]) – Directory used to cache downloaded gz payloads, Zenodo record indices, and GitHub latest lookups. Defaults to ~/.cache/synrxn.

  • timeout (int) – HTTP timeout in seconds for requests. Default is 20.

  • user_agent (str) – User-Agent string for outbound HTTP requests.

  • max_workers (int) – Maximum worker threads for load_many(). Default 6.

  • gh_ref (Optional[str]) – Optional explicit GitHub ref (branch name) used when resolving latest (commit or release fallback). If omitted, the repo default branch is used.

  • gh_enable (bool) – If True enables GitHub-based retrieval. Required when source is 'github' or 'commit'.

  • source (str) – One of {'zenodo', 'github', 'commit'}.

  • resolve_on_init (bool) – If True and source=='zenodo', resolves the Zenodo record and file index during initialization.

  • verify_checksum (bool) – If True, verifies Zenodo-provided checksums for downloads and cached payloads.

  • cache_record_index (bool) – If True, the Zenodo record file index will be cached in cache_dir.

  • force_record_id (Optional[int]) – If provided, this numeric Zenodo record id will be used directly (bypassing version lookup).

Raises:
  • ValueError – If source is not one of the supported values, or if GitHub retrieval is requested but gh_enable is False.

  • RuntimeError – If resolving version='latest' for source='commit' fails to return a commit SHA.

Examples

Zenodo (specific version):

from synrxn.data import DataLoader
from pathlib import Path

dl = DataLoader(
    task="classification",
    source="zenodo",
    version="1.0.0",
    cache_dir=Path("~/.cache/synrxn").expanduser(),
)
print(dl.available_names())
df = dl.load('schneider_b')
print(df.head())

GitHub (release tag):

from synrxn.data import DataLoader
from pathlib import Path

dl = DataLoader(
    task="classification",
    source="github",
    version="v1.0.0",
    gh_enable=True,
    cache_dir=Path("~/.cache/synrxn").expanduser(),
)
print(dl.available_names())
df = dl.load('schneider_b')
print(df.head())

Commit (explicit commit SHA):

from synrxn.data import DataLoader
from pathlib import Path

dl = DataLoader(
    task="classification",
    source="commit",
    version="3e1612e2199e8b0e369fce3ed9aff3dda68e4c32",
    gh_enable=True,
    cache_dir=Path("~/.cache/synrxn").expanduser(),
)
print(dl.available_names())
df = dl.load('schneider_b')
print(df.head())

Commit (resolve latest tip of a branch):

from synrxn.data import DataLoader
from pathlib import Path

dl = DataLoader(
    task="classification",
    source="commit",
    version="latest",
    gh_enable=True,
    gh_ref="main",
    cache_dir=Path("~/.cache/synrxn").expanduser(),
)
# dl.version will be replaced with the resolved 40-char SHA
print("resolved sha:", dl.version)
print(dl.available_names())
df = dl.load('schneider_b')
print(df.head())

Notes

  • version='latest' (commit or release) is non-deterministic; record the resolved value (dl.version) if you need reproducible results.

  • For heavy GitHub API usage, provide an authenticated session (add an Authorization token to dl._session.headers).

find_zenodo_keys(term)[source]#

Public helper to search keys in the currently-loaded Zenodo file index.

Resolves the record index if not already loaded. Returns an empty list if no client/index is available or if an error occurs.

Parameters:

term (str)

Return type:

List[str]

available_names(refresh=False)[source]#
Parameters:

refresh (bool)

Return type:

List[str]

load(name, use_cache=True, dtype=None, **pd_kw)[source]#
Parameters:
Return type:

DataFrame

load_many(names, use_cache=True, dtype=None, parallel=True, **pd_kw)[source]#
Parameters:
Return type:

Dict[str, DataFrame]

Minimal usage#

from pathlib import Path
from synrxn.data import DataLoader

loader = DataLoader(
    task="classification",
    source="zenodo",
    version="1.0.0",
    cache_dir=Path("~/.cache/synrxn").expanduser(),
)

print(loader.available_names())
df = loader.load("schneider_b")

Splitting utilities: synrxn.split.repeated_kfold#

The synrxn.split.repeated_kfold module provides tools for reproducible repeated k-fold splitting of datasets.

Typical use cases include:

  • generating repeated k-fold splits for property and classification tasks,

  • deriving train/validation/test partitions via a user-specified validation ratio,

  • preserving label distributions through stratified splitting when supported,

  • exporting/importing split indices for exact reproducibility.

class synrxn.split.repeated_kfold.SplitIndices(repeat, fold, train_idx, val_idx, test_idx)[source]#

Bases: object

Container for indices of a single (repeat, fold) split.

Parameters:
  • repeat (int) – Repeat index (0-based).

  • fold (int) – Fold index within the repeat (0-based).

  • train_idx (ndarray) – Numpy array of training row indices.

  • val_idx (ndarray) – Numpy array of validation row indices.

  • test_idx (ndarray) – Numpy array of test row indices.

repeat: int#
fold: int#
train_idx: ndarray#
val_idx: ndarray#
test_idx: ndarray#
class synrxn.split.repeated_kfold.RepeatedKFoldsSplitter(n_splits=5, n_repeats=1, ratio=(8, 1, 1), shuffle=True, random_state=None)[source]#

Bases: object

Repeated K-Fold splitter producing (train, val, test) for each outer fold.

  • sklearn-compatible split(X, y=None, groups=None, stratify=None) -> yields (train_idx, holdout_idx) where holdout_idx == val + test (useful for sklearn cross_validate).

  • Use split_with_val(X, y=None, groups=None, stratify=None) to receive (train, val, test) triples as SplitIndices objects (repeat, fold, train_idx, val_idx, test_idx).

The ratio argument is a (train, val, test) tuple that controls the proportion used when splitting the outer holdout into validation and test sets. For example, ratio=(8,1,1) means the holdout is split val:test = 1:1.

Notes on stratification:

  • Pass y (array-like) to stratify by labels (sklearn-conventional).

  • Alternatively pass stratify to split(…) or split_with_val(…). stratify may be:

    • a column name (str) when X is a pandas.DataFrame, or

    • an array-like of the same length as X.

  • If both y and stratify are provided, stratify takes precedence.

  • If stratification is requested but not possible (e.g., a class has fewer than n_splits members), the splitter falls back to non-stratified KFold and emits a warning.

Example (Sphinx-style):

>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_samples=500, n_features=20, weights=[0.9, 0.1], random_state=0)
>>> splitter = RepeatedKFoldsSplitter(n_splits=5, n_repeats=2, ratio=(8,1,1), shuffle=True, random_state=1)
>>> # sklearn-style outer cross-validation (y is used for stratification)
>>> for train_idx, hold_idx in splitter.split(X, y):
...     print(len(train_idx), len(hold_idx))
>>> # explicit train/val/test
>>> for s in splitter.split_with_val(X, stratify=y):
...     X_train, X_val, X_test = X[s.train_idx], X[s.val_idx], X[s.test_idx]
Parameters:
  • n_splits (int) – Number of outer folds (k).

  • n_repeats (int) – Number of repeats (how many times to reshuffle-and-split).

  • ratio (Tuple[int, int, int]) – Tuple of three ints (train, val, test) like (8,1,1).

  • shuffle (bool) – Whether to shuffle before splitting each repeat.

  • random_state (Optional[int]) – Base random state for reproducible repeats.

get_n_splits(X=None, y=None, groups=None)[source]#

Return how many (train, holdout) splits will be produced.

Parameters:
  • X – Feature matrix or dataframe (ignored for counting).

  • y – Labels (ignored for counting).

  • groups – Groups (ignored for counting).

Returns:

Total number of outer splits (n_splits * n_repeats).

Return type:

int

split(X, y=None, groups=None, stratify=None)[source]#

sklearn-compatible generator yielding (train_idx, holdout_idx) where holdout_idx == val + test.

The stratify argument may be:
  • a column name (str) if X is a pandas.DataFrame, or

  • an array-like of length n_samples.

If stratify is provided, it is used in preference to y.

Parameters:
  • X (Any) – Feature matrix or pandas.DataFrame.

  • y (Any | None) – Labels (array-like). Used for stratification if stratify is None.

  • groups (Any | None) – Group labels for GroupKFold (optional).

  • stratify (str | Any | None) – Column name or array-like used to stratify folds (optional).

Yields:

Tuples (train_idx, holdout_idx) for each repeat/fold.

Return type:

Iterable[Tuple[ndarray, ndarray]]

split_with_val(X, y=None, groups=None, stratify=None)[source]#

Yield SplitIndices objects containing (train, val, test) indices.

Parameters:
  • X (Any) – Feature matrix or pandas.DataFrame.

  • y (Any | None) – Labels (array-like). Used for stratification if stratify is None.

  • groups (Any | None) – Group labels for GroupKFold (optional).

  • stratify (str | Any | None) – Column name or array-like used to stratify folds (optional).

Yields:

SplitIndices objects for each repeat and fold.

Return type:

Iterable[SplitIndices]

prepare_splits(X, y=None, groups=None, stratify=None)[source]#

Compute and store all splits immediately (equivalent to iterating split(…) fully). After calling this, self._splits is populated and get_split(…) may be used.

Parameters:
Return type:

None

get_split(repeat=0, fold=0, as_frame=False)[source]#

Retrieve either index arrays (train_idx, val_idx, test_idx) or slices of the originally provided X (if it was a DataFrame or array-like) when as_frame=True.

Parameters:
  • repeat (int) – Repeat index (0-based).

  • fold (int) – Fold index within the repeat (0-based).

  • as_frame (bool) – If True, return slices of the original X (DataFrame or ndarray) rather than indices.

Returns:

Tuple of (train, val, test) either as index arrays or as slices of X.

Raises:
  • RuntimeError – If no splits have been computed yet.

  • IndexError – If the requested (repeat, fold) does not exist.

iter_splits()[source]#

Iterate over computed splits in order (repeat major, fold minor).

Returns:

Iterator of SplitIndices objects.

Return type:

Iterator[SplitIndices]

property splits: List[SplitIndices]#

Return a copy of computed splits.

property n_generated_splits: int#

Number of generated (repeat, fold) splits.

Minimal usage#

from synrxn.split.repeated_kfold import RepeatedKFoldsSplitter

splitter = RepeatedKFoldsSplitter(
    n_splits=5,
    n_repeats=3,
    random_state=2026,
    val_ratio=0.1,
)

split_indices = splitter.split(df)

Command-line interface: synrxn.__main__#

The synrxn.__main__ module exposes the command-line interface used when invoking SynRXN as a module.

python -m synrxn --help
python -m synrxn build --help

The build subcommand is intended for maintainers and advanced users who need to rebuild datasets or manifests from original sources.

SynRXN CLI (synrxn/main.py)

Typical usage inside the repository (no install):

# Show help python -m synrxn.main –help

# Build RBL dataset python -m synrxn.main build –rbl

# Build property dataset python -m synrxn.main build –property

# Run multiple builders in sequence python -m synrxn.main build –aam –classification –property

# Forward extra args to builders (after –) python -m synrxn.main build –rbl – –out-dir Data/rbl

# CLI-level dry-run: only print commands, do not execute python -m synrxn.main build –rbl –dry-run

# Ask builders themselves not to save (they must support –dry-run) python -m synrxn.main build –rbl –no-save

synrxn.__main__.configure_logging(verbose)[source]#

Configure a simple stream logger.

Parameters:

verbose (bool)

Return type:

None

synrxn.__main__.find_repo_root(start)[source]#

Try to detect the repository root starting from the given path.

Strategy: - Walk upwards until we find a directory that contains a “script” subdir. - Fallback to “start” if nothing is found.

Parameters:

start (Path)

Return type:

Path

synrxn.__main__.find_script_path(repo_root, override, target)[source]#

Resolve the script path for a given target (possibly overridden).

Parameters:
  • repo_root (Path)

  • override (str | None)

  • target (str)

Return type:

Path

synrxn.__main__.build_command(python_exe, script_path, forwarded_args)[source]#

Build the command list for subprocess.

Parameters:
Return type:

List[str]

synrxn.__main__.run_subprocess(cmd, cwd, dry_run=False)[source]#

Run subprocess and return exit code.

If dry_run is True, only print the command that would be run.

Parameters:
Return type:

int

synrxn.__main__.parse_args(argv=None)[source]#
Parameters:

argv (List[str] | None)

Return type:

Namespace

synrxn.__main__.main(argv=None)[source]#
Parameters:

argv (List[str] | None)

Return type:

int

API usage guidance#

  • Use synrxn.data.DataLoader for all dataset access instead of hardcoding local paths.

  • Prefer source="zenodo" for published experiments.

  • Prefer exact commit SHAs for development snapshots.

  • Export split indices whenever generated splits are part of a benchmark.

  • Keep package version, data version, and split settings in experiment logs.