API Reference#
This section documents the public Python API exposed by SynRXN. For practical usage examples, start with Getting Started and Tutorials and Examples.
Data access: synrxn.data#
The synrxn.data module is the main entry point for accessing curated
datasets and their manifests.
Typical responsibilities of synrxn.data.DataLoader include:
listing datasets for a task family,
resolving data sources such as Zenodo, GitHub tags, exact commits, or latest,
managing local cache directories,
verifying or tracking downloaded assets when metadata is available,
returning records as pandas DataFrames.
synrxn.data Data access utilities for SynRXN datasets.
- class synrxn.data.DataLoader(task, version=None, cache_dir=PosixPath('/home/docs/.cache/synrxn'), timeout=20, user_agent='SynRXN-DataLoader/2.0', max_workers=6, gh_ref=None, gh_enable=False, source='zenodo', resolve_on_init=False, verify_checksum=True, cache_record_index=True, force_record_id=None)[source]#
Bases:
objectDataLoader for SynRXN data stored under
Data/<task>/<name>.csv(.gz).- The loader supports three sources:
'zenodo': pulls files from the Zenodo record for the project’s concept DOI'github': pulls files from a GitHub release tag or branch'commit': pulls files from a specific commit SHA (version='latest'resolves the tip SHA)
- Parameters:
task (str) – Subfolder name under the repository’s Data/ directory (e.g.
"class","aam").version (Optional[str]) –
Source-dependent version identifier:
If
source=='zenodo': Zenodo version string (e.g."1.0.0") orNonefor the latest record.If
source=='github': GitHub release tag (e.g."v1.0.0"or"1.0.0").'latest'can be used and will resolve to the latest release tag (when available) and otherwise fall back to a branch ref.If
source=='commit': commit SHA (40-char) or the string'latest'to resolve the tip SHA of a branch.
cache_dir (Optional[pathlib.Path]) – Directory used to cache downloaded gz payloads, Zenodo record indices, and GitHub latest lookups. Defaults to
~/.cache/synrxn.timeout (int) – HTTP timeout in seconds for requests. Default is
20.user_agent (str) – User-Agent string for outbound HTTP requests.
max_workers (int) – Maximum worker threads for
load_many(). Default6.gh_ref (Optional[str]) – Optional explicit GitHub ref (branch name) used when resolving
latest(commit or release fallback). If omitted, the repo default branch is used.gh_enable (bool) – If True enables GitHub-based retrieval. Required when
sourceis'github'or'commit'.source (str) – One of
{'zenodo', 'github', 'commit'}.resolve_on_init (bool) – If True and
source=='zenodo', resolves the Zenodo record and file index during initialization.verify_checksum (bool) – If True, verifies Zenodo-provided checksums for downloads and cached payloads.
cache_record_index (bool) – If True, the Zenodo record file index will be cached in
cache_dir.force_record_id (Optional[int]) – If provided, this numeric Zenodo record id will be used directly (bypassing version lookup).
- Raises:
ValueError – If
sourceis not one of the supported values, or if GitHub retrieval is requested butgh_enableis False.RuntimeError – If resolving
version='latest'forsource='commit'fails to return a commit SHA.
Examples
Zenodo (specific version):
from synrxn.data import DataLoader from pathlib import Path dl = DataLoader( task="classification", source="zenodo", version="1.0.0", cache_dir=Path("~/.cache/synrxn").expanduser(), ) print(dl.available_names()) df = dl.load('schneider_b') print(df.head())
GitHub (release tag):
from synrxn.data import DataLoader from pathlib import Path dl = DataLoader( task="classification", source="github", version="v1.0.0", gh_enable=True, cache_dir=Path("~/.cache/synrxn").expanduser(), ) print(dl.available_names()) df = dl.load('schneider_b') print(df.head())
Commit (explicit commit SHA):
from synrxn.data import DataLoader from pathlib import Path dl = DataLoader( task="classification", source="commit", version="3e1612e2199e8b0e369fce3ed9aff3dda68e4c32", gh_enable=True, cache_dir=Path("~/.cache/synrxn").expanduser(), ) print(dl.available_names()) df = dl.load('schneider_b') print(df.head())
Commit (resolve latest tip of a branch):
from synrxn.data import DataLoader from pathlib import Path dl = DataLoader( task="classification", source="commit", version="latest", gh_enable=True, gh_ref="main", cache_dir=Path("~/.cache/synrxn").expanduser(), ) # dl.version will be replaced with the resolved 40-char SHA print("resolved sha:", dl.version) print(dl.available_names()) df = dl.load('schneider_b') print(df.head())
Notes
version='latest'(commit or release) is non-deterministic; record the resolved value (dl.version) if you need reproducible results.For heavy GitHub API usage, provide an authenticated session (add an Authorization token to
dl._session.headers).
- find_zenodo_keys(term)[source]#
Public helper to search keys in the currently-loaded Zenodo file index.
Resolves the record index if not already loaded. Returns an empty list if no client/index is available or if an error occurs.
Minimal usage#
from pathlib import Path
from synrxn.data import DataLoader
loader = DataLoader(
task="classification",
source="zenodo",
version="1.0.0",
cache_dir=Path("~/.cache/synrxn").expanduser(),
)
print(loader.available_names())
df = loader.load("schneider_b")
Splitting utilities: synrxn.split.repeated_kfold#
The synrxn.split.repeated_kfold module provides tools for reproducible
repeated k-fold splitting of datasets.
Typical use cases include:
generating repeated k-fold splits for property and classification tasks,
deriving train/validation/test partitions via a user-specified validation ratio,
preserving label distributions through stratified splitting when supported,
exporting/importing split indices for exact reproducibility.
- class synrxn.split.repeated_kfold.SplitIndices(repeat, fold, train_idx, val_idx, test_idx)[source]#
Bases:
objectContainer for indices of a single (repeat, fold) split.
- Parameters:
- train_idx: ndarray#
- val_idx: ndarray#
- test_idx: ndarray#
- class synrxn.split.repeated_kfold.RepeatedKFoldsSplitter(n_splits=5, n_repeats=1, ratio=(8, 1, 1), shuffle=True, random_state=None)[source]#
Bases:
objectRepeated K-Fold splitter producing (train, val, test) for each outer fold.
sklearn-compatible split(X, y=None, groups=None, stratify=None) -> yields (train_idx, holdout_idx) where holdout_idx == val + test (useful for sklearn cross_validate).
Use split_with_val(X, y=None, groups=None, stratify=None) to receive (train, val, test) triples as SplitIndices objects (repeat, fold, train_idx, val_idx, test_idx).
The ratio argument is a (train, val, test) tuple that controls the proportion used when splitting the outer holdout into validation and test sets. For example, ratio=(8,1,1) means the holdout is split val:test = 1:1.
Notes on stratification:
Pass y (array-like) to stratify by labels (sklearn-conventional).
Alternatively pass stratify to split(…) or split_with_val(…). stratify may be:
a column name (str) when X is a pandas.DataFrame, or
an array-like of the same length as X.
If both y and stratify are provided, stratify takes precedence.
If stratification is requested but not possible (e.g., a class has fewer than n_splits members), the splitter falls back to non-stratified KFold and emits a warning.
Example (Sphinx-style):
>>> from sklearn.datasets import make_classification >>> X, y = make_classification(n_samples=500, n_features=20, weights=[0.9, 0.1], random_state=0) >>> splitter = RepeatedKFoldsSplitter(n_splits=5, n_repeats=2, ratio=(8,1,1), shuffle=True, random_state=1) >>> # sklearn-style outer cross-validation (y is used for stratification) >>> for train_idx, hold_idx in splitter.split(X, y): ... print(len(train_idx), len(hold_idx)) >>> # explicit train/val/test >>> for s in splitter.split_with_val(X, stratify=y): ... X_train, X_val, X_test = X[s.train_idx], X[s.val_idx], X[s.test_idx]
- Parameters:
n_splits (int) – Number of outer folds (k).
n_repeats (int) – Number of repeats (how many times to reshuffle-and-split).
ratio (Tuple[int, int, int]) – Tuple of three ints (train, val, test) like (8,1,1).
shuffle (bool) – Whether to shuffle before splitting each repeat.
random_state (Optional[int]) – Base random state for reproducible repeats.
- get_n_splits(X=None, y=None, groups=None)[source]#
Return how many (train, holdout) splits will be produced.
- Parameters:
X – Feature matrix or dataframe (ignored for counting).
y – Labels (ignored for counting).
groups – Groups (ignored for counting).
- Returns:
Total number of outer splits (n_splits * n_repeats).
- Return type:
- split(X, y=None, groups=None, stratify=None)[source]#
sklearn-compatible generator yielding (train_idx, holdout_idx) where holdout_idx == val + test.
- The stratify argument may be:
a column name (str) if X is a pandas.DataFrame, or
an array-like of length n_samples.
If stratify is provided, it is used in preference to y.
- Parameters:
- Yields:
Tuples (train_idx, holdout_idx) for each repeat/fold.
- Return type:
- split_with_val(X, y=None, groups=None, stratify=None)[source]#
Yield SplitIndices objects containing (train, val, test) indices.
- Parameters:
- Yields:
SplitIndices objects for each repeat and fold.
- Return type:
- prepare_splits(X, y=None, groups=None, stratify=None)[source]#
Compute and store all splits immediately (equivalent to iterating split(…) fully). After calling this, self._splits is populated and get_split(…) may be used.
- get_split(repeat=0, fold=0, as_frame=False)[source]#
Retrieve either index arrays (train_idx, val_idx, test_idx) or slices of the originally provided X (if it was a DataFrame or array-like) when as_frame=True.
- Parameters:
- Returns:
Tuple of (train, val, test) either as index arrays or as slices of X.
- Raises:
RuntimeError – If no splits have been computed yet.
IndexError – If the requested (repeat, fold) does not exist.
- iter_splits()[source]#
Iterate over computed splits in order (repeat major, fold minor).
- Returns:
Iterator of SplitIndices objects.
- Return type:
- property splits: List[SplitIndices]#
Return a copy of computed splits.
Minimal usage#
from synrxn.split.repeated_kfold import RepeatedKFoldsSplitter
splitter = RepeatedKFoldsSplitter(
n_splits=5,
n_repeats=3,
random_state=2026,
val_ratio=0.1,
)
split_indices = splitter.split(df)
Command-line interface: synrxn.__main__#
The synrxn.__main__ module exposes the command-line interface used when
invoking SynRXN as a module.
python -m synrxn --help
python -m synrxn build --help
The build subcommand is intended for maintainers and advanced users who need
to rebuild datasets or manifests from original sources.
SynRXN CLI (synrxn/main.py)
Typical usage inside the repository (no install):
# Show help python -m synrxn.main –help
# Build RBL dataset python -m synrxn.main build –rbl
# Build property dataset python -m synrxn.main build –property
# Run multiple builders in sequence python -m synrxn.main build –aam –classification –property
# Forward extra args to builders (after –) python -m synrxn.main build –rbl – –out-dir Data/rbl
# CLI-level dry-run: only print commands, do not execute python -m synrxn.main build –rbl –dry-run
# Ask builders themselves not to save (they must support –dry-run) python -m synrxn.main build –rbl –no-save
- synrxn.__main__.configure_logging(verbose)[source]#
Configure a simple stream logger.
- Parameters:
verbose (bool)
- Return type:
None
- synrxn.__main__.find_repo_root(start)[source]#
Try to detect the repository root starting from the given path.
Strategy: - Walk upwards until we find a directory that contains a “script” subdir. - Fallback to “start” if nothing is found.
- synrxn.__main__.find_script_path(repo_root, override, target)[source]#
Resolve the script path for a given target (possibly overridden).
- synrxn.__main__.build_command(python_exe, script_path, forwarded_args)[source]#
Build the command list for subprocess.
- synrxn.__main__.run_subprocess(cmd, cwd, dry_run=False)[source]#
Run subprocess and return exit code.
If dry_run is True, only print the command that would be run.
API usage guidance#
Use
synrxn.data.DataLoaderfor all dataset access instead of hardcoding local paths.Prefer
source="zenodo"for published experiments.Prefer exact commit SHAs for development snapshots.
Export split indices whenever generated splits are part of a benchmark.
Keep package version, data version, and split settings in experiment logs.