Data Records#

This page summarizes the curated SynRXN records by benchmark family. It is designed as a quick inventory for selecting datasets, checking target columns, and finding the source citation and reuse license for each benchmark table. Dataset-level source and literature citations are collected on the Reference page.

Note

The License column reports the license of the table as distributed in the SynRXN release. Upstream publications and external source repositories should still be cited through the Source column. Users who redistribute modified versions should also check the upstream source terms.

Task families

Reaction rebalancing, atom mapping, classification, property prediction, and synthesis.

Curated tables

Named benchmark records distributed as compressed CSV files.

Citable release

Use the Scientific Data paper and archived data release for reproducible reporting.

Reaction Rebalancing#

Data folder: rbl

Reaction rebalancing datasets evaluate whether a method can restore missing species and recover a balanced reaction from an incomplete equation. The MNC, MOS, and MBS subsets are derived from USPTO-50K-style reaction records, whereas the Complex subset is based on manually validated complex-reaction mapping collections. See Reference for complete bibliographic records.

Dataset	Records	Benchmark role	Principal columns	Source	License
`complex`	1,748	Complex transformations and skeletal rearrangements.	`r_id`, input `rxn`, balanced `ground_truth`	[1, 2]	`CC BY 4.0`
`mbs`	491	Missing species on both sides of the reaction.	`r_id`, input `rxn`, balanced `ground_truth`	[3]	`CC BY 4.0`
`mnc`	33,147	Missing non-carbon species.	`r_id`, input `rxn`, balanced `ground_truth`	[3]	`CC BY 4.0`
`mos`	12,781	Missing species on one side of the reaction.	`r_id`, input `rxn`, balanced `ground_truth`	[3]	`CC BY 4.0`

Atom-to-Atom Mapping#

Data folder: aam

Atom-to-atom mapping records compare predicted mappings with curated or consensus reference mappings across synthetic and biochemical reaction domains. See Reference for source publications and mapper references.

Dataset	Records	Benchmark role	Principal columns	Source	License
`ecoli`	273	Biochemical mapping benchmark from E. coli reactions.	`r_id`, `ground_truth`, mapper outputs, `rxn`, `original_id`	[4]	`CC BY 4.0`
`enzyme_map`	47,974	Enzymatic atom-mapping benchmark from the EnzymeMap reaction collection.	`r_id`, `ground_truth`, `rxn`, `original_id`	[5]	`CC BY 4.0`
`golden`	1,758	Curated synthetic chemistry mapping benchmark.	`r_id`, `ground_truth`, mapper outputs, `rxn`, `original_id`	[2, 6]	`CC BY 4.0`
`natcomm`	491	Reference mapping set from literature-derived reactions.	`r_id`, `ground_truth`, mapper outputs, `rxn`, `original_id`	[1]	`CC BY 4.0`
`recon3d`	382	Biochemical mapping benchmark from Recon3D reactions.	`r_id`, `ground_truth`, mapper outputs, `rxn`, `original_id`	[7]	`CC BY 4.0`
`uspto_3k`	3,000	USPTO-derived synthetic chemistry mapping benchmark.	`r_id`, `ground_truth`, mapper outputs, `rxn`, `original_id`	[6]	`CC BY 4.0`

Reaction Classification#

Data folder: classification

Classification datasets evaluate reaction class, template, or enzyme-label prediction. Some corpora are provided in balanced/full and unbalanced/product focused variants. See Reference for label-source and benchmark citations.

Dataset	Records	Benchmark role	Principal columns	Source	License
`ecreact`	185,734	Enzyme Commission hierarchy classification.	`r_id`, `rxn`, `ec1`, `ec2`, `ec3`, `split`, `orig_index`	[8]	`CC BY 4.0`
`schneider_b`	50,000	Balanced Schneider reaction-class benchmark.	`r_id`, `rxn`, `label`, `split`	[9, 10]	`CC BY 4.0`
`schneider_u`	50,000	Unbalanced/product-focused Schneider variant.	`r_id`, `rxn`, `label`, `split`	[9]	`CC BY 4.0`
`syntemp`	43,441	Hierarchical SynTemp template labels.	`orig_index`, `r_id`, `rxn`, `label_0`, `label_1`, `label_2`	[3, 11]	`CC BY 4.0`
`tpl_b`	445,115	Balanced template-label benchmark.	`r_id`, `rxn`, `label`, `split`	[10, 12]	`CC BY 4.0`
`tpl_u`	445,115	Unbalanced/product-focused template-label benchmark.	`r_id`, `rxn`, `label`, `split`	[12]	`CC BY 4.0`
`uspto_50k_b`	50,016	Balanced USPTO-50K class-label benchmark.	`r_id`, `rxn`, `label`, `split`	[3, 10]	`CC BY 4.0`
`uspto_50k_u`	50,016	Unbalanced/product-focused USPTO-50K variant.	`r_id`, `rxn`, `label`, `split`	[3]	`CC BY 4.0`

Reaction Property Prediction#

Data folder: property

Property datasets evaluate continuous or probabilistic reaction attributes such as activation energy, enthalpy, rate, conversion, or reaction free energies. See Reference for upstream corpus citations.

Dataset	Records	Benchmark role	Principal columns	Source	License
`b97xd3`	16,365	Quantum-chemistry property prediction.	`r_id`, `aam`, `ea`, `dh`	[13, 14]	`CC BY 4.0`
`cycloadd`	5,269	Cycloaddition free-energy prediction.	`r_id`, `aam`, `G_act`, `G_r`, `split`	[15]	`CC BY 4.0`
`e2`	1,264	E2 activation-energy prediction.	`r_id`, `aam`, `ea`, `split`	[15]	`CC BY 4.0`
`e2sn2`	3,625	E2/SN2 activation-energy prediction.	`r_id`, `aam`, `ea`	[16, 17]	`CC BY 4.0`
`lograte`	778	Log-rate prediction.	`r_id`, `aam`, `lograte`	[16, 18]	`CC BY 4.0`
`phosphatase`	33,354	Phosphatase conversion prediction.	`r_id`, `aam`, `Conversion`, `onehot`	[16, 19]	`CC BY 4.0`
`rad6re`	31,923	Reaction enthalpy prediction.	`r_id`, `aam`, `dh`	[16, 20]	`CC BY 4.0`
`rdb7`	23,852	Activation-energy prediction with predefined splits.	`r_id`, `aam`, `ea`, `split`	[15]	`CC BY 4.0`
`rgd1`	353,984	Large-scale activation-energy prediction.	`r_id`, `aam`, `ea`, `split`	[15]	`CC BY 4.0`
`sn2`	2,361	SN2 activation-energy prediction.	`r_id`, `aam`, `ea`, `split`	[15]	`CC BY 4.0`
`snar`	503	SNAr activation-energy prediction.	`r_id`, `rxn`, `ea`	[21]	`CC BY 3.0`

Synthesis Prediction#

Data folder: synthesis

Synthesis records support single-step forward synthesis, retrosynthesis, reagent/catalyst inference, and reaction-center related workflows. The Scientific Data synthesis table explicitly lists USPTO-50K, USPTO-MIT, and USPTO-500 as the SynRXN synthesis-prediction corpora; the Diels–Alder record is documented here as an additional class-specific synthesis dataset. See Reference for USPTO-derived benchmark citations.

Dataset	Records	Benchmark role	Principal columns	Source	License
`da`	11,011	Diels–Alder reaction benchmark.	`r_id`, `code`, `reaction_original`, `reaction`, `rsmi`	[22]	`CC BY 3.0`
`uspto_500`	143,535	Reagent and catalyst inference benchmark.	`r_id`, `rxn`, `reagent`, `split`	[23]	`CC BY 4.0`
`uspto_50k`	50,016	USPTO-50K synthesis/retrosynthesis benchmark.	`r_id`, `aam`, `split`, `source`	[3, 6]	`CC BY 4.0`
`uspto_mit`	479,035	Large-scale USPTO-MIT synthesis prediction benchmark.	`r_id`, `aam`, `split`, `rc`	[24]	`CC BY 4.0`

Choosing a dataset#

Use the task family first, then choose the dataset based on target type, scale, and whether the benchmark already includes a published split. For example, schneider_b is a compact classification benchmark with a standard reaction class label, b97xd3 is a quantum-chemistry property dataset with activation energy and enthalpy targets, and uspto_50k is a common synthesis or retrosynthesis benchmark.

Practical selection guide#
Goal	Start with	Check before training
Compare atom mappers	`aam/golden`, `aam/uspto_3k`, or biochemical AAM records.	Whether `ground_truth` and mapper-output columns match your metric.
Train reaction classifiers	`classification/schneider_b` or `classification/uspto_50k_b`.	Label granularity and whether balanced or unbalanced variants are needed.
Predict reaction energies	`property/b97xd3`, `property/rgd1`, or mechanism-specific subsets.	Unit conventions, target column name, and split availability.
Recover missing species	`rbl/mnc`, `rbl/mos`, `rbl/mbs`, or `rbl/complex`.	Whether the missing-species setting matches your method.
Benchmark synthesis tasks	`synthesis/uspto_50k`, `synthesis/uspto_mit`, or `synthesis/uspto_500`.	Whether the task is product prediction, retrosynthesis, reagent inference, or reaction-center prediction.

For a complete reproducibility record, pair the selected dataset with the source mode and version described in Data Concept.