Data Records#
This page summarizes the curated SynRXN records by benchmark family. It is designed as a quick inventory for selecting datasets, checking target columns, and finding the source citation and reuse license for each benchmark table. Dataset-level source and literature citations are collected on the Reference page.
Note
The License column reports the license of the table as distributed in the
SynRXN release. Upstream publications and external source repositories should
still be cited through the Source column. Users who redistribute modified
versions should also check the upstream source terms.
Reaction Rebalancing#
Data folder: rbl
Reaction rebalancing datasets evaluate whether a method can restore missing species and recover a balanced reaction from an incomplete equation. The MNC, MOS, and MBS subsets are derived from USPTO-50K-style reaction records, whereas the Complex subset is based on manually validated complex-reaction mapping collections. See Reference for complete bibliographic records.
Dataset |
Records |
Benchmark role |
Principal columns |
Source |
License |
|---|---|---|---|---|---|
|
1,748 |
Complex transformations and skeletal rearrangements. |
|
|
|
|
491 |
Missing species on both sides of the reaction. |
|
[3] |
|
|
33,147 |
Missing non-carbon species. |
|
[3] |
|
|
12,781 |
Missing species on one side of the reaction. |
|
[3] |
|
Atom-to-Atom Mapping#
Data folder: aam
Atom-to-atom mapping records compare predicted mappings with curated or consensus reference mappings across synthetic and biochemical reaction domains. See Reference for source publications and mapper references.
Dataset |
Records |
Benchmark role |
Principal columns |
Source |
License |
|---|---|---|---|---|---|
|
273 |
Biochemical mapping benchmark from E. coli reactions. |
|
[4] |
|
|
47,974 |
Enzymatic atom-mapping benchmark from the EnzymeMap reaction collection. |
|
[5] |
|
|
1,758 |
Curated synthetic chemistry mapping benchmark. |
|
|
|
|
491 |
Reference mapping set from literature-derived reactions. |
|
[1] |
|
|
382 |
Biochemical mapping benchmark from Recon3D reactions. |
|
[7] |
|
|
3,000 |
USPTO-derived synthetic chemistry mapping benchmark. |
|
[6] |
|
Reaction Classification#
Data folder: classification
Classification datasets evaluate reaction class, template, or enzyme-label prediction. Some corpora are provided in balanced/full and unbalanced/product focused variants. See Reference for label-source and benchmark citations.
Dataset |
Records |
Benchmark role |
Principal columns |
Source |
License |
|---|---|---|---|---|---|
|
185,734 |
Enzyme Commission hierarchy classification. |
|
[8] |
|
|
50,000 |
Balanced Schneider reaction-class benchmark. |
|
|
|
|
50,000 |
Unbalanced/product-focused Schneider variant. |
|
[9] |
|
|
43,441 |
Hierarchical SynTemp template labels. |
|
|
|
|
445,115 |
Balanced template-label benchmark. |
|
|
|
|
445,115 |
Unbalanced/product-focused template-label benchmark. |
|
[12] |
|
|
50,016 |
Balanced USPTO-50K class-label benchmark. |
|
|
|
|
50,016 |
Unbalanced/product-focused USPTO-50K variant. |
|
[3] |
|
Reaction Property Prediction#
Data folder: property
Property datasets evaluate continuous or probabilistic reaction attributes such as activation energy, enthalpy, rate, conversion, or reaction free energies. See Reference for upstream corpus citations.
Dataset |
Records |
Benchmark role |
Principal columns |
Source |
License |
|---|---|---|---|---|---|
|
16,365 |
Quantum-chemistry property prediction. |
|
|
|
|
5,269 |
Cycloaddition free-energy prediction. |
|
[15] |
|
|
1,264 |
E2 activation-energy prediction. |
|
[15] |
|
|
3,625 |
E2/SN2 activation-energy prediction. |
|
|
|
|
778 |
Log-rate prediction. |
|
|
|
|
33,354 |
Phosphatase conversion prediction. |
|
|
|
|
31,923 |
Reaction enthalpy prediction. |
|
|
|
|
23,852 |
Activation-energy prediction with predefined splits. |
|
[15] |
|
|
353,984 |
Large-scale activation-energy prediction. |
|
[15] |
|
|
2,361 |
SN2 activation-energy prediction. |
|
[15] |
|
|
503 |
SNAr activation-energy prediction. |
|
[21] |
|
Synthesis Prediction#
Data folder: synthesis
Synthesis records support single-step forward synthesis, retrosynthesis, reagent/catalyst inference, and reaction-center related workflows. The Scientific Data synthesis table explicitly lists USPTO-50K, USPTO-MIT, and USPTO-500 as the SynRXN synthesis-prediction corpora; the Diels–Alder record is documented here as an additional class-specific synthesis dataset. See Reference for USPTO-derived benchmark citations.
Dataset |
Records |
Benchmark role |
Principal columns |
Source |
License |
|---|---|---|---|---|---|
|
11,011 |
Diels–Alder reaction benchmark. |
|
[22] |
|
|
143,535 |
Reagent and catalyst inference benchmark. |
|
[23] |
|
|
50,016 |
USPTO-50K synthesis/retrosynthesis benchmark. |
|
|
|
|
479,035 |
Large-scale USPTO-MIT synthesis prediction benchmark. |
|
[24] |
|
Choosing a dataset#
Use the task family first, then choose the dataset based on target type, scale,
and whether the benchmark already includes a published split. For example,
schneider_b is a compact classification benchmark with a standard reaction
class label, b97xd3 is a quantum-chemistry property dataset with activation
energy and enthalpy targets, and uspto_50k is a common synthesis or
retrosynthesis benchmark.
Goal |
Start with |
Check before training |
|---|---|---|
Compare atom mappers |
|
Whether |
Train reaction classifiers |
|
Label granularity and whether balanced or unbalanced variants are needed. |
Predict reaction energies |
|
Unit conventions, target column name, and split availability. |
Recover missing species |
|
Whether the missing-species setting matches your method. |
Benchmark synthesis tasks |
|
Whether the task is product prediction, retrosynthesis, reagent inference, or reaction-center prediction. |
For a complete reproducibility record, pair the selected dataset with the source mode and version described in Data Concept.