How volume, velocity, variety, and veracity are reshaping discovery — and why FAIR data is now the price of admission
Materials science has always generated data, but in the last decade the scale, the speed, and the heterogeneity of that data have crossed a threshold that makes the term “big data” more than marketing. Computational repositories now hold tens of millions of entries, synthesis robots produce thousands of samples per day, and characterization instruments stream terabytes per hour. This article explains how big data is changing materials science, why the FAIR principles (Findable, Accessible, Interoperable, Reusable) have become the foundation of every serious program, and how Simreka helps organizations turn sprawling data estates into actionable sustainable-material decisions.
The Four V’s of Materials Big Data
Analysts in the field speak about big data in terms of four V’s — volume, variety, velocity, and veracity — and materials science is affected by all four simultaneously. Volume comes from the explosion of high-throughput computational screening and parallel synthesis: the Materials Project alone has grown into an indispensable tool used by more than 600,000 researchers worldwide. Variety reflects the radically different forms the data takes — structured property tables, spectra, crystal structures, electron microscope images, simulation trajectories, and free-text lab notebooks. Velocity captures the pace at which new data arrives from autonomous labs, synchrotrons, and compute clusters. Veracity is the sobering reminder that much of this data is noisy, inconsistently labeled, or unit-mismatched, and that cleaning it is the single largest cost line in any informatics program.
Where the Data Comes From
Today’s materials data flows from three complementary sources. First, first-principles computation: density functional theory runs on high-performance clusters populate databases like AFLOW (with millions of calculated compounds), OQMD, and the Materials Project. Second, high-throughput experiments: parallel synthesis, combinatorial libraries, and robotic characterization systems generate thousands of physical samples and their measured properties per week. Third, curated scientific literature: text-mining pipelines harvest decades of published papers, extracting compositions, processing conditions, and measured outcomes into machine-readable tables. Each of these pipelines has grown roughly an order of magnitude in throughput since 2015.
FAIR Data — From Slogan to Engineering Standard
The FAIR principles — Findable, Accessible, Interoperable, Reusable — were coined in 2016 and have become the de facto engineering standard for materials data infrastructure. The reason is practical: without FAIR, every team rebuilds the same data pipelines, and every collaboration stalls on unit mismatches and missing metadata. Platforms like Materials Cloud (paired with the AiiDA provenance framework) implement FAIR by storing not just the final results but also the inputs, intermediate steps, software versions, and authorship of every computation — so any downstream user can fully reproduce a simulation or extend it. Research Data Express (RDE), rolling out through 2026, further reduces the burden of routine processing and explicitly enforces findability, interoperability, reusability, and traceability.
NOMAD Scale in 2026: 19M FAIR Entries and Growing
The NOMAD (Novel Materials Discovery) platform has become the largest materials-science FAIR repository in the world. Its public dashboard passed 19 million FAIR data entries in 2026, sourced from more than 50 atomistic codes and totalling more than 100 million individual total-energy calculations. Unlike the curated-outputs-only model of many databases, NOMAD stores inputs and outputs for every calculation, so downstream users can re-run, extend, or re-parameterise work that somebody else originally published. The FAIRmat NFDI consortium, operating across Germany and partner institutions, has added experimental-characterisation pipelines (XPS, XRD, electron microscopy) under the same FAIR framework, narrowing the gap between computational and experimental data. For a 2026 materials programme, treating NOMAD as a first-party data source — not a library to cite occasionally — is the inflection point between slow and fast.
The Major Repositories and What They Contain
| Repository | What It Holds | Scale (approx.) | Primary Use |
|---|---|---|---|
| Materials Project | DFT-calculated crystals, phonons, elastic constants | ~150K+ compounds, 600K+ users | Screening for stability, band gaps, elastic properties |
| AFLOW | DFT crystal structures & derived properties | Millions of entries | High-throughput alloy & electronic screening |
| OQMD | Formation energies, phase diagrams | ~1M DFT entries | Thermodynamic stability, convex hull analysis |
| NOMAD | Raw computational data, FAIR-compliant | 19M+ entries, 100M+ calcs (2026) | Reproducibility, method benchmarking |
| Materials Cloud + AiiDA | Simulations with full provenance | Growing, provenance-rich | Reproducible workflows, FAIR by design |
| Citrination / Citrine | Curated experimental + computational data | Millions of records | Industrial ML training data |
| MatBench | Standardized ML benchmarks | 13 canonical tasks | Model comparison & reproducibility |
How Big Data Powers Modern ML Models
Big data is not valuable in itself — it is valuable because it enables models that were previously impossible. Graph neural networks that predict formation energies within chemical accuracy need hundreds of thousands of labeled crystal structures to train. Generative models that propose new stable compounds (MatterGen, GNoME) need even more. Foundation models for chemistry — pretrained on hundreds of millions of molecular graphs and fine-tuned for specific properties — would not exist without aggregated, FAIR-compliant corpora. The virtuous loop runs: more data → better models → better candidate suggestions → smarter experiments → more curated data.
The Data Infrastructure Challenges That Actually Matter
Most real-world problems with big data in materials science are not about ML algorithms but about plumbing. Schemas drift between labs. Units get confused (MPa vs. GPa, wt% vs. mol%). Process conditions are often absent from legacy records. Raw instrument files sit in proprietary formats. Permissions silos block cross-team access. Industry-grade programs invest as much in metadata standards, ontologies (EMMO, MatVoc), and ELN integrations as they do in models. The 2026 infrastructure narrative is dominated by three themes: ontology convergence, automated data quality scoring, and secure multi-party data sharing that lets competitors pool pre-competitive benchmarks without exposing IP.
From Storage to Compute: Streaming Characterisation at Instrument Scale
Experimental instruments are now the dominant velocity source in materials big data. A modern 4D-STEM detector streams more than 1 TB/hour; a synchrotron XRD beamline running at 100 Hz can fill a petabyte a week. The traditional “save to NAS, analyse later” workflow is no longer viable. The 2025–2026 response is streaming analytics at the edge: ptychographic reconstruction on FPGA cards co-located with the detector, online dimensionality reduction via streaming PCA, and triage models that flag frames worth archiving versus frames that can be summarised statistically. Europe’s FAIRmat NFDI is piloting a reference architecture at DESY and MaxIV in which every streamed frame is hashed, tagged with instrument provenance, and written to a FAIR-compliant object store within seconds — so the “characterisation exhaust” of a user facility becomes a searchable scientific asset rather than a backup-tape graveyard.
How Simreka Operationalizes Big Data for Sustainability
The Simreka’s AI-Powered Formulation Generator ingests curated property data from public repositories and an organization’s proprietary ELNs, harmonizes units and metadata, and trains tailored surrogate models that respect FAIR-style provenance. Simreka’s Virtual Experiment Platform layers on top ecoinvent and proprietary LCA datasets, so every property prediction comes with an embodied-impact estimate. Simreka’s MatIQ – the AI Co-Pilot for Material Innovation joins chemical-substance data with live REACH, TSCA, and SVHC lists. Simreka’s Databank – the World’s Largest Material Informatics Platform extends the data fabric to feedstocks, including PCR and bio-based sources. Together they let R&D teams leverage the best of the open big-data ecosystem without drowning in infrastructure work.
Benchmarking Big-Data Programme Maturity
One practical framework for mid-size manufacturers in 2026 is a five-level maturity model that maps cleanly to the data infrastructure typically observed in site visits. The table below summarises the levels and their operational signatures.
| Level | Name | Signature | Typical ML Capability | Time to Next Level |
|---|---|---|---|---|
| 1 | Spreadsheet Era | Excel, shared drives, tribal knowledge | None | 6–12 months |
| 2 | Consolidated | Single ELN/LIMS, one schema | Simple regression | 9–18 months |
| 3 | FAIR-aligned | Controlled vocab, units, IDs | Tree ensembles, transfer learning | 12–24 months |
| 4 | Integrated big-data | Public + internal harmonised, provenance tracked | GNNs, physics-informed ML | 18–36 months |
| 5 | Closed-loop / autonomous | Self-driving lab, continuous retraining | Generative + active learning | — |
Ethics, Licensing, and Open Data in 2026
Big data is also a legal artefact. The 2026 FAIRmat community guidelines have made CC-BY-4.0 the default licence for public computational materials data, with CC0 for pure factual tables; experimental data is increasingly published under CDLA-Permissive-2.0 to retain attribution without restricting downstream use. On the regulatory side, CSRD and ESPR disclosure rules are pushing more sustainability-related material data into the public domain, which has already started to rebalance the open-vs-proprietary ratio in favour of openness. Ethical questions are surfacing too: autonomous labs that churn out thousands of samples per week consume real reagents, energy, and skilled labour, and the environmental accounting of AI-driven materials discovery itself is now a peer-reviewed subject. The Simreka platform propagates licence and provenance metadata end-to-end so customers can defend every data point if asked.
Conclusion
Big data has moved from a buzzword to the operating substrate of materials science. The Materials Project, AFLOW, NOMAD, Materials Cloud, and their industrial counterparts hold more material information than any one lab could generate in ten lifetimes — and they are growing faster every year. What separates leading organizations from laggards is not access to this data (it is largely open) but the discipline of FAIR practices, the plumbing to harmonize it, and the workflows to turn it into decisions. The winners will be the teams that treat data infrastructure as a first-class product, not a side project.
Frequently Asked Questions
Q1. Is public data enough to train useful models?
For broad screening — yes, often. For highly specific formulations (a particular resin chemistry, a proprietary alloy family), public data must be augmented with internal experimental records. Hybrid datasets almost always outperform either source alone.
Q2. What does “FAIR-compliant” actually require?
Persistent identifiers, rich machine-readable metadata, open access protocols, standardized units and ontologies, and clear licensing. In practice it means using tools like DOI registration, controlled vocabularies (EMMO), and storage formats (HDF5, NeXus) that other systems can consume.
Q3. How do I prevent “data swamp” problems?
Set schema standards before data is collected, enforce them with automated validation, and maintain a single curated “gold” layer for ML training separate from raw ingest. Data quality scoring and lineage tracking help teams trust what they pull.
Q4. How much does data curation typically cost?
Industry benchmarks suggest 60–80% of a materials informatics project budget goes to data collection, cleaning, and harmonization — not modeling. Budgeting accordingly is the single biggest predictor of project success.
Q5. What role do autonomous labs play?
Self-driving labs (A-Lab at LBNL, MIT’s MOAC, Toyota’s Tri-Labs) close the loop between ML models and experimental data generation, producing clean, structured, machine-consumable data at scale — exactly the kind of high-quality fuel modern models need.
Q6. How should mid-size companies start?
Begin by consolidating internal ELN and LIMS data into a harmonized schema. Then integrate selected public datasets for the property classes you care about. Only then layer on modeling. Starting with models before data is the most common reason projects stall.
Bibliographical Sources
- Springer Nature. Big Data and Machine Learning for Materials Science. Discover Materials. https://link.springer.com/article/10.1007/s43939-021-00012-0
- Nature Materials. Accelerated Data-Driven Materials Science with the Materials Project. https://www.nature.com/articles/s41563-025-02272-0
- ScienceDirect. Big-Data Driven Approaches in Materials Science: A Survey. https://www.sciencedirect.com/science/article/abs/pii/S2214785320310026
- Scientific Data. Materials Cloud, a Platform for Open Computational Science. https://www.nature.com/articles/s41597-020-00637-5
- Nature. FAIR Data Enabling New Horizons for Materials Research. https://www.nature.com/articles/s41586-022-04501-x
- arXiv. Big-Data-Driven Materials Science and its FAIR Data Infrastructure. https://arxiv.org/pdf/1904.05859
- Scientific Data. Shared Metadata for Data-Centric Materials Science. https://www.nature.com/articles/s41597-023-02501-8
- ScienceDirect. Materials Informatics Review: AI/ML Tools, Platforms, Data Repositories. https://www.sciencedirect.com/science/article/pii/S2352492825020379
- Phys.org. Building the Data Infrastructure for Next-Generation Materials Science. https://phys.org/news/2026-01-infrastructure-generation-materials-science.html
- Advanced Science (Wiley). Data-Driven Materials Science: Status, Challenges, and Perspectives. https://advanced.onlinelibrary.wiley.com/doi/10.1002/advs.201900808
- Cambridge MRS Bulletin. NOMAD: The FAIR Concept for Big Data-Driven Materials Science. https://www.cambridge.org/core/journals/mrs-bulletin/article/abs/nomad-the-fair-concept-for-big-datadriven-materials-science/1EEF321F62D41997CA16AD367B74C4B0
- npj Computational Materials. The NOMAD Artificial-Intelligence Toolkit. https://www.nature.com/articles/s41524-022-00935-z
- FAIRmat NFDI. FAIR Research Data Management in Experimental Materials Science. https://www.fairmat-nfdi.eu/
Turn Your Data Estate Into Discoveries
Stop wrestling with schemas and start shipping sustainable materials. Request a Simreka Demo → and see how our platform harmonizes your data, runs the models, and flags the greenest candidates.


