The Role of Data Curation in Reliable Healthcare AI

Summary:
In today’s rapidly expanding healthcare AI landscape, the greatest challenge to widespread adoption is not technological sophistication but the quality of the data that powers it. Data volume alone does not necessarily translate into clinical value.
The second article in the editorial series “AI in Healthcare: Credibility, Safety, and Impact on Clinical Practice” demonstrates that the true foundation of trustworthy medical AI lies in data curation. Learn how transforming raw data into structured, scientifically validated assets is the only way to mitigate bias, anticipate model performance degradation, and ensure real safety and effectiveness in bedside decision-making.
Key Topics Covered:
- The three waves of healthcare data
- Raw data vs. clinically curated structured data
- The structural challenge of data curation
- Data curation as a dynamic and continuous process
- The Epimed Prediction Model case study
- The next driver of hospital efficiency
Content:
Over the past decade, the healthcare sector has experienced at least three distinct waves in its relationship with data.
The first, around 2015, came under the banner of big data. Hospitals, payers and electronic health record vendors rushed to accumulate data, build data lakes, and digitize records. The challenge was that few knew what to do with all that information. Clinical curation was lacking, as was the bridge between medicine, science and business.
The second wave, around 2018, introduced business intelligence and the first predictive algorithms. Dashboards proliferated, and some models reached production environments. Yet real-world adoption remained limited, often hindered by insufficient scientific validation, whether internal or external, and by implementation processes that were too slow for the realities of clinical care and business sustainability.
The third wave, which began in 2023, is the era of generative and applied artificial intelligence, particularly LLMs (Large Language Models). Interest in AI has never been higher. Yet amid genuine needs and widespread enthusiasm, the pattern remains familiar: real-world adoption is still limited. This highlights an important reality: the core challenge is not technological. It is the same challenge healthcare has always faced, and it begins long before the algorithm.
Raw Data Is Not a Valuable Clinical Asset
In healthcare, the term data can be misleading. Hospital systems generate enormous volumes of data every day. However, access to large volumes of information does not automatically translate into clinically meaningful insights, nor does it create a dataset suitable for developing, training and validating AI models that are useful in real-world settings.
The key distinction lies between raw data and structured data supported by clinical curation (Figure 1). Raw data reflects how information is recorded within each institution: nomenclature varies, critical fields may be incomplete, workflows differ from one organization to another, and definitions often lack scientific validation or standardized clinical application.
Structured data with clinical curation is fundamentally different. Such datasets originate from research environments, meaning the variables they contain have been clearly defined, tested and validated through scientific studies and publications. Data are collected using standardized protocols, validated by professionals with expertise in both research and clinical practice, and organized in a way that preserves the clinical logic of each variable. The difference is not merely technical, as it is, in many ways, epistemological.
The fragmentation of healthcare systems and the lack of standardized digital infrastructure mean that representativeness is rarely guaranteed. AI models trained under these conditions often learn patterns related to documentation practices rather than patterns related to patient history, clinical progression, disease presentation or the patient’s journey throughout hospitalization.
The direct consequence is embedded bias, arguably the most significant risk facing healthcare AI today, even more so than the hallucinations that frequently dominate public discussions.
Figure 1: From Raw Data to AI Asset: The Three Stages of Clinical Data Maturity and the Key Questions That Drive Progression.

Why Is This Asset So Difficult to Replicate?
There is a structural reason why widespread adoption of healthcare AI remains limited: high-quality clinical data is not the core business of the organizations that generate most of it.
Hospitals exist to care for patients, and their data primarily supports clinical documentation. Electronic health record companies exist to capture and integrate information. Payers exist to manage financial and population health risks. Many of these organizations perform their functions exceptionally well.
However, they often lack the clinical data engineering capabilities required for variable curation and scientific validation of algorithms in real-world hospital and ICU environments. Simply put, these activities are not part of their organizational DNA.
This is where a different type of organization becomes essential—one that was built around clinical data, rooted in clinical research and quality improvement, and whose core mission is the collection, structuring, validation and analysis of information.
When a company accumulates nearly two decades of data from millions of hospitalizations across hundreds of institutions and uses that data to develop risk models, quality benchmarks and clinically useful tools, it provides something far more valuable than software: it provides an intelligence infrastructure.
It is also important to recognize that strong algorithmic performance does not necessarily translate into demonstrated clinical benefit. Most healthcare AI tools are tested using historical datasets rather than real-world hospital environments, and many fail to consider how those tools fit into clinicians’ actual workflows.
Data Curation as a Continuous Process
One aspect that is often overlooked is the dynamic nature of data curation. Dados clínicos envelhecem. Clinical data ages. Disease patterns change, as dramatically demonstrated during the COVID-19 pandemic. A reliable healthcare AI dataset is not a static repository. It requires ongoing clinical and scientific oversight, with continuous evaluation of emerging evidence and incorporation of new information as healthcare evolves.
This process has a direct impact on predictive models and AI applications, as it involves continuous updating, revalidation, and monitoring for model drift. Clinical curation does not end when data enters the database; it continues throughout the entire lifecycle of the model (Figure 2).
This perspective carries a practical implication for healthcare leaders and decision-makers. When evaluating AI solutions, the most important questions are not simply: “What is the model’s accuracy?”, but also: “how was this dataset built?” “Who built it?”, “How is it maintained over time?”
Figure 2: The Clinical Data Lifecycle in AI: Data Curation as a Continuous Process, From Standardized Collection to Model Drift Monitoring and Revalidation.

Trust Begins Before the Algorithm
The Epimed database has been built according to this philosophy since 2008. Today, it includes more than 9 million indexed hospital admissions from over 900 hospitals across 14 countries, collected through standardized protocols and supported by continuous scientific and technical curation. It is, for example, the world’s largest database of critically ill patients, specifically designed to generate clinical knowledge rather than simply record information.
It was upon this foundation that, in 2022, we developed and implemented the Epimed Prediction Model, the first machine-learning-based prognostic score to be validated and deployed at scale across thousands of Brazilian ICUs.
The extent to which AI will improve healthcare depends fundamentally on the existence of an ecosystem capable of generating rapid, robust, and generalizable knowledge about the effects of these tools. Such an ecosystem necessarily begins with high-quality, large-scale, carefully curated data.
The next major driver of hospital efficiency will not come from organizations that simply possess more data. It will come from those that know how to generate greater value from the data that already exists and that have spent years building the foundations required for AI to work effectively in clinical practice. This foundation makes it possible to develop scientifically robust analyses, models, and algorithms based on reliable data, ultimately translating into measurable improvements in clinical efficiency, patient safety, and outcomes.
___________________________________________________________________________________________________
This is the second article in the editorial series “AI in Healthcare: Credibility, Safety, and Impact in Clinical Practice,” produced by Epimed Solutions.
Author: Dr. Jorge Salluh, physician-scientist and Professor of Intensive Care Medicine at IDOR, Co-founder and Scientific Director of Epimed Solutions, Editor-in-Chief of Critical Care Science (2023–2027), and recognized among the world’s top 2% most influential scientists (Stanford–Elsevier, 2020–2025).