High Quality Data: The backbone of accurate, viable AI

May 8

a mesmerizing view of countless tiny dots arranged in horizontal lines stretching into the distance, meant to mimic high quality artificial intelligence training data.

Recent progress in AI tech continues to amaze us. More sophisticated and efficient machine learning algorithms, the proliferation of large language models, new ways to employ deep learning, and novel use cases appear daily. With such a spotlight on AI, organizations are appropriately shifting to focus on the quality and provenance of data used to train, validate, and verify their technology investments. The industry term for this is “data-centric AI” [1]. In a nutshell, it means that the emphasis is put on the data behind the algorithms, not just the algorithms themselves; as data that drives the AI will ultimately determine its future success in real-world scenarios. Why, you may ask, is data so important? First, bad data is costly. IBM estimates that the time spent identifying and cleaning data costs $3 billion dollars per year [4] in the US alone. More importantly, increasing data quality can boost AI model training efficiency and performance [5].

***An example photo of a malaria RDT, activated to be positive***

What does high quality data look like? Some AI development frameworks focus on data quantity, hoping that with a large data set, quality issues will shake out. However, that can introduce unintended bias and errors which are hard to identify and correct for later [9]. Large data sets can increase computational time, cost, and ultimately be unrealistic to acquire. At Audere we build AI for multiple use cases, including interpreting rapid diagnostic test (RDT) results for illnesses such as malaria and HIV. Whether the AI is 1) enabling targeted supportive supervision of community health workers (CHWs) in the field, 2) remotely training health workers on proper RDT administration and interpretation, 3) being used in private sector pharmacies to cost effectively scale and monitor incentive programs, or 4) backstopping self-testing programs, it is essential to create a dataset which takes into account a variety of environments and geographies.

Image data for computer vision

***Photographer uses a custom-built app to capture RDT training & validation images***

Throughout our AI journey, we continue to actively reduce the amount of data we need to support each unique RDT brand and type. Efficiencies are achieved with a focus on building a core foundation model which is then fine-tuned for unique RDTs and novel use cases. While big data has been tremendously useful for many domains, even outside of AI and data science, in our experience, more is not necessarily better. For example, if our goal is to determine whether there is an RDT present in a particular photo, an object detection task, we could use 500 images containing an RDT to train the model. However, if 300 out of the 500 photos have almost identical compositions or contain the exact same backgrounds, the true number of useful photos is only 200. An even more serious problem is that by using repetitive data, the AI may inadvertently focus on factors other than the object we want to recognize, including backgrounds and image artifacts. This is referred to as selection bias [8]. To prevent these issues, we engineer specific conditions in our data generation to ensure proper coverage of a variety of edge cases such as low lighting, shadows, blurriness, and so on. We find that by mixing adversarial conditions into our training data sets, the resulting AI model is more resilient in real world scenarios.

Another critical component of high quality training data lies within the ground truth or metadata associated with each record, which is used as labels during development. When building novel datasets for training and validation, well-trained human labelers are typically required to annotate data. As an organization that operates at the intersection of healthcare and technology, we take data integrity seriously with procedures in place to ensure the metadata linked to each image is of the utmost quality. To prevent labeling errors and inconsistencies, we have developed a stringent training and validation process, whereby a labeler is not brought into production until they achieve a perfect score on a fit-for-purpose qualification test. For quality assurance, we employ a panel of labelers, whereby each image used in training and validation is independently labeled by multiple individuals, enabling us to identify issues quickly and efficiently.

The last piece of the puzzle involves continuous evaluation of AI algorithms active in the field. In a rapidly changing world, we must take into account the dynamic data requirements of each project and monitor how user needs or environmental factors may evolve. Performance monitoring is accomplished by sampling and labeling live data to check for drift or anomalies, with insights fed directly back into the development cycle. Trusted AI includes a plan to care for, feed, and validate the AI algorithms over time. Without this diligence, the possibility of drift increases.

Image data production for a breadth of rapid diagnostic tests

As one can imagine, in order to support each additional RDT we require image data where the test is photographed in its expected environment and result states. An RDT photoset life cycle begins by procuring RDTs and reagents to properly activate positive and negative cassettes. Next, we create a photo capture plan which includes the volume of cassettes which need to be activated and photographed within each environment and result state combination, ensuring that the plan optimizes for capturing multiple images of the same cassette under different conditions. To hone in on difficult RDT cases we capture images of RDTs at a variety of line intensities, purposefully over-sampling images containing conditions such as faint positives which are of particular interest - as they can be missed by human readers with big implications for high stakes tests like HIV.

Unlike AI model development, data production deals directly with the physical world and is therefore constrained by many challenges, including import restrictions and healthcare regulations. The quality assurance rigor extends to inventory management and RDT activation. Each brand and type of RDT requires experimentation to calibrate reagent concentrations needed to produce target line intensity distribution which mimics variability in the field. By collaborating with local laboratories and domain experts in South Africa, we co-develop and optimize data production specifications and workflows allowing for a realistic representation of user environments where the AI is deployed (stay tuned for a deeper dive in a future post). Much of the data generally used to build and train AI models is still US- and Western- centric, leading to inherent bias. Audere is part of a movement to establish balance, inclusivity, and equity in the AI ecosystem [10] through thoughtful dataset creation - we further explore this concept in previous posts [11, 12, 13].

Conclusion

When building AI for the real world, where data is not set in stone and often messy, we emphasize creation of systematically engineered [6] datasets to maximize generalizability. The development of effective and impactful solutions hinges on several key data generation factors:

***To effectively mimic conditions under full sunlight, the capture team ventures outdoors to produce images with strong light***

Data quality and intentionality supersedes data quantity.
High quality data includes a variety of adversarial and non adversarial conditions relevant to the use case.
Consistent and high quality data labels are essential, in addition to processes developed to detect and remove labeling anomalies.
Data must be captured in the environments, geographies, and on devices that are as close to reality where programs will be run.
Continuous performance monitoring in the field can protect against drift and validate accuracy by identifying potential gaps and addressing shortcomings iteratively.

Great datasets are not created by magic. It takes diligence, planning, effort, and intention to produce high quality datasets which can be used to produce effective, accurate and viable AI to solve real world problems.

References

[1] “Why it’s time for ‘data-centric artificial intelligence,’” MIT Sloan. https://mitsloan.mit.edu/ideas-made-to-matter/why-its-time-data-centric-artificial-intelligence

[2] P. Villalobos, J. Sevilla, L. Heim, Tamay Besiroglu, Marius Hobbhahn, and Anson, “Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning,” arXiv (Cornell University), Oct. 2022, doi: https://doi.org/10.48550/arxiv.2211.04325.

[3] M. Majam et al., “Usability assessment of seven HIV self-test devices conducted with lay-users in Johannesburg, South Africa,” PLOS ONE, vol. 15, no. 1, p. e0227198, Jan. 2020, doi: https://doi.org/10.1371/journal.pone.0227198.

[4] T. C. Redman, “Bad Data Costs the U.S. $3 Trillion Per Year,” Harvard Business Review, Sep. 22, 2016. https://hbr.org/2016/09/bad-data-costs-the-u-s-3-trillion-per-year

[5] C. G. Northcutt, L. Jiang, and I. L. Chuang, ‘Confident Learning: Estimating Uncertainty in Dataset Labels’, J. Artif. Intell. Res., vol. 70, pp. 1373–1411, 2019.

[6] “Data-Centric AI vs. Model-Centric AI,” Introduction to Data-Centric AI. https://dcai.csail.mit.edu/2024/data-centric-model-centric/

[7] G. Press, “Andrew Ng Launches A Campaign For Data-Centric AI,” Forbes. https://www.forbes.com/sites/gilpress/2021/06/16/andrew-ng-launches-a-campaign-for-data-centric-ai/?sh=2306c06774f5

[8] “Selection bias,” Catalog of Bias, Mar. 28, 2017. http://www.catalogofbias.org/biases/selection-bias/

[9] C. G. Northcutt, A. Athalye, and J. W. Mueller, ‘Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks’, ArXiv, vol. abs/2103.14749, 2021.

[10] “Introducing the Inclusive Images Competition,” blog.research.google, Sep. 06, 2018. https://blog.research.google/2018/09/introducing-inclusive-images-competition.html

[11] “Out of sight out of mind? A dangerous mindset for global health,” AudereNow.org, Shawna Cooper, Oct. 21, 2022. https://www.auderenow.org/what-were-thinking-about/reflections-from-africa-climate-change-summit-amp-health-sector-impact-3lmww-a9k62

[12] “Technology: The Next Step Up in HIV Testing Innovation,” AudereNow.org, Dino Rech and Tim Tucker, June 27, 2022. https://www.auderenow.org/what-were-thinking-about/technology-the-next-step-up-in-hiv-testing-innovation

[12] “My hope for the future: timely and efficacious care for all,” AudereNow.org, Rouella Mendonca, Nov 8, 2022. https://www.auderenow.org/what-were-thinking-about/reflections-from-africa-climate-change-summit-amp-health-sector-impact-3lmww

Engineering, AI

Before finding her passion in the intersection between artificial intelligence and digital health, Bronte was a senior medical laboratory scientist at Seattle Children's, developing and performing high complexity diagnostic testing, including one of the first COVID-19 PCR tests in Seattle. She is excited to contribute to building healthcare technology for those who need it the most with her interdisciplinary background. In her free time, she loves traveling, doing CrossFit with friends, and yoga.

Follow on LinkedIn 

Shawna CooperBronte Litraining dataAI

Shawna Cooper