Skip to main content

BETA This is a new service - your feedback (opens in a new tab) will help us to improve it.

This is best practice guidance

Although not legally required, it's an essential activity.

This Guide covers:

  • Great Britain (England, Scotland, Wales)

From:

  • AI and Digital Regulations Service

Developers - Avoiding algorithmic bias: data quality considerations for training and testing

Reviewed: 03 November 2023 Last updated: 06 November 2023 What's new

Successful digital technologies in health and social care are trained on high-quality machine learning datasets. To build healthcare technologies that adopters will buy, prioritise data quality.

Data quality for healthcare technologies

Think about data quality when developing plans for algorithm training, testing and validation. Also think about its appropriate generalisability to the UK market.

Generalisability means the degree to which you can apply research findings from a sample study to the whole population.

When adopters are reviewing your healthcare technology, they will consider many factors. These are related to the quality of the data used to train, test and validate your algorithms.

The data you use should be:

  • representative of your target population in health and social care
  • representative of the intended use of your healthcare technology
  • technically high-quality (for example, the quality of images used to train an imaging technology)

If you train your algorithm on data that does not meet these requirements, it will not perform well.

Showing adopters that you have used high-quality data will build their trust and confidence in your healthcare technology. This makes it more likely that adopters will buy your technology.

Thinking about this early in the development process makes it less likely you will need to redo any steps related to algorithm development. This will save you time.

Data quality factors

Key data quality factors include:

Relevance of training and validation datasets:

  • Do they contain information on the relevant patient population?
  • Do they contain enough information on different population groups of interest?
  • Do the data sources have information on relevant exposures, outcomes and other covariates?
  • Does the dataset cover the range of intended settings in which the technology will be deployed?
  • Does it represent the full range of users or patients in the real-world setting? Have marginal cases been sufficiently represented?
  • Are the results likely to generalise to routine clinical practice? For example, are they representative of the population who would use the technology (according to your intended use). Would the results be applicable to the health and care settings in which it would be deployed?

Quality of training and validation datasets:

  • To what extent is data missing on key variables (exposures, outcomes and covariates)?
  • How valid are the measurements for key variables?
  • Are the key variables consistently recorded across patients and over time?
  • Is there enough data? 

External validation:

  • Does the technology perform well in datasets or systems that were not used for algorithm development?

Other considerations:

  • Does the technology introduce any ethical or equity issues?
  • Does your data-labelling process involve quality management, including addressing ways in which AI bias could be inadvertently inserted into the dataset?

How to ensure data quality

It’s important to ensure data quality throughout the lifecycle of your healthcare technology. You need to consider the quality of your data when:

  • training datasets for model development 
  • testing datasets for model evaluation and parameter variation
  • validating (internally and externally) datasets for meeting user needs 
  • updating your technology or expanding its intended use 

Use these tips when training your algorithm:

This is best practice guidance

Although not legally required, it's an essential activity.

This Guide covers:

  • Great Britain (England, Scotland, Wales)

From:

  • AI and Digital Regulations Service

Get more support

To discover how the regulatory organisations can assist you and for contact details, visit our 'Get Support' page.

Is this article useful?

How can we improve this piece?

Error:Select how we can improve this piece
Cancel

Thank you for your feedback!

To share additional insights about this page, please use the following link (opens in a new tab) to submit your observations.

Print this guidance (opens a PDF in a new tab)

Regulations are regularly updated. For the latest information, check the website as printed documents may be outdated.