Amid talk of sixth COVID-19 wave, can we trust predictive data models?

The University of Hong Kong medical faculty dean, Gabriel Leung Cheuk-wai, has expressed concern that a sixth wave of the COVID-19 outbreak could strike the city. His warning has sparked controversy over the accuracy of the data models and statistical algorithms used to predict the virus’ spread.

Epidemiologists employ models to anticipate how an infectious disease will progress, and the history of this practice and technique can be traced back to John Snow’s renowned cholera maps from 1854. Since many lacked a thorough understanding of the sickness, Snow’s knowledge and expertise were not appreciated. Some argued that data models weren’t always reliable, and that we shouldn’t be dependent on them. Some cited instances in which predictions from the data models were shown to be erroneous, such as the early days of the pandemic, where some projected that the virus would swiftly fade away. Others concluded that the data models are merely tools for better understanding complex systems.

Dr Yvette To, a professor of international studies at the City University of Hong Kong, highlighted major concerns regarding the accuracy of the data models in a letter to the South China Morning Post.

“The daily number of COVID deaths in Hong Kong would peak at nearly 100 by late March, and the cumulative number of deaths by mid-May would be around 3,206,” said Dr To, citing projections published by the HKU team on Feb 21. However, actuality revealed a striking contrast: “On May 1, the reality was that a total of 9,100 COVID-related deaths had been recorded since the beginning of the fifth wave, nearly 4,000 more than the experts had anticipated,” she noted.

In fact, data models have been criticized all around the world, not solely in Hong Kong. Some US experts and specialists have disputed the veracity of the data models used to anticipate the development of COVID-19 in the US. Algorithms for predicting COVID-19 “have become more complex, but are still only as good as the assumptions at their core and the data that feed them,” said Elizabeth Landau, a scientific reporter for Smithsonian Magazine.

Data models for infection spread are simplified representations of reality. They’re designed to closely mimic or resemble the primary characteristics of real-world disease propagation so that we can generate predictions that can be respected. Based on a variety of real-world data, the COVID-19 model predicts a date (or a range of dates) for a city’s peak number of cases.

For illnesses like influenza, these models function well because scientists have decades of data to assist them in comprehending how flu outbreaks spread throughout different types of communities. Every year, influenza models are used to determine vaccine formulations and other flu-season preparations.

So, why do data models occasionally fail to appropriately predict the proliferation of COVID-19? One explanation is that the virus’ origins are still a mystery. Scientists are continually learning how it behaves and spreads, thus data models must be revised and modified on a regular basis. What complicates things further is that this virus is constantly evolving and mutating, making it extremely difficult to predict its behavior. Several viral variations have already emerged, with more likely to arise in the future.

Another rationale is that anticipating human behavior is challenging. People don’t always do what they’re expected to, which can cause data models to fail. Strong social-distancing restrictions have been introduced in Hong Kong, for example, and everyone is required to wear masks in public. However, some people continue to flout the regulations, thus increasing the risk of infection.

Additionally, historical data isn’t necessarily a reliable predictor of future behavior. Just because something happened previously, it isn’t an indication nor conformation that it will happen the same way again. The world is constantly changing, and data models must be updated on a regular basis to reflect these changes.

Professors at Brown University recently employed a sophisticated machine-learning algorithm called physics-informed neural networks (PINNs) to construct a data model that could more precisely predict the spread of COVID-19. PINNS are artificial neural networks that analyze images and convert audio to text, similar to how image recognition and speech translation operate.

Researchers fed real-world data from multiple US states and various regions of Italy into a PINN-based data model to project the number of COVID-19 cases in those areas six months in advance. They discovered that “the actual case rates from January to June 2021 fell within the uncertainty window predicted by the models”, according to a study article on Brown University’s website. This was true for all four data sets included in the research.

Data models serve to offer insight into what the future may entail in terms of COVID infection, so how do we design policy in response to such information? Government officials could see how effective a policy is projected to be by running a baseline model for predicting the future number of COVID-19 instances that incorporates the effects of a specific policy, performing simulations with the policy, and comparing it to the baseline model without the policy. The government can then assess and estimate the efficacy and cost-effectiveness of the strategy.

Data models have made great strides since the onset of the pandemic. They are not yet perfect, but are steadily improving. With more research and refinement, data models will, in my opinion, become more accurate in predicting the prevalence of COVID-19 and other diseases. Until then, we should be circumspect and stay cautious. The information obtained from these models should be used as a guide, not as the primary foundation for decision-making.

The author is founder of Save HK and a Central Committee member of the New People’s Party.

The views do not necessarily reflect those of China Daily.