PADI-web, a platform for epidemiological surveillance in animal health

Coronavirus is the word on everyone’s lips, and details of progress in developing an automated animal health surveillance tool, PADI-web, have just been published. PADI-web searches articles from different online information sources to extract epidemiological information from them. This open-access platform for international health surveillance responds to the need of epidemiologists to monitor in real time the emergence and spread of animal infectious diseases. Explanations from Mathieu Roche and Sarah Valentin, a researcher and a doctoral student in Cirad’s TETIS joint research unit, who are developing the tool in collaboration with the ASTRE unit (Cirad – INRAE).

Global animal disease outbreak detection and monitoring rely on official intergovernmental organisations, such as FAO and OIE, as well as unofficial media sources, for which the manual extraction of relevant information is complex and time-consuming. In France, the development of the epidemic intelligence tool PADI-web (Platform for Automated extraction of Disease Information from the web) began in 2016. The International Health Monitoring Unit (VSI), part of the French Animal Health Epidemiological Surveillance Platform (ESA), is already using it to monitor infectious diseases that present risks of introduction into the French territory and harmful effects on animals and production chains..

Supplementing official information with relevant unofficial texts

PADI-web proposes supplementing the official information provided with unofficial sources. Its field of exploration: news articles selected by Google News. And since using robust media articles does not rule out incorrect information, the expertise of epidemiologists is essential to validate their relevance and to supplement the data of the tool’s learning algorithm.

Avian influenza, African swine fever, bluetongue disease, but also coronavirus: based on Google News searches, PADI-web collects, classifies, translates all articles into English, and extracts the epidemiological information. These texts are collected by disease names and animal hosts, but also by symptoms, which may be the only element mentioned in some articles[A1] . The advantage of this is clear: not only does the system provide fresh information, which has not yet been published via the official channels, and identify it according to location and date, but it can also detect the signals of a disease before it is declared. Its multilingual media coverage also considerably enhances its database. To date, it includes more than 200 000 articles, in English, French, Chinese, Spanish and Arabic, among others.

According to the health news, the number of articles on a disease vary from 0 to more than 40 per day in case of a disease emergence or new outbreaks detected.
Two examples: in the week commencing 24 February 2020, PADI-web detected on Google News 25 relevant articles on avian influenza from unofficial sources. And since 31 January 2019[A1] , more than 2 600 articles on coronaviruses have been classified as relevant in PADI-web.

A query on the term “coronavirus” and an article classified as relevant in PADI-web (screenshot, April 2020)

The evolution of international health surveillance

In the circle of existing media monitoring tools, PADI-web is the only one dedicated to the surveillance of infectious animal diseases and the detection of emerging and new diseases, whether diseases transmitted between animals, or between humans and animals. The research team would like to extend this monitoring to include social networks and, after Google, to be able to process data from Baidu, the Chinese search engine. It also hopes to share approaches so that the lessons learned for PADI-web can be applied in other fields, such as plant health.

“We can identify weak signals, now the challenge is to interpret them. International health surveillance could become very effective if PADI-web is combined with other existing tools. This is the goal of the project H2020 MOOD (MOnitoring Outbreak events for Disease surveillance in a data science context) , which began in January, to improve epidemic intelligence tools and services”, says Mathieu Roche. Another challenge will be to identify weak signals from certain data sources and to then take the necessary measures, as some Asian countries are doing.

Learn more

The PADI-web pipeline

  1. Based on predefined queries, PADI-web collects articles daily from Google News RSS feeds configured to monitor diseases. A first type of RSS feed consists of an association of terms (disease names, clinical signs or hosts) extracted through an integrated approach combining text mining and validation by domain experts. A second type of feed is symptom-based: this combination of clinical signs and hosts (for example, abortions AND cows) does not include any disease term. This enables detection of diseases that are not explicitly monitored or are not confirmed at the time of article release, as well as unknown hazards.
  2. Translation: All news articles that are not in English are detected and translated into English, the pivot language for PADI-web, using dedicated Python libraries and Microsoft Azure. The languages covered include English, French, Spanish, Arabic and Chinese. Since February 2019, multilingual processing has resulted in a 131 % increase in relevant news articles: this is the case for African swine fever, with 207 articles in English and 272 translated articles. The increase observed is 67 % for foot-and-mouth disease (104 in English, 174 translated) and 47 % for avian influenza (212 in English, 99 translated).
  3. The data classification step consists in selecting relevant news articles before extracting epidemiological information from them. For example, notification of a current epidemic, its prevention and control measures, its socioeconomic impacts, etc. The classification module is based on supervised learning from labelled examples. A corpus of 600 articles comprising 200 relevant articles and 400 irrelevant articles randomly retrieved from the PADI-web database, was manually labelled by an epidemiology expert. The accuracy score for the PADI-web classifier is 92 %. By reducing the number of irrelevant news articles retrieved, it thus saves considerable manual filtering time.
    Labelling: users can manually label the relevance of news articles. If different from the classifier label, the user label prevails. These contributions are added to the initial learning dataset, which thus increases rapidly and adapts. Users can detect significant classification biases or errors: this is essential with textual data that are prone to rapid change, as with the onset of a new disease, for example.
  4. The extraction of epidemiological data from news texts. The information extraction process relies on a combined method founded on rule-based systems and data. To extract names of diseases, hosts and symptoms, it relies on a vocabulary created using text mining methods and validated by domain experts. Locations are identified by matching the texts with location names from the gazetteer, and dates with the rule-based HeidelTime system. The number of cases is extracted from a list of regular, matching expressions (numerical or textual form).

To facilitate the comparison of official and unofficial data, the EpidNews visualisation tool was created by LIRMM in collaboration with the TETIS and ASTRE research units. It will shortly be fully integrated with the interface.