Outline of Infectious Disease WG Paper

Objective

Recommendation to the WHO - URIs for each of the controlled vocabulary terms at http://apps.who.int/gho/data/node.metadata - should be generalised for the various codesets and data.  Need a canonical URI for each item in all of these vocabularies. (e.g. see SNOMED URI policy and SNOMED controlled terms service, or Library of Congress authorities and vocabularies service)

Task for a subgroup is to go through the column heads on Fernando's spreadsheet and to redescribe them in DCAT or where necessary another specification (first cut is done - see row 4 - a few gaps, and some over-generalization; testing required)

Possible project: describe the design principles for a platform that 1) accesses species/vector occurrence data, land-use data, and population density data for outbreak prediction; 2) allows the rapid ingest of occurrence data into a platform that already has population and transit data integrated such that outbreaks can be predicted.  Describe the standards, the design of the platform that can harmonise the data in real time.  Can we make recommendations for the data types, and the upstream practices, such that they can be more effectively and rapidly integrated.

For this platform, we would want a set of canned queries - a 'convenience API'.

Need to identify some metadata and standards that are important; data formats.

Need to map out how these can be applied to the data formats.


Structure

Description of the current situation with IDDO and other attempts to collect infectious disease data.

Vision of the platform and the type of questions we would like to ask and answer.  Give the specific examples identified around Ebola. Give examples of similar platforms in other applications.

In the vision there are two staged requirements:

1) outbreak prediction: requires species/vector occurrence data, land-use data, and population density data;

2) tracking and prediction of outbreak development: allows the rapid ingest of occurrence data into a platform that already has population and transit data integrated such that the progress of an outbreak can be predicted.

Identifying the requirements for achieving the vision, i.e. availability of data sources, accessibility, quality of data, speed of request, transformation, and analyses chain. Resolution options for addressing these requirements. 

Vision should address assistance for discovery in instances when a researcher knows the sort of data needed but not where it can be obtained.

Identification and description of specific datasets needed for these questions.  Summary of the current limitations of the data sets and standards in relation to interdisciplinary data sharing and integration.

Example key datasets or services:

Services and visualizations

  • ProMed - notifiable health incidents, global, from reports 
  • HealthMap - (about)
  • MRIIDS - Mapping the Risk of International Infectious Disease Spread (MRIIDS) (about) (appears to only deal with the 2014-2016 Ebola outbreak)

Discussion of the access limitations and how these can be mitigated.

Difficulty in assessing the suitability of a dataset, particularly if it is in an unfamiliar representation/format - make the case for dataset previews? or CSV extracts? ('Assess' is the missing A from FAIR?)

Discussion of the governance and sustainability / business models for the data resources (who is looking after them and how, with what remit and what resource).

Description of the set of standards which would assist the integration of this data.  

  • DCAT - for dataset description for discovery and selection
  • DarwinCore - for species occurrence data
  • Geo-Location standards: 19115, GeoJSON, GeoSPARQL ...
  • LOINC – codes for identifying laboratories (would be applied internally to a dataset)
  • QB/Data Cube, if transformation from GHO structure (aggregated data) is required

Strategies for how these standards can be applied to existing resources.  Strategies of how they can be applied to new datasets such that the data can be integrated in real time.  Explicitly includes business processes and upstream activity to improve the way datasets are annotated at collection.

  • What infrastructure we might put into place that will make discovering, integrating, and visualizing data?
  • What pre-integrated datasets (e.g. databases) might be created/maintained that will help investigating future questions?
  • What recommendations (or tools) might we encourage data providers/collectors to use when creating new datasets?


If we appraise a dataset for its interoperability and potential for integration, what are the criteria against we score it?  Need to distinguish between quality issues and issues that pertain to interoperability.

Coverage

There is a rich battery of existing vocabularies maintained by WHO etc.  


Coffee and Cake Discussion

What is the story that we are trying to tell?  There are a couple of data integration stories: one is about assembling a more extensive and thorough coverage for a particular theme (quality of medicines).

Importance of this would not just be for infectious disease. Also want to be useful for other related research topics: e.g. quality of medicines. 

  • Case 1 - data on the same theme available from more than one source, none complete. Different organisations are doing similar work.  These reports are useful but they are distributed, different data providers, formats, data dictionaries.  Creating standards in this area, would be useful to integrate data on this topic in order to assemble a more complete dataset on a particular theme.
  • Case 2 - Multiple themes required to address a problem. For a particular scenario, there is a science hypothesis about what the vectors and exacerbating factors are.  On the basis of this model, we need a specific selection of datasets to test or make predictions.  For a different model, a different set would be required.  There may be a need for a pre-rolled set of required datasets to respond to each scenario.  

Do we have the classification system that will allow us to find the set of data sets needed for each scenario?


Identify the vocabularies that allow the dataset to be tagged with a theme and being able to tag measurements within a dataset with standard terms so that the dataset can be integrated.


Provider-centric data description?  Comes from this experiment, mission etc.  Less attention is paid to the property or quality being observed.  


IDDO adapted/adapting CDISC for hospital data.  Stretched the standard because did not cover some key information relating to hospital occurrences: 'IDDO is standardising clinical, laboratory and epidemiological data with CDISC. We have had challenges with CDISC in that it did not initially cover the pure contact tracing data (who did you interact with? what was the nature of that interaction? for how long?... who did they interact with? etc) but CDISC have been helpful to work on it with us and to "stretch" the standard to now cover this.'

CDISC is very much designed to represent the relevant data points of the trial.  

Points to the desideratum of cross walks between CDISC and FHIR.  What would the recommendation be for the data platform: would need to ingest FHIR data from clinical treatment; CDISC data for clinical data.

Gates / Wellcome interest in constructing a data platform to this 

Issue of having to stretch CDISC to accommodate the data.  This may be happening because an Ebola outbreak has its own particularities.  There is no controlled environment.  The outbreak is the only occasion on which one can test drugs.  This requires adapted protocols. To what extent does this affect what is recorded.

Suitability of CDISC as primary data dictionary is problematic. It is designed for clinical/drug trials, not for reporting medical/health incidents. It partially works for an outbreak iff there is an accompanying drug trial.