...
Important to get to the questions that the researcher will be asking.
Cahllenges Challenges faced: accessibility - access to data and the appropriate tools to open different types of files. How do we match these specific data with the standards or specifications that already exist.
...
Complex in terms of time: deals with event data, spell data.
Link to the wikipedia Wikipedia page.
Jay will explore this and identify how CDISC has been stretched.
...
GHO Metadata link: http://apps.who.int/gho/data/node.metadata, list of metadata items as XML: http://apps.who.int/gho/athena/data/
GHO Codelist example of item "Action area": http://apps.who.int/gho/data/node.metadata.ALCACTIONAREA?lang=en (also available as XML and JSON, http://apps.who.int/gho/athena/data/GBDCHILDCAUSES.xml, http://apps.who.int/gho/athena/data/GBDCHILDCAUSES.json)
We can use the existing WHO queries as an approximation. The queries can be customised and made through the API.
Related XML Schema available at: http://converters.eionet.europa.eu/schemas/620 (broken!)
Diagram of reengineered GHO XML Schema (on the basis of the XML downloads).
Default is XML. Likely to be the most reliable of the formats.
...
http://apps.who.int/gho/athena/xmart/DATAPACKAGEID/2016-05-11?format=xml&profile=text&filter=COUNTRY:SLE (results in csv, NOT in XML!)
API: http://apps.who.int/gho/data/node.resources.api
WHO has a database which allows access for particular queries, either via the API or ready made.
Recommendation to the WHO - URIs for each of the controlled vocabulary terms at http://apps.who.int/gho/data/node.metadata
The ultimate goal for IDDO is being able to map a new outbreak.
Possible project: describe a couple of standards and the design of the platform that can harmonise the data in real time. Can we make recommendations for the data types such that they can be more effectively and rapidly integrated.?
Imagining a future outbreak: is it feasible to imagine applying standards by the groups closely monitoring
...
Data Access at IDDO. Patient-level information harder to obtain. If I want anonymised data, but with age and sex how long would that take? Takes up to 3 months to get access. When the patient data is integrated they hope to shorten that access time.
SCox Point: the inventory and questions designed by the project team has a lot of similarities to DCAT. We can in fact integrated this methodology with DCAT. In future iterations, we should use the inventory stage to populate a DCAT entry on the data set.
...
Task for a subgroup is to go through the column heads on Fernando's spreadsheet and to redescribe them in DCAT or where necessary another specification.
Infectious disease outbreak: population density, environmental factors, vectors.
Desideratum for a platform that allows this. Characterise the datasets that could be relevant to allow this.
Predicting Ebola outbreak would require land-use data, bat population data; predicting the spread of outbreak would require data relating to human to human interaction, movement etc. Population movement, transport, social practice.
Issues about locating datasets of relevance. Can identify a superset of potentially relevant data sets. For the computer to zoom in need more domain specific semantics.
Project to support this kind of science: a discovery platform that goes beyond dataset discovery but can identify variable level issues relevant to particular domains.
CDISC - there is CDISC related tooling which would need to be looked into. Need to understand the meaning of the columns in the CDISC model and their relationship to other standards. Would want to map the symptoms column to an appropriate ontology (HPO, SNOMED etc).
Awareness of data sources. The Monarch Disease ontology - established network of mappings across spaces in the medical domain.
Keep wiki data on the radar. Focusing on the health domain. Consuming and aggregating public health data sources. Computable data.
Landuse:
WG Tasks
Recommendation to the WHO - URIs for each of the controlled vocabulary terms at http://apps.who.int/gho/data/node.metadata - should be generalised for the various codesets and data. Need a canonical URI for each item in all of these vocabularies. (e.g. see SNOMED URI policy and SNOMED controlled terms service, or Library of Congress authorities and vocabularies service)
Task for a subgroup is to go through the column heads on Fernando's spreadsheet and to redescribe them in DCAT or where necessary another specification.
Possible project: describe the design principles for a platform that 1) accesses species/vector occurrence data, land-use data, and population density data for outbreak prediction; 2) allows the rapid ingest of occurrence data into a platform that already has population and transit data integrated such that outbreaks can be predicted. Describe the standards, the design of the platform that can harmonise the data in real time. Can we make recommendations for the data types, and the upstream practices, such that they can be more effectively and rapidly integrated.
For this platform, we would want a set of canned queries - a convenience API.
Need to identify some metadata and standards that are important; data formats.
Need to map out how these can be applied to the data formats.