Pilot Break Out Notes
What are we doing?
Data types or data sets can form the columns and we can complete information against the dataset.
Important to get to the questions that the researcher will be asking.
Challenges faced: accessibility - access to data and the appropriate tools to open different types of files. How do we match these specific data with the standards or specifications that already exist.
Fernando's spreadsheet.
MO. In some instances the data is downloaded and used. In others the data is dynamic and so refer to the website.
Units of measure.
Decision to go with the Ebola Case Study. What is the research question?
The IDDO CDISC Ebola data was created from clinical trials data that were conducted during the ebola outbreak.
IDDO is currently creating a database from the public health institutions (hospitals, clinics, labs) not related to clinical trials but the patient events.
Name (Working term) | Area of coverage | Demographic Data (Gridded Population of the World (GPW), v4 | IDDO CDISC Ebola Data | WHO Ebola Mortality Data Sets (Basic Download; XML) |
---|---|---|---|---|
Spatial and geographic | Information required for describing geographic information and services (e.g. ISO 19115) | Available in an xml format (FGDC) that provides bounding boxes. Granularity is an issue. | Clinical data collected from different organisations; across a number of hospitals. This is an assumption. CDISC data dictionary seems to only deal with country. Location of the clinic is important information, which is also potentially disclosive. Probably is in the data set but not clear where. | The csv and json downloads have location information; when hacking the query to create xml the locations were not included: |
Temporal | Information required for describing time-based characteristics of the data (e.g. the date of publication, a time stamp) | Changes over time, but slowly. Periodic updates. Good enough for basic model. | Dates of a number of variables, dates at which particular observations happened. Temporal information is recorded in the data dictionary and pertains to a number of the events. Deals with observation time and event time. Has spell information. | |
Contributors | People, organisations, agents, ... | Data aggregator is known. Have detailed, manual description of data sources. | Organisational identifiers? Hospitals, clinics, temporary treatment units? To what extent was this collected and is it contained in the data model. There is a field for evaluator, which gives some indication. Identification risks, but may not have been collected. | |
Process | Description of processes, workflows, transformations, ... | Importance of this depends on the research question. This dataset has detailed human readable account of the process and provenance of the data product. | Document that documents the process that went into compiling the data. Internal documentation. | |
Provenance | (Related to process) Descriptions of the process used to create/produce/transform/publish data | Covered in cell above. | ||
Vocabularies / lists / classifications | Enumerated lists of terms that may be applied to content being described. | No obvious incompatibilities; can be worked around because of data dictionary. | Uses standard CDISC domains. Customisations can be rolled back into the CDISC ontology. CDISC share references a number of vocabularies. | |
Resources | Objects being described or referenced. May include datasets, but also publications, software, code, other metadata, ... | Not applicable for use case. | Varies from organisational source. Often a relatively raw data dump. Compiled from pdf forms - these can be referenced. | |
Datasets | Specific descriptions of datasets as primary objects | Yes. | Jay will explore this. | |
Observation / Capture | Classes/objects that describe the processes by which data is created, generated, captured, transformed. (QUESTION: Is this the same as Provenance/Process?) | See above, process description. Dictionary has information about the devices used to make measurements. | ||
Data | The logical structure of the data being described - variables, units of measurement, concepts, sample units and populations, records, datum(s), cells. (May or may not be a subset of datasets) | Aggregate data set. Estimate of gridded population against time. Dimensionality to be identified. | Detailed data dictionary. Separate standard that relates to SDTM standards called ADAM. Is IDDO using this? | |
Storage | The physical representation of the data (files, formats, locations, ...) | Netcdf, GEOTiff, ASCII | Part of the CDISC package. Typically xml. CDISC standards has tools. Integrated with clinical systems. Generated directly by clinical systems. | |
Access | Who can access the data and how | "You are required to login to download data or maps. Click "LOGIN" to proceed to log in or to register. If you click "CANCEL", you may browse the page but you will still be required to login to download data or maps." Is there an API to access these data? Or is it just be selection and download? | Data access committee. External community can apply for access. Data providers can determine to what extent they wish data to be made available. What criteria are used? How does the data gatekeeper manage this? What restrictions are imposed? Is access information documented? | |
Administrative / core / ... | Foundational classes for use in building the specification - e.g. identification, versioning, primitives | CDISC standard. | ||
Web compatibility? | Is it easy to use in a web environment? URLs, etc | Requires download. | Possibly in so far as CDISC is, but IDDO restricts access. | |
Is the resource maintained and supported? | Yes. | Curated and maintained by IDDO. | ||
Updating? | Dynamic or batch? | Periodic, every 3-4 years. | Periodic according to the extent that different organisations provide the data. | |
Capacity for extensions? | Over the past few years CDISC has an ontology (CDISC Share) to connect data in different CDISC domains. There is a standard resource to help. |
Discussion of Disaster Risk Reduction
Good practice guide? Maturity model? Guidance on aggregation of data for reporting.
Possible recommendation of training materials such as those that Ernie Boyko prepared for national statistics offices, international household survey?
Value of the standards is increased by tools. Importance of crosswalks between tools and standards.
Scoping exercise for Fernando and Virginia: how is it best if the data is described.
Columns in the table point out what is needed.
What is missing for the Ebola model?
Overall outputs of the workshop: a workflow that assists Sendai reporting on
WHO guidelines - deadline 21-23 November.
What can Sendai learn from the SDMX experience and process?
Solution for Virginia: AM session to prepare the outline of a 'paper' on this.
- Recommendations around process.
- Outline of a 'system' that assists reporting.
- Discussion of definitional and data issues.
Lessons from IHSN. Data gathering. SDMX, statistical data. MDG indicators were described in SDMX.
CDISC
Complex in terms of time: deals with event data, spell data.
Link to the Wikipedia page.
Jay will explore this and identify how CDISC has been stretched.
To add variables there are solutions through the ontology.
Continuation of Pilot Notes Day Two
Infectious Disease Data Spreadsheet: https://docs.google.com/spreadsheets/d/1twjmiu0_3bk_zgwJQI8y4Jk1601UjF3IhpqpA4dsb_Q/edit#gid=1931579469
Access issues. IDDO is not the data owner.
WHO data set. Ebola data and statistics: http://apps.who.int/gho/data/node.ebola-sitrep.quick-downloads?lang=en -
JSON file has an implicit scheme.
There does not appear to be a well-defined point of contact / author to find out about the structure and metadata.
The WHO data download is querying a database with a number of filters.
Global Health Observatory resources Data query API http://apps.who.int/gho/data/node.resources.api
GHO Metadata link: http://apps.who.int/gho/data/node.metadata, list of metadata items as XML: http://apps.who.int/gho/athena/data/
GHO Codelist example of item "Action area": http://apps.who.int/gho/data/node.metadata.ALCACTIONAREA?lang=en (also available as XML and JSON, http://apps.who.int/gho/athena/data/GBDCHILDCAUSES.xml, http://apps.who.int/gho/athena/data/GBDCHILDCAUSES.json)
We can use the existing WHO queries as an approximation. The queries can be customised and made through the API.
Related XML Schema available at: http://converters.eionet.europa.eu/schemas/620 (broken!)
Diagram of reengineered GHO XML Schema (on the basis of the XML downloads).
Default is XML. Likely to be the most reliable of the formats.
Query being used by the website: http://apps.who.int/gho/athena/xmart/EBOLA_MEASURE/CASES,DEATHS.xml?filter=COUNTRY:*;LOCATION:-;DATAPACKAGEID:2016-05-11;INDICATOR_TYPE:SITREP_CUMULATIVE;INDICATOR_TYPE:SITREP_CUMULATIVE_21_DAYS
http://apps.who.int/gho/athena/xmart/DATAPACKAGEID/2016-05-11?format=json&filter=COUNTRY:SLE
http://apps.who.int/gho/athena/xmart/DATAPACKAGEID/2016-05-11?format=xml&profile=text&filter=COUNTRY:SLE (results in csv, NOT in XML!)
API: http://apps.who.int/gho/data/node.resources.api
WHO has a database which allows access for particular queries, either via the API or ready made.
Recommendation to the WHO - URIs for each of the controlled vocabulary terms at http://apps.who.int/gho/data/node.metadata
The ultimate goal for IDDO is being able to map a new outbreak.
Possible project: describe a couple of standards and the design of the platform that can harmonise the data in real time. Can we make recommendations for the data types such that they can be more effectively and rapidly integrated?
Imagining a future outbreak: is it feasible to imagine applying standards by the groups closely monitoring
Get the data that we have now, do a data quality analysis. Survey the people responsible for generating the data and get some input into how the data was gathered and entered. That would allow improvements to the collection of the data in the spreadsheet.
Link to the IDDO CDISC Data Dictionary: https://docs.google.com/spreadsheets/d/1gvJ1pPcDaqdeetRjtVT8RZ7ZINxhS4bn7v9B7YOeVrE/edit#gid=610474778
http://converters.eionet.europa.eu/schemas --> collection of schemas, although, curiously, ghodata. 'schema' only has an 'xsl' link.
Data Access at IDDO. Patient-level information harder to obtain. If I want anonymised data, but with age and sex how long would that take? Takes up to 3 months to get access. When the patient data is integrated they hope to shorten that access time.
SCox Point: the inventory and questions designed by the project team has a lot of similarities to DCAT. We can in fact integrated this methodology with DCAT. In future iterations, we should use the inventory stage to populate a DCAT entry on the data set.
There might be information available about the datasets which is not describable in DCAT > so what are we doing about this?
Many of the columns in the spreadsheet should be in a controlled vocabulary.
Task for a subgroup is to go through the column heads on Fernando's spreadsheet and to redescribe them in DCAT or where necessary another specification.
Infectious disease outbreak: population density, environmental factors, vectors.
Desideratum for a platform that allows this. Characterise the datasets that could be relevant to allow this.
Predicting Ebola outbreak would require land-use data, bat population data; predicting the spread of outbreak would require data relating to human to human interaction, movement etc. Population movement, transport, social practice.
Issues about locating datasets of relevance. Can identify a superset of potentially relevant data sets. For the computer to zoom in need more domain specific semantics.
Project to support this kind of science: a discovery platform that goes beyond dataset discovery but can identify variable level issues relevant to particular domains.
CDISC - there is CDISC related tooling which would need to be looked into. Need to understand the meaning of the columns in the CDISC model and their relationship to other standards. Would want to map the symptoms column to an appropriate ontology (HPO, SNOMED etc).
Awareness of data sources. The Monarch Disease ontology - established network of mappings across spaces in the medical domain.
Keep wiki data on the radar. Focusing on the health domain. Consuming and aggregating public health data sources. Computable data.
Landuse:
WG Tasks
Recommendation to the WHO - URIs for each of the controlled vocabulary terms at http://apps.who.int/gho/data/node.metadata - should be generalised for the various codesets and data. Need a canonical URI for each item in all of these vocabularies. (e.g. see SNOMED URI policy and SNOMED controlled terms service, or Library of Congress authorities and vocabularies service)
Task for a subgroup is to go through the column heads on Fernando's spreadsheet and to redescribe them in DCAT or where necessary another specification.
Possible project: describe the design principles for a platform that 1) accesses species/vector occurrence data, land-use data, and population density data for outbreak prediction; 2) allows the rapid ingest of occurrence data into a platform that already has population and transit data integrated such that outbreaks can be predicted. Describe the standards, the design of the platform that can harmonise the data in real time. Can we make recommendations for the data types, and the upstream practices, such that they can be more effectively and rapidly integrated.
For this platform, we would want a set of canned queries - a convenience API.
Need to identify some metadata and standards that are important; data formats.
Need to map out how these can be applied to the data formats.