Monday notes
Contents
Introduction
Initial remarks, timings (Catharina)
DDI series as context for this workshop (Achim)
- workshop supported by DDI-Alliance, CODATA, Schloss Dagstuhl
- timings and organisation
- Agenda fixed for initial 1.5 days
CODATA, ISC and "Data for our planet" (SimonH)
- Presentation: http://bit.ly/ISC-CODATA-Programme-PPTX
- CODATA activities
- Data Polices
- Data Science - incl. International Data Week
- Data Skills - training, FAIR
- Data Best Practices
- ISC - Global Grand Challenges -
- Future Earth, Integrated Research on Disaster Risk (IRDR), Urban Health and Wellbeing
- CODATA & ISC Data Integration Pilot
- Infectious diseases
- Disaster Risk Reduction
- Resilient Cities
- Project 2.1 of ISC Science Action Plan - Tackling Complexity: Data Driven Interdisciplinarity in http://bit.ly/ISC-Science-Action-Plan
- Decadal initiative to advance science + practice of data documentation & semantics
- interdisciplinary case studies
- projects with unions and partners
- Partnerships, fundraising
- Role of this workshop?
- Technical recommendations
- Case studies
- Community of interest on metadata & semantics
- Advice + partners for ISC program
Ice-Breaker
Catharina Wasner - GO FAIR
Joachim 'Achim' Wackerow - DDI Alliance and GESIS
Arofan Gregory - Consultant, DDI,
Simon Hodson - CODATA
Simon Cox - CSIRO
Steve McEachern - ADA, DDI Alliance
Maria-Cristina Marinescu - Computer science, was very technical, now understanding important of human factors, how to influence/work with city-halls, progamming-language design!
Jim Todd - LSHTM - decolonise data?
Andrew Simmons - Resilience Brokers, UK based NGO, resilience.io integrated systems modelling platform; social sciences and urban design; urban development practitioners lens - issues of urban governance; hopes todentify news pilots and colalborations for the initiative, community of practice to support the pilot case studies.
Niklas Kolbe - Researcher, Ph.D. student University of Luxembourg, vocabulary search (how to find the most relevant vocabulary for a given task); application of semantic web technologies for internet of things project, vocabularies for smart cities projects, ecosystem perspective for data providers and data consumers. Wants to take away an understanding of real-world issues.
Barbara Magagna - Landsape ecologist, GIS expert. INSPIRE air-quality specifications. Data management. Reference Model for research data infrastructure.
Pier Luig Buttigieg - AWI, genomics background, got drawn into semantics. Ontologies for SDGs. Co-chairs ESIP committee and working on OBO Foundry (open biomedical ontologies)
Doug Fils - Geosciences, ocean drilling. Semantics of exposing data across oceans drilling community. Recently has been involved in NSF Earth Cube Group, ESIP Semantics Committee, structured data on the web. What can we leave: i.e. what is being done well elsewhere that we can learn from?
Steve Richard - GIS, Geology, Earth-cube - interested in advancing the "research-data-profile" for DCAT/schema.org metadata
Erol Orel - University of Geneva. HIV prediction using machine learning using available ; wants to reduce time spent cleaning and (low-level) processing data, more time using it.
Dan Brickley - Digital libraries, RDF originator, SKOS, FOAF, RDFS, SPARQL; schema.org - all intended to be x-domain - how does simple structures from RDF & schema.org interface with complex structures in science/research; Google dataset search; Data Commons = observations+schema.org
Massimo Migliorini - improve security of sensitive assets; STAG UN DRR (scientific and technology advisory group, disaster risk reduction) implementation of Sendai principle with governments. Data for resilience, challenges and barriers for data interoperability for DRR. Increase knowledge of current data interoperability issues and solutions. Greater awareness of challenges affecting other sectors than DRR.
Luis Gerardo Gonzalez Morales - UN Stats. Work on interoperability for SDGs. Working with global partnership for sustainable development data. Collaborative of sustainable development data. Basic guidelines on data interoperability to implement better practices in organisations that may not have a lot of resources or infrastructure. Statistical commission, inter-governmental body that sets standards for statistical production, SDMX, capacity building for countries to be able to implement. Heading the web development section - mostly a user of infrastructure. Job description is to bring data to users. Use of semantic web. Practical ideas on how we use metadata to increase value of data for SDGs.
Ernie Boyko - Stats Canada, since then a number of projects including WB, OECD. Currently working on capacity building and training issues in Canada. Stronger arguments and direction for the ISC CODATA programme. Interested in cross-domain research areas.
Dan Gilman - US Bureau of Labor Statistics. US has 13 offices to produce national statistics. DDI Moving Forward project, model based evolution of DDI. UN Economic Commission for Europe cooperative projects. Better understanding of how to integrate data products/projects(?) from multiple sources.
Jay Greenfield - consultant. Recent work on the new DDI standard, which is a model for data integration. Also interested in the decolonisation of data and metadata. Experience in developing metadata standards in a range of domains. Data discovery in structured and unstructured data collected by US security agencies. Keen to explore new ways for how metadata standards can play together.
Larry Hoyle - Senior Scientist at the institute of policy and social research. DDI. Producer and consumer of data.
Alejandra Gonzalez-Beltran - Computer scientist, team lead at Science and Technologies Facilities Centre (STFC), Harwell, UK (previous background in health/clinical metadata profile of DCAT)
Summary of 2018 Workshop
Three pilot projects.
Sendai Disaster Risk Reduction; IDDO Infectious Disease Outbreaks; Resilient Cities
Not a lot of written outputs. Sendai paper still needs work. Overview paper is here http://bit.ly/Dagstuhl-2018-Outcomes
Pilots:
Sendai:
Demand for data exceeds the available data.
Multiple data sources: supra-national, national, regional, local data.
Focus on data discovery. Traditional portal approach could be enhanced: standards, broader metadata, etc.
Infectious Diseases:
Oxford based initiative building on appraoch and tools produced by WWARN.
Data is not well documented, poor metadata, not easy to find, kept hidden by researchers. Not easy to access.
Many issues are cultural issues, around making data accessible and creating metadata.
Resilient cities:
Need a methodology for overcoming the methodological disconnect.
Need cross disciplinary ways for referring to the shared challenges, issues.
Lessons Learnt
Many cross-domain problems reflect issues also found within domains.
Focussing this year on solutions.
Last year was a validating exercise. We have an opportunity to meet the existing need.
Goals and Deliverables for 2019 workshop
Focussing on solutions. Guidelines. Model of AHRC of guidelines for different levels: high level interoperability guidelines; practitioner guidelines; expert guidelines.
Awareness level, working level, expert level. We will mostly be working at the working and expert level.
Aim to take the FAIR principles from high level guidelines to something more detailed that can be implemented. Working level - how can I play with this in practice. More detailed outline for implementation.
Introduction to Conceptual Framework
Overview of the simplified data lifecycle, mapped against FAIR principles and stakeholder roles: https://docs.google.com/spreadsheets/d/1fowYA4hWKMyrpQiv2uc7l6cFSp2HnhTKC4KRoZbWyEc/edit#gid=1890236617
Can we identify a cell of the matrix which identifies a moment in data processing, that relates to a given FAIR principle and indicates to which stakeholder group recommendations might be targeted. The idea is to use the matrix to help identify and structure pain points that will form the focus of given recommendations.
Deliverables package
- A set of guidelines
- A conceptual framework for the problem space. Could be the basis for organising future work.
The guidelines fit into the conceptual framework: use the terminology so they are consistent and reusable;
Workshop participants to determine what is valuable in terms of guidelines.
Guidelines could include different components: standards mappings, technology approaches, examples, profiles of standards.
Some solutions will be presented, but they should not limit the solutions and recommendations presented.
Need to work out what we can do to work together with these potential solutions across domain boundaries - big tent effort.
Proposed framework is a work in progress. It will evolve throughout the week.
Groups should meet to identify two or three issues of interest. Which boxes might the issues belong in? Group will discuss commonalities and select topics for work during the workshop.
Case Studies
Sustainable Development Goals (SDGs)
Link to slides (to be added)
Luis González
Improve interoperability because there are many unrealised opportunities to extract information from data that already exists. Vision: integrating and joining up data from multiple sources.
Pathways to interoperability: governance > data structures with users in mind > standardising data content > providing stnadard interfaces > disseminating LOD for knowledge creation (see http://www.data4sdgs.org/resources/interoperability-practitioners-guide-joining-data-development-sector)
Bringing the SDGs to the semantic web.
SDG ontology to link data with the web of knowledge that relates
Facilitate the dsicovery of SDG related data and information by humans and machines. Ensure documentation and traceability. Enable interoperability. Facilitate easier publication. Four objectives worth bearing in mind.
Working definition of metadata: data about the nature and characteristics of digital SDG information resources published on the web, structure according to standardised schemas and encoded following commonly agreed rules and vocabularies that enhances their findability, searchability etc.
Deliverables list: express and utilise metadata about existing SDG datasets following linked data standards (Schema.org and DCAT); enrich metadata with references to external KOS (e.g. skos:exactMatch); map traditional SDG datasets to their representation using the data cube vocabulary.
Challenges: cost of implementing a standard.
'The value of a specific metadata collection is increased when users are able to establish and exploit links with other metadata collections.'
SDMX information model is compatible with data cube model, with DCAT etc.
Current status of the SDG linked-data project. Basic schema of the SDG goal-target-indicator-series ontology.
Major task: develop capacity to publish SDG data as linked data; develop capacity of users to develop new SDG data products and services leveraging semantic web technologies
Discussion
XKOS - extensions to describe statistical data.
Challenge of standardising at the data level as opposed to the metadata level.
Global indicator data set, maintained by UN Stats. Relationship between indicators and the 'series' data which feeds into this indicator.
Relationship of Sendai framework to SDG Framework. Crosses between Sendai ontologies and SDG ontologies?
Problem of an excessive focus on the SDG indicators.
SDGIO interface ontology created with OBO technology.
Capture the linkages between SDG, SDGIO, ENVO and ESIP ontologies which can assist the SDG data interoperability and expression as LOD.
Disaster Risk Reduction (Massimo)
Project goals: promote the creation of standards, standardised methods and technologies for collecting disaster related data and metadata. Promote creation of national disaster loss databases.
DRR Principles and Semantics - links from presentation.
Includes links to Technical Guidance, Data Challenges, UNDRR Terminology, INSPIRE Data Models.
List of data bases in presentation also. Includes numerous European systems and observatories for particular hazards.
General process to achieve data interoperability:
Action Definition; data selection; data mapping; data assessing; ensuring data usability; data application; verifying data effectiveness. The key step of interest for this workshop is data application.
Challenges classified into technical / scientific, social, political, economic.
Good practice in particular databases. National disaster platform that is coherent with the Sendai process: Montenegro, Norway, France, Slovenia. And at European level.
Resilient Cities - Andrew Simmons and Maria-Christina Marinescu
Resilience Brokers
Definition of resilience. Capacity of cities to function so that the inhabitants continue to thrive as they enounter stresses and shocks. Bridges DRR, climate change adaptation, development, urban planning and design.
Urban resilience framework from 100 Resilient cities.
Medellin example - open data portal developed through a number of civil society agencies. Interrelatedness of questions.
Barcelona Projects
Need for cross domain collaboration
Interoperability challenges. Steps along the process.
Two use cases:
Social Exclusion - predictive tool to assist city hall in engaging with citizens that need help. Including hidden exclusion and hidden poverty. Need data from many different sources. Some data is missing. Some is statistical, high level not fine grained or individual to allow prediction. A lot of data is theoretically available, but has not be provided. Incompleteness is not necessarily a problem. Inconsistencies more so. Self sufficiency matrix. Criteria for establishing whether individuals may be at risk or not. What precisely are these criteria? Can these form the basis for determining what data is needed and therefore waht the challenge of data collection and integration are?
Urban Resilience modelling. Data is relatively straightforward. Urban planning model from IBM and Barcelona Urban Ecology agency. How does this relate to what is used in other circumstances, the lancet interconnectedness, the resilience model, resilience.io
Very little of the data was time stamped. Would seem to make it unusable?
High level objectives: data cleaning, approximate queries, use of machine learning (probabilistic models)
Urban Resilience / UN Habitat: wanted to compute indicators based on survey data rather than open data; simulate cascaded effects. Issue of needing disaggregated data. What is the mechanism that allows you to get to disaggregated data which might have access controls.
Interested in understanding how the data can be accessed...
Infectious Diseases
Jim Todd: Alpha network.
Independent institutions collecting various data, including HIV data. Agreement to extract data for pooled analysis.
Linked HIV data to clinical record data.
Population cohort data.
Alpha hub: harmonised and documented data. Aim to combined HIV data and other longitudinal data.
ALPHA harmonised data specs for particular issues. Interesting to see those specs.
How do HDSS adopt DDI standard for primary data documentation.
What does a stnadard data model look like for this?
Data provenance features - what features are required?
How can data harmonisation be politically and economically sustainable in LMIC settings?
Erol Orel:
predicting HIV incidents. UNAIDS targets of 90-90-90.
Country level HIV incidence, with similar social and behavioural characteristics.
Machine learning to predict incidence, based on a number of input criteria. Data from DHS programme. Requires a lot of cleaning. Removing a lot of factors to enter the data into models. Import into python. Python libraries for converting from statistical formats into python.
DHS programme, PHIA surveys, population based surveys. But variance in questions.
60% of time spent on data cleaning. ALPHA network estimates 80-95%.
Discussion of HIV examples
Need to get usable results from the surveys. Relationship between the researcher, data scientists and the survey designers.
Mortality is one of the outcomes. Linkage between the use cases.
Discussion of commonalities within case studies
Partitioning of effort and localisation of effort around communities. Gather communities needs and relay those needs more generally to assist interoperability.
Call - we need a data specification to do... Evaluation of those resources that are returned are often not sufficiently technical. Need expert level review of things that are tabled as supposed solutions and resources.
Useful to try and focus on some specific data sets. Understand what is actually involved in detail. Need to identify some specific instances, with data that can be focussed on.
DRR: Focus on one scenario, on indicators for a given Sendai reporting example.
Luis: working with different countries to put in place the SDG indicator. Countries often say they don't have the data. They have the publications of DHS for example, but are not aware of it and aren't aware how to use it. Valuable for many countries to take the 20 indicators that come out of DHS and model in data cube. Could think of analogous solutions for other fields. Putting the data sources into a given model that suits the field.
Jay: a lot of the projects have an explicit or implicit business process model. Not just data processing, resource and capability management. Alpha netowkr student is representing the business processes using structured metadata. To allow modelling of the processes, making recommendations and writing manuals. Modeling processes, lifecycle. BSPN is used by ...
We could take DHS data.
Dataset and series of challenges to different communities.
Urban health: exposure to heat waves and urban heat island effects.
Ontology as a mechanism for data integration.
Data quality and provenance challenges.
Ontology bridging. A number of ontologies in OBO that can be bridged.
Two issues: 1) bridging between existent ontologies; 2) exploring teh challenges encountered in a given project which has created a model for the data
Improve the availability of data. Flag particular issues of non-availablity.
Provenance is a crucial issue. Provenance process template as an output of the workshop?
Evolution of DDI: models as a way of expressing (still need to deal with classifications, definitions).
Importance of definitions (e.g. cow). Platforms
Dataset metadata & DCAT-2 (Alejandra)
Journey through bespoke metadata standards to profiles of general metadata models
New features in DCAT-2 - DataService
DDI-4 (Arofan & Jay)
Data atoms & Data molecules - can be recombined in different arrangements
Variable Cascade (Larry)
Variable = header from table
Model has recently been extended to traverse the variable/datum boundary.
Beyond interoperation inception (Pier + Barbara)
Shifting gear between semantic gradient (expressivity levels)
Mappings → Hard Mappings → co-development
Development rate is incommensurate with (i) project life-cycle (ii) tooling
I-ADOPT Working Group - Interoperable Descriptions of Observable Property Terminology
Common framework for environmental observations → methodology that might be applied in other domains?
Kickoff - RDA Helsinki