Guidelines for Outputs

Dagstuhl Interoperability Workshop: What Do the “Guidelines” Look Like?


I.              Overview

This document attempts to explore what the products of the 2019 Dagstuhl Interoperability Workshop might be, within the context that Achim has proposed: detailed guidance on the implementation of the FAIR principles for cross-domain data as enabled by the current crop of metadata specifications and standards and related technology approaches.

While it is not the case that we can dictate exactly what the workshop outputs will be, and what they must contain, it is incumbent on us to frame up the work so that the week’s outputs are both relevant and substantial. The idea that these be relatively detailed/technical in nature has been embraced – the shape of such detail remains insufficiently defined. This document attempts to look at one obvious example – the DCAT-DDI profile work – to help us extrapolate what others might be possible. If we can offer examples and “templates” of the sort of outputs we feel would be desirable, that should help participants understand how best to engage with the work.

Another aspect of the problem is that – as we learned at last year’s workshop – the issues surrounding data use across domain boundaries are plentiful and extremely broad. Different use cases encounter many of the same basic issues. FAIR helps us to narrow our focus on specific aspects of the problem space, and the use cases give us examples to work through. If we do not have the optimal depth of participant knowledge in terms of the use cases, then we may find that our work takes a more general shape. This dynamic will feed into our “Plan B,” and may encourage us to take a slightly more generic but more technical direction in terms of the workshop outputs.

II.            The FAIR Principles

Any guidelines we produce should be understandable immediately to a high-level audience at a general level, and FAIR offers us a good vehicle for this. Any of the guidelines could be explained by saying something like “it is a concrete guide for domains implementing the FAIR principles around Finding data, for users both inside and external to their domain.”

For some of the FAIR principles, such as those concerned with data discovery, such guidelines may take the form of “how to” guides for using existing metadata standards. The DCAT-DDI profile example is one of these. The FAIR principles which are concerned with interoperability and reuse may take a more sophisticated form, insofar as data harmonization and reuse can be extremely complex topics. These guidelines may involve proposing activities which involve further development of standards and technologies. (The work around PLINTH and Sendai is an example of this.)

In many cases, there is a clear distinction between the immediate and technically feasible challenges addressed by FAIR (such as publishing data as CSV instead of as PDF, or using persistent, unique identifiers) and those barriers which are methodological in nature (such as issues around the periodicity of observations, as encountered by the Resilient Cities group in 2018). Characterizing the nature of the different challenges facing those who want to implement the FAIR principles as technical challenges, methodological ones, or a combination of both, is necessary.

FAIR does not make this distinction and can only go so far in helping us really provide guidance to those looking at cross-domain data challenges. It gives us a widely accepted framework for identifying the challenges we are addressing. However, this is not a workshop on implementing FAIR, and at a certain point we will need to become more nuanced in how we describe the value of what we are doing. Ultimately, FAIR provides an entry point for understanding our work, but cannot guide us all the way to our end goals.

III.         Scope of the Guidelines

One idea is for the Guidelines to address the following topics, in whatever combination seems relevant:

  1. The Problem Space – A general statement positioning the challenge in terms of FAIR and any other frameworks which would be approachable from a cross-domain perspective. A discussion of the issues identified to which the guidelines offer solutions, structured along the typology suggested above: technical, methodological, other…
  2. Domain Relevance – What domains are covered by the specific guideline?
  3. Stakeholders – a discussion of the intended audience(s) for the guidelines. Researchers? Data Managers? Funders/strategists? Systems implementers?
  4. Specifications/Standards/Technologies – a description and explanation for the selection of the relevant resources applied to the problem. Domain standards? Generic technologies?
  5. Methodological Considerations – a discussion of the methodological implications of the guideline. Are there best practices in a business sense which would change to help provide a solution?
  6. Proposal – A detailed description of the overall guideline being recommended, and its business justification
  7. Elaboration of the Use Case – a description of the concrete case(s) analyzed in the formulation of the guidelines/solutions
  8. Exemplary Data and Metadata – Concrete examples of the kind of data and metadata being discussed.
  9. Application of Standards/Specifications/Technologies - “Code” examples of how the approach being advocated can be realized, in each of the identified standards/technologies and data/metadata examples.

A checklist approach to producing these guidelines is clearly not sufficient – it is provided here as a way of helping us to think about what we produce in more concrete terms.

IV.         An Example: the DCAT-DDI Profile

This example is the most obvious one, coming from the work at last year’s workshop. I will try to address the topics given above in light of this example, to give a bit better sense of what they might be in concrete terms.

Problem Space: Discoverability – the “F” in FAIR. If you can’t find it, it doesn’t exist. Domain knowledge is often a requisite tool for finding out about the existence of data, but such domain knowledge is only possessed by insiders. Explicit metadata for supporting discovery needs to be available across domains in a form which is known and comprehensible to outsiders and insiders both.

Domain Relevance: Archives and catalogs exist for archiving and distributing data within the social, behavioral, and economic sciences, and to a large (and overlapping degree) for data of interest in the health sciences (notably public health and epidemiology) and official statistical domains. These domains tend to use similar mechanisms for collecting data (questionnaires, registers) and a common set of tools for analyzing, processing, and publishing data. There is a high degree of collaboration across the internal subdivisions within this picture, even though they do not all belong to the same domain. Some data of importance, however, comes from outside the domain, such as geographical data, sensor data from clinical sources, and genomic data.

Stakeholders: This guideline addresses the work of data managers, archivists, data producers, and the technical implementers of systems which support them. The needs of users at all levels for finding data are also addressed.

Specifications/Standards/Technologies: The DDI standard, first devised by archivists in this domain, has been more broadly adopted by other players, and has been shown to be a very effective tool for enabling data discovery between the different types of players described above. It is domain-specific, in that it addresses the types of data and processing which are commonly used. It is not well-supported or recognized in other domains.

DCAT is a generic standard which is widely known and used, but which requires configuration to be meaningful for the data coming from a particular domain source. Here, it is applied to DDI in line with its intended use as defined by its developers. Since the requirements of social science data are fairly extreme, potential issues with DCAT and its implementation have been highlighted, and some ideas for resolving such issues outlined.

PROV-O is another standard which is important in understanding the fitness-for-purpose of data. Although it serves a different purpose than DCAT, it may also need to be addressed as something to be used in combination with DCAT in advertising the existence of social science data. We may wish to provide a PROV-O profle as well as a DCAT one!

It is assumed that modern Internet-based technologies are those which will be primarily used for data discovery, including existing search engines such as Google, the Semantic Web tools, and more traditional approaches such as XML-driven catalogue technologies.

Methodological Considerations: Data discovery across domains requires that many of the implicit assumptions made about what data is and how it is collected and used be explicitly described. In the social sciences, data tend to be used in the form of static data sets, analysis of which produces tabulations, aggregate data, etc. for the purposes of supporting research findings and policy decisions. The models used in this analysis, while important, are not seen as reusable in the same sense as the data is. This implicit approach needs to be explained so that those outside the domain can understand the parameters in which data are made available: as static “slices” of data grouped into data sets, versioned and maintained throughout the process of production and analysis, and tied to specific research or dissemination outputs. Metadata is attached to each of the various versions, accessible according to standard models and accessible with a small range of known protocols. Collection, processing, analysis, and dissemination of data can be understood in reference to generic domain models (GSIM, GSBPM, GLBPM, etc.).

The objects of study are often persons or groups of persons with a high degree of legal protection in terms of their personal data. Data confidentiality informs every stage of the data lifecycle, impacting data collection, processing, dissemination, etc. It also has impacts on the forms in which data are made available for reuse (public use files vs. scientific use files; synthetic data sets for determining assessability, etc.) These considerations will impact the discoverability of data, and even the best approach to searching for it.

Proposal: This would be a proposed use of the DCAT standard, in reference to DDI and related domain-specific models (for Provenance, etc.). Specific fields would be identified, and guidance in their use provided. (This would be documented at both general and more detailed levels). Efforts to promote changes in practice – in effect, promoting the use of such a profile – would be elaborated here.

Elaboration of the Use Case: The specific use cases examined would be described (searching for mortality data used in HIV research and combining these with National Accounts data regarding economic productivity to support an assessment of the impact of health policies in urban settings in East Africa, etc).

Exemplary Data and Metadata: Examples in human-readable form of the data and metadata described in the preceding Use Case.

Application of Standards/Specifications/Technologies: Code examples and a guide to implementing the recommended standards, using the data and metadata presented in the preceding section. This could be quite technical, and should be packaged so that it could be taken as a stand-alone product by implementers.