Notes from Implementation RDF group

RDF for CDI/Lifecycle

 

CDI UML to RDF examined, also Beta RDF for Lifecycle. These looked good but serve different purposes. Should work together OK. Pierre-Antoine was able to provide some guidance on minor tweaks to the OWL description of the DDI Lifecycle RDF.

We may wish to produce a guide to how OWL is used in describing RDF in future, esp. once we have gone through the process of developing the Lifecycle syntax representation in RDF.

What about an RDF DDI Codebook? This would be useful from a CDI perspective, but there may be some issues. Will depend a bit on tolerance for change within the existing Codebook community. (A “structural shift”, per Wendy.)

Namespacing/URIs

The goal of this discussion was to agree how best to publish RDF and other artefacts with namespaces and URIs to support the use of RDF syntax representations, and other formats (XSD, XMI< etc.)

The basic organization of the DDI product suite is around a discrete set of offerings, which have many different technical expressions, but form “conceptual” entities.  Good Web publishing practice is to provide a single namespace for each conceptual product, producing a “200” response. Specific implemented versions can then be given their own files with direct URLS, or can be reached from the “conceptual” namespace using content negotiation based on redirects configured according to each requested MIME type. This requires the correct server configuration.

We should use a trailing slash rather than a hash, being a bit harder, but is more durable.

List of “conceptual” products (including possible futures):

Legacy Namespaces:

XKOS – https://rdf-vocabulary.ddialliance.org/XKOS#

DISCO– https://rdf-vocabulary.ddialliance.org/discovery#

DDI-VOCABS - https://rdf-vocabulary.ddialliance.org/CV/[name of cv]/[version]/

New and Current Products:

DDI-CDI – https://ddialliance.org/Specification/DDI-CDI/1.0/

DDI-LIFECYCLE – https://ddialliance.org/Specification/DDI-L/3.3/

DDI-CODEBOOK – https://ddialliance.org/Specification/DDI-C/2.5/

UCMIS – https://ddialliance.org/Specification/UCMIS/1.0/

SDTL – https://ddialliance.org/Specification/SDTL/1.0/

Legacy redirects to support existing stuff EXCEPT the stand-alone RDF vocabs (including the CVs)

We are assuming that the sub-domains were included for consistency with current practice. We do not think it is worth it – let the old stuff be legacy and that is fine. This assumption was validated in discussion with Achim Wackerow.

For content negotiation, we should use MIME types – we can get new ones by asking IANA. Syntax implementations do not represent conceptual resources. If you want an implementation, ask for it directly. Interacting with IANA is time-consuming, so it is best to avoid using new MIME types if possible.

There is a vendor-specific MIME type for XMI: application/vnd.xmi+xml. All other needed MIME types exist and are common.

Defaults would be to simple HTML pages describing the available formats.

Best practice is now to use https:// rather than http://

This is predicated on the idea that making everybody angry when complicated configurations break down is bad. Keep it simple!

Should we reflect minor version changes in the URIs? They must be backward- and forward-compatible at the sub-sub level. Currently, backward-compatible changes will drive changes in the URI, so we will use a two-place version indicator in the namespace.

Subdomains are all a bad idea (except for the rdf-vocabulary one for legacy reasons).

Retain the /Specification/

Break out by product (for new products as above)

Action on Darren to write this up and circulate.

If ICSPR can’t handle this, then we will deal with it at the point of failure. Darren to pursue the needed configuration with them.

Other Issues: Transformations across Standards/”Union” Model for RDF

Identification

Discussion of IDs and how they can be expressed (esp. vis-à-vis RDF IRIs).

Codebook has:

  • DDI Lifecycle URN and a Codebook URN attributes + standard ID – the URNs are only on specific elements

  • Round-tripping between Lifecycle & Codebook is possible where the content exists in both places, in a morphologically similar fashion.

  • CDI provides a place for “DDI” identifiers, but only one.

Question: Does the metadata have an identity independent of its expression in a particular version of DDI.

Answer: The identifier refers to the metadata item, not its representation in a particular version of a particular syntax. This may be problematic if there is a non-perfect mapping between syntax versions.

When there is a morphological mismatch, one structure may have more IDs than another expression.

Unless we want to have separate Ids for every syntax expression of every version of every model, plus an identifier for the metadata item itself, we should default to using just the identifier of the metadata, as maintained by its agency.

Resolved: The agency has the canonical version of the item. It is on the systems implementers to persist IDs as required by their use of the metadata in different encodings. (Often, these are one-way flows which can get away with being lossy.) Identifiers refer to the “conceptual” metadata item, not an expression of it in a syntax/syntax or schema version/specification.

When it comes to IRIs, these can be associated with DDI IDs as required, but we do not dictate any specific relationship between URIs and DDI IDs.

If we formalize the “union” conceptual model behind all versions of DDI, then this problem becomes more tractable: we can establish the set of objects which are consistent across the different DDI specs. This is an after-the-fact “Oh Shit!” moment: the union formalization becomes a target for future products/product versions, to heighten consistency.

The development of a union conceptual model seems like a good idea, and UML object models would be a least-worst way  to do it. This would be the basis of development for the standards, not a product designed for users – it is an internal reference model for the DDI Alliance.

Instance Variables

The instance variable exists in exactly one data set. If an identical instance variable exists in another data set, it is a separate one with its own ID. The reusable aspects are a Represented Variable, not an Instance Variable. Summary statistics (etc.) are always applicable to the Instance Variable.

We do not have an object representing a selected set of cases which can be used across data sets. This is implicit in the data set, but does not work for a more granular approach to data.

When you subset to create a new data set, you get new Instance Variables.

Note that this is at the logical level.

The variable cascade is an obvious starting point for the union model.

Do we need a more machine-actionable model of a Universe? DDI-L and CDI seem to handle this in the same, fuzzy way by simply letting you do what you want in an external system.

The same applies to Populations.

Survey Description

Survey description is not retrospective – it describes the design – not the administration of the survey to specific respondents. This retrospective description might be useful, especially with dynamically generated surveys.

INSEE example – across waves, questions might be added. Fields may be pre-populated based on earlier waves.

Union Model as the Basis for Consistent Product Creation

 We could have a union model, and products would be implementations of this for specific uses. This could help us to explain how the products fit together.

Could you query across the metadata regardless of which product was used to describe it.

The depth of this  union model might not be very deep. It would consist of the set of metadata items which are valuable for management.

We start by identifying a set of high-priority types and then start mapping back to it:

  • Data Sets/Physical Instances

Question: What about data streams? “Data Frames”? Are these something we need to start to cover (CDI does). Today we focus on tabular data. Sufficient?

  • Variable cascade (Conceptual, represented, and instance variables)

  • Concepts

  • Classifications and Codelists – CVs as applied to data representation

Question: How far do we push this? Concordances? Management classes?

  • Questions

  • Data Structures

  • Universe

  • Population

  • Unit/UnitType

  • Agents (Organizations, Individuals, etc.)

What is the intended use of the union model? Is it a product?  We need to define a scope for the union model. It is a “product” in the sense of needing to be published as a namespace.

Idea: The union model would only exist at the class level, and not go into the properties of any of the objects. These union classes would be mapped against existing products.  This is a model of types. The RDF serializations need to establish equivalencies (an OWL “isA”). Union model classes would have theor own PIDs to provide a layer of description spaning the breadth of the DDI products which link to them.

What about other standards? Schema.org? DCAT? PROV?

Clarifying Description of Reuse in Data (“Dan’s Simple Requirement”): A Potential Action with an Eye to the Future

Three needed constructs:

  • Dataset, “data frame” (pattern of a record), instance variables (columns)

  • Lifecycle: PhysicalInstance, RecordRelationship made up of Logical Records, (Instance)Variables

  • How does this map to DCAT?

By adding these objects to the union model, we have a basis for alignment and further development of DDI-L and CDI.

Section 5.3, https://ddialliance.org/sites/default/files/DDI%203.2%20Best%20Practices_0.pdf (Best Practices Guide for DDI-L 3.2, 3.3) as a starting point.

Look at these inputs, and determine how they work vis-à-vis streaming data, services, databases, data files, etc. Come up with an agreed set of types which recognize the differences between reusable structures across “instances,” patterns within instances, etc. where “instances” are a single DB, data file, or data stream. Clarify what a “distribution” is in DCAT. (https://www.w3.org/TR/vocab-dcat-3/#Class:Distribution ) similar to Schema.org.

This clarified view could be added to the Union Model to help drive future developments and convergence.

Hackathon Ideas

A DDI-CDI WG working meeting is being arranged in the margins of the RDA Plenary on the two days prior to the DDI Developer’s Group Hackathon. (On the first day of the Hackathon, there will be a public DDI-CDI side event on the RDA program).

Because members of the CDI WG will be present in Gothenburg, it was suggested that if some CDI-related proposals to be made to the Hackathon for possible work projects, the CDI WG members could act to help developers work with the new CDI model.

Two ideas immediately suggest themselves, which would be of value both to the DDI community looking at implementing CDI, and help in understanding at a detailed level how DDI-CDI aligns with other DDI standards:

(1)    DDI Lifecycle-to-DDI-CDI: This would be a transformation too which would take existing DDI Lifecycle metadata expressed in XML and render it as DDI-CDI RDF, for the description of a data set.

(2)    DDI Codebook-to-DDI-CDI: This would be a transformation too which would take existing DDI Codebook metadata expressed in XML and render it as DDI-CDI RDF, for the description of a data set.

Other, similar transformations could also be developed, but the two specified here are seen as the highest-value in terms of both giving adopters of DDI-CDI who already use DDI an easy way to explore the use of the new standard as an extended dissemination format for FAIR implementations where DDI-CDI descriptions of data at a granular level would be most appropriate.

Ideally, the DDI-CDI output could ne combined with the generation of RDF study-level metadata expressed as Schema.org (in JSON-LD) or in DCAT-AP. Details are open for discussion. Since these standards are popular in many domains among institutions implementing FAIR, they are seen as good candidates.