SDTL Working Group Sept 17, 2020

Video of this meeting is available at https://drive.google.com/file/d/1EdrOOmhTmMuaFWvibn9GM9kC5-LAx8Kj/view?usp=sharing

Notes:

  1. DDI Alliance adoption process

George explained that the next step in the process for the DDI Alliance to adopt SDTL is a vote of the Scientific Committee, which has representatives of all members. The vote will be scheduled for November. George will do a webinar at the beginning of November to explain SDTL to the Scientific Committee.

2. C2Metadata Resource Page at MTNA
Carson and her colleagues have created at MTNA pointing to all of the resources created by the C2Metadata Project. The page points to a variety of tools, and it includes information about using APIs invoking the tools. The page includes the MTNA tool that runs the entire workflow from a command script to a codebook. We are also adding a simpler open source workflow tool.

3. Pending SDTL changes
The items on our Pending Changes page will be resolved in the next two weeks.

4. SDTL issues
The Sept. 11 meeting of the C2Metadata Project considered several difficult issues, which were not resolved. We continued these discussions.

a. Can SDTL include a URN that would link a variable name in SDTL to a variable description in an external ontology?

Larry Hoyle from the DDI-Cross Domain Integration working group asked George whether external identifiers could be inserted in SDTL. These identifiers would be URNs pointing to the definition of a variable in an official ontology somewhere. Larry was on the call to provide more information about the issue. People from the CODATA Digital Representation of Units of Measure (DRUM) Initiative asked the DDI-CDI team how it would point to official definitions of units of measurement. For example, how do you know that "km" stands for "kilometers"? The extension of this is how do you link a variable name points to an official identifier for a variable. Larry asked if we could add something in SDTL to make these associations.

George offered three methods:

  1. Use the URN as a variable name

  2. Create a mapping between variable names and URNs at the file level

  3. Add an identifier property to every command that creates or modifies a variable

  4. Add a SetVariableIdentifier command

It was agreed that #1 is not a good approach, because URNs would not be good variable names in the source languages.
George argued against #2, because the contents of a variable might change several times in a script. For example, a "Distance" variable might start out measured in kilometers and then be changed to miles. We need to associate different identifiers to these two states. A file-level mapping would not show which states of the variable link to the identifier.
George favored #4, because it is simpler than #3.
None of these approaches resolve the problem of how a user would assign an identifier in a command script.

Larry said that if a variable named "Distance" changed from kilometers to miles, it would be two different variables in DDI. George said that we treat the variable that way when we use SDTL to update DDI. The Metadata Updater built by MTNA creates a new variable description when a variable changes state. Variable descriptions that are not associated with a file identifier are considered temporary. Larry said that this shows why name is not a real identifier to a variable.

Tommy explained how the ProvONE implementation of SDTL adds IDs to keep track of unique states of variables. It should be easy to associate URNs with variables in JSON-LD and RDF.

Jeremy said that he preferred file level mappings (#2) if we can find a way to handle multiple changes. Jeremy said that every other SDTL command describes an actual transform, and this is not a transform happening in the SPSS file. It is useful information, but it is not a transform. Mapping at higher level would avoid repeating information. Dan said that if variables are moving to new states we might be able to put a mapping in the dataframe definition properties.

Dan said that we're talking about two things here. One is the globally unique identifier for specific states a variable. Larry was also talking about assigning units of measure to a variable, which is different than describing a variable. Dan said that units of measure is usually handled in the metadata. Dan said that he doesn't see how the parsers could add globally unique identifiers.

This issue was unresolved and requires further discussion.

b. How do we preserve the order of commands in a script when SDTL is translated into RDF?

George showed a very simplified example of a knowledge graph created using ProvONE, which Tommy is creating. Once one has a graph like this, it is possible to trace a variable backwards to see where it came from.

As part of his work, Tommy asked if the commands in an SDTL JSON file are recorded in the order of execution. This is important for recreating the workflow of a program in ProvONE. The consensus of those on the C2Metadata project was that SDTL JSON files are in the order of execution, because commands are executed in order in the source languages.

It is easy to specify ordering when SDTL is expressed in JSON and XML formats. But specifying ordering is more difficult when SDTL is translated into RDF format. So, we face the problem of representing that ordering in the RDF world that Tommy is working in. One of the solutions is to represent states of a variable. In the knowledge graph a variable will appear each time it changes state.

There is a feature in COGS for specifying that elements must appear in a specified order. Dan has some suggestions about how that can be used to assure that commands are executed in order when SDTL is translated to RDF. Dan and George are looking for places in SDTL where order matters. Dan will modify the documentation to show where order matters.

George said that there is a question about how the nesting of functions will be handled in RDF. For example, suppose that we have an expression "X + Y/Z". How does the RDF show that "Y/Z" must be done before the addition? Tommy said that he thinks this will work in RDF by leveraging the recursion in RDF.

c. Should SDTL support user-supplied functions and sub-programs?

Currently, when a Parser encounters a macro, it tries to expand it into separate commands. Expansion is not always possible, and SDTL has elements for describing macros. R and Python have ways to create functions very easily. Currently, SDTL has a Function Library that maps functions in other languages to equivalent functions in SDTL. We can't do that if the function is defined by a user. It is not clear how we could allow user defined scripts in SDTL.

This leads to questions:

  • How often does this happen?

  • How high a priority should adding user-defined functions/subprograms to SDTL be?

  • Are there ways to divide the problem into parts?

Dara said that most of the programs used to manage data in the CLOSER Project are in SPSS or Stata, which would not involve bespoke functions or macros. But Dara often creates functions in R scripts used to analyze the data when there are operations that need to be repeated. It would enhance the usefulness of SDTL if it could capture these functions.

Tommy said that this is definitely a use case in Whole Tale. It is very common to see user-defined functions in the DataONE community. It would also generalize SDTL from statistical procedural scripts to more general purposes.

Dan said that the Parsers try to unroll functions to describe what is going on inside them. We can do that with many user defined functions, because they are still in the same language. But there will be times when the Parsers will encounter things that they cannot unroll. For those cases, we have the Unsupported command in SDTL. We could have the Parsers put more information into the Unsupported command. That would be a better way to surface what was encountered there. Currently, Unsupported only has SourceInformation, which gives the text of the command in the original language. But the parser might know more about what was happening, even though it cannot follow the command. So, we might be able to expand the Unsupported command to have more explicit information inside of it.

George said that we also encountered this when we were looking at loops. If the loop says "do this loop 3 times", it is easy to describe. But if the loop says "do this loop as X goes from 1 to Z", we have the problem of not knowing the value or values of Z. We can translate the loop into SDTL, but we can't provide the same level of description as when we can expand the loop into its individual operations.

The comments from Dara and Tommy suggest that we need to do more work on this.