SDTL Working Group, June 22, 2020
Created by George Alter
Last updated Jun 26, 2020
Video of this meeting is available at SDTL-WG_2020-06-22.mp4
Introductions
Wendy Thomas and Jon Johnson of the DDI Alliance Technical Committee joined the meeting.
DDI Review Process (Wendy and Jon)
Wendy explained that the DDI Alliance standards are governed by a “Standards Development and Review Process and Procedures” document (https://ddialliance.org/sites/default/files/DDIAllianceStandardsDevelopmentandReviewProcessandProcedure.pdf). It goes through a process of getting feedback and updating prior to a vote of the Scientific Board of the Alliance.
The Alliance has developed a process of announcements and a review webpage with background information and a way to file an issue. The review page for SDTL is SDTL Review
Comments are submitted through a JIRA issue tracker.
The Technical Committee (TC) has been working with George, and Dan and Jeremy are also on the TC. The content of the review page is pretty close to final shape to announce a review for the period July 1 to August 31. Barry Radler, chair of the DDI Alliance Marketing Group, is also reviewing the review page to be sure that it explains the relevance of SDTL to the DDI Alliance community.
After the review begins the TC will check with us to be sure that the tools are working correctly, to send additional publicity if necessary, to contact specific groups, and generally to facilitate the process.
After the review period, the TC will check with the WG to be sure that all comments are reviewed and properly resolved.
Jon pointed to the part of the Review page explaining the scope of the standard: SDTL Command Language, Function Library, Pseudocode Library Schema. Other aspects of the C2Metadata Project, e.g. the computer code for parsers and updaters, are not part of the SDTL standard to be maintained by the DDI Alliance. George said that the software developed under the NSF project is all open source, but it is not part of the SDTL standard.
SDTL WG Process
Relationship to C2Metadata Project
There is a lot of overlap with the SDTL WG, but the WG includes people who have not been part of C2Metadata (Dara, Jim, Chifundo, Tommy).
The C2Metadata Project will end in 2020, and the DDI Alliance will be responsible for SDTL from then on.
George asked Wendy and Jon to release SDTL in July, so that there would be an overlap between the DDI Alliance review process and the NSF project.
Work on parsers and updaters is ongoing, and that work will bring up issues for the SDTL standard. But George believes that we can expect only minor tweaks to the existing model.
Since the C2Metadata Project and the SDTL WG will be operating in parallel for a while, George has added Chifundo, Dara, Tommy, and Jim to the C2Metadata email list and to the calendar invite for monthly project meetings.
Future meeting time
We agreed on a monthly meeting on the 3rd Thursday of every month at 12:00 ET/ 16:00 UTC.
George will record meetings and post notes on the Confluence site.
SDTL Documentation: Comments and suggestions
SDTL sites
SDTL review page: SDTL Review
SDTL Working Group page: SDTL - Structured Data Transformation Language Working Group
Other features can be added to this page. This page is on a system maintained by the DDI Alliance.
Documentation about SDTL is in the COGS (Convention-based Ontology Generation System) system developed by Jeremy and Dan. The SDTL COGS is currently in a Gitlab repository of the C2Metadata Project. COGS turns the files in the repository into web pages.
Repository page: https://gitlab.com/c2metadata/sdtl-cogs
SDTL introduction page: http://c2metadata.gitlab.io/sdtl-docs/
SDTL User Guide : http://c2metadata.gitlab.io/sdtl-docs/master/
Detailed descriptions of SDTL elements : http://c2metadata.gitlab.io/sdtl-docs/master/composite-types/
Tommy said that the COGS system is very helpful from a developer perspective. Examples and exercises would be helpful.
Dara asked about libraries in R are supported. It is important here to distinguish between the SDTL specification and the C2Metadata parsers. SDTL as a language should be able to cover any library. The implementation of a parser to translate an R script into an SDTL script is not under the auspices of the DDI Alliance. The C2Metadata parser for R covers the base and tidyverse libraries, and the parser for Python covers base and Pandas libraries.
Links to external files are not working. George and Jeremy are working on those links.
Jim expressed interest in the analysis side of the statistical packages. The goal of SDTL is to describe a dataset. It is not intended to describe tables or coefficients produced by an analysis command. [SDTL does have an “Analysis” command, which reports the source language commands of analysis procedures, but it does not do anything with them.] However, analysis commands can also create data. For example, regression commands can create new variables for predicted values and residuals. George has started some conversations about how data created by analysis commands could be documented in SDTL. Since the number and diversity of analysis commands is very large, it may be more useful to rely on an external ontology of statistical methods. George had a preliminary conversation with Philippe Rocca-Serra at the Oxford e-Research Center who maintains the STATO ontology of statistical methods (http://stato-ontology.org/). They are thinking about a project to use STATO in SDTL to describe how data from statistical analysis were created. If anyone is interested in a project like this, please contact George.
Chifundo said that he has an interest in using SDTL for harmonization of data from different sources. He offered to write up a use case about harmonization that we can add to the User Guide. Dara also works on data harmonization, and she offered to help.
Ornulf asked about the work that Tommy has done to integrate SDTL in the PROV world. Tommy pointed to his work in Github (https://github.com/ThomasThelen/sdtl-provone. He is taking a queries based approach. This involves creating queries that we would like to ask and making sure that they are supported by the data model. SDTL can be represented in PROV in a number of different ways, and we need to derive the data model from the questions that we want to answer. For example, SDTL could be used in a code checker that verifies that no variables are unused in an R script. Tommy is moving toward a system that will support SPARQL queries.
Chifundo asked if developers on the C2Metadata Project would work with him on a parser for Pentaho. George said that he would be happy to talk about it.
George asked that members of the WG continue to review the SDTL documents and send comments and suggestions to the WG email list.
Other matters
George is preparing two articles about SDTL and C2Metadata for submission to journals.
We will probably schedule a webinar about SDTL for late July.
George hopes to do a brief video demonstrating C2Metadata software and explaining the role of SDTL.