SDTL - Structured Data Transformation Language Working Group

Purpose of SDTL

Structured Data Transformation Language (SDTL) is an independent intermediate language for representing data transformation commands.  Statistical analysis packages (e.g., SPSS, Stata, SAS, and R) provide similar functionality, but each one has its own proprietary language.  SDTL consists of JSON schemas for common operations, such as RECODE, MERGE FILES, and VARIABLE LABELS.  SDTL provides machine-actionable descriptions of variable-level data transformation histories derived from any data transformation language.  Provenance metadata represented in SDTL can be added to documentation in DDI and other metadata standards.

Benefits of SDTL for DDI

SDTL greatly enhances the value of DDI, because it is a key component of an automated metadata production process.  Currently, DDI metadata is almost always created by data repositories not by data producers.  Even when data are born digital, data producers discard provenance information that could be transported into DDI, because they do data management and variable transformations in statistical packages with minimal metadata capabilities.  SDTL and the tools created by the C​2​Metadata Project are designed to create a metadata life cycle that parallels the data life cycle.  The same scripts that are used to transform and manage data files can be used to update metadata files.  As a result, data producers can create more accurate and complete DDI metadata with less time and effort for them and for data repositories.

 

Description of SDTL

There are currently three parts of SDTL:

  1. SDTL has been developed as a set of JSON schemas, which describe data transformation commands and the expressions within commands, like variable names. SDTL JSON can be rendered in other formats, like XML and OWL RDF. SDTL documentation and downloads are available on Gitlab.

  2. Function Library:  Each statistical package has hundreds of functions for common mathematical, statistical, and text operations, such as LOG, SINE, AVERAGE, and LENGTH.  The Function Library is a crosswalk between a standard SDTL representation of each function and the implementations of that function in various statistical languages. The Function Library minimizes program code for SDTL applications, because all functions can be handled in the same way, and additions to the Function Library do not require changes to program code.

  3. Pseudocode Library:  The Pseudocode Library provides human readable translations of SDTL commands.  Like the Function Library, it is external to the SDTL JSON schemas, so it minimizes program code and can easily be updated. 

The C​2​Metadata Project has developed software applications based on SDTL:

  1. Parsers for SPSS, Stata, SAS, R, and Python translate command scripts in these languages into SDTL

  2. Updaters for DDI and Ecological Markup Language (EML) incorporate SDTL into existing metadata files

  3. A Pseudocode Generator translates SDTL commands into human-readable text

  4. A Codebook Formatter reads SDTL-enhanced DDI Codebook XML, and creates an interactive HTML codebook that displays variable-level provenance metadata with hyperlinks to antecedent versions of variables.

These applications are prototypes to demonstrate the potential of SDTL, which will be available as open source code.  We expect software developers to incorporate these functions into future products.

Please note that SDTL is currently limited to describing data transformations.  A similar approach to statistical analysis is possible, but beyond the scope of our current project.

SDTL Working Group

Name

Role

Name

Role

George Alter

Lead

Carson Hunter

 

Jeremy Iverson

 

Chifundo Kanjala

 

Dara O’Neill

 

Ornulf Risnes

 

Dan Smith

 

Thomas Thelen

 

SDTL Review Page – Public page for the comment period

 

 

Related pages: