SDTL Working Group, August 19, 2021
Describing how SDTL handles dataframes
Dan is working on text to describe new documentation about dataframes in SDTL. Dan explained that we are trying to explain the use of the consumesDataframe and producesDataframe attributes on SDTL commands. We also want to explain what the name of a dataframe means in a certain place in a script. For example, if a dataframe is changed by creating a new variable, should the producesDataframe use the same name as the consumesDataframe that went into the command?
When a dataframe is changed, we need a convention to explain that the produced dataframe is a mutation of the consumed dataframe. This could be done by the convention of using the same dataframe name for both consumed and produced, or by adding a flag referencing the in-coming dataframe.
Commands in SDTL are ordered. If a command uses an existing name for a produced dataframe, we know that the following commands can only use the new version of the dataframe with that name. This works in SDTL, but other representations that do not maintain ordering of commands (e.g., RDF) may need to modify names (e.g., DFname01, DFname02, DFname03…) or create separate identifiers. For example, the DDI Updater creates a new dataframe ID when the content of a dataframe changes, even though the name of the dataframe remains the same.
George asked Dan to consider guidance on when a variableInventory should be used.
Dan proposed adding documentation about the rowDimension and columnDimension properties, which are used in multi-dimensional or n-cube data.
Representing indexed arrays in SDTL
George proposed text for the SDTL Best Practices and Conventions to describe the use of the VariableArrayDereference() and ValueArrayDereference() functions. These functions provide SDTL with a way to describe the functionality of arrays without including arrays in the language.
Each function takes as arguments an index and a list, where the index points to an element in the list. This works the same way that an array does in most computer languages. However, arrays are persistent objects, but these functions are not. So, the full list of items must be given every time the function is used.
These functions were created for use in SAS, which allows users to create arrays of variables for use in loops. It might be possible to use these functions for lists in R and Python in specific cases.
The rest of the meeting was a discussion of a path from SDTL to ProvONE. George showed a schema for creating triples that could be created in the DDI Updater and exported for use in creating ProvONE RDF. Dan and Thomas were skeptical of George’s proposal.