Pending SDTL Changes

  1. Add row number function

    1. The SDTL row_number() function will be defined as returning the current row number in the dataframe.

    2. Stata, R, and Python have simple functions for selecting a subset of data by row number

      1. For example, dataFrame.iloc[2:4] will select the 3rd and 4th rows in the data frame. (Ranges in Python are 0-indexed and open on the right.)

    3. In order to use the IfRows command in SDTL to select rows by row number, we need a way to access the row number as IfRows iterates through a dataframe

  2. Create Weight element

    1. Add a weighting property to Aggregate and Collapse in place of weightVariable with required type of Weight element

    2. Add a Weight element with two properties:

      1. weightVariable

      2. weightType: frequency (default), probability, Stata_aweight, Stata_iweight

    3. The SDTL Collapse and Aggregate commands currently allow a weightVariable property. However, this property does not capture the four types of weights available in Stata.

    4. SPSS and SAS use only “frequency” weights, but all languages have additional procedures for defining complex survey weights.

    5. More properties will be added to Weight to cover complex sampling designs when we expand SDTL to include data created by analytical procedures, like regression.

  3. Documentation: Factor subtypes

    1. R and Python both include a categorical data type, which is called Factor in R and Categorical in Python. SDTL calls the type Factor.

    2. Both R and Python allow Factor/Categorical variables to be either ordered or unordered. Only ordered factor variables can be sorted and used in greater/less than logical conditions. Unordered factors cannot be sorted, and they can only be used in equality conditions.

    3. However, there are several differences in the ways that factors are implemented in R and Python. For example, factors in R are always string values, but factors in Python can be string or numeric.

    4. Because of these differences between languages, Factor variables should be described using the subTypeSchema and subType properties in the SetDataType command. These can be implemented like this:

    5. 1 2 3 4 5 6 7 Python factors subTypeSchema: https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html subType: ordered, unordered R factors subTypeSchema: https://cran.r-project.org/doc/manuals/r-release/R-intro.html#Factors subType: ordered, unordered

       

  4. Should the type of the value property of BooleanConstantExpression be changed from string to Boolean, so that it appears as a Boolean in SDTL Json? See http://c2metadata.gitlab.io/sdtl-docs/master/composite-types/BooleanConstantExpression/

     

  5. Completed changes:

  6. Update documentation to make argumentName required in function calls

    1. Jan. 21, 2021; Implemented Mar. 11, 2021

    2. SDTL element: argumentName in FunctionArgument

    3. Change “Best Practices and Conventions” document

  7. Add software and fileFormat to SDTL file description elements

    1. Nov. 19, 2020; Implemented Dec. 9, 2020

    2. SDTL elements: Load, Save, AppendFileDescription, MergeFileDescription

    3. Properties: software and fileFormat

    4. Justification: The Load and Save commands have a software property that can be used for either the software package or a file format.  These functions should be separated into two properties: software and fileFormat.  The software property will be used to describe the software used to read/write a file.  This could include specifying libraries in R and Python, like “pandas.read_csv”.  fileFormat should have a limited controlled vocabulary, e.g. csv, sav, dta, etc.

  8. Add DateTimeConstant and TimeDurationConstant

    1. Nov. 19, 2020 (revised after e-mail discussion); Implemented Dec. 9, 2020

    2. New elements: DateTimeConstant and TimeDurationConstant

    3. Properties: DateTimeConstant

      1. ISO 8601 compliant string

    4. Properties TimeDurationConstant

      1. ISO 8601 compliant string

    5. These constants will be used in expressions involving time. DateTimeConstant provides a way to enter a date, time, or date-time combination. TimeDurationConstant is a measurement of elapsed time, which is used in computations involving time.

  9. Change Function Library Schema property from “required” to “isRequired” and “defaultValue

    1. Oct. 15, 2020; Implemented Dec. 9, 2020

    2. SDTL element: Function Library Schema

    3. Property: required

    4. Justification: The required property currently works in two ways. It shows whether a parameter of a function is required (“yes”,”no”), or it gives the value of a default if there is one. This is not best practice. These functions will be separated into two properties. isRequired will be a Boolean (True/False). defaultValue will hold a value when there is one.

    5. Implementation: Update to the Function Library Schema document on Gitlab

  10. Change sourceInformation from a single element to an array, i.e., cardinality will change from 1..1 to 0..n.

    1. Date: 13 July 2020; Implemented 30 September 2020

    2. SDTL element: CommandBase

    3. Property: sourceInformation

    4. Justification: There are some cases where it would be useful to refer to a discontinuous list of commands. Example, SAS allows more than one KEEP statement in a DATA step. SAS keeps the union of the variables listed on the KEEP statements. In SDTL this would be consolidated into a single KeepVariables command.

    5. Implementation: Update CommandBase in SDTL COGS

  11. $type and command should be spelled the same way including capitalization.

    1. Date: 13 July 2020; Implemented 30 September 2020

    2. SDTL element: CommandBase

    3. Property: $type and command

    4. Justification: This may not be necessary, but differences in case or spelling between $type and command can lead to confusion.

    5. Implementation: This will be added to the "SDTL Best Practices and Conventions" in the SDTL User Guide

  12. If the source language is case insensitive, the parser will change all variable names to either all caps or all lower case. The originalSourceText property of the SourceInformation element will show capitalization as it appears in the original script. A flag at the beginning of the SDTL script should say that variable names have been standardized.

    1. Date: 13 July 2020; Implemented 30 September 2020

    2. SDTL element: VariableReferenceBase

    3. Property:

    4. Justification: SPSS and SAS are case insensitive. This means that a variable may be “varx” in one place in a script and “VarX” in a different command in the same script. Case sensitive languages will interpret “varx” and “VarX” as two distinct variables.

    5. Implementation: This will be added to the "SDTL Best Practices and Conventions" in the SDTL User Guide

  13. There are three ways that optional properties can be omitted from the SDTL JSON file.

    1. The options are:

      1. The property is omitted -- used for single objects or arrays

      2. "property": null -- used for single objects or arrays

      3. "property": [] -- only used for arrays

    2. Date: 13 July 2020; Implemented 30 September 2020

    3. SDTL element: All

    4. Property: Null properties

    5. Justification: This clarifies how SDTL JSON should be written when a property is omitted.

    6. Implementation: This will be added to the "SDTL Best Practices and Conventions" in the SDTL User Guide