Pending SDTL Changes

  1. Copying a dataframe

    1. The SDTL NewDataframe command can be used to copy a dataframe by using a consumesDataframe property.

    2. Change text in NewDataframe to:

      1. The NewDataframe command copies or creates a new dataframe. It can be used in two ways.

        An existing dataframe can be copied to a new dataframe by using the consumesDataframe and producesDataframe properties of NewDataframe. The new dataframe will be a "deep" copy in the sense used in R and Python.

        NewDataframe can also be used to create an empty dataframe of a specific size. In Stata, the "set obs #" command will create a dataframe with a user-defined number of rows. This may be used in simulations to preset a number of simulated observations, which are then filled with randomly generated data.

    3. Add text to SDTL Best Practices and Conventions:

      1. Deep copy of a dataframe
        Python and R distinguish between a deep copy and shallow (Python) or copy by reference (R). A deep copy creates a duplicate of a dataframe that is independent of the original. A shallow copy has a new name, but it points to the storage locations of the original dataframe. This acts as an alias for the original dataframe. If a deep copy is changed, the contents of the original dataframe are not affected. However, changing a shallow copy also changes the contents of the original dataframe. In SDTL, the NewDataframe command can be used to create deep copies. SDTL does not support shallow copies at this time.

 

Completed changes:

  1. Change WeightVariable to ExpressionBase

    1. Implemented 2 June 2021

    2. WeightVariable is currently defined as VariableSymbolExpression, which means that it points to a variable. But it is possible for a weight to be an expression including a variable. The Stata manual gives this example

      1 regress y x1 x2 x3 [pweight=1/prob]

       

    3. We can accomodate this by defining WeightVariable as ExpressionBase, which will allow both simple variables and complex expressions.

  2. Collapse add properties for CaseWise and ColumnWise deletion properties

    1. Implemented 2 June 2021

    2. Use enumeration for each property giving the name of the source language, so that users can refer to documentation for details about the behavior of the property in context.

  3. Add row number function

    1. Implemented 10 May 2021

    2. The SDTL row_number() function will be defined as returning the current row number in the dataframe.

    3. Stata, R, and Python have simple functions for selecting a subset of data by row number

      1. For example, dataFrame.iloc[2:4] will select the 3rd and 4th rows in the data frame. (Ranges in Python are 0-indexed and open on the right.)

    4. In order to use the IfRows command in SDTL to select rows by row number, we need a way to access the row number as IfRows iterates through a dataframe

  4. Create Weight element

    1. Implemented 10 May 2021Add a weighting property to Aggregate and Collapse in place of weightVariable with required type of Weight element

    2. Add a Weight element with two properties:

      1. weightVariable

      2. weightType: frequency (default), probability, Stata_aweight, Stata_iweight

    3. The SDTL Collapse and Aggregate commands currently allow a weightVariable property. However, this property does not capture the four types of weights available in Stata.

    4. SPSS and SAS use only “frequency” weights, but all languages have additional procedures for defining complex survey weights.

    5. More properties will be added to Weight to cover complex sampling designs when we expand SDTL to include data created by analytical procedures, like regression.

  5. Documentation: Factor subtypes

    1. Implemented 10 May 2021

    2. R and Python both include a categorical data type, which is called Factor in R and Categorical in Python. SDTL calls the type Factor.

    3. Both R and Python allow Factor/Categorical variables to be either ordered or unordered. Only ordered factor variables can be sorted and used in greater/less than logical conditions. Unordered factors cannot be sorted, and they can only be used in equality conditions.

    4. However, there are several differences in the ways that factors are implemented in R and Python. For example, factors in R are always string values, but factors in Python can be string or numeric.

    5. Because of these differences between languages, Factor variables should be described using the subTypeSchema and subType properties in the SetDataType command. These can be implemented like this:

    6. 1 2 3 4 5 6 7 Python factors subTypeSchema: https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html subType: ordered, unordered R factors subTypeSchema: https://cran.r-project.org/doc/manuals/r-release/R-intro.html#Factors subType: ordered, unordered

       

  6. Change type of BooleanConstantExpression

    1. Implemented 10 May 2021

    2. Change the type of the value property of BooleanConstantExpression from string to Boolean, so that it appears as a Boolean in SDTL Json? See http://c2metadata.gitlab.io/sdtl-docs/master/composite-types/BooleanConstantExpression/

  7. Update documentation to make argumentName required in function calls

    1. Jan. 21, 2021; Implemented Mar. 11, 2021

    2. SDTL element: argumentName in FunctionArgument

    3. Change “Best Practices and Conventions” document

  8. Add software and fileFormat to SDTL file description elements

    1. Nov. 19, 2020; Implemented Dec. 9, 2020

    2. SDTL elements: Load, Save, AppendFileDescription, MergeFileDescription

    3. Properties: software and fileFormat

    4. Justification: The Load and Save commands have a software property that can be used for either the software package or a file format.  These functions should be separated into two properties: software and fileFormat.  The software property will be used to describe the software used to read/write a file.  This could include specifying libraries in R and Python, like “pandas.read_csv”.  fileFormat should have a limited controlled vocabulary, e.g. csv, sav, dta, etc.

  9. Add DateTimeConstant and TimeDurationConstant

    1. Nov. 19, 2020 (revised after e-mail discussion); Implemented Dec. 9, 2020

    2. New elements: DateTimeConstant and TimeDurationConstant

    3. Properties: DateTimeConstant

      1. ISO 8601 compliant string

    4. Properties TimeDurationConstant

      1. ISO 8601 compliant string

    5. These constants will be used in expressions involving time. DateTimeConstant provides a way to enter a date, time, or date-time combination. TimeDurationConstant is a measurement of elapsed time, which is used in computations involving time.

  10. Change Function Library Schema property from “required” to “isRequired” and “defaultValue

    1. Oct. 15, 2020; Implemented Dec. 9, 2020

    2. SDTL element: Function Library Schema

    3. Property: required

    4. Justification: The required property currently works in two ways. It shows whether a parameter of a function is required (“yes”,”no”), or it gives the value of a default if there is one. This is not best practice. These functions will be separated into two properties. isRequired will be a Boolean (True/False). defaultValue will hold a value when there is one.

    5. Implementation: Update to the Function Library Schema document on Gitlab

  11. Change sourceInformation from a single element to an array, i.e., cardinality will change from 1..1 to 0..n.

    1. Date: 13 July 2020; Implemented 30 September 2020

    2. SDTL element: CommandBase

    3. Property: sourceInformation

    4. Justification: There are some cases where it would be useful to refer to a discontinuous list of commands. Example, SAS allows more than one KEEP statement in a DATA step. SAS keeps the union of the variables listed on the KEEP statements. In SDTL this would be consolidated into a single KeepVariables command.

    5. Implementation: Update CommandBase in SDTL COGS

  12. $type and command should be spelled the same way including capitalization.

    1. Date: 13 July 2020; Implemented 30 September 2020

    2. SDTL element: CommandBase

    3. Property: $type and command

    4. Justification: This may not be necessary, but differences in case or spelling between $type and command can lead to confusion.

    5. Implementation: This will be added to the "SDTL Best Practices and Conventions" in the SDTL User Guide

  13. If the source language is case insensitive, the parser will change all variable names to either all caps or all lower case. The originalSourceText property of the SourceInformation element will show capitalization as it appears in the original script. A flag at the beginning of the SDTL script should say that variable names have been standardized.

    1. Date: 13 July 2020; Implemented 30 September 2020

    2. SDTL element: VariableReferenceBase

    3. Property:

    4. Justification: SPSS and SAS are case insensitive. This means that a variable may be “varx” in one place in a script and “VarX” in a different command in the same script. Case sensitive languages will interpret “varx” and “VarX” as two distinct variables.

    5. Implementation: This will be added to the "SDTL Best Practices and Conventions" in the SDTL User Guide

  14. There are three ways that optional properties can be omitted from the SDTL JSON file.

    1. The options are:

      1. The property is omitted -- used for single objects or arrays

      2. "property": null -- used for single objects or arrays

      3. "property": [] -- only used for arrays

    2. Date: 13 July 2020; Implemented 30 September 2020

    3. SDTL element: All

    4. Property: Null properties

    5. Justification: This clarifies how SDTL JSON should be written when a property is omitted.

    6. Implementation: This will be added to the "SDTL Best Practices and Conventions" in the SDTL User Guide