SDTL Working Group 15 April 2021

  • Announcements

    • Papers about the C2Metadata Project have been accepted for two conferences

      • International Digital Curation Conference April 19-20

        • The paper will automatically be published in conference proceedings, but we have submitted it for review in the International Journal of Digital Curation

      • Provenance Week July 19-22

        • This paper will be published in the conference proceedings

        • The focus of this paper is on what we have learned

    • The group planning an extension to SDTL is meeting next week

  • SDTL issues

  1. Add row number function

    1. The SDTL row_number() function will be defined as returning the current row number in the dataframe.

    2. Stata, R, and Python have simple functions for selecting a subset of data by row number

      1. For example, dataFrame.iloc[2:4] will select the 3rd and 4th rows in the data frame. (Ranges in Python are 0-indexed and open on the right.)

    3. In order to use the IfRows command in SDTL to select rows by row number, we need a way to access the row number as IfRows iterates through a dataframe

    4. Discussion:

      1. This issue came up in work on the Stata Parser. The Collapse command in Stata has an option for choosing which rows are to be included in the computation.

      2. In R, the rownames command will report the row number.

  2. Create Weight element

    1. Add a weighting property to Aggregate and Collapse in place of weightVariable with required type of Weight element

    2. Add a Weight element with two properties:

      1. weightVariable

      2. weightType: frequency (default), probability, Stata_aweight, Stata_iweight

    3. The SDTL Collapse and Aggregate commands currently allow a weightVariable property. However, this property does not capture the four types of weights available in Stata.

    4. SPSS and SAS use only “frequency” weights, but all languages have additional procedures for defining complex survey weights.

    5. More properties will be added to Weight to cover complex sampling designs when we expand SDTL to include data created by analytical procedures, like regression.

    6. Discussion

      1. This change will allow us to add more properties to the Weight element to handle more complex types of weights.

      2. SPSS and Stata have two ways of applying weights. There is a simple way that is used mostly for frequency weights. They also have a separate system for applying complex sampling weights.

      3. Stata has 4 types of weights. Stata also has a separate system for complex sampling weights.

  3. Documentation: Factor subtypes

    1. R and Python both include a categorical data type, which is called Factor in R and Categorical in Python. SDTL calls the type Factor.

    2. Both R and Python allow Factor/Categorical variables to be either ordered or unordered. Only ordered factor variables can be sorted and used in greater/less than logical conditions. Unordered factors cannot be sorted, and they can only be used in equality conditions.

    3. However, there are several differences in the ways that factors are implemented in R and Python. For example, factors in R are always string values, but factors in Python can be string or numeric.

    4. Because of these differences between languages, Factor variables should be described using the subTypeSchema and subType properties in the SetDataType command. These can be implemented like this:

    5.  

      Python factors subTypeSchema: https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html subType: ordered, unordered R factors subTypeSchema: https://cran.r-project.org/doc/manuals/r-release/R-intro.html#Factors subType: ordered, unordered
    6. Discussion

      1. George presented this for information of the WG.

      2. The issue here is that R and Python implement Factor data types in different ways.

      3. SDTL uses a minimal set of data types. It describes more specific data types by referencing external data type schemas. This allows SDTL to describe software-specific types without developing its own controlled vocabulary.

      4. Dan asked how one would translate these different types of weights if one were converting from SDTL to another language. George replied that one would need to look at the specific meaning of the type of weight in the external documentation. He argued that this is a case where we cannot build all of the functionality for different weighting systems into SDTL, and this solution allows us to be refer the use to a source that explains all of the particulars.

      5. George asked Dan if “bases” in COGS are always hierarchical. Can we have a base that cuts across other bases? Dan said that cross-cutting bases are not possible in COGS.

      6. Dan recommended against putting the weighting property in CommandBase. He proposed putting this property only in the SDTL commands that need it. At this time, only the Aggregate and Collapse commands use weights. We could create a separate class of commands in the future when we start to include analysis commands, but we don’t know how that will develop yet. This proposal was accepted.

  4. Should the type of the value property of BooleanConstantExpression be changed from string to Boolean, so that it appears as a Boolean in SDTL Json? See http://c2metadata.gitlab.io/sdtl-docs/master/composite-types/BooleanConstantExpression/

    1. Discussion

      1. The COGS system supports Boolean as a data type, and the current specification should be changed to Boolean

      2. George asked if there was a standard way to format Boolean constants. Dan replied that the formatting depends upon the serialization. In JSON, true and false are spelled with all lower case.

      3. Dara did a test, and she found that sorting does work with unordered Factors in R.