Date: Thu, 28 Mar 2024 10:03:09 +0000 (UTC) Message-ID: <33808349.125.1711620189629@f259212db825> Subject: Exported From Confluence MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_Part_124_1384750514.1711620189629" ------=_Part_124_1384750514.1711620189629 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Content-Location: file:///C:/exported.html
Add text to SDTL Best Practices a= nd Conventions:
Representing indexed arrays and lists in SDTL using VariableArrayDerefer= ence() and ValueArrayDereference()
SDTL does not include a data type for indexed arrays or lists, but the s=
ame functionality can sometimes be achieved using SDTL functions Vari=
ableArrayDereference()
and ValueArrayDereference()
.
VariableArrayDereference()
and ValueArrayDereference(=
)
both take two arguments. EXP1 is a number pointing to the location=
of the desired item in the list given as EXP2. EXP2 is an SDTL list expres=
sion (VariableListExpression
or ValueListExpression), which may consist of a range expression (
VariableRangeExpression=
, NumberRangeExpression
, StringRangeExpression). The list expression must be repeated every time that the array deref=
erence function is used.
For example, the following SAS code uses a SAS array of variables in a l= oop.
array m= usicArray {18} BIGBAND -- HVYMETAL ; do i=3D 1 to 18 ; if (musicArray[i] EQ 1 OR musicArray[i] EQ 2) then musicLike2=3Dmusic= Like2 +1 ;=20 end;
In SDTL, we would replace musicArray[i]
with a Variab=
leArrayDereference(EXP1, EXP2)
in which EXP1 is an SDTL Iterat=
orSymbolExpression
for i
and EXP2 is a VariableRa=
ngeExpression
for variables BIGBAND to HVYMETAL.
Completed changes:
Copying a dataframe
Implemented 10 Sept 2021
The SDTL NewDataframe command can be used to copy a dat= aframe by using a consumesDataframe property.
Change text in NewDataframe to:
The NewDataframe command copies or creates a new dataframe. It can be us= ed in two ways.
An existing dataframe can be copied to a new dataframe by using the cons= umesDataframe and producesDataframe properties of NewDataframe. The new dat= aframe will be a "deep" copy in the sense used in R and Python.
NewDataframe can also be used to create an empty dataframe of a specific= size. In Stata, the "set obs #" command will create a dataframe with a use= r-defined number of rows. This may be used in simulations to preset a numbe= r of simulated observations, which are then filled with randomly generated = data.
Add text to SDTL Best Practices a= nd Conventions:
Deep copy of a dataframe
Python and R distinguish between a deep copy and shallow (Python) or copy b=
y reference (R). A deep copy creates a duplicate of a dataframe that is ind=
ependent of the original. A shallow copy has a new name, but it points to t=
he storage locations of the original dataframe. This acts as an alias for t=
he original dataframe. If a deep copy is changed, the contents of the origi=
nal dataframe are not affected. However, changing a shallow copy also chang=
es the contents of the original dataframe. In SDTL, the NewDataframe comman=
d can be used to create deep copies. SDTL does not support shallow copies a=
t this time.
Change WeightVariable to ExpressionBase
Implemented 2 June 2021
WeightVariable is currently defined as VariableSymbolEx= pression, which means that it points to a variable. But it is possible for = a weight to be an expression including a variable. The Stata manual gives t= his example
regress= y x1 x2 x3 [pweight=3D1/prob]
We can accomodate this by defining WeightVariable as Ex= pressionBase, which will allow both simple variables and complex expression= s.
Collapse add properties for CaseWise and ColumnWise del= etion properties
Implemented 2 June 2021
Use enumeration for each property giving the name of the source language= , so that users can refer to documentation for details about the behavior o= f the property in context.
Add row number function
Implemented 10 May 2021
The SDTL row_number() function will be defined as retur= ning the current row number in the dataframe.
Stata, R, and Python have simple functions for selecting a subset of dat= a by row number
For example, dataFrame.iloc[2:4] will select the 3rd and 4th rows in the= data frame. (Ranges in Python are 0-indexed and open on the right.)
In order to use the IfRows command in SDTL to select rows by row number,= we need a way to access the row number as IfRows iterates through a datafr= ame
Create Weight element
Implemented 10 May 2021Add a weighting property to
Add a Weight element with two properties:
weightVariable
weightType: frequency (default), probability, S= tata_aweight, Stata_iweight
The SDTL Collapse and Aggregate commands currently allow a weigh= tVariable property. However, this property does not capture the fo= ur types of weights available in Stata.
SPSS and SAS use only =E2=80=9Cfrequency=E2=80=9D weights, but all langu= ages have additional procedures for defining complex survey weights.
More properties will be added to Weight to cover comple= x sampling designs when we expand SDTL to include data created by analytica= l procedures, like regression.
Documentation: Factor subtypes
Implemented 10 May 2021
R and Python both include a categorical data type, which is called Facto= r in R and Categorical in Python. SDTL calls the type Factor.
Both R and Python allow Factor/Categorical variables to be either ordere= d or unordered. Only ordered factor variables can be sorted and used in gre= ater/less than logical conditions. Unordered factors cannot be sorted, and = they can only be used in equality conditions.
However, there are several differences in the ways that factors are impl= emented in R and Python. For example, factors in R are always string values= , but factors in Python can be string or numeric.
Because of these differences between languages, Factor variables should =
be described using the subTypeSchema
and subType
=
properties in the SetDataType
command. These can be implemente=
d like this:
Python = factors subTypeSchema: https://pandas.pydata.org/pandas-docs/stable/user_guide/cate= gorical.html subType: ordered, unordered R factors subTypeSchema: https://cran.r-project.org/doc/manuals/r-release/R-intro.htm= l#Factors subType: ordered, unordered
Change type of BooleanConstantExpression
Implemented 10 May 2021
Change the type of the value property of BooleanConstantExpression from = string to Boolean, so that it appears as a Boolean in SDTL Json? See http://c2metada= ta.gitlab.io/sdtl-docs/master/composite-types/BooleanConstantExpression/
Update documentation to make argumentName required in function calls
Jan. 21, 2021; Implemented Mar. 11, 2021
SDTL element: argumentName in FunctionArgument
Change =E2=80=9CBest Practices and Conventions=E2=80=9D document
Add software and fileFormat to SDTL fi= le description elements
Nov. 19, 2020; Implemented Dec. 9, 2020
SDTL elements: Load, Save, App= endFileDescription, MergeFileDescription
Properties: software and fileFormat
Justification: The Load and Save comma= nds have a software property that can be used for either t= he software package or a file format. These functions should be separ= ated into two properties: software and fileFormat<= /strong>. The software property will be used to desc= ribe the software used to read/write a file. This could include speci= fying libraries in R and Python, like =E2=80=9Cpandas.read_csv=E2=80=9D.&nb= sp; fileFormat should have a limited controlled vocabulary= , e.g. csv, sav, dta, etc.
Add DateTimeConstant and TimeDurationConstant= strong>
Nov. 19, 2020 (revised after e-mail discussion); Implemented Dec. 9, 202= 0
New elements: DateTimeConstant and TimeDuration= Constant
Properties: DateTimeConstant
ISO 8601 compliant string
Properties TimeDurationConstant
ISO 8601 compliant string
These constants will be used in expressions involving time. Date= TimeConstant provides a way to enter a date, time, or date-time co= mbination. TimeDurationConstant is a measurement of elapse= d time, which is used in computations involving time.
Change Function Library Schema property from =E2=80=9Crequired= strong>=E2=80=9D to =E2=80=9CisRequired=E2=80=9D and =E2= =80=9CdefaultValue=E2=80=9D
Oct. 15, 2020; Implemented Dec. 9, 2020
SDTL element: Function Library Schema
Property: required
Justification: The required property currently works in= two ways. It shows whether a parameter of a function is required (=E2=80= =9Cyes=E2=80=9D,=E2=80=9Dno=E2=80=9D), or it gives the value of a default i= f there is one. This is not best practice. These functions will be separate= d into two properties. isRequired will be a Boolean (True/= False). defaultValue will hold a value when there is one.<= /p>
Implementation: Update to the Function Library Schema document on Gitlab=
Change sourceInformation from a single element to an ar= ray, i.e., cardinality will change from 1..1 to 0..n.
Date: 13 July 2020; Implemented 30 September 2020
SDTL element: CommandBase
Property: sourceInformation
Justification: There are some cases where it would be useful to refer to= a discontinuous list of commands. Example, SAS allows more than one KEEP s= tatement in a DATA step. SAS keeps the union of the variables listed on the= KEEP statements. In SDTL this would be consolidated into a single = KeepVariables command.
Implementation: Update CommandBase in SDTL COGS
$type and command should be spelled th= e same way including capitalization.
Date: 13 July 2020; Implemented 30 September 2020
SDTL element: CommandBase
Property: $type and command
Justification: This may not be necessary, but differences in case or spe= lling between $type and command can lead = to confusion.
Implementation: This will be added to the "SDTL Best Practices and Conve= ntions" in the SDTL User Guide
If the source language is case insensitive, the parser will change all v= ariable names to either all caps or all lower case. The originalSou= rceText property of the SourceInformation element= will show capitalization as it appears in the original script. A flag at t= he beginning of the SDTL script should say that variable names have been st= andardized.
Date: 13 July 2020; Implemented 30 September 2020
SDTL element: VariableReferenceBase
Property:
Justification: SPSS and SAS are case insensitive. This means that a vari= able may be =E2=80=9Cvarx=E2=80=9D in one place in a script and =E2=80=9CVa= rX=E2=80=9D in a different command in the same script. Case sensitive langu= ages will interpret =E2=80=9Cvarx=E2=80=9D and =E2=80=9CVarX=E2=80=9D as tw= o distinct variables.
Implementation: This will be added to the "SDTL Best Practices and Conve= ntions" in the SDTL User Guide
There are three ways that optional properties can be omitted from the SD= TL JSON file.
The options are:
The property is omitted -- used for single objects or arrays
"property": null -- used for single objects or arrays
"property": [] -- only used for arrays
Date: 13 July 2020; Implemented 30 September 2020
SDTL element: All
Property: Null properties
Justification: This clarifies how SDTL JSON should be written when a pro= perty is omitted.
Implementation: This will be added to the "SDTL Best Practices and Conve= ntions" in the SDTL User Guide