Describing the Variable Cascade¶
Introduction¶
The DDI-4 standard is intended to address the metadata needs for the entire survey lifecycle. This particular document is dedicated to an account of variables as part of the DDI-4. Here we lay out the advantages and uses of the typology for variables available in DDI. In DDI, this typology is called the variable cascade.
Variables are the main way we think of data – in production, analysis, and dissemination. Metadata for variables are, then, proxies for descriptions of data, and the metadata for a variable composes a description of it. If all the variables in some data set are described, then, in theory, we understand the data therein. Therefore, a complete and straightforward set of metadata for variables is a requirement for an effective metadata standard and system. DDI achieves this through the use of the variable cascade.
The variable cascade arose because the metadata needed for variables can be layered. This means metadata can be separated into bundles, they are layered, and this layering corresponds to a natural order within which the bundles are specified. In DDI, there are five layers
The advantage to layering is in the possibility of reuse. This is often expressed as the principle of “write once – use many”. Sometimes, for example, metadata needed to describe two variables might differ only at the third layer. This means, the top two layers only have to be written down once and reused for each of the two descriptions. And, of course, differences can arise at any of the layers. So, it is reuse that makes metadata management such an effective approach. This not only reduces the amount of metadata needed, it increases compatibility and interoperability. Gratuitous differences between descriptions that should be the same are eliminated.
For variables, reusable descriptions are brittle in the sense that if one of the attributes describing a variable changes, then a new variable needs to be defined. This is especially true when considering the allowed values (the Value Domain) for a variable. Many small variations in value domains exist in production databases, yet these differences are often of the gratuitous kind (e.g., simple differences in the way some category is described that do not alter the meaning), differences in representation (e.g., simple changes from letter codes to numeric ones), or differences in the way missing (or sentinel) values are represented.
Differences in representations, including codes, are simplified by separating them from the underlying meaning. This is equivalent to the idea of allowing for synonyms and homonyms of terms. Through reuse, all representations with the same meaning are linked to the same concept.
Missing (or sentinel) values are important for processing statistical data, as there are multiple reasons some data are not obtainable. Typically, these values are added to the value domain for a variable. However, each time in the processing life-cycle the list of sentinel values changes, the value domain changes, which forces the variable to change as well. Given that each stage of processing requires a different set of sentinel values due to processing requirements, the number of variables mushrooms. And this metadata overload is unmanageable and unsustainable. In addition, the codes for sentinel values often change when the processing environment does. For example, the codes for missing and refused are different in SAS as compared to SPSS.
In the next sections, we illustrate the variable cascade through an example.
Variable Cascade¶
The variable cascade consists of 5 layers:
- Concept
- Conceptual Variable
- Represented Variable
- Instance Variable
- Value Mapping
Each layer is a specification in increasing granularity from a simple conceptualization down to how data are mapped onto a file. These specifications are designed to afford specific reuse among descriptions of variables.
The concept layer provides an overarching concept from which a variable is defined. Examples include gender, income, occupation, and wages.
The conceptual variable includes the concept as the meaning of the variable, the associated unit type, and the categories the variable may assign. Note, these categories might be enumerated or described by a set of rules. The categories include processing or sentinel categories. These include reasons why data might be missing or incomplete.
The represented variable includes everything already specified for the conceptual variable plus the way the categories are represented, the intended datatype, a unit of measure if applicable, and any precision (e.g., number of decimal places) if applicable.
The instance variable includes everything already specified for the represented variable plus the representations for any sentinel (processing) categories.
The value mapping provides the means to find and retrieve data corresponding to the instance variable defined above out of a record in a file. It includes the specification of a physical datatype.
Each of the upper four layers may be reused to help specify layers underneath.
Details¶
Datatypes¶
Here we discuss the difference between an intended datatype and a physical datatype. Data are often designed to be interpreted in a particular way, but the available datatypes in some applications don’t make all those distinctions. For instance, data for monetary values (e.g., dollars and cents) are usually represented as floating point numbers with 2 digits of precision even though banks treat monetary amounts using scaled arithmetic. In scaled datatypes the remainders are just dropped, not rounded. It turns out, this causes a problem.
Take the average of the following amounts: $1.21, $1.22, $1.25. The average is $1.2266. As a scaled number, the answer is $1.22. As a floating point number with 2 digits of precision, the answer is $1.23, because floating point numbers are rounded to their levels of precision. Therefore, sometimes the intended datatype might not be adequately accounted for in available software.
Cascade¶
Technically, the variable cascade includes the conceptual, represented, and instance variables. We added the value mapping to include how the data from some variable are mapped to a file. We added the concept to describe a variable at its most basic.
The value mapping satisfies a need that instance variable could not. At first, it was thought the instance variable should be changed for every data set. This is ineffective, given the following example. Suppose a study disseminates data into two file formats with the same sentinel value representations. Then, just one instance variable is needed, yet two value mappings are required, corresponding to the two formats.
For an ongoing study that disseminates data in the same way each cycle, then the only thing that changes from cycle to cycle is the time stamp. In this case, even the value mappings will stay the same, i.e., be reused.
Example¶
Concepts¶
Concepts:
- gender
- person
- male
- female
- other
Note – gender is the concept we will use for our variable
Represented Variables¶
Reuse the conceptual variable already specified above
Substantive Values:
- <m, male>
- <f, female>
- <o, other>
Intended Datatype:
- Nominal
Note – Nominal is a kind of categorical data where the categories have no order given.
Again, reuse the conceptual variable above
Substantive Values
- <0, male>
- <1, female>
- <2, other>
Intended Datatype:
- Nominal
Notice, we have 2 represented variables, because some new information provided at this layer, the codes for the categories, are changed between the two cases. But, note also, the SAME conceptual variable is used in the both cases.
Instance Variables¶
Reuse the represented variable #1
Sentinel Values:
- <.m, missing>
- <.r, refused>
Actual datatype:
- character
These sentinel values might typically be used in a SAS application.
Reuse the represented variable #1
Sentinel Values
- <-999, missing>
- <-998, refused>
Actual datatype
- character
These sentinel values might typically be used in a SPSS application.
Note again, the differences here are with the sentinel values, and everything else (above it in the cascade) is the same.
Reuse the represented variable #2
Sentinel Values
- <.m, missing>
- <.r, refused>
Actual datatype:
- character
These sentinel values might typically be used in a SAS application.
Reuse the represented variable #2
Sentinel Values
- <-999, missing>
- <-998, refused>
Actual datatype:
- character
These sentinel values might typically be used in a SPSS application.
Note again, the differences here are with the sentinel values, and everything else (above it in the cascade) is the same.
The result here is 4 different variables. There are 2 version of the substantive codes and 2 versions of the sentinel codes. The combinations result in 4 instance variables, and these may be associated with data sets.
Value Mapping¶
Suppose, the data for a gender variable is stored in a CSV file. The value mapping for this variable might be as follows:
physicalDataType | char |
defaultDecimalSeparator | n/a |
defaultDigitGroupSeparator | n/a |
numberPattern | n/a |
defaultValue | “” |
nullSequence | “” |
format | %c #from the C formats for printing a character |
length | 1 |
minimumLength | 1 |
maximumLength | 1 |
scale | n/a |
decimalPositions | n/a |
required | true |
The main issues for the cascade are the physical datatype and the fact the mapping exists for this variable. This existence means the data are written somewhere.