Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Expand
titleMay 11, 2015

Simple Codebook Meeting
May 11, 2015

Present: Oliver Hopt, Larry Hoyle, Steve McEachern, Mary Vardigan

The group continued its review of the mapping between DDI Codebook and DDI 4 – https://docs.google.com/spreadsheets/d/1VDbVz2KRRSX_KEf0IfuE-QqMyTDupftCZfBdBM6VPT8/edit#gid=2125503646.

The group returned to the elements regarding availability and access. There is currently no archive information in DDI4 and this needs to be modeled, perhaps at the upcoming sprint. In terms of the use statement, some is not covered in the access object in Discovery in DDI4. This needs to be modeled also. SAML isn't useful for us because it is too high level. Both data and metadata may need something attached. We might look at this in the Datum discussion (not only columns but rows) and also attaching things to the metadata to control access. This might be like annotations where it can be attached to anything – access could have a relationship to annotated identifiable. Then any object could have an access control. From access description to object could be another solution. This could make sense because an object could have different access policies when stored in different archives. This should be discussed at the sprint also. There is an Access Control XML language that we looked at but didn't decide on. Michelle will be representing CISER at the sprint and can express their needs in this area.

In terms of Imputation, it is now the same as it has been in 3. Generation Instructions and General Instructions seem to have the same text. We need some clarification from Wendy on this. They can describe an Imputation procedure. This has not yet been brought up in 4 yet. This would be methodology or fieldwork. It is in the Processing package now. Need clarification at the sprint.

Security in variable relates to the discussion above. 3.2 doesn't do much at the row level but this is becoming a requirement.

Embargo is in Simple Codebook, but this is basically a set of placeholders right now. This should be part of the Access Rights discussion at the sprint so we do this consistently. Where should this come from? A use case or the modeling team proposing an approach. We probably need both directions. Maybe two use cases – one from Bill for metadata and one from Ornulf for data.

Response Unit not yet modeled and will come up in complex instrument. This can be at the study and variable level. An equivalent should be covered in methodology.

For question elements, there is a container in Data Capture that will work for this and allow you to instantiate pre-, post-, and literal question as well as interviewer instructions. Statement is the container.

In terms of invalid range, this is in Simple Codebook. How are we tying this to missing? In 3.2 and in Simple Codebook in 4 you can point to a managed missing values representation and in that you can do ranges. You can do things like from this value to that value is a missing value. This is there by virtue of having been brought over from 3.2. The ISO 11404 notion of sentinal value (each instance variable has a set of such values but it might point to the same represented variable) has been modeled to allow for the valid set of data to be handled in different statistical packages. You have to represent the semantics in different ways. The Data Description group should handle this.

Undocumented Codes – they should have had a label but didn't get documented. Codebook is the obvious group to handle this.

Total Responses is another part of the documentation for variable and should be handled by Codebook. This is handled with a controlled vocabulary when you say what type of statistic it is.

Summary Statistics is in Complex Data Type. They are not in the Simple Codebook view now but that hasn't been built out yet and we would need to include them in the view.

In terms of Descriptive Text, all the variables in 4 inherit Description as members.

Expand
titleJune 8, 2015 06 08

Simple Codebook
June 8, 2015

Present: Dan Gillman, Larry Hoyle, Jenny Linnerud, Oliver Hopt, Mary Vardigan

The group continued to review the spreadsheet mapping DDI 2.* to DDI4 and noting items that the modeling should take up.

Then the group turned to the metadata that the statistical packages include. Larry provided a spreadsheet that he and Achim had developed to show which metadata were included in each of the major statistical packages. It will be important for Codebook to contain all of this metadata. There are other ways of handling data, like SQL, that might also be appropriate. In the Big Data world, Python is becoming popular. Python  is a general scripting language and has replaced the role that PERL had at one point. You can explicitly represent trees like JSON and XML, so it is very flexible. People have developed modules that do statistical kinds of things with Python.

Looking at all the software metadata from the statistical point of view is important. We need to make sure that everything in Larry's spreadsheet is accounted for in a meaningful way. We need to identify things that are not in the DDI 2.* spreadsheet. We can go through this all together or do assignments.

Number of significant digits is important in some scientific data. Whether the number has been rounded can be important. This should be included in DDI4. In 11179 community, there was a discussion of accuracy and precision. This is related to significant digits. The Data Description Team should address this. In an Instance Variable we may want to talk about significant digits while for a Represented Variable we talk about accuracy.

Larry and Dan will talk with the Data Description and Modeling teams about these issues.

 

 

...