Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Info
iconfalse

 Simple Codebook View Team

 

...

Expand
titleMarch 1, 2016

Minutes Codebook 2016 – 03 - 01

 

Present: Michelle Edwards, Dan Gillman, Steve McEachern, Oliver Hopt, Jon Johnson, Larry Hoyle

Reviewing Michelle’s addition to the view spreadsheet.

We need a view

Unclear if we need a package

What other classes/objects are we likely to need?

 - Study

 - Contributor and a Role CV

 - Notes class

Need something like 2.x Study

Stats agencies may not need Study

Michelle looked at DDI Lite and the “Archives using” column and cut out unused elements. She cut down the DDI-LITE profile even further to suggest items that we think are essential.

Variable description looks like not being used by archives but we know we need that

Column E has mapping to DDI4

Can we come up with an (incomplete list of methodologies that we need)?

Last meeting: Discussion on methodology. We discussed the need for a reference document or a descriptive statement that we may wish to refer to for describing a methodology. We may have a need for a more detailed description, or even a call to a process, but this is an initial high-level description only.

- Distinction between Usage (in the Methodology model) and a text description (what we are considering here). We would have a reference to the more detailed usage.

- This will be raised by DG, LH and OH in the modelling group for a preferred approach.

How do we handle a short descriptive statements in the codebook?

A codebook will not necessarily have the detail to drive a process. What would a template have?

There is usage in the methodology model but this is something higher level than that.  A codebook would then have a reference to the more detailed metadata.

Does the modeling group have a preferred approach to this?

Bring this up at the modeling meeting tomorrow.

What in DDI4 corresponds to the note in DDI-C?

 

Variables

Do we want to reproduce the variable cascade in a codebook? Could it just be the IV? Or will we need the references to further up the cascade - RV and CV

In the current DDI4 model we have references in the InstanceVariable to RepresentedVariable and ConceptualVariable

The discussion of the Variable use lead naturally into the link from Variable to Question. And there are also Variables do NOT have Questions (e.g. derived variables, instrument measures, etc.)

What do we really want to be supporting in the Codebook view? Steve and Michelle to develop usecase(s) to identify requirements.

What about references to reuse of other's materials/variables/etc.

Larry suggests that the codebook should be able to reference such content, where it is available. The question becomes how to do that referencing, and to what extent?

Steve and Jon's query is what form would that codebook take?

Question mode has an impact – do we want to record this in a codebook.

In multi-mode surveys in a codebook there could be different forms of the questions for each mode.

Example BLS CES – multiple modes the codebook would need to account for all of that.

When people think variables they think questions in the DDI2 world

In DDI2 can’t really describe an instrument. Many variables don’t come from questions.

A variable focus is more general than a question focus and should be our focus.

DDI2.5 moves to a variable focus.

How far should we move into a focus on questions?

Steve and Michelle will put together a use case.

Is a reference to Represented etc meaningful unless there is a repository containing it.

Is a codebook for reuse? Information should be included by instance in the codebook but references to reused elements could be included. What if the target objects are not DDI encoded?

Do we want references at each possible point? Or just at the InstanceVariable?

What would a PDF look like? A weakness of a codebook is that it is stand-alone. We should have a side email discussion of how we include pointers.

No one will want to retool existing study documentation. These features need to be optional.

Two discussions for next meeting:

1. Steve and Michelle for Variable usecases

2. Larry and Dan to lead a discussion on what a "codebook" might look like in the 21st century / RDF / Linked Data world? How far can we move towards that new Codebook form?

 

 

Expand
titleFebruary 2, 2016

Codebook meeting

2 February 2016

Attending: Dan, Michelle, Steve, Oliver, Jon, Larry, Jared

There’s some lack of clarity about where this group is at.  Discussed what to include in simple codebooks.  One idea is to review the spreadsheet of common elements (summary of CESSDA) and build on that.  Essentials seem to include: enough information to read the data into statistical package, label values, understand universe, understand what measure means so you can interpret the data, attribution information.  Another idea is to look at examples of simple codebooks, identify what they use, and then map to a model.

We need to be careful to keep things simple.  Even older versions of DDI 2 weren’t exactly simple.

If we nail down definitions, then do we make instances of previous versions incompatible?  As we define what information elements we want in DDI 4.0, we can specify which element you want in 2 if you’re going backwards.  

Next steps:

  1. Michelle will go through spreadsheet and narrow down to those elements that are DDI Lite and any others that are heavily used (e.g., key words).

  2. Will paste those elements into new sheet within the spreadsheet.

...

Expand
titleNovember 23, 2015

November 23, 2015

Present: Dan Gillman, Michelle Edwards, Steve McEachern, Larry Hoyle

  • We want to incorporate everything from the InstanceVaraible
  • Add in the connection to Question
  • Structure of the physical representation
  • We want to describe DDI-C using DDI4 elements.
  • Reuse would make some codebook instances shorter, Will people think that referencing RepresentedVariable and ConceptualVariable is required if those references are optional?
  •  What are the next directions for Codebook?  Think about surveying big Codebooks users, IHSN and Nesstar users in particular – along with 5-6 archives 
  • Where Nesstar goes these users will follow
  • Cost will be primary driver for folks to migrate from 2.x to 4.x – some see the benefits of the DDI-L extensions
  • We need a migration path from 2.x to 4.x
  • 4.x is flexible enough that the migration path doesn’t need to be well defined
    • Should be based on your needs and what you think is appropriate first step to reuse
  • Variable bank, question bank, and Universes/Populations may be the natural first step to migration but each may present a different migration path
  • We may be able to recommend different paths
  • ISO Community – have technical reports – series of recommendation that folks ought to follow – think of it as “Best Practices” – these may exist but they do not depict how to but rather provide guidance
  • This is something we should seriously consider doing – maybe a Grad Student project
    • Jane Greenberg, at Drexel University – great opportunity to collaborate
    • Dan G may reach out to Jane to start a conversation
  • Back to how are we going to build codebook
  • We want to create a model-based Codebook in 4.x rather than a way to create the XML from 4.x to put into 2.x

 

    • This way we can do things more efficiently
    • Create an attribute that states it is being used for Purpose A or Purpose B
  • We could document how the information could be transferred without having a one-to-one relationship between objects.
  • To implement codebook in 4.x we need to describe attributes and their purpose
    • Examples:
    • Title / Alternate title / Parallel title -> have an attribute with a Controlled Vocabulary for what kind of title it is
    • Similar situation for Roles – in Codebook we have a number of different roles, let’s pair that down, use Agent, with a CV and a usage attribute that states Codebook - Roles – we recommended the Credit Taxonomy.
  • 1 object that covers a number of Codebook XML elements
  • Compactness will make it easier to maintain over the years – these could include these areas:
    • Citations
    • Publications
    • Related Materials
    • Methodology
  • Cluster elements?

Goal for next Meeting – December 7, 2015:

  • Review Codebook and see how we can handle current Codebook elements
    • Clusters that can stand on their own – then figure out how we can do this
    • What we need and how to manage it – then take to modellers
  • Going forward – we will review and  look at clustering elements in the Google spreadsheet. What are different uses of the same structure?
 

 

 

 

Expand
titleNovember 9, 2015

November 9, 2015

Present: Dan Gillman, Oliver Hopt, Larry Hoyle, Mary Vardigan

The group discussed whether Data Capture had made enough progress to enable Codebook to move forward. Mary will get in touch with Barry about this.

In terms of Oliver's model (the second model he proposed), the next step would be to bring in information from other groups. Access conditions was the only area not yet covered. We need to ensure that everything in Oliver's model is covered (except for Access Conditions). Oliver will go through the group's spreadsheet and map to this model to ensure full coverage.

We also need to ensure that we have adequate methodology information. We also need to be sure that full file level documentation is enabled (not just study level).

And do we want to include all of the datum level information for reuse? This may be too much for the codebook view, which has traditionally been a more flat view of a study and the files it produces. There is a connection between variable and datum so if we want this to be part of codebook or an extended version it is possible.

Do we care about anything other than the instance variables in Codebook? Codebook is something you get with a file that lets you use it and interpret it. But if you have pointers to represented variable and conceptual variables you can do more.

Since codebooks are created ad hoc, that's how it's designed. There is no guarantee that the way someone creates a conceptual variable is the same as how someone else creates it. There would be no semantic interoperability. But in a future world by design there are new surveys where comparability is designed into newer surveys. A DOI to what has been defined elsewhere would be OK.

We have polled various organizations to see which elements they use. Do we need to continue not-used elements? This is a good point in time to simplify. To survey on DDI 3 usage, Oliver has a small XSL transformation that gives out a statistic of downward paths for any given document, which could be helpful.

In Data Description, there was a related discussion about how far we should chase legacy file layouts. In one sense you want to encourage people to do things in simpler ways, rather than more complicated formats.

It was decided that the ability to include references to represented and conceptual variables is a good addition to codebook to bring in the notion of reuse.

 

 

 

 

 

 

Expand
titleMeeting Minutes Sept 28 2015

Attendees: Dan Gillman (chair), Michelle Edwards, Oliver Hopt, Larry Hoyle, Steve McEachern

 

Oliver distributed a PDF of his thinking around the Codebook model. He presented this work, and provided commentary on his thinking. 

Scenario A was discussed at the last meeting, but was seen to be problematic. 

Scenario B was his revised approach. This includes:

 - Study, DataResource and DataFile

 - Citation from Annotation

 

DataResource is consistent with the GSIM equivalent

 - Carries Citation which allows various subclasses to be citable

 - Has one attribute: productionInformation

 - VariableBasket and DataFile would be subclasses of DataResource

 

Study includes:

 - StudyDesign

 - Fieldwork

 - Etc.

 - Study would have an attribute DataResource  

 

Comments and discussion

 

1. Dan asked the meaning of the blue box around DataResource in Scenario B? Oliver indicated that this would indicate a new package DataResource.

2. Dan asked what is the cardinality of the relationship b/w Study and DataResource? Oliver suggested that this should be repeatable - e.g. more than one DataFile in a Study.

3. Dan asked DataResource is currently a collection of files or a collection of Variables. Could this include Questions?

- Oliver noted that there is currently a relationship through Measure from Question to InstanceVariable.

- We may not want to include all of the DataCapture view within Codebook

- Dan suggests that DataCapture has not yet laid out the link between the Questionnaire in the abstract versus the Instrument in the physical.

- We would want to include the Questions, Skips, ResponseCategories and InterviewerInstructions.

- Which do we want - the PhysicalInstrument or the ConceptualInstrument?

- Examples: Blood Pressure measurement, CATI instrument execution

- By including Physical, do we as a result account for Conceptual?

- Larry asks can we include by reference? Dan argues for the need for explicit rather than implicit reference. Larry notes that this means that this would make an Instrument required content.

- Dan asks if it is adequate to have just a pointer? If so, how do you link the Variable to the Question?

- Dan suggests that there IS a link between a Question and a Variable - but it is just not enough to tell you sufficient detail as to how a Datum was derived.

The group generally wasn't sure if we do want to try and link the Question and Variable - mostly due to content already existing (particularly pre-2000).

 

Oliver brought the conversation back to what we are currently trying to model.

 

Preferably there should be some machine actionable generated documentation which allows the links between these to be automatically (or semi-automatically) created. However in many cases this simply may not be available for past content (ADA and GESIS have examples, and we believe ICPSR as well).

 

As such, we may want to allow for simple external documents which describe the content in a human-readable (but not machine readable or actionable) form.

External resource is an option in Lifecycle - this might be the means for this.

 

Oliver's current model does enable this - allows for the simple, but allowing to be replaced by more complex where it is available and/or "generateable".

Steve noted that this would also be consistent with the approach taken in Methodology.

 

Where does this leave us, and where to next?

 

Dan is concerned that we may be adding a fair amount of complexity over DDI version 2.5.

e.g We have been having discussions about the link between Question and Variable - how would the user community respond to this?

 

Oliver also noted that this may touch on the discussion had with Ornulf about maintaining Codebook 2.5 through the DDI4 implementation. Ornulf's and Oliver's concern was the potential creation about too many identifiers to be maintained within a Codebook instance. Whether we would be able to handle what's done in 2.5 in a DDI4 codebook.

 

Larry noted that Colectica seem to have a potential solution to this in their current work. This seems to bypass the Lifecycle 3.2 approach, and simply use UUIDs to manage identifiability, which might be a possible solution.

 

What to do for next meeting?

 

Oliver undertook to clarify what the relationships between his Study object and the other packages would be (e.g. to DataCapture, Methodology, etc.).

We also need to ensure that we keep track of what the requirements are for aligning with DDI2.5

 

Next meeting: Monday 12th October, 8am U.S. Eastern time

Note that there will be changes for other locations due to daylight savings.

...

Expand
title2015 08 17

August 17, 2015

Present: Michelle Edwards, Dan Gillman, Larry Hoyle, Steve McEachern, Mary Vardigan

What is the advantage of moving from Codebook to Lifecycle?

One benefit is building a collection of reusable instruments in multiple languages. Reusing the census variables in other questionnaires is another area.

Something we should promote is building in limited amounts of reuse. It may be possible to incorporate areas of reuse without incorporating in others where we don't see the benefit – variables are an area. Can this be done piecemeal? Most variables are instance and possibly represented variables. As they see the need they can build out to conceptual level. The recent work on ANES and GSS is a good example of this. With the concept management perspective we have, you can always argue that any two usages are different in some way. We will be imprecise in some ways always. There is a push among NSOs for question banks, but there is a recognition that modes affect the responses. Your intent is to measure the same concept. This is why concept management is a powerful idea.

One of the problems may be the tools that are needed. We can't yet articulate the use case to build the tools we need.

Identifiers

What is the best way to proceed in terms of identifiers? Oliver is doing some modeling so we should be able to look at identification based on what he does. Mary will introduce the identifier discussion with Ornulf so we can get Nesstar Publisher on board.

We hope there will be a way to use identifiers in DDI-C and append to them to make them unique. In the end we need a unique identifier at what level? Anything that is identifiable in 4 requires a unique identifier. Whether everything that is identifiable in 2 will be in 4 is pretty assured. The IDS in 2 are at the variable level. If variable has a unique identifier and the study has a unique identifier, there should be global uniqueness if we could add the registry ID. In the Linked Data world you could find all the variables in the world related to a concept. Another is the simple fact that many studies are ongoing so the yearly or monthly variables could be looked at across time. Any time you are making comparisons over time, subject, or geography you need this.

 

...

 

Expand
title2015 07 20

July 20, 2015

Present: Michelle Edwards, Oliver Hopt, Larry Hoyle, Mary Vardigan

Managing DDI Codebook in DDI4

The group discussed whether it would be possible to reconcile the different approaches to identification if we were to manage DDI Codebook in DDI4 in the future, which is the goal. Currently in DDI Codebook IDs are unique only for the individual instance, not across instances, and the approach of DDI Lifecycle and DDI4 is to have globally unique IDs for all DDI objects.  It was the sense of the group, however, that the IDs are not a big barrier, either using the URNs or using UUIDs; it should be fairly easy to make a transfer. Scripts can generate UUIDs. We could manage Codebook in DDI4 without taking advantage of referencing and reuse. Also, there is a Local ID in DDI4, which could carry what is currently the ID in DDI Codebook and a UUID could be added. Colectica goes back and forth between DDI Codebook and DDI Lifecyle and they use UUIDs so it would be helpful to talk with them about these issues.

There is also a political issue in that DDI Codebook has been handled separately and people feel ownership of it as it stands now. It is used around the world by the IHSN. We want to maintain close relationships with these partners, so we will need to design a system that works for them. We should contact Nesstar to start a conversation about how Nesstar Publisher might make some relatively small changes to accommodate this switch to managing DDI Codebook elements in DDI4.

Status of Spreadsheet and Modeling

In the past weeks the Codebook group annotated a spreadsheet – https://docs.google.com/spreadsheets/d/1VDbVz2KRRSX_KEf0IfuE-QqMyTDupftCZfBdBM6VPT8/edit#gid=2125503646 – containing all of the Codebook elements used by CESSDA archives with the objective of determining which elements are currently in DDI4 and which might need to be added. It was the sense of the modelers on the call that the spreadsheet as it stands now is adequate input for the modeling effort. Oliver with support from Larry will start to add classes to Drupal based on the spreadsheet and will get back to the group with any questions. He estimates that he will have a first Codebook View to show in four weeks.

...

Expand
titleJune 8, 2015

Simple Codebook June 8, 2015

Present: Dan Gillman, Larry Hoyle, Jenny Linnerud, Oliver Hopt, Mary Vardigan

The group continued to review the spreadsheet mapping DDI 2.* to DDI4 and noting items that the modeling should take up.

Then the group turned to the metadata that the statistical packages include. Larry provided a spreadsheet that he and Achim had developed to show which metadata were included in each of the major statistical packages. It will be important for Codebook to contain all of this metadata. There are other ways of handling data, like SQL, that might also be appropriate. In the Big Data world, Python is becoming popular. Python  is a general scripting language and has replaced the role that PERL had at one point. You can explicitly represent trees like JSON and XML, so it is very flexible. People have developed modules that do statistical kinds of things with Python.

Looking at all the software metadata from the statistical point of view is important. We need to make sure that everything in Larry's spreadsheet is accounted for in a meaningful way. We need to identify things that are not in the DDI 2.* spreadsheet. We can go through this all together or do assignments.

Number of significant digits is important in some scientific data. Whether the number has been rounded can be important. This should be included in DDI4. In 11179 community, there was a discussion of accuracy and precision. This is related to significant digits. The Data Description Team should address this. In an Instance Variable we may want to talk about significant digits while for a Represented Variable we talk about accuracy. We don't want to lose simple statistics on variables.

Larry and Dan will talk with the Data Description and Modeling teams about these issues.

 

 

 

Expand
titleMay 11, 2015

Simple Codebook Meeting May 11, 2015

Present: Oliver Hopt, Larry Hoyle, Steve McEachern, Mary Vardigan

The group continued its review of the mapping between DDI Codebook and DDI 4 – https://docs.google.com/spreadsheets/d/1VDbVz2KRRSX_KEf0IfuE-QqMyTDupftCZfBdBM6VPT8/edit#gid=2125503646.

The group returned to the elements regarding availability and access. There is currently no archive information in DDI4 and this needs to be modeled, perhaps at the upcoming sprint. In terms of the use statement, some is not covered in the access object in Discovery in DDI4. This needs to be modeled also. SAML isn't useful for us because it is too high level. Both data and metadata may need something attached. We might look at this in the Datum discussion (not only columns but rows) and also attaching things to the metadata to control access. This might be like annotations where it can be attached to anything – access could have a relationship to annotated identifiable. Then any object could have an access control. From access description to object could be another solution. This could make sense because an object could have different access policies when stored in different archives. This should be discussed at the sprint also. There is an Access Control XML language that we looked at but didn't decide on. Michelle will be representing CISER at the sprint and can express their needs in this area.

In terms of Imputation, it is now the same as it has been in 3. Generation Instructions and General Instructions seem to have the same text. We need some clarification from Wendy on this. They can describe an Imputation procedure. This has not yet been brought up in 4 yet. This would be methodology or fieldwork. It is in the Processing package now. Need clarification at the sprint.

Security in variable relates to the discussion above. 3.2 doesn't do much at the row level but this is becoming a requirement.

Embargo is in Simple Codebook, but this is basically a set of placeholders right now. This should be part of the Access Rights discussion at the sprint so we do this consistently. Where should this come from? A use case or the modeling team proposing an approach. We probably need both directions. Maybe two use cases – one from Bill for metadata and one from Ornulf for data.

Response Unit not yet modeled and will come up in complex instrument. This can be at the study and variable level. An equivalent should be covered in methodology.

For question elements, there is a container in Data Capture that will work for this and allow you to instantiate pre-, post-, and literal question as well as interviewer instructions. Statement is the container.

In terms of invalid range, this is in Simple Codebook. How are we tying this to missing? In 3.2 and in Simple Codebook in 4 you can point to a managed missing values representation and in that you can do ranges. You can do things like from this value to that value is a missing value. This is there by virtue of having been brought over from 3.2. The ISO 11404 notion of sentinal value (each instance variable has a set of such values but it might point to the same represented variable) has been modeled to allow for the valid set of data to be handled in different statistical packages. You have to represent the semantics in different ways. The Data Description group should handle this.

Undocumented Codes – they should have had a label but didn't get documented. Codebook is the obvious group to handle this.

Total Responses is another part of the documentation for variable and should be handled by Codebook. This is handled with a controlled vocabulary when you say what type of statistic it is.

Summary Statistics is in Complex Data Type. They are not in the Simple Codebook view now but that hasn't been built out yet and we would need to include them in the view.

In terms of Descriptive Text, all the variables in 4 inherit Description as members.

...

Expand
titleApril 27, 2015

Simple Codebook Meeting April 27, 2015

Present: Michelle Edwards, Dan Gillman, Oliver Hopt, Steve McEachern, Mary Vardigan

The group went back to the mapping between DDI Codebook and what is in DDI4. In terms of Access Conditions, there is an Access module in Discovery, where it is streamlined. It looks as if availability and use statements are not included; everything is structured string. We might look at SAML or another controlled vocabulary for access control like XACML (Extensible Access Control Markup Language). The issue is whether the outside source maintains previous versions, which we don't have control over.

In terms of Other Material, this was all found in DDI4 except for the Other Material table. This was part of DDI Codebook to mark up a table for presentation. In terms of VarDoc version, none of that was in DDI4. In DDI4 versioning is done at a low level, so this is taken care of at a level of the model that is not about particular content but about everything – Identifiable and Annotated Identifiable. There is an ID and a version. The question is that in Codebook the description is applied against Variable; in DDI4 identification applies broadly.

The group traced identification through the DDI4 model and looked at Collections and Members. Version Type in DDI Codebook does not seem to be covered, but no one is using this. Type seems no longer relevant and related to documents rather than to elements. People who understand this element from the old way of thinking have to know that the idea of a version is being expanded. We need to table this for now but are leaning toward deprecating this element.

Coding Instructions probably maps to Fieldwork and Methodology, which we don't have yet in DDI4.

 

 

 

Expand
titleApril 13, 2015

Simple Codebook Meeting 2015-04-13

The meeting focussed on reviewing the next set of metadata elements from DDI-C - those covered by Steve.

Steve had created an additional three columns to his copy of the spreadsheet for his work - adding:

  1. Package (for elements already matched in DDI4)
  2. Suggested Package (for elements that have no match)
  3. the DDI-C definition.

These additional columns have now been added to the Google Spreadsheet - linked here.

The discussion then focussed on the elements. Notes on specific elements are included in the spreadsheet, and summarised below

Elements

Source

  • example for digitized statistical abstract  the original print publication. If administrative data the original administrative program. A simple version of provenance

Geographic unit

  • “Lowest level of geographic aggregation covered by the data.”
  • Would GeographicLevels (plural) be better to indicate that multiple levels can be used.Is GeographicLevel a better term than GeographicUnit?

Control operations

  • Description of what was done. Data collection process,

General comments and issues

It was noted that much of the methodology section of DDI-C was not yet covered in DDI4. Part of this will be addressed by the Methodology working group.

There is however a set of elements that are not really methodology (or at least the research design), but rather are descriptive of the process and outcome of the execution of the methodology. These elements might most appropriately fall under the heading of "Fieldwork". Examples from DDI-C include:

  • CollectionSituation
  • MinimizeLossActions
  • ControlOperations

and, notably, RESPONSE RATE.

The group had concerns that it was unclear how we might provide recommendations here? e.g. ResponseRate, what is meant – “opposite of rate of refusal?” other types?  It was also recognised that this is not really part of methodology, but has an impact on methodology – as well as on analysis, post processing. For example, was there an intervention based on low response rate? Fieldwork issues.

On similar lines, there was a recognition that Methodology is the ugly part of DDI Codebook. Dan suggested that this section may be in need of a significant revamp, given the developments in survey methodology that have occurred since the original development of DDI-C, in particular the Total Survey Error framework.

It was noted that these issues with Methodology and Fieldwork need to be raised with the AG sooner rather than later, as they have resource and workload implications for the Moving Forward program. Steve will write something up on this and distribute to the group, prior to sending to the AG.

 

 

 

Expand
titleMarch 16, 2015

Simple Codebook Meeting

March 16, 2015

Present: Dan Gillman, Oliver Hopt, Larry Hoyle, Mary Vardigan

The agenda for the meeting was to determine if all elements in the CESSDA profile/Nesstar profile are present in DDI 4. Larry Hoyle had created a spreadsheet of DDI Lite and the list of elements from CESSDA profiles. There seems to be a wide variety of the selection of the elements and attributes in the repositories using DDI Lite. The Nesstar Webview comes as the base. The group compared elements used across different repositories.

The task was to find out which elements are in DDI4, so the group decided to divide up the list of 200+ elements. There appears not to be any DDI4 elements about the metadata itself, the DDI document. It basically parallels the study description information. This may not be relevant for DDI4. Perhaps the Data Citation group should think about this. This is often the archive's intellectual property, so some representation of it will be of interest to most of the archives. Citing the user guide or documentation is a common practice.

DDI Codebook has some elements of description that DDI4 has not been talking about. We need to bring forth something to the Advisory Group about this – this is an issue that we need to discuss. In DDI Lifecycle there is the corresponding instance with a citation on it. There is no DDI4 instance because instance is a root element for documents in general.

Will the idea of a document description disappear in 4? The archive creates a document describing the data. The landing page is sometimes (always?)  metadata.

Study level, variable level, record level, file level: should the Data Citation group look at what are targets of citation?

In DDI Codebook, we have DocumentDescription; in DDI Lifecycle we have DDIInstance. Should DDIInstance be brought back into DDI4? – with revised content but allowing attachment of annotation.

Being able to point to an XML file with the model and generate that file from elements in 4 is adequate. But it is no longer enough to point to one object that contains everything.

We have the logical vs. physical distinction. A DDIInstance as a physical thing – something that's there.  Pulling together the information into that representation is an activity with Authors, etc. There is the "same" content in two archives. – different contact people, different URIs for each. This is parallel to data description.

Assignments for the next meeting

Where in DDI4 do each of these elements exist?

FirstLine

LastLine

N

Who

Content

70

101

31

Dan

Citation

102

131

29

Steve

Scope Methodology

132

155

23

Oliver

Access Conditions

156

184

28

Larry

File Variable

185

205

20

Mary

VarDoc

206

232

26

Michelle

CategoryGroups OtherMaterial

 

 

...

Expand
titleFebruary 02 2015

Simple Codebook Meeting Minutes February 2, 2015

 

Present: Dan Gillman, Oliver Hopt, Larry Hoyle, Steve McEachern, Mary Vardigan

The Simple Codebook committee will now be chaired by Dan Gillman as Wolfgang is not able to chair currently.

This group has been in a holding pattern because we are waiting on the results of other groups. However, it was suggested that we look at the Codebook 2.5 (Codebook Version) in comparison to DDI 3.* (Lifecycle Version).

XML permits a detailed description of elements and this is part of the distinction between 2 and 3. But UML doesn't allow this and doesn't account for nesting and levels of detail. We should try to incorporate what is in Version 2 into the model as best we can. We as a group should try to build this. One additional possible other advance would be that we could then have a single model to account for both Codebook and Lifecycle. Both views would be under one spec in this approach.

Is referencing and reusability a distinction between the two versions that we should take into account? Should it be communicated to the modeling team that we may not need the complexity?

For users who want to describe their data, they should be able to write a description and fit it into a framework. If you want to have interoperability with other systems, then that is a different issue.

For the standalone one-off research project, users will not need to be reusing variables and questions, but for longitudinal and research across languages and cultures, this is important; there is a need to harmonize across questionnaires, reuse metadata across time, etc. Maybe this is Complex Codebook?

We need a distinction between the user perspective and the technical perspective. Simple and complex need to be interoperable. It's necessary to reduce the complexity of what is modeled in the library by choosing the simple cases.

One of the decisions for DDI 4 is to make everything identifiable and drop the container aspect of identifiability. This takes away a lot of the complexity.

From a marketing perspective, we need to distinguish between the DDI Codebook version and the Simple Codebook view. Looking at what is in 2 now will be required and we need to lay out what we need to account for. In the study section for DDI Codebook, there were a number of elements that allowed you to provide a high level text description of various methodological things. Preserving that is important.

Capturing what is in an SPSS or SAS representation including all the metadata you can put there is also important. When you move data around, you don't want to lose anything. When you look at how researchers want to record information, it is often difficult for them to record things in detail. Guided structures for them as part of their workflow is important and Codebook this is one view that could help them with this. You need some structure that becomes machine-actionable. You don't want people to just write a narrative.

At BLS, there is a Handbook of Methods. It has narrative descriptions of the surveys BLS does and it doesn't have a lot of detail. This should be captured in DDI rather than in a PDF. There is a need for high level and detailed as well. There may still be a need for some kind of a DDI Lite as a way of inducing reluctant data producers to get involved. For variables the detail is necessary. We should make this as flexible as we can.

We can start by looking at what is in 2.5 and figure out from the point of view of a list of what we need to account for. This would be a set of requirements that we as a group need to figure out how to solve. One question we want to address from a modeling point of view is, for example, when we need to say how the sample is constructed: Would those higher level descriptions go in a class of things that are independent of everything else or part of a sampling class? These are design issues that might have an impact on the way the more detailed model plays out.

If we can manage both 2 and 3 in the same structure we as a standards body will have an easier time with this. We should consult with Wendy on this.

Several archives still rely on DDI Codebook, Nesstar, etc. There is a set of codebook specs from different archives.

Are we talking about having our Simple Codebook view covering everything that is in 2.5? It should be even less. But should there be a view that is everything in 2.5? One idea is a view that is a really simple codebook but to allow for complexity in any direction you would like to go so we could incorporate everything that is in 2.5. Or go into more detail in 3 for whatever direction you want to go so there is a seamless distinction between high and detailed levels. This is basically what DDI 4 is. We should provide a lot of different options about how much detail the user wants. With 4 right now we have detailed descriptions of a lot of things but we are not allowing for high level descriptions. The description and definition were discussed in London with respect to Drupal in the sense that there could be radio buttons to indicate that they should be used to standardize those objects. It could be possible to have a description without any usage of detailed sub-elements.

There could be an attribute that could be high-level description. Or we have an element saying this is the Sample Description. Just having an element called description associated with identifiable objects may not be sufficient. In the annotated identifiable there is an annotation element that has Dublin Core properties like Title, Contributor, etc. It has an abstract. But there is nothing that is a high-level description.

On the one hand it might be nice to have a Sampling Description, but it might be over-specified. It's important to have an element dedicated to a high-level description that you are offering in place of the detail or as a supplement to the detail. A general description like the annotation will lose semantic interoperability. We need machine-interpretability. We also want the possibility to reference just the high level description in the simple codebook.

We should be able to allow for user-defined views that provide for whatever level of detail an organization uses. A Simple Codebook view that maps back to 2.5 would be useful. It would allow those organizations just using 2 to feel comfortable using 4.

DDI 4 does not have the same hierarchy as DDI 3. We would still need an object carrying high level content for the sampling process and nothing else. In 3 there was a parent node but we don't have this structure in DDI 4, which means you need to create a container for this description. It's not a question of using description as a property containing the text, but which element carries the description.

Between now and the next meeting, Oliver will make some slides with an example of what we have been talking about. We also need to dig into DDI 2.5 to get a handle on what is needed at the higher level. Dan and Larry will look at this. Dan will also consult with Wendy on this.

 

 

 

Expand
titleNovember 23 2014

Simple Codebook Meeting Minutes
November 23, 2014

 

Present: Dan Gillman, Steve McEachern, Mary Vardigan

Meeting Times

The current time is midnight for Canberra, so we need to find another meeting time. 2pm EST U.S. time is the preferred time for the new year.

DDI 3.2 vs. 4

We are thinking in terms of forward compatibility so that everything in 3.* is covered in 4. This is not the best approach. Rather, we should solve the problem we want to solve and then worry about how to map it after we have solved it.

Framing happens unconsciously -- the circumstances of how you think about a problem constrains the way you are conceiving it.

Still it’s worth having a look at what we have right now to see what the overlap is.

By sticking with the nicely defined distinction between logical and physical we can be more precise going forward.

There is not too much not actually covered in 4 but it is going to be reorganized.

Next Steps

Steve will compare the spreadsheet to Data Description in 4 to determine how they map and overlap.

...

Expand
titleSeptember 15, 2014

 

Simple Codebook Meeting
September 15, 2014

 

Present: Dan Gillman, Oliver Hopt, Larry Hoyle, Jenny Linnerud, Steve McEachern, Ornulf Risnes, Wendy Thomas, Mary Vardigan

Discussion

The group affirmed Wendy’s definition of a codebook (See Appendix A for the full document):

A codebook combines the contents of a data dictionary with additional information to support the intelligent use of the data which it describes. The data dictionary provides structured information on the layout of the data, providing sufficient detail to the incorporation of the data into a program for analysis including the name, physical location of the data, data type, size, and meaning of the values. This should include both valid and invalid (missing) values as well as information on the record types, relationships and internal layout. The codebook pulls together additional information required for understanding the source of the data, its relevance to the research question, and related information about the survey design, methodologies employed, the data collection process, data processing, and data quality.

A codebook should contain information for discovery and for data manipulation (data dictionary contents) in a structured format to support programming for access. Other sections of metadata may be machine actionable or informational depending on the use of the codebook structure. Informational content can be maintained in-line (as specific content of the codebook) or by reference to external content (a questionnaire, research proposal, methodology resources, etc.).

The group discussed overlap with other groups and packages since codebook is a compilation of other packages. Simple Codebook is most likely a compilation of Conceptual, Simple Data Description, Discovery, and additional information that facilitates interpretation of the data and intelligent use. The difficulty is determining what depth of information is appropriate. For replication purposes, you need a lot of detail.

The Simple Data Description group is first focusing on data description in a broad way and will then define a subset for “simple.” Perhaps this group should do the same.

It would be helpful to have reports from other groups so that we know where they are and what makes sense to combine for simple codebook.

In Wendy’s list (Appendix A), much of the content we need is covered by other groups, but we could use more detail in Data Source, Data Processing, and Methodology. Methodology framed its scope broadly in Toronto but hasn’t yet met as a group. One activity for that group would be to review the sampling and weighting specifications that came out of the Survey Design and Implementation working group to see what is needed beyond that work.

Next Meeting

The group will meet again on Monday, September 29, to get reports from other groups.

Appendix A

What is a codebook?

[also referred to by DataONE as science metadata for science data]

A codebook combines the contents of a data dictionary with additional information to support the intelligent use of the data which it describes. The data dictionary provides structured information on the layout of the data, providing sufficient detail to the incorporation of the data into a program for analysis including the name, physical location of the data, data type, size, and meaning of the values. This should include both valid and invalid (missing) values as well as information on the record types, relationships and internal layout. The codebook pulls together additional information required for understanding the source of the data, its relevance to the research question, and related information about the survey design, methodologies employed, the data collection process, data processing, and data quality.

A codebook should contain information for discovery and for data manipulation (data dictionary contents) in a structured format to support programming for access. Other sections of metadata may be machine actionable or informational depending on the use of the codebook structure. Informational content can be maintained in-line (as specific content of the codebook) or by reference to external content (a questionnaire, research proposal, methodology resources, etc.).

Discussion

The definitions below for "codebook" are survey centric when refering to the broader set of metadata related to a data file. Another term may be preferable but there isn't one that leaps to mind. Whether called a codebook, science metadata, metadata, or something else, data files have 2 levels of description:

·         A structured physical description that supports the ability of the programmer to access the data accurately

·         Supporting information that allows the researcher to evaluate “fitness of use” of the data to a particular research question, the overall quality of the data, and the specifics of the conceptual (objects, universe/population, conceptual definitions, spatial and temporal) coverage. This information may be applicable to the study as a whole or to the individual variable. This also includes information on why and how the data were captured, processed, and preserved.

 

Type of information

Basic Codebook

Survey

Fauna (Wildlife)

Data structure:

·         Record type

·         Record layout

·         Record relationship

·         Data type

·         Valid values

·         Invalid values

Structured metadata to support access

Structured metadata to support access

Structured metadata to support access

Data source:

·         Why was data collected

·         How was data collected

·         Who collected the data

·         The universe or population and how it was identified and selected

Descriptive to support assessment of quality and fitness-for-use

Purpose of the survey; Survey content and flow (may or may not need to be actionable); identification and sampling of survey population (may or may not need to be actionable for replication purposes)

Purpose of study, how data was collected (may need to be actionable to support replication and/or calibration); identification and sampling of survey population (may or may not need to be actionable for replication purposes)

Data processing:

·         Data capture process

·         Validation

·         Quality control

·         Normalizing, coding, derivations

·         Protection (confidentiality, suppression, interpolation, embargo, etc.)

Informational material; support provenance

May need structured metadata for purposes of replication; Include processes, background information, proposed, actual, and implications for data

May need structured to support mechanical capture instruments, calibrations, situational variants, etc.

Discovery information:

·         Who

·         What

·         When

·         Why

·         Coverage

o   Topical

o   Temporal

o   Spatial

Structured metadata to support discovery and access to the data as a whole

Structured metadata to support discovery and access to the data as a whole

Structured metadata to support discovery and access to the data as a whole

Conceptual basis

·         Object

·         Concept

Informational material

Structured to support analysis of change over time and relationship between studies. May just be descriptive / informational.

Structured to support genre level comparison (heavy use of common taxonomies, etc.)

Methodologies employed

Informational material

Structured to support replication and comparison between studies

Structured to support replication and comparison between studies

Related materials of relevance to data

Informational material

  

Definitions

Data Dictionary

·         A data dictionary, or metadata repository, as defined in the IBM Dictionary of Computing, is a "centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format."[1] The term can have one of several closely related meanings pertaining to databases and database management systems (DBMS):

·         A document describing a database or collection of databases

·         An integral component of a DBMS that is required to determine its structure

·         A piece of middleware that extends or supplants the native data dictionary of a DBMS

·         Database about a database. A data dictionary defines the structure of the database itself (not that of the data held in the database) and is used in control and maintenance of large databases. Among other items of information, it records (1) what data is stored, (2) name, description, and characteristics of each data element, (3) types of relationships between data elements, (4) access rights and frequency of access. Also called system dictionary when used in the context of a system design.Read more: http://www.businessdictionary.com/definition/data-dictionary.html#ixzz3Am5wCgZI

·         A data dictionary is a collection of descriptions of the data objects or items in a data model for the benefit of programmers and others who need to refer to them. (Posted by Margaret Rouse  @ WhatIs.com)

Codebook

What is a codebook? (http://www.sscnet.ucla.edu/issr/da/tutor/tutcode.htm)

A codebook describes and documents the questions asked or items collected in a survey. Codebooks and study documentation will provide you with crucial details to help you decide whether or not a particular data collection will be useful in your research. The codebook will describe the subject of the survey or data collection, the sample and how it was constructed, and how the data were coded, entered, and processed.  The questionnaire or survey instrument will be included along with a description or layout of how the data file is organized.  Some codebooks are available electronically, and you can read them on your computer screen, download them to your machine, or print them out. Others are not electronic and must be used in a library or archive, or, depending on copyright, photocopied if you want your own for personal use.

Codebook : Lisa Carley-Baxter (http://srmo.sagepub.com/view/encyclopedia-of-survey-research-methods/n69.xml)

Codebooks are used by survey researchers to serve two main purposes: to provide a guide for coding responses and to serve as documentation of the layout and code definitions of a data file. Data files usually contain one line for each observation, such as a record or person (also called a "respondent"). Each column generally represents a single variable; however, one variable may span several columns. At the most basic level, a codebook describes the layout of the data in the data file and describes what the data codes mean. Codebooks are used to document the values associated with the answer options for a given survey question. Each answer category is given a unique numeric value, and these unique numeric values are then used by researchers in their analysis of the ...

Codebook (Wikipedia.com)

A codebook is a type of document used for gathering and storing codes. Originally codebooks were often literally books, but today codebook is a byword for the complete record of a series of codes, regardless of physical format.

ICPSR

What is a codebook?

A codebook provides information on the structure, contents, and layout of a data file. Users are strongly encouraged to look at the codebook of a study before downloading the datafiles.

While codebooks vary widely in quality and amount of information given, a typical codebook includes:

• Column locations and widths for each variable

• Definitions of different record types

• Response codes for each variable

• Codes used to indicate nonresponse and missing data

• Exact questions and skip patterns used in a survey

• Other indications of the content and characteristics of each variable

Additionally, codebooks may also contain:

• Frequencies of response

• Survey objectives

• Concept definitions

• A description of the survey design and methodology

• A copy of the survey questionnaire (if applicable)

• Information on data collection, data processing, and data quality

 

 

Expand
titleDagstuhl Sprint Oct 2014

Minutes from Dagstuhl Sprint 2014 Working Group

...

Expand
titleJune 30, 2014
 

Meeting: 2014-06-30

 

Attending: Guillaume Duffes, Dan Gillman, Larry Hoyle, Ørnulf Risnes, Steve McEachern, Wendy Thomas

 

Reviewed list of related package and view content from Wolfgang

 

Decisions:

 

There is currently a lot of duplication in the list and it needs to be normalized prior to review.

 

Steve will normalize the list and send it out to members later this week with the following instructions:

 

Review the list and do the following:

 
  1. Add any unlisted objects that you would expect to find in a basic or simple codebook

  2. For each item indicate if the item is one which would be required in order to publish the codebook or is one that would be useful to have in the codebook

  3. Return your review to the group.

 

 

 

Unless other agenda items arise, schedule the next meeting after the deadline for returning reviews.

 

Process:

 
  • Items that have agreement in terms of "required" will go into a basic view

  • Items that have agreement in terms of "would like to see" will go into an "intermediate" view

  • Items without agreement will be discussed and assigned during the next meeting

 

This may result in the creation of two "simple codebook" views and appropriate names should be determined.

 

Discussion:

 

Given the range of use cases (something above a simple data set to a simple study housed in an archive) it is difficult to determine what is meant by "simple". Rather than discuss in the abstract it may be helpful to get a list of objects one would like to see in a simple codebook from the members of group and then identify those objects that are considered to be the minimum requirement for publication. This may result is two levels for a simple codebook (basic and intermediate) but the approach would provide clear information on where there is consensus and where there is debate.

 

Statements that may help define the differences between these two levels:

 
  • The bare minimum needed in order to publish (basic)

  • What would you like to see in this view (intermediate)?

 

There has been a shift from the initial content creation in Drupal of a simple codebook "package" to the idea of a "view" and we need to reorient the Drupal content to this shift. In addition, packages and views relating to the simple codebook view that were not in existence when the work of this group was started are now more fully defined. The content of these packages and views needs to be considered when defining the view(s) of a simple codebook.

 

View orientation is liberating

 
  • A view contains objects (it is not a compilation of views)

  • A view (specific version) may be partially or fully support another view - the intent to do this should be noted in the description of the new view

 

The following process could be useful in defining the view(s) for a simple codebook:

 

Creating the list of objects for a simple codebook:

 
  • Start with Wolfgang's list as an example, (normalized version of this list)

  • What would you add?

  • What would you like?

  • What is required vs. what is optional (simple to intermediate)?

 

Create a view of Simple codebook in Drupal - using the final agreed upon list of a view

 

Note: Some of the objects being included are complex objects. These should then be reviewed to see if a simpler basic object of that type is needed. (I.e. we may only want to include a "stripped down" version in the view)

 

Steve will take a go at normalizing and send list out to group

 

Wolfgang can then enforce getting responses.

 

Meeting in two weeks:

 
  • this week if possible for list out

  • wish list turnaround

  • may want to delay next meeting until after due date for getting lists back from members