Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Info
iconfalse

 Simple Codebook View Team


Mary to spearhead proposal for Dagstuhl that includes requirements for Simple Codebook

Mary to contact Oliver regarding the Simple Codebook view on Drupal

Expand
titleNovember 10, 2014

Simple Codebook Team Minutes
2014 11 10

Dan Gillman, Larry Hoyle, Jenny Linnerud, Steve McEachern, Wendy Thomas, Mary Vardigan, Wolfgang Zenk-Moeltgen

Mappings

There is a spreadsheet for the archival codebook use case that lists elements used by ICPSR, CESSDA, and IHSN. Wendy will map the IHSN codebook elements to DDI 3.2 and send this to the group.

Package vs. View

The group also looked at the Simple Codebook package on the Lion Drupal site. The question was raised of whether we should still work on the Lion site since the simple codebook is still a package and not a view.

All Discovery information came from the Disco specification. We should model our own objects for Simple Codebook and then map to Disco. There hasn’t yet been a discussion yet about what is a property and what is an object.

In terms of information elements needed for codebooks, there are more things in the package than in the spreadsheet because we copied the DDI 3.2 elements and didn’t delete anything with the idea that we would need the other elements later. We moved all the objects specified as Keep from DDI 3.2 into the package.

The content groups should create a complete list of the information elements needed and then the modelers will arrange this into packages and make decisions about objects vs. properties.

Should we add all elements from package into the view? Right now we should not put in anything we are not using. We are trying to start with the essential and then add onto it.

Things go in the view on Lion if we want it in the simple DDI 4 codebook. We are compiling the list of elements used by ICPSR, CESSDA, and IHSN for the basic set of elements. The package that Wolfgang created at Dagstuhl contains additional things not in the spreadsheet. Everything from the Excel list should go into the view.

Graphs are only created for packages and not for views. This is something we will miss if we work only in the view. It doesn’t make sense to work on the Lion site until basic processes have been defined. How do we capture the results of this group?

After the EDDI sprint the whole process should be working and documentation will be produced from a view and a diagram will be produced from the nightly build. Right now things in other groups’ work (e.g., instrument) are not linked in to our codebook set of elements.

Things that need to be fixed on Lion include:

a. Arrows on aggregation and composition are the wrong way round

b. On each object the current DDI 3.2 and GSIM fields should also be rendered in View mode (they are currently only visible in Edit mode)

c. No graph appears for View, only a flat list

Wendy will relay these points to the modelers and the Lion maintainers.

Composite View Modifications

If you take elements that have been defined elsewhere in your composite view, you get everything but you may not want everything. You should be able to make a simple codebook. Someone needs to remodel it so that we can take just the portion we want.

We want to confirm what we need in our use case through the Excel spreadsheet. We need to draw a line for our first proposal for our simple codebook. The more things we decide are properties of an object, the more remodeling we will have to do when people want only some properties. Does this argue for making more things objects rather than properties? This is a tough call as we want to decrease the number of elements.

Working with Data Description

We should take a look at the Data Description modelthat came out of Dagstuhl and use that as a test because the current discussion about Datum is going to change things. What came out of Dagstuhl has had a lot of review and is considered solid. On Lion, what’s there now is the representation of what was decided at Dagstuhl and is up to date. Steve can compile this in a straightforward way and generate a view which is all the objects that will be used by Data Description. All the relevant content is in Lion.

First we need to look at the Excel list and compile the data description level and then look at the View for Data Description to make sure this matches what we need.

Next Meeting

The next meeting will be on November 24.

Actions

 

  • Wendy will complete spreadsheet with information for IHSN

  • The group will pull out the needed elements at the data description level in the Excel sheet

  • Steve will create a view of Data Description (we can flatten this if needed)

  • The group can compare the use case elements to what is in Data Description

Expand
titleNovember 23 2014

Simple Codebook Meeting Minutes
November 23, 2014

 

Present: Dan Gillman, Steve McEachern, Mary Vardigan

Meeting Times

The current time is midnight for Canberra, so we need to find another meeting time. 2pm EST U.S. time is the preferred time for the new year.

DDI 3.2 vs. 4

We are thinking in terms of forward compatibility so that everything in 3.* is covered in 4. This is not the best approach. Rather, we should solve the problem we want to solve and then worry about how to map it after we have solved it.

Framing happens unconsciously -- the circumstances of how you think about a problem constrains the way you are conceiving it.

Still it’s worth having a look at what we have right now to see what the overlap is.

By sticking with the nicely defined distinction between logical and physical we can be more precise going forward.

There is not too much not actually covered in 4 but it is going to be reorganized.

Next Steps

Steve will compare the spreadsheet to Data Description in 4 to determine how they map and overlap.

Expand
titleFebruary 02 2015

Simple Codebook Meeting Minutes
February 2, 2015

 

Present: Dan Gillman, Oliver Hopt, Larry Hoyle, Steve McEachern, Mary Vardigan

The Simple Codebook committee will now be chaired by Dan Gillman as Wolfgang is not able to chair currently.

This group has been in a holding pattern because we are waiting on the results of other groups. However, it was suggested that we look at the Codebook 2.5 (Codebook Version) in comparison to DDI 3.* (Lifecycle Version).

XML permits a detailed description of elements and this is part of the distinction between 2 and 3. But UML doesn't allow this and doesn't account for nesting and levels of detail. We should try to incorporate what is in Version 2 into the model as best we can. We as a group should try to build this. One additional possible other advance would be that we could then have a single model to account for both Codebook and Lifecycle. Both views would be under one spec in this approach.

Is referencing and reusability a distinction between the two versions that we should take into account? Should it be communicated to the modeling team that we may not need the complexity?

For users who want to describe their data, they should be able to write a description and fit it into a framework. If you want to have interoperability with other systems, then that is a different issue.

For the standalone one-off research project, users will not need to be reusing variables and questions, but for longitudinal and research across languages and cultures, this is important; there is a need to harmonize across questionnaires, reuse metadata across time, etc. Maybe this is Complex Codebook?

We need a distinction between the user perspective and the technical perspective. Simple and complex need to be interoperable. It's necessary to reduce the complexity of what is modeled in the library by choosing the simple cases.

One of the decisions for DDI 4 is to make everything identifiable and drop the container aspect of identifiability. This takes away a lot of the complexity.

From a marketing perspective, we need to distinguish between the DDI Codebook version and the Simple Codebook view. Looking at what is in 2 now will be required and we need to lay out what we need to account for. In the study section for DDI Codebook, there were a number of elements that allowed you to provide a high level text description of various methodological things. Preserving that is important.

Capturing what is in an SPSS or SAS representation including all the metadata you can put there is also important. When you move data around, you don't want to lose anything. When you look at how researchers want to record information, it is often difficult for them to record things in detail. Guided structures for them as part of their workflow is important and Codebook this is one view that could help them with this. You need some structure that becomes machine-actionable. You don't want people to just write a narrative.

At BLS, there is a Handbook of Methods. It has narrative descriptions of the surveys BLS does and it doesn't have a lot of detail. This should be captured in DDI rather than in a PDF. There is a need for high level and detailed as well. There may still be a need for some kind of a DDI Lite as a way of inducing reluctant data producers to get involved. For variables the detail is necessary. We should make this as flexible as we can.

We can start by looking at what is in 2.5 and figure out from the point of view of a list of what we need to account for. This would be a set of requirements that we as a group need to figure out how to solve. One question we want to address from a modeling point of view is, for example, when we need to say how the sample is constructed: Would those higher level descriptions go in a class of things that are independent of everything else or part of a sampling class? These are design issues that might have an impact on the way the more detailed model plays out.

If we can manage both 2 and 3 in the same structure we as a standards body will have an easier time with this. We should consult with Wendy on this.

Several archives still rely on DDI Codebook, Nesstar, etc. There is a set of codebook specs from different archives.

Are we talking about having our Simple Codebook view covering everything that is in 2.5? It should be even less. But should there be a view that is everything in 2.5? One idea is a view that is a really simple codebook but to allow for complexity in any direction you would like to go so we could incorporate everything that is in 2.5. Or go into more detail in 3 for whatever direction you want to go so there is a seamless distinction between high and detailed levels. This is basically what DDI 4 is. We should provide a lot of different options about how much detail the user wants. With 4 right now we have detailed descriptions of a lot of things but we are not allowing for high level descriptions. The description and definition were discussed in London with respect to Drupal in the sense that there could be radio buttons to indicate that they should be used to standardize those objects. It could be possible to have a description without any usage of detailed sub-elements.

There could be an attribute that could be high-level description. Or we have an element saying this is the Sample Description. Just having an element called description associated with identifiable objects may not be sufficient. In the annotated identifiable there is an annotation element that has Dublin Core properties like Title, Contributor, etc. It has an abstract. But there is nothing that is a high-level description.

On the one hand it might be nice to have a Sampling Description, but it might be over-specified. It's important to have an element dedicated to a high-level description that you are offering in place of the detail or as a supplement to the detail. A general description like the annotation will lose semantic interoperability. We need machine-interpretability. We also want the possibility to reference just the high level description in the simple codebook.

We should be able to allow for user-defined views that provide for whatever level of detail an organization uses. A Simple Codebook view that maps back to 2.5 would be useful. It would allow those organizations just using 2 to feel comfortable using 4.

DDI 4 does not have the same hierarchy as DDI 3. We would still need an object carrying high level content for the sampling process and nothing else. In 3 there was a parent node but we don't have this structure in DDI 4, which means you need to create a container for this description. It's not a question of using description as a property containing the text, but which element carries the description.

Between now and the next meeting, Oliver will make some slides with an example of what we have been talking about. We also need to dig into DDI 2.5 to get a handle on what is needed at the higher level. Dan and Larry will look at this. Dan will also consult with Wendy on this.

 

 

Expand
titleFebruary 16, 2015

Simple Codebook Meeting
February 16, 2015

Present: Dan Gillman, Larry Hoyle, Steve McEachern, Mary Vardigan

Completeness of cross walk between 2 and 3

The crosswalk or mapping is essentially one-way from 2 to 3. Codebook doesn't have the reusability that Lifecycle does. This is the same issue as between SPSS and Stata/SAS. We should look at the mapping in more detail.

Content and functionality of Simple Codebook

We want to make sure that Simple Codebook lets us write or ingest 2.x fairly seamlessly. Are the same kinds of element names available in 3? The names change even at the highest level.

Many miss the Tag Library as it was so simple. This kind of resource would be useful along with a mapping. However, Wendy advises that we don't have to worry about 2 since the mapping is there.

Even 2 has a lot of content. Are we still talking about a simple codebook as opposed to a complex codebook? Simple should allow you to take information from a major statistical package and move to another without losing any information (this is our definition of simple) . In terms of questions, they should be included as should sampling and universe. We should review DDI Lite and DDI Core, which have not been updated to the most recent versions of Codebook and Lifecycle. This may enable us to have a framework for content. We will deal with functionality later.

We have been making the assumption that we have the Instrument information and the Data Description information from those two views. What else do we need? We need context information or study level – Universe, sampling, design, bibliographic information. In DDI 2.* we have Citation, Study information (which is discovery related), Methodology, and Access. This is good content.

What do you need to know to use the data? You need the variable information. Question order and the way questions are asked may be important.

There is a tension between being very simple and following best practice for good documentation. Can we add pointers to relevant information? The simple/complex distinction is levels of detail.

For secondary users, we need enough information for a researcher to be able to understand and evaluate the quality of a dataset without reference back to the original data producer. We also need enough information to pull the data into a statistical package.

We started an exercise to take the common set of CESSDA, ICPSR, and IHSN mandatory schemas, and figure out what is the superset. The spreadsheet can be found in the attachments on the page: Simple Codebook View Team. We should compare this set of elements to what is in DDI Lite and DDI Core.

Necessary for a simple codebook: variables and questions and layout; universe or population; level of geography (basically coverage, including temporal and subject); sampling; or weights (and point to thorough description of sampling).

The distinction between simple and complex for data description is between a simple rectangular file and other data types; this applies to codebook in some ways as well. But there may be a cascading effect if we limit ourselves to simple rectangular files (we should describe hierarchical files as well like CPS). You can have hierarchical data in CSV with a record type field but historically we have had files with physical representations of the data that are esoteric. How much of this do we need to handle? For a simple codebook, the simple representation should be limited to unicode or something like that.

Homework: review DDI Core: http://www.ddialliance.org/sites/default/files/ddi3/DDI3_CR3_Core.xml and DDI Lite: http://www.ddialliance.org/sites/default/files/ddi-lite.html

And think about what limitations we want to put on format to keep the idea of simple codebook but to keep it rich enough so we are covering enough situations.

The next meeting will be in two weeks on March 2.

Expand
titleMarch 2 2015

Simple Codebook Meeting
March 2, 2015

Present: Michelle Edwards, Dan Gillman, Oliver Hopt, Larry Hoyle, Steve McEachern, Mary Vardigan

The group welcomed Michelle Edwards of CISER. The chair noted that this group is in a sense waiting for other groups (Discovery, Data Description, Instrument) to complete what they are doing so that we can finish our work. We recognize a need to  incorporate both Codebook and Lifecycle into one spec (DDI 4), so we have been exploring that in our group a bit.

DDI Lite was reviewed and compared with the element sets that ICPSR, GESIS, and IHSN use and they are a fairly good match.

We won't be able to exactly duplicate Codebook and Lifecycle as views of DDI 4 but we can get close. Organizations that have invested in 3.2 do not want to lose that investment. Can we map 3.2 to 4 by automatically importing what's in 3.2? We may need a conversation with Guillaume about this. This should probably be at the Advisory Group level.

DDI Codebook and Lifecycle have different names for the same element. We will need mappings for people.

What we write out is also important. Interoperability can be defined in terms of reading and writing out of a system. If we can read 2.5 into 4, we are able to ingest anything that occurs anywhere under 2.5. We want to be able to write an instance that contains all the semantic content of Codebook. If we know that there is an equivalence we should have a 2.5 writer to write it out in that name. It is the structure and the mappings that matter.

There were changes between Codebook and Lifecycle that were not necessarily clean because of the use of things by reference in 3 (categories and codes). Upward compatibility may be tougher than downward compatibility. We should probably not worry about 3 here but concern ourselves with mapping 2.5 into 4.

Is Codebook still an aggregation of Discovery, Description, and Instrument? Right now Discovery is a stripped down element set.

We could start with 2.5 as a starting point and we need to be able to account for this. Then we could look at 4 and ask whether everything is covered. Can we restrict this to 2.5 Lite? Generally, yes.

A Codebook view would be intended for an audience that is creating or managing codebooks and it doesn't matter what things are in other views or packages.

Views can overlap as much as you want. DDI Lite is a view. DDI 2.5 is a view. We are leveraging the experience of repositories (ICPSR, GESIS, IHSN) in serving up data, so that makes a good codebook. It makes sense to rely on DDI Lite, which we know is used.

The group reviewed the elements in DDI Lite. ADA uses a few other elements like deposit date, alternative title, collection situation, etc.. ADA uses the default Nesstar template which is close to DDI Lite. We should look at Nesstar also. The CESSDA Profile would be the best thing to use.  We need to identify where things are already defined in 4 and where things still need to be defined in 4. We need to know what is missing from 4 in order to have a sense of where we stand. Our group could then go to the AG to say what needs to be addressed in sprints.

If we have something in 4 that maps to Nesstar/CESSDA profile, that allows a big chunk of DDI users to adopt 4. There is another migration path we can look at: we have 2.5 codebook - is there a more modern one? Migrate 2.5 to something different? This may be out of scope for our group but we should discuss it.

 

 

Expand
titleMarch 16, 2015

Simple Codebook Meeting

March 16, 2015

Present: Dan Gillman, Oliver Hopt, Larry Hoyle, Mary Vardigan

The agenda for the meeting was to determine if all elements in the CESSDA profile/Nesstar profile are present in DDI 4. Larry Hoyle had created a spreadsheet of DDI Lite and the list of elements from CESSDA profiles. There seems to be a wide variety of the selection of the elements and attributes in the repositories using DDI Lite. The Nesstar Webview comes as the base. The group compared elements used across different repositories.

The task was to find out which elements are in DDI4, so the group decided to divide up the list of 200+ elements. There appears not to be any DDI4 elements about the metadata itself, the DDI document. It basically parallels the study description information. This may not be relevant for DDI4. Perhaps the Data Citation group should think about this. This is often the archive's intellectual property, so some representation of it will be of interest to most of the archives. Citing the user guide or documentation is a common practice.

DDI Codebook has some elements of description that DDI4 has not been talking about. We need to bring forth something to the Advisory Group about this – this is an issue that we need to discuss. In DDI Lifecycle there is the corresponding instance with a citation on it. There is no DDI4 instance because instance is a root element for documents in general.

Will the idea of a document description disappear in 4? The archive creates a document describing the data. The landing page is sometimes (always?)  metadata.

Study level, variable level, record level, file level: should the Data Citation group look at what are targets of citation?

In DDI Codebook, we have DocumentDescription; in DDI Lifecycle we have DDIInstance. Should DDIInstance be brought back into DDI4? – with revised content but allowing attachment of annotation.

Being able to point to an XML file with the model and generate that file from elements in 4 is adequate. But it is no longer enough to point to one object that contains everything.

We have the logical vs. physical distinction. A DDIInstance as a physical thing – something that's there.  Pulling together the information into that representation is an activity with Authors, etc. There is the "same" content in two archives. – different contact people, different URIs for each. This is parallel to data description.

Assignments for the next meeting

Where in DDI4 do each of these elements exist?

FirstLine

LastLine

N

Who

Content

70

101

31

Dan

Citation

102

131

29

Steve

Scope Methodology

132

155

23

Oliver

Access Conditions

156

184

28

Larry

File Variable

185

205

20

Mary

VarDoc

206

232

26

Michelle

CategoryGroups OtherMaterial

 

 

Expand
titleApril 13, 2015

Simple Codebook Meeting
2015-04-13

The meeting focussed on reviewing the next set of metadata elements from DDI-C - those covered by Steve.

Steve had created an additional three columns to his copy of the spreadsheet for his work - adding:

  1. Package (for elements already matched in DDI4)
  2. Suggested Package (for elements that have no match)
  3. the DDI-C definition.

These additional columns have now been added to the Google Spreadsheet - linked here.

The discussion then focussed on the elements. Notes on specific elements are included in the spreadsheet, and summarised below

Elements

Source

  • example for digitized statistical abstract  the original print publication. If administrative data the original administrative program. A simple version of provenance

Geographic unit

  • “Lowest level of geographic aggregation covered by the data.”
  • Would GeographicLevels (plural) be better to indicate that multiple levels can be used.Is GeographicLevel a better term than GeographicUnit?

Control operations

  • Description of what was done. Data collection process,

General comments and issues

It was noted that much of the methodology section of DDI-C was not yet covered in DDI4. Part of this will be addressed by the Methodology working group.

There is however a set of elements that are not really methodology (or at least the research design), but rather are descriptive of the process and outcome of the execution of the methodology. These elements might most appropriately fall under the heading of "Fieldwork". Examples from DDI-C include:

  • CollectionSituation
  • MinimizeLossActions
  • ControlOperations

and, notably, RESPONSE RATE.

The group had concerns that it was unclear how we might provide recommendations here? e.g. ResponseRate, what is meant – “opposite of rate of refusal?” other types?  It was also recognised that this is not really part of methodology, but has an impact on methodology – as well as on analysis, post processing. For example, was there an intervention based on low response rate? Fieldwork issues.

On similar lines, there was a recognition that Methodology is the ugly part of DDI Codebook. Dan suggested that this section may be in need of a significant revamp, given the developments in survey methodology that have occurred since the original development of DDI-C, in particular the Total Survey Error framework.

It was noted that these issues with Methodology and Fieldwork need to be raised with the AG sooner rather than later, as they have resource and workload implications for the Moving Forward program. Steve will write something up on this and distribute to the group, prior to sending to the AG.

 

 

Expand
titleApril 27, 2015

Simple Codebook Meeting
April 27, 2015

Present: Michelle Edwards, Dan Gillman, Oliver Hopt, Steve McEachern, Mary Vardigan

The group went back to the mapping between DDI Codebook and what is in DDI4. In terms of Access Conditions, there is an Access module in Discovery, where it is streamlined. It looks as if availability and use statements are not included; everything is structured string. We might look at SAML or another controlled vocabulary for access control like XACML (Extensible Access Control Markup Language). The issue is whether the outside source maintains previous versions, which we don't have control over.

In terms of Other Material, this was all found in DDI4 except for the Other Material table. This was part of DDI Codebook to mark up a table for presentation. In terms of VarDoc version, none of that was in DDI4. In DDI4 versioning is done at a low level, so this is taken care of at a level of the model that is not about particular content but about everything – Identifiable and Annotated Identifiable. There is an ID and a version. The question is that in Codebook the description is applied against Variable; in DDI4 identification applies broadly.

The group traced identification through the DDI4 model and looked at Collections and Members. Version Type in DDI Codebook does not seem to be covered, but no one is using this. Type seems no longer relevant and related to documents rather than to elements. People who understand this element from the old way of thinking have to know that the idea of a version is being expanded. We need to table this for now but are leaning toward deprecating this element.

Coding Instructions probably maps to Fieldwork and Methodology, which we don't have yet in DDI4.

 

 

Expand
titleMay 11, 2015

Simple Codebook Meeting
May 11, 2015

Present: Oliver Hopt, Larry Hoyle, Steve McEachern, Mary Vardigan

The group continued its review of the mapping between DDI Codebook and DDI 4 – https://docs.google.com/spreadsheets/d/1VDbVz2KRRSX_KEf0IfuE-QqMyTDupftCZfBdBM6VPT8/edit#gid=2125503646.

The group returned to the elements regarding availability and access. There is currently no archive information in DDI4 and this needs to be modeled, perhaps at the upcoming sprint. In terms of the use statement, some is not covered in the access object in Discovery in DDI4. This needs to be modeled also. SAML isn't useful for us because it is too high level. Both data and metadata may need something attached. We might look at this in the Datum discussion (not only columns but rows) and also attaching things to the metadata to control access. This might be like annotations where it can be attached to anything – access could have a relationship to annotated identifiable. Then any object could have an access control. From access description to object could be another solution. This could make sense because an object could have different access policies when stored in different archives. This should be discussed at the sprint also. There is an Access Control XML language that we looked at but didn't decide on. Michelle will be representing CISER at the sprint and can express their needs in this area.

In terms of Imputation, it is now the same as it has been in 3. Generation Instructions and General Instructions seem to have the same text. We need some clarification from Wendy on this. They can describe an Imputation procedure. This has not yet been brought up in 4 yet. This would be methodology or fieldwork. It is in the Processing package now. Need clarification at the sprint.

Security in variable relates to the discussion above. 3.2 doesn't do much at the row level but this is becoming a requirement.

Embargo is in Simple Codebook, but this is basically a set of placeholders right now. This should be part of the Access Rights discussion at the sprint so we do this consistently. Where should this come from? A use case or the modeling team proposing an approach. We probably need both directions. Maybe two use cases – one from Bill for metadata and one from Ornulf for data.

Response Unit not yet modeled and will come up in complex instrument. This can be at the study and variable level. An equivalent should be covered in methodology.

For question elements, there is a container in Data Capture that will work for this and allow you to instantiate pre-, post-, and literal question as well as interviewer instructions. Statement is the container.

In terms of invalid range, this is in Simple Codebook. How are we tying this to missing? In 3.2 and in Simple Codebook in 4 you can point to a managed missing values representation and in that you can do ranges. You can do things like from this value to that value is a missing value. This is there by virtue of having been brought over from 3.2. The ISO 11404 notion of sentinal value (each instance variable has a set of such values but it might point to the same represented variable) has been modeled to allow for the valid set of data to be handled in different statistical packages. You have to represent the semantics in different ways. The Data Description group should handle this.

Undocumented Codes – they should have had a label but didn't get documented. Codebook is the obvious group to handle this.

Total Responses is another part of the documentation for variable and should be handled by Codebook. This is handled with a controlled vocabulary when you say what type of statistic it is.

Summary Statistics is in Complex Data Type. They are not in the Simple Codebook view now but that hasn't been built out yet and we would need to include them in the view.

In terms of Descriptive Text, all the variables in 4 inherit Description as members.

Expand
title2015 06 08

Simple Codebook
June 8, 2015

Present: Dan Gillman, Larry Hoyle, Jenny Linnerud, Oliver Hopt, Mary Vardigan

The group continued to review the spreadsheet mapping DDI 2.* to DDI4 and noting items that the modeling should take up.

Then the group turned to the metadata that the statistical packages include. Larry provided a spreadsheet that he and Achim had developed to show which metadata were included in each of the major statistical packages. It will be important for Codebook to contain all of this metadata. There are other ways of handling data, like SQL, that might also be appropriate. In the Big Data world, Python is becoming popular. Python  is a general scripting language and has replaced the role that PERL had at one point. You can explicitly represent trees like JSON and XML, so it is very flexible. People have developed modules that do statistical kinds of things with Python.

Looking at all the software metadata from the statistical point of view is important. We need to make sure that everything in Larry's spreadsheet is accounted for in a meaningful way. We need to identify things that are not in the DDI 2.* spreadsheet. We can go through this all together or do assignments.

Number of significant digits is important in some scientific data. Whether the number has been rounded can be important. This should be included in DDI4. In 11179 community, there was a discussion of accuracy and precision. This is related to significant digits. The Data Description Team should address this. In an Instance Variable we may want to talk about significant digits while for a Represented Variable we talk about accuracy.

Larry and Dan will talk with the Data Description and Modeling teams about these issues.

 

 

 

 

 

...

Expand
titleJune 30, 2014
 

Meeting: 2014-06-30

 

Attending: Guillaume Duffes, Dan Gillman, Larry Hoyle, Ørnulf Risnes, Steve McEachern, Wendy Thomas

 

Reviewed list of related package and view content from Wolfgang

 

Decisions:

 

There is currently a lot of duplication in the list and it needs to be normalized prior to review.

 

Steve will normalize the list and send it out to members later this week with the following instructions:

 

Review the list and do the following:

 
  1. Add any unlisted objects that you would expect to find in a basic or simple codebook

  2. For each item indicate if the item is one which would be required in order to publish the codebook or is one that would be useful to have in the codebook

  3. Return your review to the group.

 

 

 

Unless other agenda items arise, schedule the next meeting after the deadline for returning reviews.

 

Process:

 
  • Items that have agreement in terms of "required" will go into a basic view

  • Items that have agreement in terms of "would like to see" will go into an "intermediate" view

  • Items without agreement will be discussed and assigned during the next meeting

 

This may result in the creation of two "simple codebook" views and appropriate names should be determined.

 

Discussion:

 

Given the range of use cases (something above a simple data set to a simple study housed in an archive) it is difficult to determine what is meant by "simple". Rather than discuss in the abstract it may be helpful to get a list of objects one would like to see in a simple codebook from the members of group and then identify those objects that are considered to be the minimum requirement for publication. This may result is two levels for a simple codebook (basic and intermediate) but the approach would provide clear information on where there is consensus and where there is debate.

 

Statements that may help define the differences between these two levels:

 
  • The bare minimum needed in order to publish (basic)

  • What would you like to see in this view (intermediate)?

 

There has been a shift from the initial content creation in Drupal of a simple codebook "package" to the idea of a "view" and we need to reorient the Drupal content to this shift. In addition, packages and views relating to the simple codebook view that were not in existence when the work of this group was started are now more fully defined. The content of these packages and views needs to be considered when defining the view(s) of a simple codebook.

 

View orientation is liberating

 
  • A view contains objects (it is not a compilation of views)

  • A view (specific version) may be partially or fully support another view - the intent to do this should be noted in the description of the new view

 

The following process could be useful in defining the view(s) for a simple codebook:

 

Creating the list of objects for a simple codebook:

 
  • Start with Wolfgang's list as an example, (normalized version of this list)

  • What would you add?

  • What would you like?

  • What is required vs. what is optional (simple to intermediate)?

 

Create a view of Simple codebook in Drupal - using the final agreed upon list of a view

 

Note: Some of the objects being included are complex objects. These should then be reviewed to see if a simpler basic object of that type is needed. (I.e. we may only want to include a "stripped down" version in the view)

 

Steve will take a go at normalizing and send list out to group

 

Wolfgang can then enforce getting responses.

 

Meeting in two weeks:

 
  • this week if possible for list out

  • wish list turnaround

  • may want to delay next meeting until after due date for getting lists back from members

 

 

Expand
titleDagstuhl Sprint Oct 2014

Minutes from Dagstuhl Sprint 2014 Working Group

...

titleSeptember 15, 2014

 

Simple Codebook Meeting
September 15, 2014

 

Present: Dan Gillman, Oliver Hopt, Larry Hoyle, Jenny Linnerud, Steve McEachern, Ornulf Risnes, Wendy Thomas, Mary Vardigan

Discussion

The group affirmed Wendy’s definition of a codebook (See Appendix A for the full document):

A codebook combines the contents of a data dictionary with additional information to support the intelligent use of the data which it describes. The data dictionary provides structured information on the layout of the data, providing sufficient detail to the incorporation of the data into a program for analysis including the name, physical location of the data, data type, size, and meaning of the values. This should include both valid and invalid (missing) values as well as information on the record types, relationships and internal layout. The codebook pulls together additional information required for understanding the source of the data, its relevance to the research question, and related information about the survey design, methodologies employed, the data collection process, data processing, and data quality.

A codebook should contain information for discovery and for data manipulation (data dictionary contents) in a structured format to support programming for access. Other sections of metadata may be machine actionable or informational depending on the use of the codebook structure. Informational content can be maintained in-line (as specific content of the codebook) or by reference to external content (a questionnaire, research proposal, methodology resources, etc.).

The group discussed overlap with other groups and packages since codebook is a compilation of other packages. Simple Codebook is most likely a compilation of Conceptual, Simple Data Description, Discovery, and additional information that facilitates interpretation of the data and intelligent use. The difficulty is determining what depth of information is appropriate. For replication purposes, you need a lot of detail.

The Simple Data Description group is first focusing on data description in a broad way and will then define a subset for “simple.” Perhaps this group should do the same.

It would be helpful to have reports from other groups so that we know where they are and what makes sense to combine for simple codebook.

In Wendy’s list (Appendix A), much of the content we need is covered by other groups, but we could use more detail in Data Source, Data Processing, and Methodology. Methodology framed its scope broadly in Toronto but hasn’t yet met as a group. One activity for that group would be to review the sampling and weighting specifications that came out of the Survey Design and Implementation working group to see what is needed beyond that work.

Next Meeting

The group will meet again on Monday, September 29, to get reports from other groups.

Appendix A

What is a codebook?

[also referred to by DataONE as science metadata for science data]

A codebook combines the contents of a data dictionary with additional information to support the intelligent use of the data which it describes. The data dictionary provides structured information on the layout of the data, providing sufficient detail to the incorporation of the data into a program for analysis including the name, physical location of the data, data type, size, and meaning of the values. This should include both valid and invalid (missing) values as well as information on the record types, relationships and internal layout. The codebook pulls together additional information required for understanding the source of the data, its relevance to the research question, and related information about the survey design, methodologies employed, the data collection process, data processing, and data quality.

A codebook should contain information for discovery and for data manipulation (data dictionary contents) in a structured format to support programming for access. Other sections of metadata may be machine actionable or informational depending on the use of the codebook structure. Informational content can be maintained in-line (as specific content of the codebook) or by reference to external content (a questionnaire, research proposal, methodology resources, etc.).

Discussion

The definitions below for "codebook" are survey centric when refering to the broader set of metadata related to a data file. Another term may be preferable but there isn't one that leaps to mind. Whether called a codebook, science metadata, metadata, or something else, data files have 2 levels of description:

·         A structured physical description that supports the ability of the programmer to access the data accurately

·         Supporting information that allows the researcher to evaluate “fitness of use” of the data to a particular research question, the overall quality of the data, and the specifics of the conceptual (objects, universe/population, conceptual definitions, spatial and temporal) coverage. This information may be applicable to the study as a whole or to the individual variable. This also includes information on why and how the data were captured, processed, and preserved.

 

...

Type of information

...

Basic Codebook

...

Survey

...

Fauna (Wildlife)

...

Data structure:

·         Record type

·         Record layout

·         Record relationship

·         Data type

·         Valid values

·         Invalid values

...

Structured metadata to support access

...

Structured metadata to support access

...

Structured metadata to support access

...

Data source:

·         Why was data collected

·         How was data collected

·         Who collected the data

·         The universe or population and how it was identified and selected

...

Descriptive to support assessment of quality and fitness-for-use

...

Purpose of the survey; Survey content and flow (may or may not need to be actionable); identification and sampling of survey population (may or may not need to be actionable for replication purposes)

...

Purpose of study, how data was collected (may need to be actionable to support replication and/or calibration); identification and sampling of survey population (may or may not need to be actionable for replication purposes)

...

Data processing:

·         Data capture process

·         Validation

·         Quality control

·         Normalizing, coding, derivations

·         Protection (confidentiality, suppression, interpolation, embargo, etc.)

...

Informational material; support provenance

...

May need structured metadata for purposes of replication; Include processes, background information, proposed, actual, and implications for data

...

May need structured to support mechanical capture instruments, calibrations, situational variants, etc.

...

Discovery information:

·         Who

·         What

·         When

·         Why

·         Coverage

o   Topical

o   Temporal

o   Spatial

...

Structured metadata to support discovery and access to the data as a whole

...

Structured metadata to support discovery and access to the data as a whole

...

Structured metadata to support discovery and access to the data as a whole

...

Conceptual basis

·         Object

·         Concept

...

Informational material

...

Structured to support analysis of change over time and relationship between studies. May just be descriptive / informational.

...

Structured to support genre level comparison (heavy use of common taxonomies, etc.)

...

Methodologies employed

...

Informational material

...

Structured to support replication and comparison between studies

...

Structured to support replication and comparison between studies

...

Related materials of relevance to data

...

Informational material

...

Definitions

Data Dictionary

·         A data dictionary, or metadata repository, as defined in the IBM Dictionary of Computing, is a "centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format."[1] The term can have one of several closely related meanings pertaining to databases and database management systems (DBMS):

·         A document describing a database or collection of databases

·         An integral component of a DBMS that is required to determine its structure

·         A piece of middleware that extends or supplants the native data dictionary of a DBMS

·         Database about a database. A data dictionary defines the structure of the database itself (not that of the data held in the database) and is used in control and maintenance of large databases. Among other items of information, it records (1) what data is stored, (2) name, description, and characteristics of each data element, (3) types of relationships between data elements, (4) access rights and frequency of access. Also called system dictionary when used in the context of a system design.Read more: http://www.businessdictionary.com/definition/data-dictionary.html#ixzz3Am5wCgZI

·         A data dictionary is a collection of descriptions of the data objects or items in a data model for the benefit of programmers and others who need to refer to them. (Posted by Margaret Rouse  @ WhatIs.com)

Codebook

What is a codebook? (http://www.sscnet.ucla.edu/issr/da/tutor/tutcode.htm)

A codebook describes and documents the questions asked or items collected in a survey. Codebooks and study documentation will provide you with crucial details to help you decide whether or not a particular data collection will be useful in your research. The codebook will describe the subject of the survey or data collection, the sample and how it was constructed, and how the data were coded, entered, and processed.  The questionnaire or survey instrument will be included along with a description or layout of how the data file is organized.  Some codebooks are available electronically, and you can read them on your computer screen, download them to your machine, or print them out. Others are not electronic and must be used in a library or archive, or, depending on copyright, photocopied if you want your own for personal use.

Codebook : Lisa Carley-Baxter (http://srmo.sagepub.com/view/encyclopedia-of-survey-research-methods/n69.xml)

Codebooks are used by survey researchers to serve two main purposes: to provide a guide for coding responses and to serve as documentation of the layout and code definitions of a data file. Data files usually contain one line for each observation, such as a record or person (also called a "respondent"). Each column generally represents a single variable; however, one variable may span several columns. At the most basic level, a codebook describes the layout of the data in the data file and describes what the data codes mean. Codebooks are used to document the values associated with the answer options for a given survey question. Each answer category is given a unique numeric value, and these unique numeric values are then used by researchers in their analysis of the ...

Codebook (Wikipedia.com)

A codebook is a type of document used for gathering and storing codes. Originally codebooks were often literally books, but today codebook is a byword for the complete record of a series of codes, regardless of physical format.

ICPSR

What is a codebook?

A codebook provides information on the structure, contents, and layout of a data file. Users are strongly encouraged to look at the codebook of a study before downloading the datafiles.

While codebooks vary widely in quality and amount of information given, a typical codebook includes:

• Column locations and widths for each variable

• Definitions of different record types

• Response codes for each variable

• Codes used to indicate nonresponse and missing data

• Exact questions and skip patterns used in a survey

• Other indications of the content and characteristics of each variable

Additionally, codebooks may also contain:

• Frequencies of response

• Survey objectives

• Concept definitions

• A description of the survey design and methodology

• A copy of the survey questionnaire (if applicable)

• Information on data collection, data processing, and data quality

 

...

titleSeptember 29, 2014

Simple Codebook Meeting
September 29, 2014

Present: Jenny Linnerud, Steve McEachern, Barry Radler, Wendy Thomas, Mary Vardigan

Discussion

The group had a discussion of Simple Codebook as a compilation of different components, including Conceptual, Simple Data Description, Discovery, and Simple Instrument, as well as elements of data processing/provenance and Methodology. Steve showed a diagram he had created to show the big picture. He subsequently added boxes for Methodology and Discovery to that big picture – see DDI4_view_overview.pdf. This helps us visualize the structure of DDI 4 and also the codebook view. This diagram should be part of the upcoming Dagstuhl meeting orientation on the first day to ensure that everyone is united in their understanding of how DDI 4 works.

It was pointed out that there might be links to additional information (e.g., Methodology) in Simple Codebook but a more Complex Codebook could bring some of that information inline in a structured, machine-actionable way (e.g., routing/skip patterns through a questionnaire). The group also discussed that we need to distinguish questionnaire-centric codebooks from more generic codebooks that talk about measurement rather than variable, for example.

In Toronto the Methodology group got started but it needs more time to focus on this area. Initially, the group drew a line between design and implementation. It was pointed out that we need to separate what do you want to do from what are you doing and what have you done. We also need to think about replication.

In terms of data citation, we may need to cite at a very granular level. Should Discovery be part of all views in this sense?

We should think about a set of administrative metadata that accompanies each view and describes it so there is some consistency across views. This might indicate the order that compilations should take and whether they have a logical sequence. This would be a guide to the view.

For Dagstuhl, our group should put out a requirements document indicating what we need for the Simple Codebook view, including the administrative metadata. We also want to have guidelines for future groups doing composite views.

The Simple Codebook needs a view on the Drupal server, so we will contact Oliver about that.

Action Items:

Mary to make sure Steve's diagram is part of Dagstuhl orientation

2017-11-21 Simple Code book mtg

Participants:

Dan Gillman

Jay Greenfield

Larry Hoyle

 

Discussion:

We discussed the Codebook example Larry put together based on the Australian National Election Study. The file and another example can be found at the Simple Codebook View team page (https://ddi-alliance.atlassian.net/wiki/spaces/DDI4/pages/491676/Simple+Codebook+View+Team). The file for ANES is ANES2017.xml and the original codebook from the study is the file ADA.CODEBOOK.01259Highlighted.pdf


The ANES example, built using Oxygen, is pretty complete. Larry led us through a discussion of each element and where each description went. We found a few things Larry forgot including Variable Groups. In the new View, these are called Variable Collections after the Collections Pattern. The new Variables Collection allows one to form a group with variables from anywhere in the cascade. So, the type (conceptual, represented, and instance) can be mixed. In DDI 2.x, this cannot be done.


We wondered if the name, Variables Collection, should be changed to the DDI 2.x name, Variables Group. Since the functionality has expanded, it might be good to accept the name change. The modeling Team needs to take a look at this.


Expand
title2017_08_15 Simple Codebook Minutes

Simple Codebook Meeting 2017-08-15

Attendees: Dan Gillman, Larry Hoyle

We discussed the classes imported from DDI3.2 used for variable statistics, currently residing in the SimpleCodebook Package. After discussion we decided that these classes should be added to the SimpleCodebookView. That has been done.



Expand
title2017_03_28 Simple Codebook Minutes

Simple Codebook Meeting 2017-03-28

Attendees: Dan Gillman, Larry Hoyle, Jared Lyle, Oliver Hopt

Discussion on an issue in Jira concerning the pysical data description within SC, (CWG-8) displaying the relation to DataDescriptionView


- Result was, that all classes listed there should be in SimpleCodebookView so Larry will add them.


Looking back at the work in the Norway Sprint, saying that, we could cover all DDI-C elements into DDI-V classes.


- This might convince DDI-C users to upgrade, especially if we offer a more user friendly access to the mapping, that we created.

- It will also be important to show, how actual examples wil look like, without all the extra attributes within the classes, that are not applicable within a certain view.

- This is also a strength of the views in general.

- And tools will help with hiding complexity. A tool for editing SimpleCodebookView metadata might look exactly the same as a DDI-C tool.

- From that point on, we could start to show, what the benefit would be from switching to DDI-V by making visible, what then would be possible.


Expand
title2017_03_14 Simple Codebook Minutes

Simple Codebook Meeting 2017-03-14

Attendees: Dan Gillman, Larry Hoyle, Oliver Hopt


Looking into Larry's YAML template, which is very usefull for identifying classes with to much detail modeled.


Reduction is absolutly neccessary


A lot of things could be reduced to just a URI. This will include ExternalControlledVocabulary, ExternalMaterial.

If interoperability is needed for the content of these classes, it will be possible to download through the URI. The most cases with interoperability will be those that rely on DDI data, which will be identified directly by the prefix of the URI.


This is to be discused within the modelling team.


Expand
title2017_01_17 Simple Codebook Minutes

Simple Codebook Meeting 2017-01-17

Attendees: Jared Lyle, Dan Gillman, Larry Hoyle


The group discussed the remaining needed information items for simple codebook from the google sheet https://docs.google.com/spreadsheets/d/1VDbVz2KRRSX_KEf0IfuE-QqMyTDupftCZfBdBM6VPT8/edit#gid=294218085

We discussed Embargo and Security. Do these need to be attached to more than Study and InstanceVariable?

For the Methodology related items we discussed creating realizations of the Methodology class.


This raised questions to be raised with the Modeling group:


Where does Embargo go?  (one proposal: in Annotation)

               Or Embargo relationship from InstanceVariable

Where should security information go? (one proposal: in Annotation)

               Access from InstanceVariable


Make a Methodology realization   to use overview for:

               TimeMethodology  (0..1)

               SamplingMethodology  (0..1)

               DataCollectMethodology (0..1)

… (more)

               OtherMethodology  (0..n)


InstanceVariable (ConceptualVariable?) points to            

               derivationMethodology



Need Methodology at the high level - do we need all of the classes it relates to as well?

For derivation we need to add Design and Algorithm


General question for modeling- how do you add in just part of a pattern?





Expand
titleMeeting 2016_11_22

Simple Codebook Meeting 2016-11-22

Attendees:

Dan Gilman, Oliver Hopt, Larry Hoyle

We discussed a YAML template of the codebook view and looked at it in NotePad++.  This revealed the complexity of the model (20992 lines in the template) which should probably be discussed in the Modeling Team.

The YAML template was produced by a small Python program that reads the xmi.xml from the lion site, parses it and produces the YAML.

We discussed the possibility of having a YAML template accessible for each DDI4 class either through the nightly build or from lion.

 The codebook YAML template is available in the list of files on the Simple Codebook Team page.



Expand
title13 September 2016 Meeting Minutes

Notes on Codebook meeting 13 September 2016

Participants: Dan Gillman, Steve McEachern, Larry Hoyle, Gillian Kerr

Review of missing content in Data Description required by Codebook

1. Embargo for variables:
Perhaps add to AnnotatedIdentifiable

Security, embargoes, access:
These issues need to be addressed by Modelling team
New issue added to MT on Jira

2. Variable "interval"
Dan suggests a new property "Statistical Data Type Family": a property to allow you to define a classification system (e.g. nominal/ordinal/interval/ratio; discrete/continuous)

Note that both of the above have "primitive data types" in ISO11404 (nominal - state,
ordinal - enumerated, interval - integer, ratio - real). Use these as the starting point.

3. Variable files

suggest leaving this out and see if this is needed in the review. Note that the codebook approach here may be counter to the modelling strategy we have for DDI4 - it also makes reuse problematic.

4. Summary statistics

suggest leaving this out of review and note that it has to be developed. We could note what they will be, and where they may be located (on the DataStore??). Suggest looking at how Lifecycle handles them (similar to a key-value pair) after the review is completed. (Conversation following with Wendy: probably attach with the physical file, but should not be tightly integrated).

5. Derivation Description (and related):

suggest to be addressed by Methodology and Process

Methodology items: need to be addressed by the Methodology group
Imputation, weighting, sampling, dataColl

Other items:
aboutMissing
responseUnit (and analysisUnit)

ACTION ITEM: responseUnit and analysisUnit should be discussed via email in the next two weeks



Expand
title30 August 2016 Minutes

Attendees: Dan Gillman, Michelle Edwards, Oliver Hopt, Larry Hoyle, Gilliam Kerr

  • Reviewed Larry's work
    • Matched classes currently in Lion to highlighted DDI-C elements from the Knutholmen sprint
    • Some classes within the highlighted elements include attributes that may not be required for Codebook
    • Ensure that all classes in the Codebook View are included in the DDI-C list
  • Reviewed W3C tabular view document suggestions from the Data Description team
    • some areas covered in this document may need to be considered for Codebook
    • line terminators is one example
  • Are we ready to pass this work onto the TC?
    • let's wait for Methodology and Data Description
  • Documentation needed for the Release?
    • there should be some documentation
    • maybe a list highlighting elements that make a thorough codebook?
    • core set of elements
    • a template

Next Steps:

  • Oliver will update the example used in Knutholmen to accompany the Release
  • Oliver will create a document with xpath - then we can discuss how we should add documentation to this for the Release





Expand
title5 July 2016 meeting minutes

Meeting Minutes, 5 July 2016

Attendees: Dan Gillman, Michelle Edwards, Larry Hoyle, Gillian Kerr, Steve McEachern

The meeting focussed on reviewed the output of the previous meeting (see minutes below) for those who were unable to attend previously.

As noted in the previous minutes, there are three core areas of decisions to be made by this group to progress for the Q3 release. Each are considered below.


Decisions for Modelling group

1. Optional vs mandatory content

It appears to be that focus will be on making everything optional. Wendy Thomas has circulated a document on this through the SRG.

Larry commented that we need to consider what content is being imported into the Codebook view when we import the packages that we rely on - particularly those classes that are not appropriate or relevant.

Gillian also noted on mandatory content that if it IS mandatory, then people will often include nonsense content to comply with the field requirements - REDUCING data quality. Suggested that content might be better managed by making it optional, and then enabling links to reference content (e.g. ORCID for author/investigator information). Dan noted that content could be supported by VALUE or by REFERENCE - which would be one means of enabling this.

2. Citations

Should we retain the text string citation, or just use constructed citations In Norway, the preference was to remove the text string. However there are cases where there may be recommended text that a data producer requires.

3. Access conditions
There are a number of usecases that require managed access conditions. We don't yet have this on the work program (AG to consider) or a model to support access conditions (TC/Mod to consider). Dan suggested this might be added onto the AnnotatedIdentifiable class? Suggested that Codebook group develop an approach and propose to Modelling/AG for broader usage.

Decisions for the SimpleCodebook group (or for referral by SC to other teams)

1. Geographic polygons

Why opt these out? Can include by reference?

2. Variable metadata

There is content from 2.5 that is basically describing fixed width files. If we want to include something along these lines (and this was agreed by consensus from the Codebook group - although we may improve upon the current handling within 2.5) then the DataDescription group needs to address this. (Steve to include in next DD agenda)

Larry noted that we probably should include the relevant PhysicalLayout classes and attributes (contributes :-)

3. ResponseUnit and AnalysisUnit

These are rather mixed up in Codebook. Dan suggested referring back to the Unit/UnitType/Population/Universe content in the Conceptual package. But we need to carefully specify the relationships.

Going forward:

- Draw on the existing content on Access Conditions and the available classes from the Data Description
- There is content that will be incomplete in the Codebook Q3 release - need to recognise this
- Also want to consider the extent to which we improve past content (e.g V2.5 methodology)

To do:
- Steve to raise Variable metadata (and FixedWidth content) with DataDescription group
- Modelling/AG need to consider AccessConditions (Codebook potentially to provide a solution)
- Unit/ResponseUnit/AnalysisUnit: may need to be postponed


Expand
title21 June 2016

Minutes of Simple Codebook group, Tuesday June 21, 2016

Attendees: Steve McEachern, Larry Hoyle, Oliver Hopt

Review of the output from Norway continued. The group focussed on how to resolve the outstanding fields identified in the DDI-C profile (from the Google spreadsheet here:

https://docs.google.com/spreadsheets/d/1VDbVz2KRRSX_KEf0IfuE-QqMyTDupftCZfBdBM6VPT8/edit#gid=1652443366).

The proposal was for the remaining content to be addressed through three mechanisms:
- Referral to AG/Modelling group for "general approach" matters (such as Citation)
- Specific issues for the team to resolve (or to be addressed by related teams including DataDescription)
- Content that needed to be deferred due to dependencies on future activities of current working groups (Methodology and Physical Layout).

Details of each are below.

Discussions for modelling and/or advisory group:
1. Optional vs. mandatory content
2. Citations: text citations (e.g. Bibliographic Citation) vs. constructed/compiled/generated citation (from constituent parts)
- (also need to account for required citation text from data producers)
- Dublin Core: BibCit is one of the DCMI terms (but not the core 15 terms)
3. Access conditions:
- Whole datasets (DDI-C, DDI-L profiles)
- Variables within datasets (DDI-C, DDI-L profiles)
- Units within datasets
- Cases (records?) within datasets
- Metadata (e.g. Census RDC content restricts information on variable metadata)
AND
- What content is required within the access conditions (there was a model mentioned that may be a candidate)
- Variable Security and Variable Embargo (from DDI2.5)

SC team (or related teams) to resolve - Additional questions/fields outstanding:
1. Geographic Polygons

2. Variable metadata:
- VariableFiles: (files that contain this variable??) - probably covered by a DDI4 Relationship - recommend deferring this if needed (as may be part of future DataDescription model development)
- VariableInterval: continuous or discrete

3. ResponseUnit and AnalysisUnit (and Unit of Measurement)
- Consider a situation where the respondent to a survey is a Parent but the unit of interest is the Child - and the unit of analysis might be either the Child or the Household?? How do we describe these different "units"
- In particular, "AnalysisUnit" is problematic - because the unit of analysis is dependent on the research use - not on the data as captured.
- Might be related to Viewpoint??

Variable content characteristics may be best addressed now:

- VariableInterval,
- 3.2 dimensions such format, scale, decimalPositions, ... - numeric representation, classification level, ... (3.2 ties this more closely to the data type). Fundamentally these are attributes of the DATA TYPE and the MEASUREMENT
- Also consider the SummaryStatistics (DDI-C 2.5) in this discussion (note that this is probably more a characteristic of the "set of datums" rather than the InstanceVariable)
- Should be addressed by DataDescription to make a recommendation on when these attributes will be incorporated
- Larry Hoyle recommends including PRECISION as an attribute of the measurement
- What other content is commonly available from statistical packages. Reference Hoyle and Wackerow paper in IASSIST Quarterly. V39 N.3-4

Recommended for deferral
1. Methodology - all related fields
 

- All fields within Methodology section of DDI-C

- Also includes Imputation

 This requires output of Methodology team

2. PhysicalLayout
- MissingData
- VariableLocationStart-End-... (i.e. location in FixedWidthFiles)
This content requires the fixed width layout from the DataDescription group - which may not be included in the initial DD preview release.



Expand
titleMinutes May 10, 2016

Codebook meeting 2016-05-10

Attending: Dan Gillman, Gillian Kerr, Oliver Hopt, Larry Hoyle

We reviewed the spreadsheet https://docs.google.com/spreadsheets/d/1VDbVz2KRRSX_KEf0IfuE-QqMyTDupftCZfBdBM6VPT8/edit#gid=1652443366 , sheet NewStartingPointCdbk_4

Which now has xpaths to ddi2.5 elements (column F) and description of the DDI4 classes (column E) which correspond as well as descriptions of needed DDI4 classes

 We discussed the creation of a view.  

We need two new classes one for the whole activity producing data and one to describe each wave, or phase.

Issues are associated with the top level (e.g. design) but then there are specifics at each repeated instance producing different data.  The general and specific shouldn’t be duplicative for one time activities. An example of the top level information would be the purpose for the whole set of activities. Another would be the funding source for the whole, or authorizing legislation.

What terms could we use? --- Activity? Data capture activity?  

Need anchor class and specific class “anchor class” and “concrete anchor instance class”


In stats agencies ongoing activities – designs change, the overall is known by a name and has a funding source. (e.g. CPS or American community survey). The specific might be a monthly collection e.g. monthly CPS as input to calculation of unemployment rate.

Another example would be the  Christmas bird count which has annual data collections but can also be considered to be an overall series.

Decision:

“StudySeries” as overall

“Study”  for the specific – the  user community is familiar with this term, even if developers don’t like it

Conceptual would be the best current package for these classes

Oliver will create classes then the rest of us can work on descriptions.

Larry will add other classes to the view


Goodbye “TOFKAS” (The Object Formerly Known As Study). Even Prince went back to “Prince”


Expand
titleApril 26, 2016

Notes from Codebook Meeting 2016-04-26, 8am EDT
Attending Dan Gillman, Gillian Kerr, Larry Hoyle
We discussed the need to update column E in https://docs.google.com/spreadsheets/d/1VDbVz2KRRSX_KEf0IfuE-QqMyTDupftCZfBdBM6VPT8/edit#gid=1652443366  Larry will make a first pass at this.
Does DDI4 have the ability to describe a Reference period for a question e.g.  “over the last three months have you…?”  The ReferenceDate class http://lion.ddialliance.org/ddiobjects/referencedate has a typeOfDate property that should be able to do this.
We need a controlled vocabulary for the semantic of a ReferenceDate typeOfDate. Examples: the date range to which a question refers…. This will be a heavily used vocabulary. Should this be a choice in the standard? This is less flexible than an external vocabulary that can expand. The latter is preferable. The DDI Alliance controlled vocabulary group might address this issue.
What class will we use for the “study” in 4? GSIM has statistical activity.  We have derived data, experimental data, data collection from administrative sources. Scraped data from an administrative registry, mashed data from the web,  something like the consumer price index, qualitative data. TOFKAS – the object formerly known as study – TOFKAT the object formerly known as TOFKAS. Perhaps Thingamabob? Data Activity?  Study? ThatWhichWasCaptured? DataCollection? AcquisitionActivity? Whatever the name we need a class to represent the overall activity of creating/collecting the data.


Expand
titleMarch 15, 2016

Present: Dan Gillman, Steve McEachern, Jared, Oliver Hopt, Gillian Kerr, Kristiyan Panayotov

Gillian Kerr and Kristiyan Panayotov join the group, Steve and Jared we take care of access to websites and mailing lists.

  • Continue of discussion on Variables in SC

Decisions on the level of information we want to keep in and out of simple codebook should be left to the point, when the other views and packages are a bit more sattled.

The main direction of the further discussion in the next meetings will therefore go on to the definition, what we would expact to be documented for a variable and its surrounding classes. Main Question: What do we wnat to seen in a codebook, especially a basic one?

What is a variable?

- What is the meaning of a variable?

  - Characteristic

  - Population

    - Is a variable the same, if everything is the same but the population? (Sex of humans and/or bears)

- What are the values of a variable?


Extending the usecases especially from Gillian, which sounded to be quite complex while fitting into SC.

 

Expand
titleMarch 1, 2016

Minutes Codebook 2016 – 03 - 01


Present: Michelle Edwards, Dan Gillman, Steve McEachern, Oliver Hopt, Jon Johnson, Larry Hoyle

Reviewing Michelle’s addition to the view spreadsheet.

We need a view

Unclear if we need a package

What other classes/objects are we likely to need?

 - Study

 - Contributor and a Role CV

 - Notes class

Need something like 2.x Study

Stats agencies may not need Study

Michelle looked at DDI Lite and the “Archives using” column and cut out unused elements. She cut down the DDI-LITE profile even further to suggest items that we think are essential.

Variable description looks like not being used by archives but we know we need that

Column E has mapping to DDI4

Can we come up with an (incomplete list of methodologies that we need)?

Last meeting: Discussion on methodology. We discussed the need for a reference document or a descriptive statement that we may wish to refer to for describing a methodology. We may have a need for a more detailed description, or even a call to a process, but this is an initial high-level description only.

- Distinction between Usage (in the Methodology model) and a text description (what we are considering here). We would have a reference to the more detailed usage.

- This will be raised by DG, LH and OH in the modelling group for a preferred approach.

How do we handle a short descriptive statements in the codebook?

A codebook will not necessarily have the detail to drive a process. What would a template have?

There is usage in the methodology model but this is something higher level than that.  A codebook would then have a reference to the more detailed metadata.

Does the modeling group have a preferred approach to this?

Bring this up at the modeling meeting tomorrow.

What in DDI4 corresponds to the note in DDI-C?


  • Variables

Do we want to reproduce the variable cascade in a codebook? Could it just be the IV? Or will we need the references to further up the cascade - RV and CV

In the current DDI4 model we have references in the InstanceVariable to RepresentedVariable and ConceptualVariable

The discussion of the Variable use lead naturally into the link from Variable to Question. And there are also Variables do NOT have Questions (e.g. derived variables, instrument measures, etc.)

What do we really want to be supporting in the Codebook view? Steve and Michelle to develop usecase(s) to identify requirements.

What about references to reuse of other's materials/variables/etc.

Larry suggests that the codebook should be able to reference such content, where it is available. The question becomes how to do that referencing, and to what extent?

Steve and Jon's query is what form would that codebook take?

Question mode has an impact – do we want to record this in a codebook.

In multi-mode surveys in a codebook there could be different forms of the questions for each mode.

Example BLS CES – multiple modes the codebook would need to account for all of that.

When people think variables they think questions in the DDI2 world

In DDI2 can’t really describe an instrument. Many variables don’t come from questions.

A variable focus is more general than a question focus and should be our focus.

DDI2.5 moves to a variable focus.

How far should we move into a focus on questions?

Steve and Michelle will put together a use case.

Is a reference to Represented etc meaningful unless there is a repository containing it.

Is a codebook for reuse? Information should be included by instance in the codebook but references to reused elements could be included. What if the target objects are not DDI encoded?

Do we want references at each possible point? Or just at the InstanceVariable?

What would a PDF look like? A weakness of a codebook is that it is stand-alone. We should have a side email discussion of how we include pointers.

No one will want to retool existing study documentation. These features need to be optional.

Two discussions for next meeting:

1. Steve and Michelle for Variable usecases

2. Larry and Dan to lead a discussion on what a "codebook" might look like in the 21st century / RDF / Linked Data world? How far can we move towards that new Codebook form?


 

Expand
titleFebruary 2, 2016

Codebook meeting

2 February 2016

Attending: Dan, Michelle, Steve, Oliver, Jon, Larry, Jared

There’s some lack of clarity about where this group is at.  Discussed what to include in simple codebooks.  One idea is to review the spreadsheet of common elements (summary of CESSDA) and build on that.  Essentials seem to include: enough information to read the data into statistical package, label values, understand universe, understand what measure means so you can interpret the data, attribution information.  Another idea is to look at examples of simple codebooks, identify what they use, and then map to a model.

We need to be careful to keep things simple.  Even older versions of DDI 2 weren’t exactly simple.

If we nail down definitions, then do we make instances of previous versions incompatible?  As we define what information elements we want in DDI 4.0, we can specify which element you want in 2 if you’re going backwards.  

Next steps:

  • Michelle will go through spreadsheet and narrow down to those elements that are DDI Lite and any others that are heavily used (e.g., key words).

  • Will paste those elements into new sheet within the spreadsheet.

 

Expand
titleNovember 23, 2015
  • November 23, 2015

Present: Dan Gillman, Michelle Edwards, Steve McEachern, Larry Hoyle

  • We want to incorporate everything from the InstanceVaraible
  • Add in the connection to Question
  • Structure of the physical representation
  • We want to describe DDI-C using DDI4 elements.
  • Reuse would make some codebook instances shorter, Will people think that referencing RepresentedVariable and ConceptualVariable is required if those references are optional?
  •  What are the next directions for Codebook?  Think about surveying big Codebooks users, IHSN and Nesstar users in particular – along with 5-6 archives 
  • Where Nesstar goes these users will follow
  • Cost will be primary driver for folks to migrate from 2.x to 4.x – some see the benefits of the DDI-L extensions
  • We need a migration path from 2.x to 4.x
  • 4.x is flexible enough that the migration path doesn’t need to be well defined
    • Should be based on your needs and what you think is appropriate first step to reuse
  • Variable bank, question bank, and Universes/Populations may be the natural first step to migration but each may present a different migration path
  • We may be able to recommend different paths
  • ISO Community – have technical reports – series of recommendation that folks ought to follow – think of it as “Best Practices” – these may exist but they do not depict how to but rather provide guidance
  • This is something we should seriously consider doing – maybe a Grad Student project
    • Jane Greenberg, at Drexel University – great opportunity to collaborate
    • Dan G may reach out to Jane to start a conversation
  • Back to how are we going to build codebook
  • We want to create a model-based Codebook in 4.x rather than a way to create the XML from 4.x to put into 2.x


    • This way we can do things more efficiently
    • Create an attribute that states it is being used for Purpose A or Purpose B
  • We could document how the information could be transferred without having a one-to-one relationship between objects.
  • To implement codebook in 4.x we need to describe attributes and their purpose
    • Examples:
    • Title / Alternate title / Parallel title -> have an attribute with a Controlled Vocabulary for what kind of title it is
    • Similar situation for Roles – in Codebook we have a number of different roles, let’s pair that down, use Agent, with a CV and a usage attribute that states Codebook - Roles – we recommended the Credit Taxonomy.
  • 1 object that covers a number of Codebook XML elements
  • Compactness will make it easier to maintain over the years – these could include these areas:
    • Citations
    • Publications
    • Related Materials
    • Methodology
  • Cluster elements?

Goal for next Meeting – December 7, 2015:

  • Review Codebook and see how we can handle current Codebook elements
    • Clusters that can stand on their own – then figure out how we can do this
    • What we need and how to manage it – then take to modellers
  • Going forward – we will review and  look at clustering elements in the Google spreadsheet. What are different uses of the same structure?


 

Expand
titleNovember 9, 2015
  • November 9, 2015

Present: Dan Gillman, Oliver Hopt, Larry Hoyle, Mary Vardigan

The group discussed whether Data Capture had made enough progress to enable Codebook to move forward. Mary will get in touch with Barry about this.

In terms of Oliver's model (the second model he proposed), the next step would be to bring in information from other groups. Access conditions was the only area not yet covered. We need to ensure that everything in Oliver's model is covered (except for Access Conditions). Oliver will go through the group's spreadsheet and map to this model to ensure full coverage.

We also need to ensure that we have adequate methodology information. We also need to be sure that full file level documentation is enabled (not just study level).

And do we want to include all of the datum level information for reuse? This may be too much for the codebook view, which has traditionally been a more flat view of a study and the files it produces. There is a connection between variable and datum so if we want this to be part of codebook or an extended version it is possible.

Do we care about anything other than the instance variables in Codebook? Codebook is something you get with a file that lets you use it and interpret it. But if you have pointers to represented variable and conceptual variables you can do more.

Since codebooks are created ad hoc, that's how it's designed. There is no guarantee that the way someone creates a conceptual variable is the same as how someone else creates it. There would be no semantic interoperability. But in a future world by design there are new surveys where comparability is designed into newer surveys. A DOI to what has been defined elsewhere would be OK.

We have polled various organizations to see which elements they use. Do we need to continue not-used elements? This is a good point in time to simplify. To survey on DDI 3 usage, Oliver has a small XSL transformation that gives out a statistic of downward paths for any given document, which could be helpful.

In Data Description, there was a related discussion about how far we should chase legacy file layouts. In one sense you want to encourage people to do things in simpler ways, rather than more complicated formats.

It was decided that the ability to include references to represented and conceptual variables is a good addition to codebook to bring in the notion of reuse.


 

Expand
titleMeeting Minutes Sept 28 2015

Attendees: Dan Gillman (chair), Michelle Edwards, Oliver Hopt, Larry Hoyle, Steve McEachern


Oliver distributed a PDF of his thinking around the Codebook model. He presented this work, and provided commentary on his thinking. 

Scenario A was discussed at the last meeting, but was seen to be problematic. 

Scenario B was his revised approach. This includes:

 - Study, DataResource and DataFile

 - Citation from Annotation


DataResource is consistent with the GSIM equivalent

 - Carries Citation which allows various subclasses to be citable

 - Has one attribute: productionInformation

 - VariableBasket and DataFile would be subclasses of DataResource


Study includes:

 - StudyDesign

 - Fieldwork

 - Etc.

 - Study would have an attribute DataResource  


Comments and discussion


1. Dan asked the meaning of the blue box around DataResource in Scenario B? Oliver indicated that this would indicate a new package DataResource.

2. Dan asked what is the cardinality of the relationship b/w Study and DataResource? Oliver suggested that this should be repeatable - e.g. more than one DataFile in a Study.

3. Dan asked DataResource is currently a collection of files or a collection of Variables. Could this include Questions?

- Oliver noted that there is currently a relationship through Measure from Question to InstanceVariable.

- We may not want to include all of the DataCapture view within Codebook

- Dan suggests that DataCapture has not yet laid out the link between the Questionnaire in the abstract versus the Instrument in the physical.

- We would want to include the Questions, Skips, ResponseCategories and InterviewerInstructions.

- Which do we want - the PhysicalInstrument or the ConceptualInstrument?

- Examples: Blood Pressure measurement, CATI instrument execution

- By including Physical, do we as a result account for Conceptual?

- Larry asks can we include by reference? Dan argues for the need for explicit rather than implicit reference. Larry notes that this means that this would make an Instrument required content.

- Dan asks if it is adequate to have just a pointer? If so, how do you link the Variable to the Question?

- Dan suggests that there IS a link between a Question and a Variable - but it is just not enough to tell you sufficient detail as to how a Datum was derived.

The group generally wasn't sure if we do want to try and link the Question and Variable - mostly due to content already existing (particularly pre-2000).


Oliver brought the conversation back to what we are currently trying to model.


Preferably there should be some machine actionable generated documentation which allows the links between these to be automatically (or semi-automatically) created. However in many cases this simply may not be available for past content (ADA and GESIS have examples, and we believe ICPSR as well).


As such, we may want to allow for simple external documents which describe the content in a human-readable (but not machine readable or actionable) form.

External resource is an option in Lifecycle - this might be the means for this.


Oliver's current model does enable this - allows for the simple, but allowing to be replaced by more complex where it is available and/or "generateable".

Steve noted that this would also be consistent with the approach taken in Methodology.


Where does this leave us, and where to next?


Dan is concerned that we may be adding a fair amount of complexity over DDI version 2.5.

e.g We have been having discussions about the link between Question and Variable - how would the user community respond to this?


Oliver also noted that this may touch on the discussion had with Ornulf about maintaining Codebook 2.5 through the DDI4 implementation. Ornulf's and Oliver's concern was the potential creation about too many identifiers to be maintained within a Codebook instance. Whether we would be able to handle what's done in 2.5 in a DDI4 codebook.


Larry noted that Colectica seem to have a potential solution to this in their current work. This seems to bypass the Lifecycle 3.2 approach, and simply use UUIDs to manage identifiability, which might be a possible solution.


What to do for next meeting?


Oliver undertook to clarify what the relationships between his Study object and the other packages would be (e.g. to DataCapture, Methodology, etc.).

We also need to ensure that we keep track of what the requirements are for aligning with DDI2.5


Next meeting: Monday 12th October, 8am U.S. Eastern time

Note that there will be changes for other locations due to daylight savings.

 

Expand
title2015 08 17
  • August 17, 2015

Present: Michelle Edwards, Dan Gillman, Larry Hoyle, Steve McEachern, Mary Vardigan

  • What is the advantage of moving from Codebook to Lifecycle?

One benefit is building a collection of reusable instruments in multiple languages. Reusing the census variables in other questionnaires is another area.

Something we should promote is building in limited amounts of reuse. It may be possible to incorporate areas of reuse without incorporating in others where we don't see the benefit – variables are an area. Can this be done piecemeal? Most variables are instance and possibly represented variables. As they see the need they can build out to conceptual level. The recent work on ANES and GSS is a good example of this. With the concept management perspective we have, you can always argue that any two usages are different in some way. We will be imprecise in some ways always. There is a push among NSOs for question banks, but there is a recognition that modes affect the responses. Your intent is to measure the same concept. This is why concept management is a powerful idea.

One of the problems may be the tools that are needed. We can't yet articulate the use case to build the tools we need.

  • Identifiers

What is the best way to proceed in terms of identifiers? Oliver is doing some modeling so we should be able to look at identification based on what he does. Mary will introduce the identifier discussion with Ornulf so we can get Nesstar Publisher on board.

We hope there will be a way to use identifiers in DDI-C and append to them to make them unique. In the end we need a unique identifier at what level? Anything that is identifiable in 4 requires a unique identifier. Whether everything that is identifiable in 2 will be in 4 is pretty assured. The IDS in 2 are at the variable level. If variable has a unique identifier and the study has a unique identifier, there should be global uniqueness if we could add the registry ID. In the Linked Data world you could find all the variables in the world related to a concept. Another is the simple fact that many studies are ongoing so the yearly or monthly variables could be looked at across time. Any time you are making comparisons over time, subject, or geography you need this.

 

Expand
title2015 07 20
  • July 20, 2015

Present: Michelle Edwards, Oliver Hopt, Larry Hoyle, Mary Vardigan

  • Managing DDI Codebook in DDI4

The group discussed whether it would be possible to reconcile the different approaches to identification if we were to manage DDI Codebook in DDI4 in the future, which is the goal. Currently in DDI Codebook IDs are unique only for the individual instance, not across instances, and the approach of DDI Lifecycle and DDI4 is to have globally unique IDs for all DDI objects.  It was the sense of the group, however, that the IDs are not a big barrier, either using the URNs or using UUIDs; it should be fairly easy to make a transfer. Scripts can generate UUIDs. We could manage Codebook in DDI4 without taking advantage of referencing and reuse. Also, there is a Local ID in DDI4, which could carry what is currently the ID in DDI Codebook and a UUID could be added. Colectica goes back and forth between DDI Codebook and DDI Lifecyle and they use UUIDs so it would be helpful to talk with them about these issues.

There is also a political issue in that DDI Codebook has been handled separately and people feel ownership of it as it stands now. It is used around the world by the IHSN. We want to maintain close relationships with these partners, so we will need to design a system that works for them. We should contact Nesstar to start a conversation about how Nesstar Publisher might make some relatively small changes to accommodate this switch to managing DDI Codebook elements in DDI4.

  • Status of Spreadsheet and Modeling

In the past weeks the Codebook group annotated a spreadsheet – https://docs.google.com/spreadsheets/d/1VDbVz2KRRSX_KEf0IfuE-QqMyTDupftCZfBdBM6VPT8/edit#gid=2125503646 – containing all of the Codebook elements used by CESSDA archives with the objective of determining which elements are currently in DDI4 and which might need to be added. It was the sense of the modelers on the call that the spreadsheet as it stands now is adequate input for the modeling effort. Oliver with support from Larry will start to add classes to Drupal based on the spreadsheet and will get back to the group with any questions. He estimates that he will have a first Codebook View to show in four weeks.

 

Expand
titleJuly 6 2015
  • Simple Codebook Meeting July 6, 2015

Present: Dan Gillman, Larry Hoyle, Jenny Linnerud, Mary Vardigan

  • DDI 2.5 and DDI 4

Do we bring anything forward to the AG or go directly to the modelers? In terms of how we go through the spreadsheet again, are we asking for changes or is it more informational? At the AG meeting, when we discussed the issues we talked about in this meeting in terms of freezing 2.5 and doing everything within DDI4, we met with some resistance. We don't want to announce that we are freezing 2.5 until we have to. But the basic thrust of what we want to do (manage everything in 4) doesn't seem to be that controversial, but we have to have that in place to move forward with what we are doing. If we put forward the set of requirements, it won't make sense till we have an agreement that this way makes sense. We are saying: this is what we have to have in 4 for us to be able to handle 2.5 and handle further refinements of 2.5 from within 4. Can we get everyone to agree that we want to maintain all the attributes of 2.5 in 4 and not have two separate management activities going on? We want to maintain at least all new things in 4. Right now 2.5 is in XML but there is no reason we can't bind it to RDF.

  • Relation to CSPA and GSIM

We have a sales job here. The modelers' way of doing a binding may force them into a certain way of describing objects. As long as you have everything you need in 4 to map to 2.5, you should be able to write a binding. The bindings should not drive the design.

The CSPA LIM (Logical Information Model) was undertaken partly because the DDI was not delivering as fast as desired for the NSIs. Now we need to make sure that DDI and the LIM stay aligned so that we are conformant to GSIM. DDI should be a profile of GSIM and it should instantiate processes as GSIM does. GSIM is the more high level, abstract version of what DDI is becoming. We are filling in the details of what GSIM leaves to the implementer for DDI and it reduces the amount of variability in the implementations.

LIM is supposed to be halfway between GSIM and a physical implementation. So far the LIM covers codelists and the next step is statistical classifications.

We don't want to see another standard with small differences we need to bridge.

  • DDI Codebook and Moving to 4

The perception of Lifecycle is that it has added complexity that people don't want to deal with. Some of the complexity comes from reuse. We may have some issues in terms of whether we can actually model 2.5 in the codebook view of 4. There is a lot of stuff that you may have to bring along that ends up complicating things. But if attributes are what we really care about (combination of class and property – could be a relationship), we are totally flattening out the model into a set of these attributes and taking what we need. In terms of identification, we need to figure out what the requirements are in 2.5 and make sure there are attributes in 4 that handle all of those things in 2. If we have the flat model view of Codebook as not a view in the strict sense but essentially just a SGL dump of attributes out of 4, can we produce 2 from that? This is what we need to be able to show. This is how we need to present 2. As a group, we need to go down into the identification area and figure out how to map to 4. The binding doesn't have to take into account all the relationships that exist among all the classes – it is simply a dump of all the attributes.

We need to gather more evidence among our group. Once we resolve this, we can answer all the concerns. This requires a different way of thinking. We should be able to automatically say: we want the following attributes and write them out. There will be an issue of whether an instance of 2 can be ingested into 4 and make sense. DDI 2 does not indicate that code schemes are the same.

  • Next Meeting

We will go back through the spreadsheet and make sure we have everything and are ready to send things to the modelers and then start to look at the IDs.

 

Expand
titleJune 22 2015
  • Simple Codebook Meeting June 22, 2015

Present: Michelle Edwards, Dan Gillman, Larry Hoyle, Mary Vardigan

Larry and Achim developed a spreadsheet that shows the metadata that is included in all the major statistical packages. This should go up on the site.

The group began by talking about category groups in DDI Codebook. We now have the order relation mechanism in DDI 4 to handle hierarchies in categories. None of the CESSDA archives uses this in DDI Codebook. Do we need to map back to this? We have other ways of handling this in DDI 4 with classification schemes, etc., but it would be hard to map explicitly. Should we be deprecating some of these elements and attributes? We don't want to lose the notion of statistics in the Codebook View. In 4 there are summary statistics and category statistics that roll up to the instance variable. This is in complex data type and imported from 3.2. There is a Variable Statistic that belongs to a Statistical Summary which is attached to a physical instance. This has a reference to a variable and its payload in terms of what actual statistic content it has. It can be frequencies or aggregate summary statistics. We can represent the same content.

Are we providing a view that closely follows 2.5 or do we just want to map to 2.5? The mapping makes the most sense. If we do a copy we will mess up our model structure. There is no reuse in 2.5 and there is a way of thinking in 2.5 that may not match the thinking in 4. This approach could make tools more complicated, but ideally the tools using 2.5 will support the new way of doing it.

Right now there is a Simple Codebook package on Lion, but right now in the Simple Codebook view there is only Study Unit and Other Material. Is the Simple Codebook View intended to look like 2.5 or is it something new? We should create a new Codebook. What we need to do is how to migrate their 2.5 Codebook to a 4.0 Codebook. Most likely there will be a lot of people who choose to stay at 2.5. Tools will have to figure out how to map these things. Codebook 2.5 will need to be frozen and any changes to Codebook will be done in 4.0. To describe a process, you would need to convert 2.5 to 4 to harness the process. It will be incumbent on us to do this mapping, which we are doing.

We have several issues that are AG/Scientific Board issues:

  • Freezing 2.5 but it will be supported
  • Having a mapping
  • All new work will happen in 4 in the Codebook View

Tools for developing countries use DDI Codebook so this could be an issue.

We need to get the mapping as clear as we can before we give the spreadsheet to the modelers. We should provide this spreadsheet with more detail to the modelers – this is how you map 2.5 to 4.0 for the Codebook View. Then we have to work with the modelers to figure out what the Codebook View looks like using the spreadsheet as a guide. Then the people doing tools in 2.5 will have a way to translate. We should be able to export 4 into a 2.5 framework like an API and it should be readable. It's a binding called a coding to map attributes. This means that the community using 2.5 with the available tools should be able to read and interoperate with a Codebook developed under 4.0. We also need to address whether we maintain the ability to write 2.5 out of 4.0 even though we will be making updates to the Codebook View over time. We will have to version the views. Version 1 of the Codebook View will be equivalent to 2.5 but Version 2 will not unless this is a constraint that we want to include.


 

Expand
titleJune 8, 2015
  • Simple Codebook June 8, 2015

Present: Dan Gillman, Larry Hoyle, Jenny Linnerud, Oliver Hopt, Mary Vardigan

The group continued to review the spreadsheet mapping DDI 2.* to DDI4 and noting items that the modeling should take up.

Then the group turned to the metadata that the statistical packages include. Larry provided a spreadsheet that he and Achim had developed to show which metadata were included in each of the major statistical packages. It will be important for Codebook to contain all of this metadata. There are other ways of handling data, like SQL, that might also be appropriate. In the Big Data world, Python is becoming popular. Python  is a general scripting language and has replaced the role that PERL had at one point. You can explicitly represent trees like JSON and XML, so it is very flexible. People have developed modules that do statistical kinds of things with Python.

Looking at all the software metadata from the statistical point of view is important. We need to make sure that everything in Larry's spreadsheet is accounted for in a meaningful way. We need to identify things that are not in the DDI 2.* spreadsheet. We can go through this all together or do assignments.

Number of significant digits is important in some scientific data. Whether the number has been rounded can be important. This should be included in DDI4. In 11179 community, there was a discussion of accuracy and precision. This is related to significant digits. The Data Description Team should address this. In an Instance Variable we may want to talk about significant digits while for a Represented Variable we talk about accuracy. We don't want to lose simple statistics on variables.

Larry and Dan will talk with the Data Description and Modeling teams about these issues.

 

Expand
titleMay 11, 2015
  • Simple Codebook Meeting May 11, 2015

Present: Oliver Hopt, Larry Hoyle, Steve McEachern, Mary Vardigan

The group continued its review of the mapping between DDI Codebook and DDI 4 – https://docs.google.com/spreadsheets/d/1VDbVz2KRRSX_KEf0IfuE-QqMyTDupftCZfBdBM6VPT8/edit#gid=2125503646.

The group returned to the elements regarding availability and access. There is currently no archive information in DDI4 and this needs to be modeled, perhaps at the upcoming sprint. In terms of the use statement, some is not covered in the access object in Discovery in DDI4. This needs to be modeled also. SAML isn't useful for us because it is too high level. Both data and metadata may need something attached. We might look at this in the Datum discussion (not only columns but rows) and also attaching things to the metadata to control access. This might be like annotations where it can be attached to anything – access could have a relationship to annotated identifiable. Then any object could have an access control. From access description to object could be another solution. This could make sense because an object could have different access policies when stored in different archives. This should be discussed at the sprint also. There is an Access Control XML language that we looked at but didn't decide on. Michelle will be representing CISER at the sprint and can express their needs in this area.

In terms of Imputation, it is now the same as it has been in 3. Generation Instructions and General Instructions seem to have the same text. We need some clarification from Wendy on this. They can describe an Imputation procedure. This has not yet been brought up in 4 yet. This would be methodology or fieldwork. It is in the Processing package now. Need clarification at the sprint.

Security in variable relates to the discussion above. 3.2 doesn't do much at the row level but this is becoming a requirement.

Embargo is in Simple Codebook, but this is basically a set of placeholders right now. This should be part of the Access Rights discussion at the sprint so we do this consistently. Where should this come from? A use case or the modeling team proposing an approach. We probably need both directions. Maybe two use cases – one from Bill for metadata and one from Ornulf for data.

Response Unit not yet modeled and will come up in complex instrument. This can be at the study and variable level. An equivalent should be covered in methodology.

For question elements, there is a container in Data Capture that will work for this and allow you to instantiate pre-, post-, and literal question as well as interviewer instructions. Statement is the container.

In terms of invalid range, this is in Simple Codebook. How are we tying this to missing? In 3.2 and in Simple Codebook in 4 you can point to a managed missing values representation and in that you can do ranges. You can do things like from this value to that value is a missing value. This is there by virtue of having been brought over from 3.2. The ISO 11404 notion of sentinal value (each instance variable has a set of such values but it might point to the same represented variable) has been modeled to allow for the valid set of data to be handled in different statistical packages. You have to represent the semantics in different ways. The Data Description group should handle this.

Undocumented Codes – they should have had a label but didn't get documented. Codebook is the obvious group to handle this.

Total Responses is another part of the documentation for variable and should be handled by Codebook. This is handled with a controlled vocabulary when you say what type of statistic it is.

Summary Statistics is in Complex Data Type. They are not in the Simple Codebook view now but that hasn't been built out yet and we would need to include them in the view.

In terms of Descriptive Text, all the variables in 4 inherit Description as members.

 

Expand
titleApril 27, 2015
  • Simple Codebook Meeting April 27, 2015

Present: Michelle Edwards, Dan Gillman, Oliver Hopt, Steve McEachern, Mary Vardigan

The group went back to the mapping between DDI Codebook and what is in DDI4. In terms of Access Conditions, there is an Access module in Discovery, where it is streamlined. It looks as if availability and use statements are not included; everything is structured string. We might look at SAML or another controlled vocabulary for access control like XACML (Extensible Access Control Markup Language). The issue is whether the outside source maintains previous versions, which we don't have control over.

In terms of Other Material, this was all found in DDI4 except for the Other Material table. This was part of DDI Codebook to mark up a table for presentation. In terms of VarDoc version, none of that was in DDI4. In DDI4 versioning is done at a low level, so this is taken care of at a level of the model that is not about particular content but about everything – Identifiable and Annotated Identifiable. There is an ID and a version. The question is that in Codebook the description is applied against Variable; in DDI4 identification applies broadly.

The group traced identification through the DDI4 model and looked at Collections and Members. Version Type in DDI Codebook does not seem to be covered, but no one is using this. Type seems no longer relevant and related to documents rather than to elements. People who understand this element from the old way of thinking have to know that the idea of a version is being expanded. We need to table this for now but are leaning toward deprecating this element.

Coding Instructions probably maps to Fieldwork and Methodology, which we don't have yet in DDI4.

 

Expand
titleApril 13, 2015
  • Simple Codebook Meeting 2015-04-13

The meeting focussed on reviewing the next set of metadata elements from DDI-C - those covered by Steve.

Steve had created an additional three columns to his copy of the spreadsheet for his work - adding:

  • Package (for elements already matched in DDI4)
  • Suggested Package (for elements that have no match)
  • the DDI-C definition.

These additional columns have now been added to the Google Spreadsheet - linked here.

The discussion then focussed on the elements. Notes on specific elements are included in the spreadsheet, and summarised below

Elements

Source

  • example for digitized statistical abstract  the original print publication. If administrative data the original administrative program. A simple version of provenance

Geographic unit

  • “Lowest level of geographic aggregation covered by the data.”
  • Would GeographicLevels (plural) be better to indicate that multiple levels can be used.Is GeographicLevel a better term than GeographicUnit?

Control operations

  • Description of what was done. Data collection process,

General comments and issues

It was noted that much of the methodology section of DDI-C was not yet covered in DDI4. Part of this will be addressed by the Methodology working group.

There is however a set of elements that are not really methodology (or at least the research design), but rather are descriptive of the process and outcome of the execution of the methodology. These elements might most appropriately fall under the heading of "Fieldwork". Examples from DDI-C include:

  • CollectionSituation
  • MinimizeLossActions
  • ControlOperations

and, notably, RESPONSE RATE.

The group had concerns that it was unclear how we might provide recommendations here? e.g. ResponseRate, what is meant – “opposite of rate of refusal?” other types?  It was also recognised that this is not really part of methodology, but has an impact on methodology – as well as on analysis, post processing. For example, was there an intervention based on low response rate? Fieldwork issues.

On similar lines, there was a recognition that Methodology is the ugly part of DDI Codebook. Dan suggested that this section may be in need of a significant revamp, given the developments in survey methodology that have occurred since the original development of DDI-C, in particular the Total Survey Error framework.

It was noted that these issues with Methodology and Fieldwork need to be raised with the AG sooner rather than later, as they have resource and workload implications for the Moving Forward program. Steve will write something up on this and distribute to the group, prior to sending to the AG.

 

Expand
titleMarch 16, 2015
  • Simple Codebook Meeting

  • March 16, 2015

Present: Dan Gillman, Oliver Hopt, Larry Hoyle, Mary Vardigan

The agenda for the meeting was to determine if all elements in the CESSDA profile/Nesstar profile are present in DDI 4. Larry Hoyle had created a spreadsheet of DDI Lite and the list of elements from CESSDA profiles. There seems to be a wide variety of the selection of the elements and attributes in the repositories using DDI Lite. The Nesstar Webview comes as the base. The group compared elements used across different repositories.

The task was to find out which elements are in DDI4, so the group decided to divide up the list of 200+ elements. There appears not to be any DDI4 elements about the metadata itself, the DDI document. It basically parallels the study description information. This may not be relevant for DDI4. Perhaps the Data Citation group should think about this. This is often the archive's intellectual property, so some representation of it will be of interest to most of the archives. Citing the user guide or documentation is a common practice.

DDI Codebook has some elements of description that DDI4 has not been talking about. We need to bring forth something to the Advisory Group about this – this is an issue that we need to discuss. In DDI Lifecycle there is the corresponding instance with a citation on it. There is no DDI4 instance because instance is a root element for documents in general.

Will the idea of a document description disappear in 4? The archive creates a document describing the data. The landing page is sometimes (always?)  metadata.

Study level, variable level, record level, file level: should the Data Citation group look at what are targets of citation?

In DDI Codebook, we have DocumentDescription; in DDI Lifecycle we have DDIInstance. Should DDIInstance be brought back into DDI4? – with revised content but allowing attachment of annotation.

Being able to point to an XML file with the model and generate that file from elements in 4 is adequate. But it is no longer enough to point to one object that contains everything.

We have the logical vs. physical distinction. A DDIInstance as a physical thing – something that's there.  Pulling together the information into that representation is an activity with Authors, etc. There is the "same" content in two archives. – different contact people, different URIs for each. This is parallel to data description.

  • Assignments for the next meeting

Where in DDI4 do each of these elements exist?

FirstLine

LastLine

N

Who

Content

70

101

31

Dan

Citation

102

131

29

Steve

Scope Methodology

132

155

23

Oliver

Access Conditions

156

184

28

Larry

File Variable

185

205

20

Mary

VarDoc

206

232

26

Michelle

CategoryGroups OtherMaterial



 

Expand
titleMarch 2 2015
  • Simple Codebook Meeting March 2, 2015

Present: Michelle Edwards, Dan Gillman, Oliver Hopt, Larry Hoyle, Steve McEachern, Mary Vardigan

The group welcomed Michelle Edwards of CISER. The chair noted that this group is in a sense waiting for other groups (Discovery, Data Description, Instrument) to complete what they are doing so that we can finish our work. We recognize a need to  incorporate both Codebook and Lifecycle into one spec (DDI 4), so we have been exploring that in our group a bit.

DDI Lite was reviewed and compared with the element sets that ICPSR, GESIS, and IHSN use and they are a fairly good match.

We won't be able to exactly duplicate Codebook and Lifecycle as views of DDI 4 but we can get close. Organizations that have invested in 3.2 do not want to lose that investment. Can we map 3.2 to 4 by automatically importing what's in 3.2? We may need a conversation with Guillaume about this. This should probably be at the Advisory Group level.

DDI Codebook and Lifecycle have different names for the same element. We will need mappings for people.

What we write out is also important. Interoperability can be defined in terms of reading and writing out of a system. If we can read 2.5 into 4, we are able to ingest anything that occurs anywhere under 2.5. We want to be able to write an instance that contains all the semantic content of Codebook. If we know that there is an equivalence we should have a 2.5 writer to write it out in that name. It is the structure and the mappings that matter.

There were changes between Codebook and Lifecycle that were not necessarily clean because of the use of things by reference in 3 (categories and codes). Upward compatibility may be tougher than downward compatibility. We should probably not worry about 3 here but concern ourselves with mapping 2.5 into 4.

Is Codebook still an aggregation of Discovery, Description, and Instrument? Right now Discovery is a stripped down element set.

We could start with 2.5 as a starting point and we need to be able to account for this. Then we could look at 4 and ask whether everything is covered. Can we restrict this to 2.5 Lite? Generally, yes.

A Codebook view would be intended for an audience that is creating or managing codebooks and it doesn't matter what things are in other views or packages.

Views can overlap as much as you want. DDI Lite is a view. DDI 2.5 is a view. We are leveraging the experience of repositories (ICPSR, GESIS, IHSN) in serving up data, so that makes a good codebook. It makes sense to rely on DDI Lite, which we know is used.

The group reviewed the elements in DDI Lite. ADA uses a few other elements like deposit date, alternative title, collection situation, etc.. ADA uses the default Nesstar template which is close to DDI Lite. We should look at Nesstar also. The CESSDA Profile would be the best thing to use.  We need to identify where things are already defined in 4 and where things still need to be defined in 4. We need to know what is missing from 4 in order to have a sense of where we stand. Our group could then go to the AG to say what needs to be addressed in sprints.

If we have something in 4 that maps to Nesstar/CESSDA profile, that allows a big chunk of DDI users to adopt 4. There is another migration path we can look at: we have 2.5 codebook - is there a more modern one? Migrate 2.5 to something different? This may be out of scope for our group but we should discuss it.



 

Expand
titleFebruary 16, 2015
  • Simple Codebook Meeting February 16, 2015

Present: Dan Gillman, Larry Hoyle, Steve McEachern, Mary Vardigan

  • Completeness of cross walk between 2 and 3

The crosswalk or mapping is essentially one-way from 2 to 3. Codebook doesn't have the reusability that Lifecycle does. This is the same issue as between SPSS and Stata/SAS. We should look at the mapping in more detail.

  • Content and functionality of Simple Codebook

We want to make sure that Simple Codebook lets us write or ingest 2.x fairly seamlessly. Are the same kinds of element names available in 3? The names change even at the highest level.

Many miss the Tag Library as it was so simple. This kind of resource would be useful along with a mapping. However, Wendy advises that we don't have to worry about 2 since the mapping is there.

Even 2 has a lot of content. Are we still talking about a simple codebook as opposed to a complex codebook? Simple should allow you to take information from a major statistical package and move to another without losing any information (this is our definition of simple) . In terms of questions, they should be included as should sampling and universe. We should review DDI Lite and DDI Core, which have not been updated to the most recent versions of Codebook and Lifecycle. This may enable us to have a framework for content. We will deal with functionality later.

We have been making the assumption that we have the Instrument information and the Data Description information from those two views. What else do we need? We need context information or study level – Universe, sampling, design, bibliographic information. In DDI 2.* we have Citation, Study information (which is discovery related), Methodology, and Access. This is good content.

What do you need to know to use the data? You need the variable information. Question order and the way questions are asked may be important.

There is a tension between being very simple and following best practice for good documentation. Can we add pointers to relevant information? The simple/complex distinction is levels of detail.

For secondary users, we need enough information for a researcher to be able to understand and evaluate the quality of a dataset without reference back to the original data producer. We also need enough information to pull the data into a statistical package.

We started an exercise to take the common set of CESSDA, ICPSR, and IHSN mandatory schemas, and figure out what is the superset. The spreadsheet can be found in the attachments on the page: Simple Codebook View Team. We should compare this set of elements to what is in DDI Lite and DDI Core.

Necessary for a simple codebook: variables and questions and layout; universe or population; level of geography (basically coverage, including temporal and subject); sampling; or weights (and point to thorough description of sampling).

The distinction between simple and complex for data description is between a simple rectangular file and other data types; this applies to codebook in some ways as well. But there may be a cascading effect if we limit ourselves to simple rectangular files (we should describe hierarchical files as well like CPS). You can have hierarchical data in CSV with a record type field but historically we have had files with physical representations of the data that are esoteric. How much of this do we need to handle? For a simple codebook, the simple representation should be limited to unicode or something like that.

Homework: review DDI Core: http://www.ddialliance.org/sites/default/files/ddi3/DDI3_CR3_Core.xml and DDI Lite: http://www.ddialliance.org/sites/default/files/ddi-lite.html

And think about what limitations we want to put on format to keep the idea of simple codebook but to keep it rich enough so we are covering enough situations.

The next meeting will be in two weeks on March 2.

 

Expand
titleFebruary 02 2015
  • Simple Codebook Meeting Minutes February 2, 2015


Present: Dan Gillman, Oliver Hopt, Larry Hoyle, Steve McEachern, Mary Vardigan

The Simple Codebook committee will now be chaired by Dan Gillman as Wolfgang is not able to chair currently.

This group has been in a holding pattern because we are waiting on the results of other groups. However, it was suggested that we look at the Codebook 2.5 (Codebook Version) in comparison to DDI 3.* (Lifecycle Version).

XML permits a detailed description of elements and this is part of the distinction between 2 and 3. But UML doesn't allow this and doesn't account for nesting and levels of detail. We should try to incorporate what is in Version 2 into the model as best we can. We as a group should try to build this. One additional possible other advance would be that we could then have a single model to account for both Codebook and Lifecycle. Both views would be under one spec in this approach.

Is referencing and reusability a distinction between the two versions that we should take into account? Should it be communicated to the modeling team that we may not need the complexity?

For users who want to describe their data, they should be able to write a description and fit it into a framework. If you want to have interoperability with other systems, then that is a different issue.

For the standalone one-off research project, users will not need to be reusing variables and questions, but for longitudinal and research across languages and cultures, this is important; there is a need to harmonize across questionnaires, reuse metadata across time, etc. Maybe this is Complex Codebook?

We need a distinction between the user perspective and the technical perspective. Simple and complex need to be interoperable. It's necessary to reduce the complexity of what is modeled in the library by choosing the simple cases.

One of the decisions for DDI 4 is to make everything identifiable and drop the container aspect of identifiability. This takes away a lot of the complexity.

From a marketing perspective, we need to distinguish between the DDI Codebook version and the Simple Codebook view. Looking at what is in 2 now will be required and we need to lay out what we need to account for. In the study section for DDI Codebook, there were a number of elements that allowed you to provide a high level text description of various methodological things. Preserving that is important.

Capturing what is in an SPSS or SAS representation including all the metadata you can put there is also important. When you move data around, you don't want to lose anything. When you look at how researchers want to record information, it is often difficult for them to record things in detail. Guided structures for them as part of their workflow is important and Codebook this is one view that could help them with this. You need some structure that becomes machine-actionable. You don't want people to just write a narrative.

At BLS, there is a Handbook of Methods. It has narrative descriptions of the surveys BLS does and it doesn't have a lot of detail. This should be captured in DDI rather than in a PDF. There is a need for high level and detailed as well. There may still be a need for some kind of a DDI Lite as a way of inducing reluctant data producers to get involved. For variables the detail is necessary. We should make this as flexible as we can.

We can start by looking at what is in 2.5 and figure out from the point of view of a list of what we need to account for. This would be a set of requirements that we as a group need to figure out how to solve. One question we want to address from a modeling point of view is, for example, when we need to say how the sample is constructed: Would those higher level descriptions go in a class of things that are independent of everything else or part of a sampling class? These are design issues that might have an impact on the way the more detailed model plays out.

If we can manage both 2 and 3 in the same structure we as a standards body will have an easier time with this. We should consult with Wendy on this.

Several archives still rely on DDI Codebook, Nesstar, etc. There is a set of codebook specs from different archives.

Are we talking about having our Simple Codebook view covering everything that is in 2.5? It should be even less. But should there be a view that is everything in 2.5? One idea is a view that is a really simple codebook but to allow for complexity in any direction you would like to go so we could incorporate everything that is in 2.5. Or go into more detail in 3 for whatever direction you want to go so there is a seamless distinction between high and detailed levels. This is basically what DDI 4 is. We should provide a lot of different options about how much detail the user wants. With 4 right now we have detailed descriptions of a lot of things but we are not allowing for high level descriptions. The description and definition were discussed in London with respect to Drupal in the sense that there could be radio buttons to indicate that they should be used to standardize those objects. It could be possible to have a description without any usage of detailed sub-elements.

There could be an attribute that could be high-level description. Or we have an element saying this is the Sample Description. Just having an element called description associated with identifiable objects may not be sufficient. In the annotated identifiable there is an annotation element that has Dublin Core properties like Title, Contributor, etc. It has an abstract. But there is nothing that is a high-level description.

On the one hand it might be nice to have a Sampling Description, but it might be over-specified. It's important to have an element dedicated to a high-level description that you are offering in place of the detail or as a supplement to the detail. A general description like the annotation will lose semantic interoperability. We need machine-interpretability. We also want the possibility to reference just the high level description in the simple codebook.

We should be able to allow for user-defined views that provide for whatever level of detail an organization uses. A Simple Codebook view that maps back to 2.5 would be useful. It would allow those organizations just using 2 to feel comfortable using 4.

DDI 4 does not have the same hierarchy as DDI 3. We would still need an object carrying high level content for the sampling process and nothing else. In 3 there was a parent node but we don't have this structure in DDI 4, which means you need to create a container for this description. It's not a question of using description as a property containing the text, but which element carries the description.

Between now and the next meeting, Oliver will make some slides with an example of what we have been talking about. We also need to dig into DDI 2.5 to get a handle on what is needed at the higher level. Dan and Larry will look at this. Dan will also consult with Wendy on this.


 

Expand
titleNovember 23 2014
  • Simple Codebook Meeting Minutes
    November 23, 2014


Present: Dan Gillman, Steve McEachern, Mary Vardigan

  • Meeting Times

The current time is midnight for Canberra, so we need to find another meeting time. 2pm EST U.S. time is the preferred time for the new year.

  • DDI 3.2 vs. 4

We are thinking in terms of forward compatibility so that everything in 3.* is covered in 4. This is not the best approach. Rather, we should solve the problem we want to solve and then worry about how to map it after we have solved it.

Framing happens unconsciously -- the circumstances of how you think about a problem constrains the way you are conceiving it.

Still it’s worth having a look at what we have right now to see what the overlap is.

By sticking with the nicely defined distinction between logical and physical we can be more precise going forward.

There is not too much not actually covered in 4 but it is going to be reorganized.

  • Next Steps

Steve will compare the spreadsheet to Data Description in 4 to determine how they map and overlap.

 

Expand
titleNovember 10, 2014
  • Simple Codebook Team Minutes 2014 11 10

Dan Gillman, Larry Hoyle, Jenny Linnerud, Steve McEachern, Wendy Thomas, Mary Vardigan, Wolfgang Zenk-Moeltgen

  • Mappings

There is a spreadsheet for the archival codebook use case that lists elements used by ICPSR, CESSDA, and IHSN. Wendy will map the IHSN codebook elements to DDI 3.2 and send this to the group.

  • Package vs. View

The group also looked at the Simple Codebook package on the Lion Drupal site. The question was raised of whether we should still work on the Lion site since the simple codebook is still a package and not a view.

All Discovery information came from the Disco specification. We should model our own objects for Simple Codebook and then map to Disco. There hasn’t yet been a discussion yet about what is a property and what is an object.

In terms of information elements needed for codebooks, there are more things in the package than in the spreadsheet because we copied the DDI 3.2 elements and didn’t delete anything with the idea that we would need the other elements later. We moved all the objects specified as Keep from DDI 3.2 into the package.

The content groups should create a complete list of the information elements needed and then the modelers will arrange this into packages and make decisions about objects vs. properties.

Should we add all elements from package into the view? Right now we should not put in anything we are not using. We are trying to start with the essential and then add onto it.

Things go in the view on Lion if we want it in the simple DDI 4 codebook. We are compiling the list of elements used by ICPSR, CESSDA, and IHSN for the basic set of elements. The package that Wolfgang created at Dagstuhl contains additional things not in the spreadsheet. Everything from the Excel list should go into the view.

Graphs are only created for packages and not for views. This is something we will miss if we work only in the view. It doesn’t make sense to work on the Lion site until basic processes have been defined. How do we capture the results of this group?

After the EDDI sprint the whole process should be working and documentation will be produced from a view and a diagram will be produced from the nightly build. Right now things in other groups’ work (e.g., instrument) are not linked in to our codebook set of elements.

Things that need to be fixed on Lion include:

a. Arrows on aggregation and composition are the wrong way round

b. On each object the current DDI 3.2 and GSIM fields should also be rendered in View mode (they are currently only visible in Edit mode)

c. No graph appears for View, only a flat list

Wendy will relay these points to the modelers and the Lion maintainers.

  • Composite View Modifications

If you take elements that have been defined elsewhere in your composite view, you get everything but you may not want everything. You should be able to make a simple codebook. Someone needs to remodel it so that we can take just the portion we want.

We want to confirm what we need in our use case through the Excel spreadsheet. We need to draw a line for our first proposal for our simple codebook. The more things we decide are properties of an object, the more remodeling we will have to do when people want only some properties. Does this argue for making more things objects rather than properties? This is a tough call as we want to decrease the number of elements.

  • Working with Data Description

We should take a look at the Data Description modelthat came out of Dagstuhl and use that as a test because the current discussion about Datum is going to change things. What came out of Dagstuhl has had a lot of review and is considered solid. On Lion, what’s there now is the representation of what was decided at Dagstuhl and is up to date. Steve can compile this in a straightforward way and generate a view which is all the objects that will be used by Data Description. All the relevant content is in Lion.

First we need to look at the Excel list and compile the data description level and then look at the View for Data Description to make sure this matches what we need.

  • Next Meeting

The next meeting will be on November 24.

  • Actions


  • Wendy will complete spreadsheet with information for IHSN

  • The group will pull out the needed elements at the data description level in the Excel sheet

  • Steve will create a view of Data Description (we can flatten this if needed)

  • The group can compare the use case elements to what is in Data Description

 

Expand
titleSeptember 29, 2014
  • Simple Codebook Meeting September 29, 2014

Present: Jenny Linnerud, Steve McEachern, Barry Radler, Wendy Thomas, Mary Vardigan

  • Discussion

The group had a discussion of Simple Codebook as a compilation of different components, including Conceptual, Simple Data Description, Discovery, and Simple Instrument, as well as elements of data processing/provenance and Methodology. Steve showed a diagram he had created to show the big picture. He subsequently added boxes for Methodology and Discovery to that big picture – see DDI4_view_overview.pdf. This helps us visualize the structure of DDI 4 and also the codebook view. This diagram should be part of the upcoming Dagstuhl meeting orientation on the first day to ensure that everyone is united in their understanding of how DDI 4 works.

It was pointed out that there might be links to additional information (e.g., Methodology) in Simple Codebook but a more Complex Codebook could bring some of that information inline in a structured, machine-actionable way (e.g., routing/skip patterns through a questionnaire). The group also discussed that we need to distinguish questionnaire-centric codebooks from more generic codebooks that talk about measurement rather than variable, for example.

In Toronto the Methodology group got started but it needs more time to focus on this area. Initially, the group drew a line between design and implementation. It was pointed out that we need to separate what do you want to do from what are you doing and what have you done. We also need to think about replication.

In terms of data citation, we may need to cite at a very granular level. Should Discovery be part of all views in this sense?

We should think about a set of administrative metadata that accompanies each view and describes it so there is some consistency across views. This might indicate the order that compilations should take and whether they have a logical sequence. This would be a guide to the view.

For Dagstuhl, our group should put out a requirements document indicating what we need for the Simple Codebook view, including the administrative metadata. We also want to have guidelines for future groups doing composite views.

The Simple Codebook needs a view on the Drupal server, so we will contact Oliver about that.

Action Items:

Mary to make sure Steve's diagram is part of Dagstuhl orientation

Mary to spearhead proposal for Dagstuhl that includes requirements for Simple Codebook

Mary to contact Oliver regarding the Simple Codebook view on Drupal

 

 

Expand
titleSeptember 15, 2014


  • Simple Codebook Meeting
    September 15, 2014


Present: Dan Gillman, Oliver Hopt, Larry Hoyle, Jenny Linnerud, Steve McEachern, Ornulf Risnes, Wendy Thomas, Mary Vardigan

  • Discussion

The group affirmed Wendy’s definition of a codebook (See Appendix A for the full document):

A codebook combines the contents of a data dictionary with additional information to support the intelligent use of the data which it describes. The data dictionary provides structured information on the layout of the data, providing sufficient detail to the incorporation of the data into a program for analysis including the name, physical location of the data, data type, size, and meaning of the values. This should include both valid and invalid (missing) values as well as information on the record types, relationships and internal layout. The codebook pulls together additional information required for understanding the source of the data, its relevance to the research question, and related information about the survey design, methodologies employed, the data collection process, data processing, and data quality.

A codebook should contain information for discovery and for data manipulation (data dictionary contents) in a structured format to support programming for access. Other sections of metadata may be machine actionable or informational depending on the use of the codebook structure. Informational content can be maintained in-line (as specific content of the codebook) or by reference to external content (a questionnaire, research proposal, methodology resources, etc.).

The group discussed overlap with other groups and packages since codebook is a compilation of other packages. Simple Codebook is most likely a compilation of Conceptual, Simple Data Description, Discovery, and additional information that facilitates interpretation of the data and intelligent use. The difficulty is determining what depth of information is appropriate. For replication purposes, you need a lot of detail.

The Simple Data Description group is first focusing on data description in a broad way and will then define a subset for “simple.” Perhaps this group should do the same.

It would be helpful to have reports from other groups so that we know where they are and what makes sense to combine for simple codebook.

In Wendy’s list (Appendix A), much of the content we need is covered by other groups, but we could use more detail in Data Source, Data Processing, and Methodology. Methodology framed its scope broadly in Toronto but hasn’t yet met as a group. One activity for that group would be to review the sampling and weighting specifications that came out of the Survey Design and Implementation working group to see what is needed beyond that work.

  • Next Meeting

The group will meet again on Monday, September 29, to get reports from other groups.

  • Appendix A

  • What is a codebook?

[also referred to by DataONE as science metadata for science data]

A codebook combines the contents of a data dictionary with additional information to support the intelligent use of the data which it describes. The data dictionary provides structured information on the layout of the data, providing sufficient detail to the incorporation of the data into a program for analysis including the name, physical location of the data, data type, size, and meaning of the values. This should include both valid and invalid (missing) values as well as information on the record types, relationships and internal layout. The codebook pulls together additional information required for understanding the source of the data, its relevance to the research question, and related information about the survey design, methodologies employed, the data collection process, data processing, and data quality.

A codebook should contain information for discovery and for data manipulation (data dictionary contents) in a structured format to support programming for access. Other sections of metadata may be machine actionable or informational depending on the use of the codebook structure. Informational content can be maintained in-line (as specific content of the codebook) or by reference to external content (a questionnaire, research proposal, methodology resources, etc.).

  • Discussion

The definitions below for "codebook" are survey centric when refering to the broader set of metadata related to a data file. Another term may be preferable but there isn't one that leaps to mind. Whether called a codebook, science metadata, metadata, or something else, data files have 2 levels of description:

·         A structured physical description that supports the ability of the programmer to access the data accurately

·         Supporting information that allows the researcher to evaluate “fitness of use” of the data to a particular research question, the overall quality of the data, and the specifics of the conceptual (objects, universe/population, conceptual definitions, spatial and temporal) coverage. This information may be applicable to the study as a whole or to the individual variable. This also includes information on why and how the data were captured, processed, and preserved.


Type of information

Basic Codebook

Survey

Fauna (Wildlife)

Data structure:

·         Record type

·         Record layout

·         Record relationship

·         Data type

·         Valid values

·         Invalid values

Structured metadata to support access

Structured metadata to support access

Structured metadata to support access

Data source:

·         Why was data collected

·         How was data collected

·         Who collected the data

·         The universe or population and how it was identified and selected

Descriptive to support assessment of quality and fitness-for-use

Purpose of the survey; Survey content and flow (may or may not need to be actionable); identification and sampling of survey population (may or may not need to be actionable for replication purposes)

Purpose of study, how data was collected (may need to be actionable to support replication and/or calibration); identification and sampling of survey population (may or may not need to be actionable for replication purposes)

Data processing:

·         Data capture process

·         Validation

·         Quality control

·         Normalizing, coding, derivations

·         Protection (confidentiality, suppression, interpolation, embargo, etc.)

Informational material; support provenance

May need structured metadata for purposes of replication; Include processes, background information, proposed, actual, and implications for data

May need structured to support mechanical capture instruments, calibrations, situational variants, etc.

Discovery information:

·         Who

·         What

·         When

·         Why

·         Coverage

o   Topical

o   Temporal

o   Spatial

Structured metadata to support discovery and access to the data as a whole

Structured metadata to support discovery and access to the data as a whole

Structured metadata to support discovery and access to the data as a whole

Conceptual basis

·         Object

·         Concept

Informational material

Structured to support analysis of change over time and relationship between studies. May just be descriptive / informational.

Structured to support genre level comparison (heavy use of common taxonomies, etc.)

Methodologies employed

Informational material

Structured to support replication and comparison between studies

Structured to support replication and comparison between studies

Related materials of relevance to data

Informational material



  • Definitions

  • Data Dictionary

·         A data dictionary, or metadata repository, as defined in the IBM Dictionary of Computing, is a "centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format."[1] The term can have one of several closely related meanings pertaining to databases and database management systems (DBMS):

·         A document describing a database or collection of databases

·         An integral component of a DBMS that is required to determine its structure

·         A piece of middleware that extends or supplants the native data dictionary of a DBMS

·         Database about a database. A data dictionary defines the structure of the database itself (not that of the data held in the database) and is used in control and maintenance of large databases. Among other items of information, it records (1) what data is stored, (2) name, description, and characteristics of each data element, (3) types of relationships between data elements, (4) access rights and frequency of access. Also called system dictionary when used in the context of a system design.Read more: http://www.businessdictionary.com/definition/data-dictionary.html#ixzz3Am5wCgZI

·         A data dictionary is a collection of descriptions of the data objects or items in a data model for the benefit of programmers and others who need to refer to them. (Posted by Margaret Rouse  @ WhatIs.com)

A codebook describes and documents the questions asked or items collected in a survey. Codebooks and study documentation will provide you with crucial details to help you decide whether or not a particular data collection will be useful in your research. The codebook will describe the subject of the survey or data collection, the sample and how it was constructed, and how the data were coded, entered, and processed.  The questionnaire or survey instrument will be included along with a description or layout of how the data file is organized.  Some codebooks are available electronically, and you can read them on your computer screen, download them to your machine, or print them out. Others are not electronic and must be used in a library or archive, or, depending on copyright, photocopied if you want your own for personal use.

Codebooks are used by survey researchers to serve two main purposes: to provide a guide for coding responses and to serve as documentation of the layout and code definitions of a data file. Data files usually contain one line for each observation, such as a record or person (also called a "respondent"). Each column generally represents a single variable; however, one variable may span several columns. At the most basic level, a codebook describes the layout of the data in the data file and describes what the data codes mean. Codebooks are used to document the values associated with the answer options for a given survey question. Each answer category is given a unique numeric value, and these unique numeric values are then used by researchers in their analysis of the ...

A codebook is a type of document used for gathering and storing codes. Originally codebooks were often literally books, but today codebook is a byword for the complete record of a series of codes, regardless of physical format.

  • ICPSR

What is a codebook?

A codebook provides information on the structure, contents, and layout of a data file. Users are strongly encouraged to look at the codebook of a study before downloading the datafiles.

While codebooks vary widely in quality and amount of information given, a typical codebook includes:

• Column locations and widths for each variable

• Definitions of different record types

• Response codes for each variable

• Codes used to indicate nonresponse and missing data

• Exact questions and skip patterns used in a survey

• Other indications of the content and characteristics of each variable

Additionally, codebooks may also contain:

• Frequencies of response

• Survey objectives

• Concept definitions

• A description of the survey design and methodology

• A copy of the survey questionnaire (if applicable)

• Information on data collection, data processing, and data quality

 

Expand
titleDagstuhl Sprint Oct 2014

Minutes from Dagstuhl Sprint 2014 Working Group

 

Expand
titleJune 30, 2014
 
  • Meeting: 2014-06-30

 

Attending: Guillaume Duffes, Dan Gillman, Larry Hoyle, Ørnulf Risnes, Steve McEachern, Wendy Thomas

 

Reviewed list of related package and view content from Wolfgang

 
  • Decisions:

 

There is currently a lot of duplication in the list and it needs to be normalized prior to review.

 

Steve will normalize the list and send it out to members later this week with the following instructions:

 

Review the list and do the following:

 
  • Add any unlisted objects that you would expect to find in a basic or simple codebook

  • For each item indicate if the item is one which would be required in order to publish the codebook or is one that would be useful to have in the codebook

  • Return your review to the group.

Unless other agenda items arise, schedule the next meeting after the deadline for returning reviews.

 

Process:

 
  • Items that have agreement in terms of "required" will go into a basic view

  • Items that have agreement in terms of "would like to see" will go into an "intermediate" view

  • Items without agreement will be discussed and assigned during the next meeting

 

This may result in the creation of two "simple codebook" views and appropriate names should be determined.

 
  • Discussion:

 

Given the range of use cases (something above a simple data set to a simple study housed in an archive) it is difficult to determine what is meant by "simple". Rather than discuss in the abstract it may be helpful to get a list of objects one would like to see in a simple codebook from the members of group and then identify those objects that are considered to be the minimum requirement for publication. This may result is two levels for a simple codebook (basic and intermediate) but the approach would provide clear information on where there is consensus and where there is debate.

 

Statements that may help define the differences between these two levels:

 
  • The bare minimum needed in order to publish (basic)

  • What would you like to see in this view (intermediate)?

 

There has been a shift from the initial content creation in Drupal of a simple codebook "package" to the idea of a "view" and we need to reorient the Drupal content to this shift. In addition, packages and views relating to the simple codebook view that were not in existence when the work of this group was started are now more fully defined. The content of these packages and views needs to be considered when defining the view(s) of a simple codebook.

 

View orientation is liberating

 
  • A view contains objects (it is not a compilation of views)

  • A view (specific version) may be partially or fully support another view - the intent to do this should be noted in the description of the new view

 

The following process could be useful in defining the view(s) for a simple codebook:

 

Creating the list of objects for a simple codebook:

 
  • Start with Wolfgang's list as an example, (normalized version of this list)

  • What would you add?

  • What would you like?

  • What is required vs. what is optional (simple to intermediate)?

 

Create a view of Simple codebook in Drupal - using the final agreed upon list of a view

 

Note: Some of the objects being included are complex objects. These should then be reviewed to see if a simpler basic object of that type is needed. (I.e. we may only want to include a "stripped down" version in the view)

 

Steve will take a go at normalizing and send list out to group

 

Wolfgang can then enforce getting responses.

 

Meeting in two weeks:

 
  • this week if possible for list out

  • wish list turnaround

  • may want to delay next meeting until after due date for getting lists back from members