Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 44 Next »

 June 30, 2014
 

Meeting: 2014-06-30

 

Attending: Guillaume Duffes, Dan Gillman, Larry Hoyle, Ørnulf Risnes, Steve McEachern, Wendy Thomas

 

Reviewed list of related package and view content from Wolfgang

 

Decisions:

 

There is currently a lot of duplication in the list and it needs to be normalized prior to review.

 

Steve will normalize the list and send it out to members later this week with the following instructions:

 

Review the list and do the following:

 
  1. Add any unlisted objects that you would expect to find in a basic or simple codebook

  2. For each item indicate if the item is one which would be required in order to publish the codebook or is one that would be useful to have in the codebook

  3. Return your review to the group.

 

 

 

Unless other agenda items arise, schedule the next meeting after the deadline for returning reviews.

 

Process:

 
  • Items that have agreement in terms of "required" will go into a basic view

  • Items that have agreement in terms of "would like to see" will go into an "intermediate" view

  • Items without agreement will be discussed and assigned during the next meeting

 

This may result in the creation of two "simple codebook" views and appropriate names should be determined.

 

Discussion:

 

Given the range of use cases (something above a simple data set to a simple study housed in an archive) it is difficult to determine what is meant by "simple". Rather than discuss in the abstract it may be helpful to get a list of objects one would like to see in a simple codebook from the members of group and then identify those objects that are considered to be the minimum requirement for publication. This may result is two levels for a simple codebook (basic and intermediate) but the approach would provide clear information on where there is consensus and where there is debate.

 

Statements that may help define the differences between these two levels:

 
  • The bare minimum needed in order to publish (basic)

  • What would you like to see in this view (intermediate)?

 

There has been a shift from the initial content creation in Drupal of a simple codebook "package" to the idea of a "view" and we need to reorient the Drupal content to this shift. In addition, packages and views relating to the simple codebook view that were not in existence when the work of this group was started are now more fully defined. The content of these packages and views needs to be considered when defining the view(s) of a simple codebook.

 

View orientation is liberating

 
  • A view contains objects (it is not a compilation of views)

  • A view (specific version) may be partially or fully support another view - the intent to do this should be noted in the description of the new view

 

The following process could be useful in defining the view(s) for a simple codebook:

 

Creating the list of objects for a simple codebook:

 
  • Start with Wolfgang's list as an example, (normalized version of this list)

  • What would you add?

  • What would you like?

  • What is required vs. what is optional (simple to intermediate)?

 

Create a view of Simple codebook in Drupal - using the final agreed upon list of a view

 

Note: Some of the objects being included are complex objects. These should then be reviewed to see if a simpler basic object of that type is needed. (I.e. we may only want to include a "stripped down" version in the view)

 

Steve will take a go at normalizing and send list out to group

 

Wolfgang can then enforce getting responses.

 

Meeting in two weeks:

 
  • this week if possible for list out

  • wish list turnaround

  • may want to delay next meeting until after due date for getting lists back from members

 

 

 September 15, 2014

 

Simple Codebook Meeting
September 15, 2014

 

Present: Dan Gillman, Oliver Hopt, Larry Hoyle, Jenny Linnerud, Steve McEachern, Ornulf Risnes, Wendy Thomas, Mary Vardigan

Discussion

The group affirmed Wendy’s definition of a codebook (See Appendix A for the full document):

A codebook combines the contents of a data dictionary with additional information to support the intelligent use of the data which it describes. The data dictionary provides structured information on the layout of the data, providing sufficient detail to the incorporation of the data into a program for analysis including the name, physical location of the data, data type, size, and meaning of the values. This should include both valid and invalid (missing) values as well as information on the record types, relationships and internal layout. The codebook pulls together additional information required for understanding the source of the data, its relevance to the research question, and related information about the survey design, methodologies employed, the data collection process, data processing, and data quality.

A codebook should contain information for discovery and for data manipulation (data dictionary contents) in a structured format to support programming for access. Other sections of metadata may be machine actionable or informational depending on the use of the codebook structure. Informational content can be maintained in-line (as specific content of the codebook) or by reference to external content (a questionnaire, research proposal, methodology resources, etc.).

The group discussed overlap with other groups and packages since codebook is a compilation of other packages. Simple Codebook is most likely a compilation of Conceptual, Simple Data Description, Discovery, and additional information that facilitates interpretation of the data and intelligent use. The difficulty is determining what depth of information is appropriate. For replication purposes, you need a lot of detail.

The Simple Data Description group is first focusing on data description in a broad way and will then define a subset for “simple.” Perhaps this group should do the same.

It would be helpful to have reports from other groups so that we know where they are and what makes sense to combine for simple codebook.

In Wendy’s list (Appendix A), much of the content we need is covered by other groups, but we could use more detail in Data Source, Data Processing, and Methodology. Methodology framed its scope broadly in Toronto but hasn’t yet met as a group. One activity for that group would be to review the sampling and weighting specifications that came out of the Survey Design and Implementation working group to see what is needed beyond that work.

Next Meeting

The group will meet again on Monday, September 29, to get reports from other groups.

Appendix A

What is a codebook?

[also referred to by DataONE as science metadata for science data]

A codebook combines the contents of a data dictionary with additional information to support the intelligent use of the data which it describes. The data dictionary provides structured information on the layout of the data, providing sufficient detail to the incorporation of the data into a program for analysis including the name, physical location of the data, data type, size, and meaning of the values. This should include both valid and invalid (missing) values as well as information on the record types, relationships and internal layout. The codebook pulls together additional information required for understanding the source of the data, its relevance to the research question, and related information about the survey design, methodologies employed, the data collection process, data processing, and data quality.

A codebook should contain information for discovery and for data manipulation (data dictionary contents) in a structured format to support programming for access. Other sections of metadata may be machine actionable or informational depending on the use of the codebook structure. Informational content can be maintained in-line (as specific content of the codebook) or by reference to external content (a questionnaire, research proposal, methodology resources, etc.).

Discussion

The definitions below for "codebook" are survey centric when refering to the broader set of metadata related to a data file. Another term may be preferable but there isn't one that leaps to mind. Whether called a codebook, science metadata, metadata, or something else, data files have 2 levels of description:

·         A structured physical description that supports the ability of the programmer to access the data accurately

·         Supporting information that allows the researcher to evaluate “fitness of use” of the data to a particular research question, the overall quality of the data, and the specifics of the conceptual (objects, universe/population, conceptual definitions, spatial and temporal) coverage. This information may be applicable to the study as a whole or to the individual variable. This also includes information on why and how the data were captured, processed, and preserved.

 

Type of information

Basic Codebook

Survey

Fauna (Wildlife)

Data structure:

·         Record type

·         Record layout

·         Record relationship

·         Data type

·         Valid values

·         Invalid values

Structured metadata to support access

Structured metadata to support access

Structured metadata to support access

Data source:

·         Why was data collected

·         How was data collected

·         Who collected the data

·         The universe or population and how it was identified and selected

Descriptive to support assessment of quality and fitness-for-use

Purpose of the survey; Survey content and flow (may or may not need to be actionable); identification and sampling of survey population (may or may not need to be actionable for replication purposes)

Purpose of study, how data was collected (may need to be actionable to support replication and/or calibration); identification and sampling of survey population (may or may not need to be actionable for replication purposes)

Data processing:

·         Data capture process

·         Validation

·         Quality control

·         Normalizing, coding, derivations

·         Protection (confidentiality, suppression, interpolation, embargo, etc.)

Informational material; support provenance

May need structured metadata for purposes of replication; Include processes, background information, proposed, actual, and implications for data

May need structured to support mechanical capture instruments, calibrations, situational variants, etc.

Discovery information:

·         Who

·         What

·         When

·         Why

·         Coverage

o   Topical

o   Temporal

o   Spatial

Structured metadata to support discovery and access to the data as a whole

Structured metadata to support discovery and access to the data as a whole

Structured metadata to support discovery and access to the data as a whole

Conceptual basis

·         Object

·         Concept

Informational material

Structured to support analysis of change over time and relationship between studies. May just be descriptive / informational.

Structured to support genre level comparison (heavy use of common taxonomies, etc.)

Methodologies employed

Informational material

Structured to support replication and comparison between studies

Structured to support replication and comparison between studies

Related materials of relevance to data

Informational material

  

Definitions

Data Dictionary

·         A data dictionary, or metadata repository, as defined in the IBM Dictionary of Computing, is a "centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format."[1] The term can have one of several closely related meanings pertaining to databases and database management systems (DBMS):

·         A document describing a database or collection of databases

·         An integral component of a DBMS that is required to determine its structure

·         A piece of middleware that extends or supplants the native data dictionary of a DBMS

·         Database about a database. A data dictionary defines the structure of the database itself (not that of the data held in the database) and is used in control and maintenance of large databases. Among other items of information, it records (1) what data is stored, (2) name, description, and characteristics of each data element, (3) types of relationships between data elements, (4) access rights and frequency of access. Also called system dictionary when used in the context of a system design.Read more: http://www.businessdictionary.com/definition/data-dictionary.html#ixzz3Am5wCgZI

·         A data dictionary is a collection of descriptions of the data objects or items in a data model for the benefit of programmers and others who need to refer to them. (Posted by Margaret Rouse  @ WhatIs.com)

Codebook

What is a codebook? (http://www.sscnet.ucla.edu/issr/da/tutor/tutcode.htm)

A codebook describes and documents the questions asked or items collected in a survey. Codebooks and study documentation will provide you with crucial details to help you decide whether or not a particular data collection will be useful in your research. The codebook will describe the subject of the survey or data collection, the sample and how it was constructed, and how the data were coded, entered, and processed.  The questionnaire or survey instrument will be included along with a description or layout of how the data file is organized.  Some codebooks are available electronically, and you can read them on your computer screen, download them to your machine, or print them out. Others are not electronic and must be used in a library or archive, or, depending on copyright, photocopied if you want your own for personal use.

Codebook : Lisa Carley-Baxter (http://srmo.sagepub.com/view/encyclopedia-of-survey-research-methods/n69.xml)

Codebooks are used by survey researchers to serve two main purposes: to provide a guide for coding responses and to serve as documentation of the layout and code definitions of a data file. Data files usually contain one line for each observation, such as a record or person (also called a "respondent"). Each column generally represents a single variable; however, one variable may span several columns. At the most basic level, a codebook describes the layout of the data in the data file and describes what the data codes mean. Codebooks are used to document the values associated with the answer options for a given survey question. Each answer category is given a unique numeric value, and these unique numeric values are then used by researchers in their analysis of the ...

Codebook (Wikipedia.com)

A codebook is a type of document used for gathering and storing codes. Originally codebooks were often literally books, but today codebook is a byword for the complete record of a series of codes, regardless of physical format.

ICPSR

What is a codebook?

A codebook provides information on the structure, contents, and layout of a data file. Users are strongly encouraged to look at the codebook of a study before downloading the datafiles.

While codebooks vary widely in quality and amount of information given, a typical codebook includes:

• Column locations and widths for each variable

• Definitions of different record types

• Response codes for each variable

• Codes used to indicate nonresponse and missing data

• Exact questions and skip patterns used in a survey

• Other indications of the content and characteristics of each variable

Additionally, codebooks may also contain:

• Frequencies of response

• Survey objectives

• Concept definitions

• A description of the survey design and methodology

• A copy of the survey questionnaire (if applicable)

• Information on data collection, data processing, and data quality

 

 September 29, 2014

Simple Codebook Meeting
September 29, 2014

Present: Jenny Linnerud, Steve McEachern, Barry Radler, Wendy Thomas, Mary Vardigan

Discussion

The group had a discussion of Simple Codebook as a compilation of different components, including Conceptual, Simple Data Description, Discovery, and Simple Instrument, as well as elements of data processing/provenance and Methodology. Steve showed a diagram he had created to show the big picture. He subsequently added boxes for Methodology and Discovery to that big picture – see DDI4_view_overview.pdf. This helps us visualize the structure of DDI 4 and also the codebook view. This diagram should be part of the upcoming Dagstuhl meeting orientation on the first day to ensure that everyone is united in their understanding of how DDI 4 works.

It was pointed out that there might be links to additional information (e.g., Methodology) in Simple Codebook but a more Complex Codebook could bring some of that information inline in a structured, machine-actionable way (e.g., routing/skip patterns through a questionnaire). The group also discussed that we need to distinguish questionnaire-centric codebooks from more generic codebooks that talk about measurement rather than variable, for example.

In Toronto the Methodology group got started but it needs more time to focus on this area. Initially, the group drew a line between design and implementation. It was pointed out that we need to separate what do you want to do from what are you doing and what have you done. We also need to think about replication.

In terms of data citation, we may need to cite at a very granular level. Should Discovery be part of all views in this sense?

We should think about a set of administrative metadata that accompanies each view and describes it so there is some consistency across views. This might indicate the order that compilations should take and whether they have a logical sequence. This would be a guide to the view.

For Dagstuhl, our group should put out a requirements document indicating what we need for the Simple Codebook view, including the administrative metadata. We also want to have guidelines for future groups doing composite views.

The Simple Codebook needs a view on the Drupal server, so we will contact Oliver about that.

Action Items:

Mary to make sure Steve's diagram is part of Dagstuhl orientation

Mary to spearhead proposal for Dagstuhl that includes requirements for Simple Codebook

Mary to contact Oliver regarding the Simple Codebook view on Drupal


 November 10, 2014

Simple Codebook Team Minutes
2014 11 10

Dan Gillman, Larry Hoyle, Jenny Linnerud, Steve McEachern, Wendy Thomas, Mary Vardigan, Wolfgang Zenk-Moeltgen

Mappings

There is a spreadsheet for the archival codebook use case that lists elements used by ICPSR, CESSDA, and IHSN. Wendy will map the IHSN codebook elements to DDI 3.2 and send this to the group.

Package vs. View

The group also looked at the Simple Codebook package on the Lion Drupal site. The question was raised of whether we should still work on the Lion site since the simple codebook is still a package and not a view.

All Discovery information came from the Disco specification. We should model our own objects for Simple Codebook and then map to Disco. There hasn’t yet been a discussion yet about what is a property and what is an object.

In terms of information elements needed for codebooks, there are more things in the package than in the spreadsheet because we copied the DDI 3.2 elements and didn’t delete anything with the idea that we would need the other elements later. We moved all the objects specified as Keep from DDI 3.2 into the package.

The content groups should create a complete list of the information elements needed and then the modelers will arrange this into packages and make decisions about objects vs. properties.

Should we add all elements from package into the view? Right now we should not put in anything we are not using. We are trying to start with the essential and then add onto it.

Things go in the view on Lion if we want it in the simple DDI 4 codebook. We are compiling the list of elements used by ICPSR, CESSDA, and IHSN for the basic set of elements. The package that Wolfgang created at Dagstuhl contains additional things not in the spreadsheet. Everything from the Excel list should go into the view.

Graphs are only created for packages and not for views. This is something we will miss if we work only in the view. It doesn’t make sense to work on the Lion site until basic processes have been defined. How do we capture the results of this group?

After the EDDI sprint the whole process should be working and documentation will be produced from a view and a diagram will be produced from the nightly build. Right now things in other groups’ work (e.g., instrument) are not linked in to our codebook set of elements.

Things that need to be fixed on Lion include:

a. Arrows on aggregation and composition are the wrong way round

b. On each object the current DDI 3.2 and GSIM fields should also be rendered in View mode (they are currently only visible in Edit mode)

c. No graph appears for View, only a flat list

Wendy will relay these points to the modelers and the Lion maintainers.

Composite View Modifications

If you take elements that have been defined elsewhere in your composite view, you get everything but you may not want everything. You should be able to make a simple codebook. Someone needs to remodel it so that we can take just the portion we want.

We want to confirm what we need in our use case through the Excel spreadsheet. We need to draw a line for our first proposal for our simple codebook. The more things we decide are properties of an object, the more remodeling we will have to do when people want only some properties. Does this argue for making more things objects rather than properties? This is a tough call as we want to decrease the number of elements.

Working with Data Description

We should take a look at the Data Description modelthat came out of Dagstuhl and use that as a test because the current discussion about Datum is going to change things. What came out of Dagstuhl has had a lot of review and is considered solid. On Lion, what’s there now is the representation of what was decided at Dagstuhl and is up to date. Steve can compile this in a straightforward way and generate a view which is all the objects that will be used by Data Description. All the relevant content is in Lion.

First we need to look at the Excel list and compile the data description level and then look at the View for Data Description to make sure this matches what we need.

Next Meeting

The next meeting will be on November 24.

Actions

 

  • Wendy will complete spreadsheet with information for IHSN

  • The group will pull out the needed elements at the data description level in the Excel sheet

  • Steve will create a view of Data Description (we can flatten this if needed)

  • The group can compare the use case elements to what is in Data Description

 November 23 2014

Simple Codebook Meeting Minutes
November 23, 2014

 

Present: Dan Gillman, Steve McEachern, Mary Vardigan

Meeting Times

The current time is midnight for Canberra, so we need to find another meeting time. 2pm EST U.S. time is the preferred time for the new year.

DDI 3.2 vs. 4

We are thinking in terms of forward compatibility so that everything in 3.* is covered in 4. This is not the best approach. Rather, we should solve the problem we want to solve and then worry about how to map it after we have solved it.

Framing happens unconsciously -- the circumstances of how you think about a problem constrains the way you are conceiving it.

Still it’s worth having a look at what we have right now to see what the overlap is.

By sticking with the nicely defined distinction between logical and physical we can be more precise going forward.

There is not too much not actually covered in 4 but it is going to be reorganized.

Next Steps

Steve will compare the spreadsheet to Data Description in 4 to determine how they map and overlap.

 February 02 2015

Simple Codebook Meeting Minutes
February 2, 2015

 

Present: Dan Gillman, Oliver Hopt, Larry Hoyle, Steve McEachern, Mary Vardigan

The Simple Codebook committee will now be chaired by Dan Gillman as Wolfgang is not able to chair currently.

This group has been in a holding pattern because we are waiting on the results of other groups. However, it was suggested that we look at the Codebook 2.5 (Codebook Version) in comparison to DDI 3.* (Lifecycle Version).

XML permits a detailed description of elements and this is part of the distinction between 2 and 3. But UML doesn't allow this and doesn't account for nesting and levels of detail. We should try to incorporate what is in Version 2 into the model as best we can. We as a group should try to build this. One additional possible other advance would be that we could then have a single model to account for both Codebook and Lifecycle. Both views would be under one spec in this approach.

Is referencing and reusability a distinction between the two versions that we should take into account? Should it be communicated to the modeling team that we may not need the complexity?

For users who want to describe their data, they should be able to write a description and fit it into a framework. If you want to have interoperability with other systems, then that is a different issue.

For the standalone one-off research project, users will not need to be reusing variables and questions, but for longitudinal and research across languages and cultures, this is important; there is a need to harmonize across questionnaires, reuse metadata across time, etc. Maybe this is Complex Codebook?

We need a distinction between the user perspective and the technical perspective. Simple and complex need to be interoperable. It's necessary to reduce the complexity of what is modeled in the library by choosing the simple cases.

One of the decisions for DDI 4 is to make everything identifiable and drop the container aspect of identifiability. This takes away a lot of the complexity.

From a marketing perspective, we need to distinguish between the DDI Codebook version and the Simple Codebook view. Looking at what is in 2 now will be required and we need to lay out what we need to account for. In the study section for DDI Codebook, there were a number of elements that allowed you to provide a high level text description of various methodological things. Preserving that is important.

Capturing what is in an SPSS or SAS representation including all the metadata you can put there is also important. When you move data around, you don't want to lose anything. When you look at how researchers want to record information, it is often difficult for them to record things in detail. Guided structures for them as part of their workflow is important and Codebook this is one view that could help them with this. You need some structure that becomes machine-actionable. You don't want people to just write a narrative.

At BLS, there is a Handbook of Methods. It has narrative descriptions of the surveys BLS does and it doesn't have a lot of detail. This should be captured in DDI rather than in a PDF. There is a need for high level and detailed as well. There may still be a need for some kind of a DDI Lite as a way of inducing reluctant data producers to get involved. For variables the detail is necessary. We should make this as flexible as we can.

We can start by looking at what is in 2.5 and figure out from the point of view of a list of what we need to account for. This would be a set of requirements that we as a group need to figure out how to solve. One question we want to address from a modeling point of view is, for example, when we need to say how the sample is constructed: Would those higher level descriptions go in a class of things that are independent of everything else or part of a sampling class? These are design issues that might have an impact on the way the more detailed model plays out.

If we can manage both 2 and 3 in the same structure we as a standards body will have an easier time with this. We should consult with Wendy on this.

Several archives still rely on DDI Codebook, Nesstar, etc. There is a set of codebook specs from different archives.

Are we talking about having our Simple Codebook view covering everything that is in 2.5? It should be even less. But should there be a view that is everything in 2.5? One idea is a view that is a really simple codebook but to allow for complexity in any direction you would like to go so we could incorporate everything that is in 2.5. Or go into more detail in 3 for whatever direction you want to go so there is a seamless distinction between high and detailed levels. This is basically what DDI 4 is. We should provide a lot of different options about how much detail the user wants. With 4 right now we have detailed descriptions of a lot of things but we are not allowing for high level descriptions. The description and definition were discussed in London with respect to Drupal in the sense that there could be radio buttons to indicate that they should be used to standardize those objects. It could be possible to have a description without any usage of detailed sub-elements.

There could be an attribute that could be high-level description. Or we have an element saying this is the Sample Description. Just having an element called description associated with identifiable objects may not be sufficient. In the annotated identifiable there is an annotation element that has Dublin Core properties like Title, Contributor, etc. It has an abstract. But there is nothing that is a high-level description.

On the one hand it might be nice to have a Sampling Description, but it might be over-specified. It's important to have an element dedicated to a high-level description that you are offering in place of the detail or as a supplement to the detail. A general description like the annotation will lose semantic interoperability. We need machine-interpretability. We also want the possibility to reference just the high level description in the simple codebook.

We should be able to allow for user-defined views that provide for whatever level of detail an organization uses. A Simple Codebook view that maps back to 2.5 would be useful. It would allow those organizations just using 2 to feel comfortable using 4.

DDI 4 does not have the same hierarchy as DDI 3. We would still need an object carrying high level content for the sampling process and nothing else. In 3 there was a parent node but we don't have this structure in DDI 4, which means you need to create a container for this description. It's not a question of using description as a property containing the text, but which element carries the description.

Between now and the next meeting, Oliver will make some slides with an example of what we have been talking about. We also need to dig into DDI 2.5 to get a handle on what is needed at the higher level. Dan and Larry will look at this. Dan will also consult with Wendy on this.

 

 

 February 16, 2015

Simple Codebook Meeting
February 16, 2015

Present: Dan Gillman, Larry Hoyle, Steve McEachern, Mary Vardigan

Completeness of cross walk between 2 and 3

The crosswalk or mapping is essentially one-way from 2 to 3. Codebook doesn't have the reusability that Lifecycle does. This is the same issue as between SPSS and Stata/SAS. We should look at the mapping in more detail.

Content and functionality of Simple Codebook

We want to make sure that Simple Codebook lets us write or ingest 2.x fairly seamlessly. Are the same kinds of element names available in 3? The names change even at the highest level.

Many miss the Tag Library as it was so simple. This kind of resource would be useful along with a mapping. However, Wendy advises that we don't have to worry about 2 since the mapping is there.

Even 2 has a lot of content. Are we still talking about a simple codebook as opposed to a complex codebook? Simple should allow you to take information from a major statistical package and move to another without losing any information (this is our definition of simple) . In terms of questions, they should be included as should sampling and universe. We should review DDI Lite and DDI Core, which have not been updated to the most recent versions of Codebook and Lifecycle. This may enable us to have a framework for content. We will deal with functionality later.

We have been making the assumption that we have the Instrument information and the Data Description information from those two views. What else do we need? We need context information or study level – Universe, sampling, design, bibliographic information. In DDI 2.* we have Citation, Study information (which is discovery related), Methodology, and Access. This is good content.

What do you need to know to use the data? You need the variable information. Question order and the way questions are asked may be important.

There is a tension between being very simple and following best practice for good documentation. Can we add pointers to relevant information? The simple/complex distinction is levels of detail.

For secondary users, we need enough information for a researcher to be able to understand and evaluate the quality of a dataset without reference back to the original data producer. We also need enough information to pull the data into a statistical package.

We started an exercise to take the common set of CESSDA, ICPSR, and IHSN mandatory schemas, and figure out what is the superset. The spreadsheet can be found in the attachments on the page: Simple Codebook View Team. We should compare this set of elements to what is in DDI Lite and DDI Core.

Necessary for a simple codebook: variables and questions and layout; universe or population; level of geography (basically coverage, including temporal and subject); sampling; or weights (and point to thorough description of sampling).

The distinction between simple and complex for data description is between a simple rectangular file and other data types; this applies to codebook in some ways as well. But there may be a cascading effect if we limit ourselves to simple rectangular files (we should describe hierarchical files as well like CPS). You can have hierarchical data in CSV with a record type field but historically we have had files with physical representations of the data that are esoteric. How much of this do we need to handle? For a simple codebook, the simple representation should be limited to unicode or something like that.

Homework: review DDI Core: http://www.ddialliance.org/sites/default/files/ddi3/DDI3_CR3_Core.xml and DDI Lite: http://www.ddialliance.org/sites/default/files/ddi-lite.html

And think about what limitations we want to put on format to keep the idea of simple codebook but to keep it rich enough so we are covering enough situations.

The next meeting will be in two weeks on March 2.

 March 2 2015

Simple Codebook Meeting
March 2, 2015

Present: Michelle Edwards, Dan Gillman, Oliver Hopt, Larry Hoyle, Steve McEachern, Mary Vardigan

The group welcomed Michelle Edwards of CISER. The chair noted that this group is in a sense waiting for other groups (Discovery, Data Description, Instrument) to complete what they are doing so that we can finish our work. We recognize a need to  incorporate both Codebook and Lifecycle into one spec (DDI 4), so we have been exploring that in our group a bit.

DDI Lite was reviewed and compared with the element sets that ICPSR, GESIS, and IHSN use and they are a fairly good match.

We won't be able to exactly duplicate Codebook and Lifecycle as views of DDI 4 but we can get close. Organizations that have invested in 3.2 do not want to lose that investment. Can we map 3.2 to 4 by automatically importing what's in 3.2? We may need a conversation with Guillaume about this. This should probably be at the Advisory Group level.

DDI Codebook and Lifecycle have different names for the same element. We will need mappings for people.

What we write out is also important. Interoperability can be defined in terms of reading and writing out of a system. If we can read 2.5 into 4, we are able to ingest anything that occurs anywhere under 2.5. We want to be able to write an instance that contains all the semantic content of Codebook. If we know that there is an equivalence we should have a 2.5 writer to write it out in that name. It is the structure and the mappings that matter.

There were changes between Codebook and Lifecycle that were not necessarily clean because of the use of things by reference in 3 (categories and codes). Upward compatibility may be tougher than downward compatibility. We should probably not worry about 3 here but concern ourselves with mapping 2.5 into 4.

Is Codebook still an aggregation of Discovery, Description, and Instrument? Right now Discovery is a stripped down element set.

We could start with 2.5 as a starting point and we need to be able to account for this. Then we could look at 4 and ask whether everything is covered. Can we restrict this to 2.5 Lite? Generally, yes.

A Codebook view would be intended for an audience that is creating or managing codebooks and it doesn't matter what things are in other views or packages.

Views can overlap as much as you want. DDI Lite is a view. DDI 2.5 is a view. We are leveraging the experience of repositories (ICPSR, GESIS, IHSN) in serving up data, so that makes a good codebook. It makes sense to rely on DDI Lite, which we know is used.

The group reviewed the elements in DDI Lite. ADA uses a few other elements like deposit date, alternative title, collection situation, etc.. ADA uses the default Nesstar template which is close to DDI Lite. We should look at Nesstar also. The CESSDA Profile would be the best thing to use.  We need to identify where things are already defined in 4 and where things still need to be defined in 4. We need to know what is missing from 4 in order to have a sense of where we stand. Our group could then go to the AG to say what needs to be addressed in sprints.

If we have something in 4 that maps to Nesstar/CESSDA profile, that allows a big chunk of DDI users to adopt 4. There is another migration path we can look at: we have 2.5 codebook - is there a more modern one? Migrate 2.5 to something different? This may be out of scope for our group but we should discuss it.

 

 

 March 16, 2015

Simple Codebook Meeting

March 16, 2015

Present: Dan Gillman, Oliver Hopt, Larry Hoyle, Mary Vardigan

The agenda for the meeting was to determine if all elements in the CESSDA profile/Nesstar profile are present in DDI 4. Larry Hoyle had created a spreadsheet of DDI Lite and the list of elements from CESSDA profiles. There seems to be a wide variety of the selection of the elements and attributes in the repositories using DDI Lite. The Nesstar Webview comes as the base. The group compared elements used across different repositories.

The task was to find out which elements are in DDI4, so the group decided to divide up the list of 200+ elements. There appears not to be any DDI4 elements about the metadata itself, the DDI document. It basically parallels the study description information. This may not be relevant for DDI4. Perhaps the Data Citation group should think about this. This is often the archive's intellectual property, so some representation of it will be of interest to most of the archives. Citing the user guide or documentation is a common practice.

DDI Codebook has some elements of description that DDI4 has not been talking about. We need to bring forth something to the Advisory Group about this – this is an issue that we need to discuss. In DDI Lifecycle there is the corresponding instance with a citation on it. There is no DDI4 instance because instance is a root element for documents in general.

Will the idea of a document description disappear in 4? The archive creates a document describing the data. The landing page is sometimes (always?)  metadata.

Study level, variable level, record level, file level: should the Data Citation group look at what are targets of citation?

In DDI Codebook, we have DocumentDescription; in DDI Lifecycle we have DDIInstance. Should DDIInstance be brought back into DDI4? – with revised content but allowing attachment of annotation.

Being able to point to an XML file with the model and generate that file from elements in 4 is adequate. But it is no longer enough to point to one object that contains everything.

We have the logical vs. physical distinction. A DDIInstance as a physical thing – something that's there.  Pulling together the information into that representation is an activity with Authors, etc. There is the "same" content in two archives. – different contact people, different URIs for each. This is parallel to data description.

Assignments for the next meeting

Where in DDI4 do each of these elements exist?

FirstLine

LastLine

N

Who

Content

70

101

31

Dan

Citation

102

131

29

Steve

Scope Methodology

132

155

23

Oliver

Access Conditions

156

184

28

Larry

File Variable

185

205

20

Mary

VarDoc

206

232

26

Michelle

CategoryGroups OtherMaterial

 

 

 April 13, 2015

Simple Codebook Meeting
2015-04-13

The meeting focussed on reviewing the next set of metadata elements from DDI-C - those covered by Steve.

Steve had created an additional three columns to his copy of the spreadsheet for his work - adding:

  1. Package (for elements already matched in DDI4)
  2. Suggested Package (for elements that have no match)
  3. the DDI-C definition.

These additional columns have now been added to the Google Spreadsheet - linked here.

The discussion then focussed on the elements. Notes on specific elements are included in the spreadsheet, and summarised below

Elements

Source

  • example for digitized statistical abstract  the original print publication. If administrative data the original administrative program. A simple version of provenance

Geographic unit

  • “Lowest level of geographic aggregation covered by the data.”
  • Would GeographicLevels (plural) be better to indicate that multiple levels can be used.Is GeographicLevel a better term than GeographicUnit?

Control operations

  • Description of what was done. Data collection process,

General comments and issues

It was noted that much of the methodology section of DDI-C was not yet covered in DDI4. Part of this will be addressed by the Methodology working group.

There is however a set of elements that are not really methodology (or at least the research design), but rather are descriptive of the process and outcome of the execution of the methodology. These elements might most appropriately fall under the heading of "Fieldwork". Examples from DDI-C include:

  • CollectionSituation
  • MinimizeLossActions
  • ControlOperations

and, notably, RESPONSE RATE.

The group had concerns that it was unclear how we might provide recommendations here? e.g. ResponseRate, what is meant – “opposite of rate of refusal?” other types?  It was also recognised that this is not really part of methodology, but has an impact on methodology – as well as on analysis, post processing. For example, was there an intervention based on low response rate? Fieldwork issues.

On similar lines, there was a recognition that Methodology is the ugly part of DDI Codebook. Dan suggested that this section may be in need of a significant revamp, given the developments in survey methodology that have occurred since the original development of DDI-C, in particular the Total Survey Error framework.

It was noted that these issues with Methodology and Fieldwork need to be raised with the AG sooner rather than later, as they have resource and workload implications for the Moving Forward program. Steve will write something up on this and distribute to the group, prior to sending to the AG.

 

 

 April 27, 2015

Simple Codebook Meeting
April 27, 2015

Present: Michelle Edwards, Dan Gillman, Oliver Hopt, Steve McEachern, Mary Vardigan

The group went back to the mapping between DDI Codebook and what is in DDI4. In terms of Access Conditions, there is an Access module in Discovery, where it is streamlined. It looks as if availability and use statements are not included; everything is structured string. We might look at SAML or another controlled vocabulary for access control like XACML (Extensible Access Control Markup Language). The issue is whether the outside source maintains previous versions, which we don't have control over.

In terms of Other Material, this was all found in DDI4 except for the Other Material table. This was part of DDI Codebook to mark up a table for presentation. In terms of VarDoc version, none of that was in DDI4. In DDI4 versioning is done at a low level, so this is taken care of at a level of the model that is not about particular content but about everything – Identifiable and Annotated Identifiable. There is an ID and a version. The question is that in Codebook the description is applied against Variable; in DDI4 identification applies broadly.

The group traced identification through the DDI4 model and looked at Collections and Members. Version Type in DDI Codebook does not seem to be covered, but no one is using this. Type seems no longer relevant and related to documents rather than to elements. People who understand this element from the old way of thinking have to know that the idea of a version is being expanded. We need to table this for now but are leaning toward deprecating this element.

Coding Instructions probably maps to Fieldwork and Methodology, which we don't have yet in DDI4.

 

 

 May 11, 2015

Simple Codebook Meeting
May 11, 2015

Present: Oliver Hopt, Larry Hoyle, Steve McEachern, Mary Vardigan

The group continued its review of the mapping between DDI Codebook and DDI 4 – https://docs.google.com/spreadsheets/d/1VDbVz2KRRSX_KEf0IfuE-QqMyTDupftCZfBdBM6VPT8/edit#gid=2125503646.

The group returned to the elements regarding availability and access. There is currently no archive information in DDI4 and this needs to be modeled, perhaps at the upcoming sprint. In terms of the use statement, some is not covered in the access object in Discovery in DDI4. This needs to be modeled also. SAML isn't useful for us because it is too high level. Both data and metadata may need something attached. We might look at this in the Datum discussion (not only columns but rows) and also attaching things to the metadata to control access. This might be like annotations where it can be attached to anything – access could have a relationship to annotated identifiable. Then any object could have an access control. From access description to object could be another solution. This could make sense because an object could have different access policies when stored in different archives. This should be discussed at the sprint also. There is an Access Control XML language that we looked at but didn't decide on. Michelle will be representing CISER at the sprint and can express their needs in this area.

In terms of Imputation, it is now the same as it has been in 3. Generation Instructions and General Instructions seem to have the same text. We need some clarification from Wendy on this. They can describe an Imputation procedure. This has not yet been brought up in 4 yet. This would be methodology or fieldwork. It is in the Processing package now. Need clarification at the sprint.

Security in variable relates to the discussion above. 3.2 doesn't do much at the row level but this is becoming a requirement.

Embargo is in Simple Codebook, but this is basically a set of placeholders right now. This should be part of the Access Rights discussion at the sprint so we do this consistently. Where should this come from? A use case or the modeling team proposing an approach. We probably need both directions. Maybe two use cases – one from Bill for metadata and one from Ornulf for data.

Response Unit not yet modeled and will come up in complex instrument. This can be at the study and variable level. An equivalent should be covered in methodology.

For question elements, there is a container in Data Capture that will work for this and allow you to instantiate pre-, post-, and literal question as well as interviewer instructions. Statement is the container.

In terms of invalid range, this is in Simple Codebook. How are we tying this to missing? In 3.2 and in Simple Codebook in 4 you can point to a managed missing values representation and in that you can do ranges. You can do things like from this value to that value is a missing value. This is there by virtue of having been brought over from 3.2. The ISO 11404 notion of sentinal value (each instance variable has a set of such values but it might point to the same represented variable) has been modeled to allow for the valid set of data to be handled in different statistical packages. You have to represent the semantics in different ways. The Data Description group should handle this.

Undocumented Codes – they should have had a label but didn't get documented. Codebook is the obvious group to handle this.

Total Responses is another part of the documentation for variable and should be handled by Codebook. This is handled with a controlled vocabulary when you say what type of statistic it is.

Summary Statistics is in Complex Data Type. They are not in the Simple Codebook view now but that hasn't been built out yet and we would need to include them in the view.

In terms of Descriptive Text, all the variables in 4 inherit Description as members.

 2015 06 08

Simple Codebook
June 8, 2015

Present: Dan Gillman, Larry Hoyle, Jenny Linnerud, Oliver Hopt, Mary Vardigan

 

 

 

 

 

 

 

 

  • No labels