Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Minutes from Dagstuhl Sprint 2014 Working Group
Info
iconfalse

 Simple Codebook View Team

Expand
titleDagstuhl Sprint Oct 2014

 

 

Expand
titleJune 30, 2014
 

Meeting: 2014-06-30

 

Attending: Guillaume Duffes, Dan Gillman, Larry Hoyle, Ørnulf Risnes, Steve McEachern, Wendy Thomas

 

Reviewed list of related package and view content from Wolfgang

 

Decisions:

 

There is currently a lot of duplication in the list and it needs to be normalized prior to review.

 

Steve will normalize the list and send it out to members later this week with the following instructions:

 

Review the list and do the following:

 
  1. Add any unlisted objects that you would expect to find in a basic or simple codebook

  2. For each item indicate if the item is one which would be required in order to publish the codebook or is one that would be useful to have in the codebook

  3. Return your review to the group.

 

 

 

Unless other agenda items arise, schedule the next meeting after the deadline for returning reviews.

 

Process:

 
  • Items that have agreement in terms of "required" will go into a basic view

  • Items that have agreement in terms of "would like to see" will go into an "intermediate" view

  • Items without agreement will be discussed and assigned during the next meeting

 

This may result in the creation of two "simple codebook" views and appropriate names should be determined.

 

Discussion:

 

Given the range of use cases (something above a simple data set to a simple study housed in an archive) it is difficult to determine what is meant by "simple". Rather than discuss in the abstract it may be helpful to get a list of objects one would like to see in a simple codebook from the members of group and then identify those objects that are considered to be the minimum requirement for publication. This may result is two levels for a simple codebook (basic and intermediate) but the approach would provide clear information on where there is consensus and where there is debate.

 

Statements that may help define the differences between these two levels:

 
  • The bare minimum needed in order to publish (basic)

  • What would you like to see in this view (intermediate)?

 

There has been a shift from the initial content creation in Drupal of a simple codebook "package" to the idea of a "view" and we need to reorient the Drupal content to this shift. In addition, packages and views relating to the simple codebook view that were not in existence when the work of this group was started are now more fully defined. The content of these packages and views needs to be considered when defining the view(s) of a simple codebook.

 

View orientation is liberating

 
  • A view contains objects (it is not a compilation of views)

  • A view (specific version) may be partially or fully support another view - the intent to do this should be noted in the description of the new view

 

The following process could be useful in defining the view(s) for a simple codebook:

 

Creating the list of objects for a simple codebook:

 
  • Start with Wolfgang's list as an example, (normalized version of this list)

  • What would you add?

  • What would you like?

  • What is required vs. what is optional (simple to intermediate)?

 

Create a view of Simple codebook in Drupal - using the final agreed upon list of a view

 

Note: Some of the objects being included are complex objects. These should then be reviewed to see if a simpler basic object of that type is needed. (I.e. we may only want to include a "stripped down" version in the view)

 

Steve will take a go at normalizing and send list out to group

 

Wolfgang can then enforce getting responses.

 

Meeting in two weeks:

 
  • this week if possible for list out

  • wish list turnaround

  • may want to delay next meeting until after due date for getting lists back from members

 

 

Expand
titleDagstuhl Sprint Oct 2014

Minutes from Dagstuhl Sprint 2014 Working Group

 

 

Expand
titleSeptember 15, 2014

 

Simple Codebook Meeting
September 15, 2014

 

Present: Dan Gillman, Oliver Hopt, Larry Hoyle, Jenny Linnerud, Steve McEachern, Ornulf Risnes, Wendy Thomas, Mary Vardigan

Discussion

The group affirmed Wendy’s definition of a codebook (See Appendix A for the full document):

A codebook combines the contents of a data dictionary with additional information to support the intelligent use of the data which it describes. The data dictionary provides structured information on the layout of the data, providing sufficient detail to the incorporation of the data into a program for analysis including the name, physical location of the data, data type, size, and meaning of the values. This should include both valid and invalid (missing) values as well as information on the record types, relationships and internal layout. The codebook pulls together additional information required for understanding the source of the data, its relevance to the research question, and related information about the survey design, methodologies employed, the data collection process, data processing, and data quality.

A codebook should contain information for discovery and for data manipulation (data dictionary contents) in a structured format to support programming for access. Other sections of metadata may be machine actionable or informational depending on the use of the codebook structure. Informational content can be maintained in-line (as specific content of the codebook) or by reference to external content (a questionnaire, research proposal, methodology resources, etc.).

The group discussed overlap with other groups and packages since codebook is a compilation of other packages. Simple Codebook is most likely a compilation of Conceptual, Simple Data Description, Discovery, and additional information that facilitates interpretation of the data and intelligent use. The difficulty is determining what depth of information is appropriate. For replication purposes, you need a lot of detail.

The Simple Data Description group is first focusing on data description in a broad way and will then define a subset for “simple.” Perhaps this group should do the same.

It would be helpful to have reports from other groups so that we know where they are and what makes sense to combine for simple codebook.

In Wendy’s list (Appendix A), much of the content we need is covered by other groups, but we could use more detail in Data Source, Data Processing, and Methodology. Methodology framed its scope broadly in Toronto but hasn’t yet met as a group. One activity for that group would be to review the sampling and weighting specifications that came out of the Survey Design and Implementation working group to see what is needed beyond that work.

Next Meeting

The group will meet again on Monday, September 29, to get reports from other groups.

Appendix A

What is a codebook?

[also referred to by DataONE as science metadata for science data]

A codebook combines the contents of a data dictionary with additional information to support the intelligent use of the data which it describes. The data dictionary provides structured information on the layout of the data, providing sufficient detail to the incorporation of the data into a program for analysis including the name, physical location of the data, data type, size, and meaning of the values. This should include both valid and invalid (missing) values as well as information on the record types, relationships and internal layout. The codebook pulls together additional information required for understanding the source of the data, its relevance to the research question, and related information about the survey design, methodologies employed, the data collection process, data processing, and data quality.

A codebook should contain information for discovery and for data manipulation (data dictionary contents) in a structured format to support programming for access. Other sections of metadata may be machine actionable or informational depending on the use of the codebook structure. Informational content can be maintained in-line (as specific content of the codebook) or by reference to external content (a questionnaire, research proposal, methodology resources, etc.).

Discussion

The definitions below for "codebook" are survey centric when refering to the broader set of metadata related to a data file. Another term may be preferable but there isn't one that leaps to mind. Whether called a codebook, science metadata, metadata, or something else, data files have 2 levels of description:

·         A structured physical description that supports the ability of the programmer to access the data accurately

·         Supporting information that allows the researcher to evaluate “fitness of use” of the data to a particular research question, the overall quality of the data, and the specifics of the conceptual (objects, universe/population, conceptual definitions, spatial and temporal) coverage. This information may be applicable to the study as a whole or to the individual variable. This also includes information on why and how the data were captured, processed, and preserved.

 

Type of information

Basic Codebook

Survey

Fauna (Wildlife)

Data structure:

·         Record type

·         Record layout

·         Record relationship

·         Data type

·         Valid values

·         Invalid values

Structured metadata to support access

Structured metadata to support access

Structured metadata to support access

Data source:

·         Why was data collected

·         How was data collected

·         Who collected the data

·         The universe or population and how it was identified and selected

Descriptive to support assessment of quality and fitness-for-use

Purpose of the survey; Survey content and flow (may or may not need to be actionable); identification and sampling of survey population (may or may not need to be actionable for replication purposes)

Purpose of study, how data was collected (may need to be actionable to support replication and/or calibration); identification and sampling of survey population (may or may not need to be actionable for replication purposes)

Data processing:

·         Data capture process

·         Validation

·         Quality control

·         Normalizing, coding, derivations

·         Protection (confidentiality, suppression, interpolation, embargo, etc.)

Informational material; support provenance

May need structured metadata for purposes of replication; Include processes, background information, proposed, actual, and implications for data

May need structured to support mechanical capture instruments, calibrations, situational variants, etc.

Discovery information:

·         Who

·         What

·         When

·         Why

·         Coverage

o   Topical

o   Temporal

o   Spatial

Structured metadata to support discovery and access to the data as a whole

Structured metadata to support discovery and access to the data as a whole

Structured metadata to support discovery and access to the data as a whole

Conceptual basis

·         Object

·         Concept

Informational material

Structured to support analysis of change over time and relationship between studies. May just be descriptive / informational.

Structured to support genre level comparison (heavy use of common taxonomies, etc.)

Methodologies employed

Informational material

Structured to support replication and comparison between studies

Structured to support replication and comparison between studies

Related materials of relevance to data

Informational material

  

Definitions

Data Dictionary

·         A data dictionary, or metadata repository, as defined in the IBM Dictionary of Computing, is a "centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format."[1] The term can have one of several closely related meanings pertaining to databases and database management systems (DBMS):

·         A document describing a database or collection of databases

·         An integral component of a DBMS that is required to determine its structure

·         A piece of middleware that extends or supplants the native data dictionary of a DBMS

·         Database about a database. A data dictionary defines the structure of the database itself (not that of the data held in the database) and is used in control and maintenance of large databases. Among other items of information, it records (1) what data is stored, (2) name, description, and characteristics of each data element, (3) types of relationships between data elements, (4) access rights and frequency of access. Also called system dictionary when used in the context of a system design.Read more: http://www.businessdictionary.com/definition/data-dictionary.html#ixzz3Am5wCgZI

·         A data dictionary is a collection of descriptions of the data objects or items in a data model for the benefit of programmers and others who need to refer to them. (Posted by Margaret Rouse  @ WhatIs.com)

Codebook

What is a codebook? (http://www.sscnet.ucla.edu/issr/da/tutor/tutcode.htm)

A codebook describes and documents the questions asked or items collected in a survey. Codebooks and study documentation will provide you with crucial details to help you decide whether or not a particular data collection will be useful in your research. The codebook will describe the subject of the survey or data collection, the sample and how it was constructed, and how the data were coded, entered, and processed.  The questionnaire or survey instrument will be included along with a description or layout of how the data file is organized.  Some codebooks are available electronically, and you can read them on your computer screen, download them to your machine, or print them out. Others are not electronic and must be used in a library or archive, or, depending on copyright, photocopied if you want your own for personal use.

Codebook : Lisa Carley-Baxter (http://srmo.sagepub.com/view/encyclopedia-of-survey-research-methods/n69.xml)

Codebooks are used by survey researchers to serve two main purposes: to provide a guide for coding responses and to serve as documentation of the layout and code definitions of a data file. Data files usually contain one line for each observation, such as a record or person (also called a "respondent"). Each column generally represents a single variable; however, one variable may span several columns. At the most basic level, a codebook describes the layout of the data in the data file and describes what the data codes mean. Codebooks are used to document the values associated with the answer options for a given survey question. Each answer category is given a unique numeric value, and these unique numeric values are then used by researchers in their analysis of the ...

Codebook (Wikipedia.com)

A codebook is a type of document used for gathering and storing codes. Originally codebooks were often literally books, but today codebook is a byword for the complete record of a series of codes, regardless of physical format.

ICPSR

What is a codebook?

A codebook provides information on the structure, contents, and layout of a data file. Users are strongly encouraged to look at the codebook of a study before downloading the datafiles.

While codebooks vary widely in quality and amount of information given, a typical codebook includes:

• Column locations and widths for each variable

• Definitions of different record types

• Response codes for each variable

• Codes used to indicate nonresponse and missing data

• Exact questions and skip patterns used in a survey

• Other indications of the content and characteristics of each variable

Additionally, codebooks may also contain:

• Frequencies of response

• Survey objectives

• Concept definitions

• A description of the survey design and methodology

• A copy of the survey questionnaire (if applicable)

• Information on data collection, data processing, and data quality

 

...

Expand
titleMay 11, 2015

Simple Codebook Meeting
May 11, 2015

Present: Oliver Hopt, Larry Hoyle, Steve McEachern, Mary Vardigan

The group continued its review of the mapping between DDI Codebook and DDI 4 – https://docs.google.com/spreadsheets/d/1VDbVz2KRRSX_KEf0IfuE-QqMyTDupftCZfBdBM6VPT8/edit#gid=2125503646.

The group returned to the elements regarding availability and access. There is currently no archive information in DDI4 and this needs to be modeled, perhaps at the upcoming sprint. In terms of the use statement, some is not covered in the access object in Discovery in DDI4. This needs to be modeled also. SAML isn't useful for us because it is too high level. Both data and metadata may need something attached. We might look at this in the Datum discussion (not only columns but rows) and also attaching things to the metadata to control access. This might be like annotations where it can be attached to anything – access could have a relationship to annotated identifiable. Then any object could have an access control. From access description to object could be another solution. This could make sense because an object could have different access policies when stored in different archives. This should be discussed at the sprint also. There is an Access Control XML language that we looked at but didn't decide on. Michelle will be representing CISER at the sprint and can express their needs in this area.

In terms of Imputation, it is now the same as it has been in 3. Generation Instructions and General Instructions seem to have the same text. We need some clarification from Wendy on this. They can describe an Imputation procedure. This has not yet been brought up in 4 yet. This would be methodology or fieldwork. It is in the Processing package now. Need clarification at the sprint.

Security in variable relates to the discussion above. 3.2 doesn't do much at the row level but this is becoming a requirement.

Embargo is in Simple Codebook, but this is basically a set of placeholders right now. This should be part of the Access Rights discussion at the sprint so we do this consistently. Where should this come from? A use case or the modeling team proposing an approach. We probably need both directions. Maybe two use cases – one from Bill for metadata and one from Ornulf for data.

Response Unit not yet modeled and will come up in complex instrument. This can be at the study and variable level. An equivalent should be covered in methodology.

For question elements, there is a container in Data Capture that will work for this and allow you to instantiate pre-, post-, and literal question as well as interviewer instructions. Statement is the container.

In terms of invalid range, this is in Simple Codebook. How are we tying this to missing? In 3.2 and in Simple Codebook in 4 you can point to a managed missing values representation and in that you can do ranges. You can do things like from this value to that value is a missing value. This is there by virtue of having been brought over from 3.2. The ISO 11404 notion of sentinal value (each instance variable has a set of such values but it might point to the same represented variable) has been modeled to allow for the valid set of data to be handled in different statistical packages. You have to represent the semantics in different ways. The Data Description group should handle this.

Undocumented Codes – they should have had a label but didn't get documented. Codebook is the obvious group to handle this.

Total Responses is another part of the documentation for variable and should be handled by Codebook. This is handled with a controlled vocabulary when you say what type of statistic it is.

Summary Statistics is in Complex Data Type. They are not in the Simple Codebook view now but that hasn't been built out yet and we would need to include them in the view.

In terms of Descriptive Text, all the variables in 4 inherit Description as members.