Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Info
iconfalse

 Simple Codebook View Team

 

Expand
titleMinutes May 10, 2016

Codebook meeting 2016-05-10

Attending: Dan Gillman, Gillian Kerr, Oliver Hopt, Larry Hoyle

We reviewed the spreadsheet https://docs.google.com/spreadsheets/d/1VDbVz2KRRSX_KEf0IfuE-QqMyTDupftCZfBdBM6VPT8/edit#gid=1652443366 , sheet NewStartingPointCdbk_4

Which now has xpaths to ddi2.5 elements (column F) and description of the DDI4 classes (column E) which correspond as well as descriptions of needed DDI4 classes

 We discussed the creation of a view.  

We need two new classes one for the whole activity producing data and one to describe each wave, or phase.

Issues are associated with the top level (e.g. design) but then there are specifics at each repeated instance producing different data.  The general and specific shouldn’t be duplicative for one time activities. An example of the top level information would be the purpose for the whole set of activities. Another would be the funding source for the whole, or authorizing legislation.

What terms could we use? --- Activity? Data capture activity?  

Need anchor class and specific class “anchor class” and “concrete anchor instance class”

 

In stats agencies ongoing activities – designs change, the overall is known by a name and has a funding source. (e.g. CPS or American community survey). The specific might be a monthly collection e.g. monthly CPS as input to calculation of unemployment rate.

Another example would be the  Christmas bird count which has annual data collections but can also be considered to be an overall series.

Decision:

“StudySeries” as overall

“Study”  for the specific – the  user community is familiar with this term, even if developers don’t like it

Conceptual would be the best current package for these classes

Oliver will create classes then the rest of us can work on descriptions.

Larry will add other classes to the view

 

Goodbye “TOFKAS” (The Object Formerly Known As Study). Even Prince went back to “Prince”

Expand
titleApril 26, 2016

Notes from Codebook Meeting 2016-04-26, 8am EDT
Attending Dan Gillman, Gillian Kerr, Larry Hoyle
We discussed the need to update column E in https://docs.google.com/spreadsheets/d/1VDbVz2KRRSX_KEf0IfuE-QqMyTDupftCZfBdBM6VPT8/edit#gid=1652443366  Larry will make a first pass at this.
Does DDI4 have the ability to describe a Reference period for a question e.g.  “over the last three months have you…?”  The ReferenceDate class http://lion.ddialliance.org/ddiobjects/referencedate has a typeOfDate property that should be able to do this.
We need a controlled vocabulary for the semantic of a ReferenceDate typeOfDate. Examples: the date range to which a question refers…. This will be a heavily used vocabulary. Should this be a choice in the standard? This is less flexible than an external vocabulary that can expand. The latter is preferable. The DDI Alliance controlled vocabulary group might address this issue.
What class will we use for the “study” in 4? GSIM has statistical activity.  We have derived data, experimental data, data collection from administrative sources. Scraped data from an administrative registry, mashed data from the web,  something like the consumer price index, qualitative data. TOFKAS – the object formerly known as study – TOFKAT the object formerly known as TOFKAS. Perhaps Thingamabob? Data Activity?  Study? ThatWhichWasCaptured? DataCollection? AcquisitionActivity? Whatever the name we need a class to represent the overall activity of creating/collecting the data.

...

Expand
titleFebruary 2, 2016

Codebook meeting

2 February 2016

Attending: Dan, Michelle, Steve, Oliver, Jon, Larry, Jared

There’s some lack of clarity about where this group is at.  Discussed what to include in simple codebooks.  One idea is to review the spreadsheet of common elements (summary of CESSDA) and build on that.  Essentials seem to include: enough information to read the data into statistical package, label values, understand universe, understand what measure means so you can interpret the data, attribution information.  Another idea is to look at examples of simple codebooks, identify what they use, and then map to a model.

We need to be careful to keep things simple.  Even older versions of DDI 2 weren’t exactly simple.

If we nail down definitions, then do we make instances of previous versions incompatible?  As we define what information elements we want in DDI 4.0, we can specify which element you want in 2 if you’re going backwards.  

Next steps:

  1. Michelle will go through spreadsheet and narrow down to those elements that are DDI Lite and any others that are heavily used (e.g., key words).

  2. Will paste those elements into new sheet within the spreadsheet.

...

Expand
titleMarch 16, 2015

Simple Codebook Meeting

March 16, 2015

Present: Dan Gillman, Oliver Hopt, Larry Hoyle, Mary Vardigan

The agenda for the meeting was to determine if all elements in the CESSDA profile/Nesstar profile are present in DDI 4. Larry Hoyle had created a spreadsheet of DDI Lite and the list of elements from CESSDA profiles. There seems to be a wide variety of the selection of the elements and attributes in the repositories using DDI Lite. The Nesstar Webview comes as the base. The group compared elements used across different repositories.

The task was to find out which elements are in DDI4, so the group decided to divide up the list of 200+ elements. There appears not to be any DDI4 elements about the metadata itself, the DDI document. It basically parallels the study description information. This may not be relevant for DDI4. Perhaps the Data Citation group should think about this. This is often the archive's intellectual property, so some representation of it will be of interest to most of the archives. Citing the user guide or documentation is a common practice.

DDI Codebook has some elements of description that DDI4 has not been talking about. We need to bring forth something to the Advisory Group about this – this is an issue that we need to discuss. In DDI Lifecycle there is the corresponding instance with a citation on it. There is no DDI4 instance because instance is a root element for documents in general.

Will the idea of a document description disappear in 4? The archive creates a document describing the data. The landing page is sometimes (always?)  metadata.

Study level, variable level, record level, file level: should the Data Citation group look at what are targets of citation?

In DDI Codebook, we have DocumentDescription; in DDI Lifecycle we have DDIInstance. Should DDIInstance be brought back into DDI4? – with revised content but allowing attachment of annotation.

Being able to point to an XML file with the model and generate that file from elements in 4 is adequate. But it is no longer enough to point to one object that contains everything.

We have the logical vs. physical distinction. A DDIInstance as a physical thing – something that's there.  Pulling together the information into that representation is an activity with Authors, etc. There is the "same" content in two archives. – different contact people, different URIs for each. This is parallel to data description.

Assignments for the next meeting

Where in DDI4 do each of these elements exist?

FirstLine

LastLine

N

Who

Content

70

101

31

Dan

Citation

102

131

29

Steve

Scope Methodology

132

155

23

Oliver

Access Conditions

156

184

28

Larry

File Variable

185

205

20

Mary

VarDoc

206

232

26

Michelle

CategoryGroups OtherMaterial

 

 

 

Expand
titleMarch 2 2015

Simple Codebook Meeting March 2, 2015

Present: Michelle Edwards, Dan Gillman, Oliver Hopt, Larry Hoyle, Steve McEachern, Mary Vardigan

The group welcomed Michelle Edwards of CISER. The chair noted that this group is in a sense waiting for other groups (Discovery, Data Description, Instrument) to complete what they are doing so that we can finish our work. We recognize a need to  incorporate both Codebook and Lifecycle into one spec (DDI 4), so we have been exploring that in our group a bit.

DDI Lite was reviewed and compared with the element sets that ICPSR, GESIS, and IHSN use and they are a fairly good match.

We won't be able to exactly duplicate Codebook and Lifecycle as views of DDI 4 but we can get close. Organizations that have invested in 3.2 do not want to lose that investment. Can we map 3.2 to 4 by automatically importing what's in 3.2? We may need a conversation with Guillaume about this. This should probably be at the Advisory Group level.

DDI Codebook and Lifecycle have different names for the same element. We will need mappings for people.

What we write out is also important. Interoperability can be defined in terms of reading and writing out of a system. If we can read 2.5 into 4, we are able to ingest anything that occurs anywhere under 2.5. We want to be able to write an instance that contains all the semantic content of Codebook. If we know that there is an equivalence we should have a 2.5 writer to write it out in that name. It is the structure and the mappings that matter.

There were changes between Codebook and Lifecycle that were not necessarily clean because of the use of things by reference in 3 (categories and codes). Upward compatibility may be tougher than downward compatibility. We should probably not worry about 3 here but concern ourselves with mapping 2.5 into 4.

Is Codebook still an aggregation of Discovery, Description, and Instrument? Right now Discovery is a stripped down element set.

We could start with 2.5 as a starting point and we need to be able to account for this. Then we could look at 4 and ask whether everything is covered. Can we restrict this to 2.5 Lite? Generally, yes.

A Codebook view would be intended for an audience that is creating or managing codebooks and it doesn't matter what things are in other views or packages.

Views can overlap as much as you want. DDI Lite is a view. DDI 2.5 is a view. We are leveraging the experience of repositories (ICPSR, GESIS, IHSN) in serving up data, so that makes a good codebook. It makes sense to rely on DDI Lite, which we know is used.

The group reviewed the elements in DDI Lite. ADA uses a few other elements like deposit date, alternative title, collection situation, etc.. ADA uses the default Nesstar template which is close to DDI Lite. We should look at Nesstar also. The CESSDA Profile would be the best thing to use.  We need to identify where things are already defined in 4 and where things still need to be defined in 4. We need to know what is missing from 4 in order to have a sense of where we stand. Our group could then go to the AG to say what needs to be addressed in sprints.

If we have something in 4 that maps to Nesstar/CESSDA profile, that allows a big chunk of DDI users to adopt 4. There is another migration path we can look at: we have 2.5 codebook - is there a more modern one? Migrate 2.5 to something different? This may be out of scope for our group but we should discuss it.

 

 

...

Expand
titleSeptember 15, 2014

 

Simple Codebook Meeting
September 15, 2014

 

Present: Dan Gillman, Oliver Hopt, Larry Hoyle, Jenny Linnerud, Steve McEachern, Ornulf Risnes, Wendy Thomas, Mary Vardigan

Discussion

The group affirmed Wendy’s definition of a codebook (See Appendix A for the full document):

A codebook combines the contents of a data dictionary with additional information to support the intelligent use of the data which it describes. The data dictionary provides structured information on the layout of the data, providing sufficient detail to the incorporation of the data into a program for analysis including the name, physical location of the data, data type, size, and meaning of the values. This should include both valid and invalid (missing) values as well as information on the record types, relationships and internal layout. The codebook pulls together additional information required for understanding the source of the data, its relevance to the research question, and related information about the survey design, methodologies employed, the data collection process, data processing, and data quality.

A codebook should contain information for discovery and for data manipulation (data dictionary contents) in a structured format to support programming for access. Other sections of metadata may be machine actionable or informational depending on the use of the codebook structure. Informational content can be maintained in-line (as specific content of the codebook) or by reference to external content (a questionnaire, research proposal, methodology resources, etc.).

The group discussed overlap with other groups and packages since codebook is a compilation of other packages. Simple Codebook is most likely a compilation of Conceptual, Simple Data Description, Discovery, and additional information that facilitates interpretation of the data and intelligent use. The difficulty is determining what depth of information is appropriate. For replication purposes, you need a lot of detail.

The Simple Data Description group is first focusing on data description in a broad way and will then define a subset for “simple.” Perhaps this group should do the same.

It would be helpful to have reports from other groups so that we know where they are and what makes sense to combine for simple codebook.

In Wendy’s list (Appendix A), much of the content we need is covered by other groups, but we could use more detail in Data Source, Data Processing, and Methodology. Methodology framed its scope broadly in Toronto but hasn’t yet met as a group. One activity for that group would be to review the sampling and weighting specifications that came out of the Survey Design and Implementation working group to see what is needed beyond that work.

Next Meeting

The group will meet again on Monday, September 29, to get reports from other groups.

Appendix A

What is a codebook?

[also referred to by DataONE as science metadata for science data]

A codebook combines the contents of a data dictionary with additional information to support the intelligent use of the data which it describes. The data dictionary provides structured information on the layout of the data, providing sufficient detail to the incorporation of the data into a program for analysis including the name, physical location of the data, data type, size, and meaning of the values. This should include both valid and invalid (missing) values as well as information on the record types, relationships and internal layout. The codebook pulls together additional information required for understanding the source of the data, its relevance to the research question, and related information about the survey design, methodologies employed, the data collection process, data processing, and data quality.

A codebook should contain information for discovery and for data manipulation (data dictionary contents) in a structured format to support programming for access. Other sections of metadata may be machine actionable or informational depending on the use of the codebook structure. Informational content can be maintained in-line (as specific content of the codebook) or by reference to external content (a questionnaire, research proposal, methodology resources, etc.).

Discussion

The definitions below for "codebook" are survey centric when refering to the broader set of metadata related to a data file. Another term may be preferable but there isn't one that leaps to mind. Whether called a codebook, science metadata, metadata, or something else, data files have 2 levels of description:

·         A structured physical description that supports the ability of the programmer to access the data accurately

·         Supporting information that allows the researcher to evaluate “fitness of use” of the data to a particular research question, the overall quality of the data, and the specifics of the conceptual (objects, universe/population, conceptual definitions, spatial and temporal) coverage. This information may be applicable to the study as a whole or to the individual variable. This also includes information on why and how the data were captured, processed, and preserved.

 

Type of information

Basic Codebook

Survey

Fauna (Wildlife)

Data structure:

·         Record type

·         Record layout

·         Record relationship

·         Data type

·         Valid values

·         Invalid values

Structured metadata to support access

Structured metadata to support access

Structured metadata to support access

Data source:

·         Why was data collected

·         How was data collected

·         Who collected the data

·         The universe or population and how it was identified and selected

Descriptive to support assessment of quality and fitness-for-use

Purpose of the survey; Survey content and flow (may or may not need to be actionable); identification and sampling of survey population (may or may not need to be actionable for replication purposes)

Purpose of study, how data was collected (may need to be actionable to support replication and/or calibration); identification and sampling of survey population (may or may not need to be actionable for replication purposes)

Data processing:

·         Data capture process

·         Validation

·         Quality control

·         Normalizing, coding, derivations

·         Protection (confidentiality, suppression, interpolation, embargo, etc.)

Informational material; support provenance

May need structured metadata for purposes of replication; Include processes, background information, proposed, actual, and implications for data

May need structured to support mechanical capture instruments, calibrations, situational variants, etc.

Discovery information:

·         Who

·         What

·         When

·         Why

·         Coverage

o   Topical

o   Temporal

o   Spatial

Structured metadata to support discovery and access to the data as a whole

Structured metadata to support discovery and access to the data as a whole

Structured metadata to support discovery and access to the data as a whole

Conceptual basis

·         Object

·         Concept

Informational material

Structured to support analysis of change over time and relationship between studies. May just be descriptive / informational.

Structured to support genre level comparison (heavy use of common taxonomies, etc.)

Methodologies employed

Informational material

Structured to support replication and comparison between studies

Structured to support replication and comparison between studies

Related materials of relevance to data

Informational material

  

Definitions

Data Dictionary

·         A data dictionary, or metadata repository, as defined in the IBM Dictionary of Computing, is a "centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format."[1] The term can have one of several closely related meanings pertaining to databases and database management systems (DBMS):

·         A document describing a database or collection of databases

·         An integral component of a DBMS that is required to determine its structure

·         A piece of middleware that extends or supplants the native data dictionary of a DBMS

·         Database about a database. A data dictionary defines the structure of the database itself (not that of the data held in the database) and is used in control and maintenance of large databases. Among other items of information, it records (1) what data is stored, (2) name, description, and characteristics of each data element, (3) types of relationships between data elements, (4) access rights and frequency of access. Also called system dictionary when used in the context of a system design.Read more: http://www.businessdictionary.com/definition/data-dictionary.html#ixzz3Am5wCgZI

·         A data dictionary is a collection of descriptions of the data objects or items in a data model for the benefit of programmers and others who need to refer to them. (Posted by Margaret Rouse  @ WhatIs.com)

Codebook

What is a codebook? (http://www.sscnet.ucla.edu/issr/da/tutor/tutcode.htm)

A codebook describes and documents the questions asked or items collected in a survey. Codebooks and study documentation will provide you with crucial details to help you decide whether or not a particular data collection will be useful in your research. The codebook will describe the subject of the survey or data collection, the sample and how it was constructed, and how the data were coded, entered, and processed.  The questionnaire or survey instrument will be included along with a description or layout of how the data file is organized.  Some codebooks are available electronically, and you can read them on your computer screen, download them to your machine, or print them out. Others are not electronic and must be used in a library or archive, or, depending on copyright, photocopied if you want your own for personal use.

Codebook : Lisa Carley-Baxter (http://srmo.sagepub.com/view/encyclopedia-of-survey-research-methods/n69.xml)

Codebooks are used by survey researchers to serve two main purposes: to provide a guide for coding responses and to serve as documentation of the layout and code definitions of a data file. Data files usually contain one line for each observation, such as a record or person (also called a "respondent"). Each column generally represents a single variable; however, one variable may span several columns. At the most basic level, a codebook describes the layout of the data in the data file and describes what the data codes mean. Codebooks are used to document the values associated with the answer options for a given survey question. Each answer category is given a unique numeric value, and these unique numeric values are then used by researchers in their analysis of the ...

Codebook (Wikipedia.com)

A codebook is a type of document used for gathering and storing codes. Originally codebooks were often literally books, but today codebook is a byword for the complete record of a series of codes, regardless of physical format.

ICPSR

What is a codebook?

A codebook provides information on the structure, contents, and layout of a data file. Users are strongly encouraged to look at the codebook of a study before downloading the datafiles.

While codebooks vary widely in quality and amount of information given, a typical codebook includes:

• Column locations and widths for each variable

• Definitions of different record types

• Response codes for each variable

• Codes used to indicate nonresponse and missing data

• Exact questions and skip patterns used in a survey

• Other indications of the content and characteristics of each variable

Additionally, codebooks may also contain:

• Frequencies of response

• Survey objectives

• Concept definitions

• A description of the survey design and methodology

• A copy of the survey questionnaire (if applicable)

• Information on data collection, data processing, and data quality

...