- Created by Former user, last modified by Mary Vardigan on Feb 16, 2015
You are viewing an old version of this page. View the current version.
Compare with Current View Page History
« Previous Version 23 Next »
Meeting: 2014-06-30
Attending: Guillaume Duffes, Dan Gillman, Larry Hoyle, Ørnulf Risnes, Steve McEachern, Wendy Thomas
Reviewed list of related package and view content from Wolfgang
Decisions:
There is currently a lot of duplication in the list and it needs to be normalized prior to review.
Steve will normalize the list and send it out to members later this week with the following instructions:
Review the list and do the following:
Add any unlisted objects that you would expect to find in a basic or simple codebook
For each item indicate if the item is one which would be required in order to publish the codebook or is one that would be useful to have in the codebook
Return your review to the group.
Unless other agenda items arise, schedule the next meeting after the deadline for returning reviews.
Process:
Items that have agreement in terms of "required" will go into a basic view
Items that have agreement in terms of "would like to see" will go into an "intermediate" view
Items without agreement will be discussed and assigned during the next meeting
This may result in the creation of two "simple codebook" views and appropriate names should be determined.
Discussion:
Given the range of use cases (something above a simple data set to a simple study housed in an archive) it is difficult to determine what is meant by "simple". Rather than discuss in the abstract it may be helpful to get a list of objects one would like to see in a simple codebook from the members of group and then identify those objects that are considered to be the minimum requirement for publication. This may result is two levels for a simple codebook (basic and intermediate) but the approach would provide clear information on where there is consensus and where there is debate.
Statements that may help define the differences between these two levels:
The bare minimum needed in order to publish (basic)
What would you like to see in this view (intermediate)?
There has been a shift from the initial content creation in Drupal of a simple codebook "package" to the idea of a "view" and we need to reorient the Drupal content to this shift. In addition, packages and views relating to the simple codebook view that were not in existence when the work of this group was started are now more fully defined. The content of these packages and views needs to be considered when defining the view(s) of a simple codebook.
View orientation is liberating
A view contains objects (it is not a compilation of views)
A view (specific version) may be partially or fully support another view - the intent to do this should be noted in the description of the new view
The following process could be useful in defining the view(s) for a simple codebook:
Creating the list of objects for a simple codebook:
Start with Wolfgang's list as an example, (normalized version of this list)
What would you add?
What would you like?
What is required vs. what is optional (simple to intermediate)?
Create a view of Simple codebook in Drupal - using the final agreed upon list of a view
Note: Some of the objects being included are complex objects. These should then be reviewed to see if a simpler basic object of that type is needed. (I.e. we may only want to include a "stripped down" version in the view)
Steve will take a go at normalizing and send list out to group
Wolfgang can then enforce getting responses.
Meeting in two weeks:
this week if possible for list out
wish list turnaround
may want to delay next meeting until after due date for getting lists back from members
Simple Codebook Meeting
September 15, 2014
Present: Dan Gillman, Oliver Hopt, Larry Hoyle, Jenny Linnerud, Steve McEachern, Ornulf Risnes, Wendy Thomas, Mary Vardigan
Discussion
The group affirmed Wendy’s definition of a codebook (See Appendix A for the full document):
A codebook combines the contents of a data dictionary with additional information to support the intelligent use of the data which it describes. The data dictionary provides structured information on the layout of the data, providing sufficient detail to the incorporation of the data into a program for analysis including the name, physical location of the data, data type, size, and meaning of the values. This should include both valid and invalid (missing) values as well as information on the record types, relationships and internal layout. The codebook pulls together additional information required for understanding the source of the data, its relevance to the research question, and related information about the survey design, methodologies employed, the data collection process, data processing, and data quality.
A codebook should contain information for discovery and for data manipulation (data dictionary contents) in a structured format to support programming for access. Other sections of metadata may be machine actionable or informational depending on the use of the codebook structure. Informational content can be maintained in-line (as specific content of the codebook) or by reference to external content (a questionnaire, research proposal, methodology resources, etc.).
The group discussed overlap with other groups and packages since codebook is a compilation of other packages. Simple Codebook is most likely a compilation of Conceptual, Simple Data Description, Discovery, and additional information that facilitates interpretation of the data and intelligent use. The difficulty is determining what depth of information is appropriate. For replication purposes, you need a lot of detail.
The Simple Data Description group is first focusing on data description in a broad way and will then define a subset for “simple.” Perhaps this group should do the same.
It would be helpful to have reports from other groups so that we know where they are and what makes sense to combine for simple codebook.
In Wendy’s list (Appendix A), much of the content we need is covered by other groups, but we could use more detail in Data Source, Data Processing, and Methodology. Methodology framed its scope broadly in Toronto but hasn’t yet met as a group. One activity for that group would be to review the sampling and weighting specifications that came out of the Survey Design and Implementation working group to see what is needed beyond that work.
Next Meeting
The group will meet again on Monday, September 29, to get reports from other groups.
Appendix A
What is a codebook?
[also referred to by DataONE as science metadata for science data]
A codebook combines the contents of a data dictionary with additional information to support the intelligent use of the data which it describes. The data dictionary provides structured information on the layout of the data, providing sufficient detail to the incorporation of the data into a program for analysis including the name, physical location of the data, data type, size, and meaning of the values. This should include both valid and invalid (missing) values as well as information on the record types, relationships and internal layout. The codebook pulls together additional information required for understanding the source of the data, its relevance to the research question, and related information about the survey design, methodologies employed, the data collection process, data processing, and data quality.
A codebook should contain information for discovery and for data manipulation (data dictionary contents) in a structured format to support programming for access. Other sections of metadata may be machine actionable or informational depending on the use of the codebook structure. Informational content can be maintained in-line (as specific content of the codebook) or by reference to external content (a questionnaire, research proposal, methodology resources, etc.).
Discussion
The definitions below for "codebook" are survey centric when refering to the broader set of metadata related to a data file. Another term may be preferable but there isn't one that leaps to mind. Whether called a codebook, science metadata, metadata, or something else, data files have 2 levels of description:
· A structured physical description that supports the ability of the programmer to access the data accurately
· Supporting information that allows the researcher to evaluate “fitness of use” of the data to a particular research question, the overall quality of the data, and the specifics of the conceptual (objects, universe/population, conceptual definitions, spatial and temporal) coverage. This information may be applicable to the study as a whole or to the individual variable. This also includes information on why and how the data were captured, processed, and preserved.
Type of information | Basic Codebook | Survey | Fauna (Wildlife) |
Data structure: · Record type · Record layout · Record relationship · Data type · Valid values · Invalid values | Structured metadata to support access | Structured metadata to support access | Structured metadata to support access |
Data source: · Why was data collected · How was data collected · Who collected the data · The universe or population and how it was identified and selected | Descriptive to support assessment of quality and fitness-for-use | Purpose of the survey; Survey content and flow (may or may not need to be actionable); identification and sampling of survey population (may or may not need to be actionable for replication purposes) | Purpose of study, how data was collected (may need to be actionable to support replication and/or calibration); identification and sampling of survey population (may or may not need to be actionable for replication purposes) |
Data processing: · Data capture process · Validation · Quality control · Normalizing, coding, derivations · Protection (confidentiality, suppression, interpolation, embargo, etc.) | Informational material; support provenance | May need structured metadata for purposes of replication; Include processes, background information, proposed, actual, and implications for data | May need structured to support mechanical capture instruments, calibrations, situational variants, etc. |
Discovery information: · Who · What · When · Why · Coverage o Topical o Temporal o Spatial | Structured metadata to support discovery and access to the data as a whole | Structured metadata to support discovery and access to the data as a whole | Structured metadata to support discovery and access to the data as a whole |
Conceptual basis · Object · Concept | Informational material | Structured to support analysis of change over time and relationship between studies. May just be descriptive / informational. | Structured to support genre level comparison (heavy use of common taxonomies, etc.) |
Methodologies employed | Informational material | Structured to support replication and comparison between studies | Structured to support replication and comparison between studies |
Related materials of relevance to data | Informational material |
Definitions
Data Dictionary
· A data dictionary, or metadata repository, as defined in the IBM Dictionary of Computing, is a "centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format."[1] The term can have one of several closely related meanings pertaining to databases and database management systems (DBMS):
· A document describing a database or collection of databases
· An integral component of a DBMS that is required to determine its structure
· A piece of middleware that extends or supplants the native data dictionary of a DBMS
· Database about a database. A data dictionary defines the structure of the database itself (not that of the data held in the database) and is used in control and maintenance of large databases. Among other items of information, it records (1) what data is stored, (2) name, description, and characteristics of each data element, (3) types of relationships between data elements, (4) access rights and frequency of access. Also called system dictionary when used in the context of a system design.Read more: http://www.businessdictionary.com/definition/data-dictionary.html#ixzz3Am5wCgZI
· A data dictionary is a collection of descriptions of the data objects or items in a data model for the benefit of programmers and others who need to refer to them. (Posted by Margaret Rouse @ WhatIs.com)
Codebook
What is a codebook? (http://www.sscnet.ucla.edu/issr/da/tutor/tutcode.htm)
A codebook describes and documents the questions asked or items collected in a survey. Codebooks and study documentation will provide you with crucial details to help you decide whether or not a particular data collection will be useful in your research. The codebook will describe the subject of the survey or data collection, the sample and how it was constructed, and how the data were coded, entered, and processed. The questionnaire or survey instrument will be included along with a description or layout of how the data file is organized. Some codebooks are available electronically, and you can read them on your computer screen, download them to your machine, or print them out. Others are not electronic and must be used in a library or archive, or, depending on copyright, photocopied if you want your own for personal use.
Codebook : Lisa Carley-Baxter (http://srmo.sagepub.com/view/encyclopedia-of-survey-research-methods/n69.xml)
Codebooks are used by survey researchers to serve two main purposes: to provide a guide for coding responses and to serve as documentation of the layout and code definitions of a data file. Data files usually contain one line for each observation, such as a record or person (also called a "respondent"). Each column generally represents a single variable; however, one variable may span several columns. At the most basic level, a codebook describes the layout of the data in the data file and describes what the data codes mean. Codebooks are used to document the values associated with the answer options for a given survey question. Each answer category is given a unique numeric value, and these unique numeric values are then used by researchers in their analysis of the ...
Codebook (Wikipedia.com)
A codebook is a type of document used for gathering and storing codes. Originally codebooks were often literally books, but today codebook is a byword for the complete record of a series of codes, regardless of physical format.
ICPSR
What is a codebook?
A codebook provides information on the structure, contents, and layout of a data file. Users are strongly encouraged to look at the codebook of a study before downloading the datafiles.
While codebooks vary widely in quality and amount of information given, a typical codebook includes:
• Column locations and widths for each variable
• Definitions of different record types
• Response codes for each variable
• Codes used to indicate nonresponse and missing data
• Exact questions and skip patterns used in a survey
• Other indications of the content and characteristics of each variable
Additionally, codebooks may also contain:
• Frequencies of response
• Survey objectives
• Concept definitions
• A description of the survey design and methodology
• A copy of the survey questionnaire (if applicable)
• Information on data collection, data processing, and data quality
Simple Codebook Meeting
September 29, 2014
Present: Jenny Linnerud, Steve McEachern, Barry Radler, Wendy Thomas, Mary Vardigan
Discussion
The group had a discussion of Simple Codebook as a compilation of different components, including Conceptual, Simple Data Description, Discovery, and Simple Instrument, as well as elements of data processing/provenance and Methodology. Steve showed a diagram he had created to show the big picture. He subsequently added boxes for Methodology and Discovery to that big picture – see DDI4_view_overview.pdf. This helps us visualize the structure of DDI 4 and also the codebook view. This diagram should be part of the upcoming Dagstuhl meeting orientation on the first day to ensure that everyone is united in their understanding of how DDI 4 works.
It was pointed out that there might be links to additional information (e.g., Methodology) in Simple Codebook but a more Complex Codebook could bring some of that information inline in a structured, machine-actionable way (e.g., routing/skip patterns through a questionnaire). The group also discussed that we need to distinguish questionnaire-centric codebooks from more generic codebooks that talk about measurement rather than variable, for example.
In Toronto the Methodology group got started but it needs more time to focus on this area. Initially, the group drew a line between design and implementation. It was pointed out that we need to separate what do you want to do from what are you doing and what have you done. We also need to think about replication.
In terms of data citation, we may need to cite at a very granular level. Should Discovery be part of all views in this sense?
We should think about a set of administrative metadata that accompanies each view and describes it so there is some consistency across views. This might indicate the order that compilations should take and whether they have a logical sequence. This would be a guide to the view.
For Dagstuhl, our group should put out a requirements document indicating what we need for the Simple Codebook view, including the administrative metadata. We also want to have guidelines for future groups doing composite views.
The Simple Codebook needs a view on the Drupal server, so we will contact Oliver about that.
Action Items:
Mary to make sure Steve's diagram is part of Dagstuhl orientation
Mary to spearhead proposal for Dagstuhl that includes requirements for Simple Codebook
Mary to contact Oliver regarding the Simple Codebook view on Drupal
Simple Codebook Team Minutes
2014 11 10
Dan Gillman, Larry Hoyle, Jenny Linnerud, Steve McEachern, Wendy Thomas, Mary Vardigan, Wolfgang Zenk-Moeltgen
Mappings
There is a spreadsheet for the archival codebook use case that lists elements used by ICPSR, CESSDA, and IHSN. Wendy will map the IHSN codebook elements to DDI 3.2 and send this to the group.
Package vs. View
The group also looked at the Simple Codebook package on the Lion Drupal site. The question was raised of whether we should still work on the Lion site since the simple codebook is still a package and not a view.
All Discovery information came from the Disco specification. We should model our own objects for Simple Codebook and then map to Disco. There hasn’t yet been a discussion yet about what is a property and what is an object.
In terms of information elements needed for codebooks, there are more things in the package than in the spreadsheet because we copied the DDI 3.2 elements and didn’t delete anything with the idea that we would need the other elements later. We moved all the objects specified as Keep from DDI 3.2 into the package.
The content groups should create a complete list of the information elements needed and then the modelers will arrange this into packages and make decisions about objects vs. properties.
Should we add all elements from package into the view? Right now we should not put in anything we are not using. We are trying to start with the essential and then add onto it.
Things go in the view on Lion if we want it in the simple DDI 4 codebook. We are compiling the list of elements used by ICPSR, CESSDA, and IHSN for the basic set of elements. The package that Wolfgang created at Dagstuhl contains additional things not in the spreadsheet. Everything from the Excel list should go into the view.
Graphs are only created for packages and not for views. This is something we will miss if we work only in the view. It doesn’t make sense to work on the Lion site until basic processes have been defined. How do we capture the results of this group?
After the EDDI sprint the whole process should be working and documentation will be produced from a view and a diagram will be produced from the nightly build. Right now things in other groups’ work (e.g., instrument) are not linked in to our codebook set of elements.
Things that need to be fixed on Lion include:
a. Arrows on aggregation and composition are the wrong way round
b. On each object the current DDI 3.2 and GSIM fields should also be rendered in View mode (they are currently only visible in Edit mode)
c. No graph appears for View, only a flat list
Wendy will relay these points to the modelers and the Lion maintainers.
Composite View Modifications
If you take elements that have been defined elsewhere in your composite view, you get everything but you may not want everything. You should be able to make a simple codebook. Someone needs to remodel it so that we can take just the portion we want.
We want to confirm what we need in our use case through the Excel spreadsheet. We need to draw a line for our first proposal for our simple codebook. The more things we decide are properties of an object, the more remodeling we will have to do when people want only some properties. Does this argue for making more things objects rather than properties? This is a tough call as we want to decrease the number of elements.
Working with Data Description
We should take a look at the Data Description modelthat came out of Dagstuhl and use that as a test because the current discussion about Datum is going to change things. What came out of Dagstuhl has had a lot of review and is considered solid. On Lion, what’s there now is the representation of what was decided at Dagstuhl and is up to date. Steve can compile this in a straightforward way and generate a view which is all the objects that will be used by Data Description. All the relevant content is in Lion.
First we need to look at the Excel list and compile the data description level and then look at the View for Data Description to make sure this matches what we need.
Next Meeting
The next meeting will be on November 24.
Actions
Wendy will complete spreadsheet with information for IHSN
The group will pull out the needed elements at the data description level in the Excel sheet
Steve will create a view of Data Description (we can flatten this if needed)
The group can compare the use case elements to what is in Data Description
Simple Codebook Meeting Minutes
November 23, 2014
Present: Dan Gillman, Steve McEachern, Mary Vardigan
Meeting Times
The current time is midnight for Canberra, so we need to find another meeting time. 2pm EST U.S. time is the preferred time for the new year.
DDI 3.2 vs. 4
We are thinking in terms of forward compatibility so that everything in 3.* is covered in 4. This is not the best approach. Rather, we should solve the problem we want to solve and then worry about how to map it after we have solved it.
Framing happens unconsciously -- the circumstances of how you think about a problem constrains the way you are conceiving it.
Still it’s worth having a look at what we have right now to see what the overlap is.
By sticking with the nicely defined distinction between logical and physical we can be more precise going forward.
There is not too much not actually covered in 4 but it is going to be reorganized.
Next Steps
Steve will compare the spreadsheet to Data Description in 4 to determine how they map and overlap.
Simple Codebook Meeting Minutes
February 2, 2015
Present: Dan Gillman, Oliver Hopt, Larry Hoyle, Steve McEachern, Mary Vardigan
The Simple Codebook committee will now be chaired by Dan Gillman as Wolfgang is not able to chair currently.
This group has been in a holding pattern because we are waiting on the results of other groups. However, it was suggested that we look at the Codebook 2.5 (Codebook Version) in comparison to DDI 3.* (Lifecycle Version).
XML permits a detailed description of elements and this is part of the distinction between 2 and 3. But UML doesn't allow this and doesn't account for nesting and levels of detail. We should try to incorporate what is in Version 2 into the model as best we can. We as a group should try to build this. One additional possible other advance would be that we could then have a single model to account for both Codebook and Lifecycle. Both views would be under one spec in this approach.
Is referencing and reusability a distinction between the two versions that we should take into account? Should it be communicated to the modeling team that we may not need the complexity?
For users who want to describe their data, they should be able to write a description and fit it into a framework. If you want to have interoperability with other systems, then that is a different issue.
For the standalone one-off research project, users will not need to be reusing variables and questions, but for longitudinal and research across languages and cultures, this is important; there is a need to harmonize across questionnaires, reuse metadata across time, etc. Maybe this is Complex Codebook?
We need a distinction between the user perspective and the technical perspective. Simple and complex need to be interoperable. It's necessary to reduce the complexity of what is modeled in the library by choosing the simple cases.
One of the decisions for DDI 4 is to make everything identifiable and drop the container aspect of identifiability. This takes away a lot of the complexity.
From a marketing perspective, we need to distinguish between the DDI Codebook version and the Simple Codebook view. Looking at what is in 2 now will be required and we need to lay out what we need to account for. In the study section for DDI Codebook, there were a number of elements that allowed you to provide a high level text description of various methodological things. Preserving that is important.
Capturing what is in an SPSS or SAS representation including all the metadata you can put there is also important. When you move data around, you don't want to lose anything. When you look at how researchers want to record information, it is often difficult for them to record things in detail. Guided structures for them as part of their workflow is important and Codebook this is one view that could help them with this. You need some structure that becomes machine-actionable. You don't want people to just write a narrative.
At BLS, there is a Handbook of Methods. It has narrative descriptions of the surveys BLS does and it doesn't have a lot of detail. This should be captured in DDI rather than in a PDF. There is a need for high level and detailed as well. There may still be a need for some kind of a DDI Lite as a way of inducing reluctant data producers to get involved. For variables the detail is necessary. We should make this as flexible as we can.
We can start by looking at what is in 2.5 and figure out from the point of view of a list of what we need to account for. This would be a set of requirements that we as a group need to figure out how to solve. One question we want to address from a modeling point of view is, for example, when we need to say how the sample is constructed: Would those higher level descriptions go in a class of things that are independent of everything else or part of a sampling class? These are design issues that might have an impact on the way the more detailed model plays out.
If we can manage both 2 and 3 in the same structure we as a standards body will have an easier time with this. We should consult with Wendy on this.
Several archives still rely on DDI Codebook, Nesstar, etc. There is a set of codebook specs from different archives.
Are we talking about having our Simple Codebook view covering everything that is in 2.5? It should be even less. But should there be a view that is everything in 2.5? One idea is a view that is a really simple codebook but to allow for complexity in any direction you would like to go so we could incorporate everything that is in 2.5. Or go into more detail in 3 for whatever direction you want to go so there is a seamless distinction between high and detailed levels. This is basically what DDI 4 is. We should provide a lot of different options about how much detail the user wants. With 4 right now we have detailed descriptions of a lot of things but we are not allowing for high level descriptions. The description and definition were discussed in London with respect to Drupal in the sense that there could be radio buttons to indicate that they should be used to standardize those objects. It could be possible to have a description without any usage of detailed sub-elements.
There could be an attribute that could be high-level description. Or we have an element saying this is the Sample Description. Just having an element called description associated with identifiable objects may not be sufficient. In the annotated identifiable there is an annotation element that has Dublin Core properties like Title, Contributor, etc. It has an abstract. But there is nothing that is a high-level description.
On the one hand it might be nice to have a Sampling Description, but it might be over-specified. It's important to have an element dedicated to a high-level description that you are offering in place of the detail or as a supplement to the detail. A general description like the annotation will lose semantic interoperability. We need machine-interpretability. We also want the possibility to reference just the high level description in the simple codebook.
We should be able to allow for user-defined views that provide for whatever level of detail an organization uses. A Simple Codebook view that maps back to 2.5 would be useful. It would allow those organizations just using 2 to feel comfortable using 4.
DDI 4 does not have the same hierarchy as DDI 3. We would still need an object carrying high level content for the sampling process and nothing else. In 3 there was a parent node but we don't have this structure in DDI 4, which means you need to create a container for this description. It's not a question of using description as a property containing the text, but which element carries the description.
Between now and the next meeting, Oliver will make some slides with an example of what we have been talking about. We also need to dig into DDI 2.5 to get a handle on what is needed at the higher level. Dan and Larry will look at this. Dan will also consult with Wendy on this.
Simple Codebook Meeting
February 16, 2015
Present: Dan Gillman, Larry Hoyle, Steve McEachern, Mary Vardigan
Completeness of cross walk between 2 and 3
It is essentially one-way from 2 to 3. Codebook doesn't have the reusability that Lifecycle does. This is the same issue as between SPSS and Stata/SAS. We should look at the mapping.
Content and functionality of Simple Codebook
We want to make sure that Simple Codebook lets us write or ingest 2.x fairly seamlessly. Are the same kinds of element names available in 3? The names change even at the highest level.
Many miss the Tag Library as it was so simple. This kind of resource would be useful along with a mapping. However, Wendy advises that we don't have to worry about 2 since the mapping is there.
Even 2 has a lot of content. Are we still talking about a simple codebook as opposed to a complex codebook? Simple should allow you to take information from a major statistical package and move to another without losing any information (this is our definition of simple) . In terms of question, that should be included as should sampling and universe. We should review DDI Lite and DDI Core, which have not been updated to the most recent versions of Codebook and Lifecycle. This may enable us to have a framework for content. We will deal with functionality later.
We make the assumption that we have the instrument information and the data description information from those two views. What else do we need? Context information or study level – Universe, sampling, design, bibliographic information. Citation, Study information, which is discovery related, methodology, and access. Does access below?
What do you need to know to use the data? You need the variable information. Question order and the way questions are asked may be important.
There is a tension between being very simple and following best practice for good documentation. Can we add pointers to relevant information? The simple/complex distinction is levels of detail.
For secondary users, we need enough information for a researcher to be able to understand and evaluate the quality of a dataset without reference back to the original data producer and to pull it into a statistical package.
Take common set of CESSDA, ICPSR, and IHSN mandatory schemas, and figure out what is the superset?
Necessary: variables and questions and layout; universe or population; level of geography (basically coverage); sampling; or weights (and point to thorough description of sampling)
Distinction between simple and complex for data description is between simple rectangular file and other data types; this applies to codebook in some ways as well. Is there a cascading effect if we limit ourselves to simple rectangular files, we limit ourselves (we should describe hierarchical files as well like CPS). If we are describing the files themselves, you can describe qualitative files as objects with the existing DDI. You can have hierarchical in CSV with a record type field but historically we have had files with physical representations of the data.
For a simple codebook, the simple representation needs to be limited to unicode or something like that.
Homework: review DDI Core: http://www.ddialliance.org/sites/default/files/ddi3/DDI3_CR3_Core.xml and DDI Lite: http://www.ddialliance.org/sites/default/files/ddi-lite.html
And think about what limitations we want to put on format to keep the idea of simple codebook but keep it rich enough so we are covering enough situations.
- No labels