Meeting: 2014-06-30Attending: Guillaume Duffes, Dan Gillman, Larry Hoyle, Ørnulf Risnes, Steve McEachern, Wendy Thomas Reviewed list of related package and view content from Wolfgang Decisions:There is currently a lot of duplication in the list and it needs to be normalized prior to review. Steve will normalize the list and send it out to members later this week with the following instructions: Review the list and do the following:
Unless other agenda items arise, schedule the next meeting after the deadline for returning reviews. Process:
This may result in the creation of two "simple codebook" views and appropriate names should be determined. Discussion:Given the range of use cases (something above a simple data set to a simple study housed in an archive) it is difficult to determine what is meant by "simple". Rather than discuss in the abstract it may be helpful to get a list of objects one would like to see in a simple codebook from the members of group and then identify those objects that are considered to be the minimum requirement for publication. This may result is two levels for a simple codebook (basic and intermediate) but the approach would provide clear information on where there is consensus and where there is debate. Statements that may help define the differences between these two levels:
There has been a shift from the initial content creation in Drupal of a simple codebook "package" to the idea of a "view" and we need to reorient the Drupal content to this shift. In addition, packages and views relating to the simple codebook view that were not in existence when the work of this group was started are now more fully defined. The content of these packages and views needs to be considered when defining the view(s) of a simple codebook. View orientation is liberating
The following process could be useful in defining the view(s) for a simple codebook: Creating the list of objects for a simple codebook:
Create a view of Simple codebook in Drupal - using the final agreed upon list of a view Note: Some of the objects being included are complex objects. These should then be reviewed to see if a simpler basic object of that type is needed. (I.e. we may only want to include a "stripped down" version in the view) Steve will take a go at normalizing and send list out to group Wolfgang can then enforce getting responses. Meeting in two weeks:
|
Simple Codebook Meeting |
Type of information | Basic Codebook | Survey | Fauna (Wildlife) |
Data structure: · Record type · Record layout · Record relationship · Data type · Valid values · Invalid values | Structured metadata to support access | Structured metadata to support access | Structured metadata to support access |
Data source: · Why was data collected · How was data collected · Who collected the data · The universe or population and how it was identified and selected | Descriptive to support assessment of quality and fitness-for-use | Purpose of the survey; Survey content and flow (may or may not need to be actionable); identification and sampling of survey population (may or may not need to be actionable for replication purposes) | Purpose of study, how data was collected (may need to be actionable to support replication and/or calibration); identification and sampling of survey population (may or may not need to be actionable for replication purposes) |
Data processing: · Data capture process · Validation · Quality control · Normalizing, coding, derivations · Protection (confidentiality, suppression, interpolation, embargo, etc.) | Informational material; support provenance | May need structured metadata for purposes of replication; Include processes, background information, proposed, actual, and implications for data | May need structured to support mechanical capture instruments, calibrations, situational variants, etc. |
Discovery information: · Who · What · When · Why · Coverage o Topical o Temporal o Spatial | Structured metadata to support discovery and access to the data as a whole | Structured metadata to support discovery and access to the data as a whole | Structured metadata to support discovery and access to the data as a whole |
Conceptual basis · Object · Concept | Informational material | Structured to support analysis of change over time and relationship between studies. May just be descriptive / informational. | Structured to support genre level comparison (heavy use of common taxonomies, etc.) |
Methodologies employed | Informational material | Structured to support replication and comparison between studies | Structured to support replication and comparison between studies |
Related materials of relevance to data | Informational material |
· A data dictionary, or metadata repository, as defined in the IBM Dictionary of Computing, is a "centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format."[1] The term can have one of several closely related meanings pertaining to databases and database management systems (DBMS):
· A document describing a database or collection of databases
· An integral component of a DBMS that is required to determine its structure
· A piece of middleware that extends or supplants the native data dictionary of a DBMS
· Database about a database. A data dictionary defines the structure of the database itself (not that of the data held in the database) and is used in control and maintenance of large databases. Among other items of information, it records (1) what data is stored, (2) name, description, and characteristics of each data element, (3) types of relationships between data elements, (4) access rights and frequency of access. Also called system dictionary when used in the context of a system design.Read more: http://www.businessdictionary.com/definition/data-dictionary.html#ixzz3Am5wCgZI
· A data dictionary is a collection of descriptions of the data objects or items in a data model for the benefit of programmers and others who need to refer to them. (Posted by Margaret Rouse @ WhatIs.com)
A codebook describes and documents the questions asked or items collected in a survey. Codebooks and study documentation will provide you with crucial details to help you decide whether or not a particular data collection will be useful in your research. The codebook will describe the subject of the survey or data collection, the sample and how it was constructed, and how the data were coded, entered, and processed. The questionnaire or survey instrument will be included along with a description or layout of how the data file is organized. Some codebooks are available electronically, and you can read them on your computer screen, download them to your machine, or print them out. Others are not electronic and must be used in a library or archive, or, depending on copyright, photocopied if you want your own for personal use.
Codebooks are used by survey researchers to serve two main purposes: to provide a guide for coding responses and to serve as documentation of the layout and code definitions of a data file. Data files usually contain one line for each observation, such as a record or person (also called a "respondent"). Each column generally represents a single variable; however, one variable may span several columns. At the most basic level, a codebook describes the layout of the data in the data file and describes what the data codes mean. Codebooks are used to document the values associated with the answer options for a given survey question. Each answer category is given a unique numeric value, and these unique numeric values are then used by researchers in their analysis of the ...
A codebook is a type of document used for gathering and storing codes. Originally codebooks were often literally books, but today codebook is a byword for the complete record of a series of codes, regardless of physical format.
What is a codebook?
A codebook provides information on the structure, contents, and layout of a data file. Users are strongly encouraged to look at the codebook of a study before downloading the datafiles.
While codebooks vary widely in quality and amount of information given, a typical codebook includes:
• Column locations and widths for each variable
• Definitions of different record types
• Response codes for each variable
• Codes used to indicate nonresponse and missing data
• Exact questions and skip patterns used in a survey
• Other indications of the content and characteristics of each variable
Additionally, codebooks may also contain:
• Frequencies of response
• Survey objectives
• Concept definitions
• A description of the survey design and methodology
• A copy of the survey questionnaire (if applicable)
• Information on data collection, data processing, and data quality
Simple Codebook Meeting |
Simple Codebook Team Minutes |
Simple Codebook Meeting Minutes |
Simple Codebook Meeting Minutes |
Simple Codebook Meeting |
Simple Codebook Meeting |
Simple Codebook MeetingMarch 16, 2015Present: Dan Gillman, Oliver Hopt, Larry Hoyle, Mary Vardigan The agenda for the meeting was to determine if all elements in the CESSDA profile/Nesstar profile are present in DDI 4. Larry Hoyle had created a spreadsheet of DDI Lite and the list of elements from CESSDA profiles. There seems to be a wide variety of the selection of the elements and attributes in the repositories using DDI Lite. The Nesstar Webview comes as the base. The group compared elements used across different repositories. The task was to find out which elements are in DDI4, so the group decided to divide up the list of 200+ elements. There appears not to be any DDI4 elements about the metadata itself, the DDI document. It basically parallels the study description information. This may not be relevant for DDI4. Perhaps the Data Citation group should think about this. This is often the archive's intellectual property, so some representation of it will be of interest to most of the archives. Citing the user guide or documentation is a common practice. DDI Codebook has some elements of description that DDI4 has not been talking about. We need to bring forth something to the Advisory Group about this – this is an issue that we need to discuss. In DDI Lifecycle there is the corresponding instance with a citation on it. There is no DDI4 instance because instance is a root element for documents in general. Will the idea of a document description disappear in 4? The archive creates a document describing the data. The landing page is sometimes (always?) metadata. Study level, variable level, record level, file level: should the Data Citation group look at what are targets of citation? In DDI Codebook, we have DocumentDescription; in DDI Lifecycle we have DDIInstance. Should DDIInstance be brought back into DDI4? – with revised content but allowing attachment of annotation. Being able to point to an XML file with the model and generate that file from elements in 4 is adequate. But it is no longer enough to point to one object that contains everything. We have the logical vs. physical distinction. A DDIInstance as a physical thing – something that's there. Pulling together the information into that representation is an activity with Authors, etc. There is the "same" content in two archives. – different contact people, different URIs for each. This is parallel to data description. Assignments for the next meetingWhere in DDI4 do each of these elements exist?
|
Simple Codebook Meeting |
Simple Codebook Meeting |
Simple Codebook Meeting |
Simple Codebook |
Simple Codebook Meeting |
Simple Codebook Meeting |
July 20, 2015Present: Michelle Edwards, Oliver Hopt, Larry Hoyle, Mary Vardigan Managing DDI Codebook in DDI4The group discussed whether it would be possible to reconcile the different approaches to identification if we were to manage DDI Codebook in DDI4 in the future, which is the goal. Currently in DDI Codebook IDs are unique only for the individual instance, not across instances, and the approach of DDI Lifecycle and DDI4 is to have globally unique IDs for all DDI objects. It was the sense of the group, however, that the IDs are not a big barrier, either using the URNs or using UUIDs; it should be fairly easy to make a transfer. Scripts can generate UUIDs. We could manage Codebook in DDI4 without taking advantage of referencing and reuse. Also, there is a Local ID in DDI4, which could carry what is currently the ID in DDI Codebook and a UUID could be added. Colectica goes back and forth between DDI Codebook and DDI Lifecyle and they use UUIDs so it would be helpful to talk with them about these issues. There is also a political issue in that DDI Codebook has been handled separately and people feel ownership of it as it stands now. It is used around the world by the IHSN. We want to maintain close relationships with these partners, so we will need to design a system that works for them. We should contact Nesstar to start a conversation about how Nesstar Publisher might make some relatively small changes to accommodate this switch to managing DDI Codebook elements in DDI4. Status of Spreadsheet and ModelingIn the past weeks the Codebook group annotated a spreadsheet – https://docs.google.com/spreadsheets/d/1VDbVz2KRRSX_KEf0IfuE-QqMyTDupftCZfBdBM6VPT8/edit#gid=2125503646 – containing all of the Codebook elements used by CESSDA archives with the objective of determining which elements are currently in DDI4 and which might need to be added. It was the sense of the modelers on the call that the spreadsheet as it stands now is adequate input for the modeling effort. Oliver with support from Larry will start to add classes to Drupal based on the spreadsheet and will get back to the group with any questions. He estimates that he will have a first Codebook View to show in four weeks. |
August 17, 2015Present: Michelle Edwards, Dan Gillman, Larry Hoyle, Steve McEachern, Mary Vardigan What is the advantage of moving from Codebook to Lifecycle?One benefit is building a collection of reusable instruments in multiple languages. Reusing the census variables in other questionnaires is another area. Something we should promote is building in limited amounts of reuse. It may be possible to incorporate areas of reuse without incorporating in others where we don't see the benefit – variables are an area. Can this be done piecemeal? Most variables are instance and possibly represented variables. As they see the need they can build out to conceptual level. The recent work on ANES and GSS is a good example of this. With the concept management perspective we have, you can always argue that any two usages are different in some way. We will be imprecise in some ways always. There is a push among NSOs for question banks, but there is a recognition that modes affect the responses. Your intent is to measure the same concept. This is why concept management is a powerful idea. One of the problems may be the tools that are needed. We can't yet articulate the use case to build the tools we need. IdentifiersWhat is the best way to proceed in terms of identifiers? Oliver is doing some modeling so we should be able to look at identification based on what he does. Mary will introduce the identifier discussion with Ornulf so we can get Nesstar Publisher on board. We hope there will be a way to use identifiers in DDI-C and append to them to make them unique. In the end we need a unique identifier at what level? Anything that is identifiable in 4 requires a unique identifier. Whether everything that is identifiable in 2 will be in 4 is pretty assured. The IDS in 2 are at the variable level. If variable has a unique identifier and the study has a unique identifier, there should be global uniqueness if we could add the registry ID. In the Linked Data world you could find all the variables in the world related to a concept. Another is the simple fact that many studies are ongoing so the yearly or monthly variables could be looked at across time. Any time you are making comparisons over time, subject, or geography you need this. |
Attendees: Dan Gillman (chair), Michelle Edwards, Oliver Hopt, Larry Hoyle, Steve McEachern
Oliver distributed a PDF of his thinking around the Codebook model. He presented this work, and provided commentary on his thinking. Scenario A was discussed at the last meeting, but was seen to be problematic. Scenario B was his revised approach. This includes: - Study, DataResource and DataFile - Citation from Annotation
DataResource is consistent with the GSIM equivalent - Carries Citation which allows various subclasses to be citable - Has one attribute: productionInformation - VariableBasket and DataFile would be subclasses of DataResource
Study includes: - StudyDesign - Fieldwork - Etc. - Study would have an attribute DataResource
Comments and discussion
1. Dan asked the meaning of the blue box around DataResource in Scenario B? Oliver indicated that this would indicate a new package DataResource. 2. Dan asked what is the cardinality of the relationship b/w Study and DataResource? Oliver suggested that this should be repeatable - e.g. more than one DataFile in a Study. 3. Dan asked DataResource is currently a collection of files or a collection of Variables. Could this include Questions? - Oliver noted that there is currently a relationship through Measure from Question to InstanceVariable. - We may not want to include all of the DataCapture view within Codebook - Dan suggests that DataCapture has not yet laid out the link between the Questionnaire in the abstract versus the Instrument in the physical. - We would want to include the Questions, Skips, ResponseCategories and InterviewerInstructions. - Which do we want - the PhysicalInstrument or the ConceptualInstrument? - Examples: Blood Pressure measurement, CATI instrument execution - By including Physical, do we as a result account for Conceptual? - Larry asks can we include by reference? Dan argues for the need for explicit rather than implicit reference. Larry notes that this means that this would make an Instrument required content. - Dan asks if it is adequate to have just a pointer? If so, how do you link the Variable to the Question? - Dan suggests that there IS a link between a Question and a Variable - but it is just not enough to tell you sufficient detail as to how a Datum was derived. The group generally wasn't sure if we do want to try and link the Question and Variable - mostly due to content already existing (particularly pre-2000).
Oliver brought the conversation back to what we are currently trying to model.
Preferably there should be some machine actionable generated documentation which allows the links between these to be automatically (or semi-automatically) created. However in many cases this simply may not be available for past content (ADA and GESIS have examples, and we believe ICPSR as well).
As such, we may want to allow for simple external documents which describe the content in a human-readable (but not machine readable or actionable) form. External resource is an option in Lifecycle - this might be the means for this.
Oliver's current model does enable this - allows for the simple, but allowing to be replaced by more complex where it is available and/or "generateable". Steve noted that this would also be consistent with the approach taken in Methodology.
Where does this leave us, and where to next?
Dan is concerned that we may be adding a fair amount of complexity over DDI version 2.5. e.g We have been having discussions about the link between Question and Variable - how would the user community respond to this?
Oliver also noted that this may touch on the discussion had with Ornulf about maintaining Codebook 2.5 through the DDI4 implementation. Ornulf's and Oliver's concern was the potential creation about too many identifiers to be maintained within a Codebook instance. Whether we would be able to handle what's done in 2.5 in a DDI4 codebook.
Larry noted that Colectica seem to have a potential solution to this in their current work. This seems to bypass the Lifecycle 3.2 approach, and simply use UUIDs to manage identifiability, which might be a possible solution.
What to do for next meeting?
Oliver undertook to clarify what the relationships between his Study object and the other packages would be (e.g. to DataCapture, Methodology, etc.). We also need to ensure that we keep track of what the requirements are for aligning with DDI2.5
Next meeting: Monday 12th October, 8am U.S. Eastern time Note that there will be changes for other locations due to daylight savings. |
November 9, 2015Present: Dan Gillman, Oliver Hopt, Larry Hoyle, Mary Vardigan The group discussed whether Data Capture had made enough progress to enable Codebook to move forward. Mary will get in touch with Barry about this. In terms of Oliver's model (the second model he proposed), the next step would be to bring in information from other groups. Access conditions was the only area not yet covered. We need to ensure that everything in Oliver's model is covered (except for Access Conditions). Oliver will go through the group's spreadsheet and map to this model to ensure full coverage. We also need to ensure that we have adequate methodology information. We also need to be sure that full file level documentation is enabled (not just study level). And do we want to include all of the datum level information for reuse? This may be too much for the codebook view, which has traditionally been a more flat view of a study and the files it produces. There is a connection between variable and datum so if we want this to be part of codebook or an extended version it is possible. Do we care about anything other than the instance variables in Codebook? Codebook is something you get with a file that lets you use it and interpret it. But if you have pointers to represented variable and conceptual variables you can do more. Since codebooks are created ad hoc, that's how it's designed. There is no guarantee that the way someone creates a conceptual variable is the same as how someone else creates it. There would be no semantic interoperability. But in a future world by design there are new surveys where comparability is designed into newer surveys. A DOI to what has been defined elsewhere would be OK. We have polled various organizations to see which elements they use. Do we need to continue not-used elements? This is a good point in time to simplify. To survey on DDI 3 usage, Oliver has a small XSL transformation that gives out a statistic of downward paths for any given document, which could be helpful. In Data Description, there was a related discussion about how far we should chase legacy file layouts. In one sense you want to encourage people to do things in simpler ways, rather than more complicated formats. It was decided that the ability to include references to represented and conceptual variables is a good addition to codebook to bring in the notion of reuse.
|
November 23, 2015Present: Dan Gillman, Michelle Edwards, Steve McEachern, Larry Hoyle
Goal for next Meeting – December 7, 2015:
|