Info | ||
---|---|---|
| ||
Expand | ||
---|---|---|
| ||
Notes of 30 November meeting Attendees: Dan Gillman, Larry Hoyle, Steve McEachern Discussion began with a discussion of the distinctions between precision and number of digits, and then similarly between intended and physical data types. (Note that there are differences that are introduced by the choice of platform). Larry provided an update on the analysis of the DD model that he had used in rendering the Australian Election Study codebook in DDI4. There was one outstanding item to be resolved: Number of variables Discussion:
To be added as an issue for the Modelling group - suggest adding into SimpleCollection. As part of the discussion, the group identified that there was a need also to file an issue to allow representation of groups of VariableGroups (and more generally of "Groups of Groups"). Both issues were filed for the Modelling group. At this point, the Data Description group is satisfied that the model is sufficient to support the requirements of the DDI prototype, and is ready for handover to the Modelling group. Proposed next meeting: |
Expand | ||
---|---|---|
| ||
Data Description Meeting - 5 October 2017 Attendees: Larry Hoyle, Dan Gillman, Dan Smith, Steve McEachern, Jay Greenfield Agenda for this meeting to outline basic work program for Dagstuhl. Questions on Larry's work To facilitate this discussion, Larry walked through his slides (JIRA issue 20) 1. Is Datum "the thing we have written down", or "the thing we are observing"? This is an application of the Signification pattern 2. Is Datum in the LogicalRecord or in the PhysicalRecord? Datum - the Sign - is in the LogicalRecord 3. Is DataPoint in the LogicalRecord or in the PhysicalRecord?
Touchpoints for DataDescription and DataCapture Proposals coming from the DataCapture group: 1. When creating a ResponseDomain for use within either RepresentedMeasure or RepresentedQuestion, would like to be able to reference a RepresentedVariable in the cascade You could have multiple domains joined together in DDI3 - proposing for DDI4 a 1-to-1 relationship between ResponseDomain and a Value Domain associated with a RepresentedVariable. Note also that Capture is REUSABLE - and therefore Capture is REPEATED and PROSPECTIVE.
This is the RETROSPECTIVE case. DC have not gone to this in the Capture model, but there is the capacity to record the SourceCapture in an InstanceVariable. Data collection would have to be done as a PROCESS. However we do want to ensure that the InstanceVariable is able to point to the Capture that created it. Should this be an InstanceCapture? Dan Smith suggests probably yes. Dan G suggests one means for traversing the questionnaire by working up the cascade to the concept. Dan S suggests that there is actually a graph - one through the Cascade, and the other through the Instrument to the Concept (finding all the Data that have been collected from this instrument). There is still the open discussion of where the Sentinel values should fit in the cascade - Dan S suggests putting them at the Definition level rather than the Usage level. Larry also noted that we still need to keep in mind Units (particuarly changing Units, e.g. in data harmonisation) 3. Common data elements (This is coming from ICPSR project. Jay also identified the NIH example of this: https://cde.nlm.nih.gov/home) These are definitions that combine Questions with Representations. Dan S suggests that we explicitly model this in DDI4. The CDE is the Item (RepresentedQuestion or RepresentedMeasure), plus its ResponseDomain, plus the RepresentedVariable it creates. May also want the ConceptualVariable. This could be a View, which brings together the relevant content from DC/DD, etc. Work program for Dagstuhl: 1. Addressing the above touchpoints 1 & 2 2. Reviewing and resolving the maturity issues identified by Jay in LogicalDataDescription 3. Exploring the CommonDataElements use case (item 3 above). Feeds into UseCase program in Week02. Work process for Dagstuhl: Jay notes that we don't have everyone in the room. We will need to coordinate possible dial-in times at Dagstuhl. Noted that end of day Germany is start of day in US (4pm in Dagstuhl is 9am in Minneapolis). Next meeting: To be confirmed - will be early November after Dagstuhl workshops |
...
Expand | ||
---|---|---|
| ||
Data Description meeting, 14 January 2016, 2100 CET Attendees: Barry Radler, Flavio Rizzolo, Dan Smith, Jay Greenfield, Ornulf Risnes, Steve McEachern, Dan Gillman (from 21.40 onwards) Apologies: Larry Hoyle There were three outstanding questions from the previous meeting designated for discussion - see previous meeting notes below. 1. Relationships between DataPoint and DataStructure It was agreed to remove the relationships between DataPoint and DataStructure
And then add relationships from DataStructure to InstanceVariable - the same two relationships above Questions on this point:
Dan’s argument: DataRecord and DataStructure store data, but Viewpoint stores relationships Flavio: DataStructure has homogeneous DataRecords only (confirmed by Ornulf) THUS - need to add to DataStructure definition that it is a homogeneous set of DataRecords. Agreed that the following needs to be added to the model documentation:
Further questions: Dan: How do we associated specific Viewpoints with the DataStructure? Jay: Can a Viewpoint describe, for example, an RDF triple? Dan suggests that this might be possible to do with the use of Roles (e.g. Predicate is defined as an Identifier role for an IV) Ornulf noted that some of the uses here are documented in the paper from he and Dan authored at the Dagstuhl sprint https://docs.google.com/document/d/1-vxWdastNsTWMf8qlR35wj1128FNSX-4YBrA_MJBaLk/edit Different Viewpoints could be layered on top of the DataRecord. You also don’t necessarily need to use the Viewpoint. Dan S. noted than that three layers that can be used:
You will always need to use the DataStructure, but the other two will be optional DataStructure will therefore have the following relationships:
2. ORDERING: Agreed that Ordering of DataRecords in DataStructure should be possible but OPTIONAL. Ordering of InstanceVariables in a DataStructure still needs to be clarified. 3. Usecases This point wasn't covered directly in the discussion. Agreed that there is a need for testing usecases against the model now, but need to finalise the clean-up of Lion (per Wendy Thomas's review - see minutes below). Agreed therefore that Flavio would update Lion/Drupal, and we would have a special meeting Monday Jan 25 to review this, ahead of the regular meeting on Jan 28. Steve, Jay and Flavio will convene the review meeting, with others welcome if available. Actions:
Next meeting(s): a) Review meeting Monday Jan 25th, time TBC. b) Regular meeting Thursday Jan 28th, 10PM CET, GoToMeeting: https://global.gotomeeting.com/join/148887013
(Note that meeting time will return to CET 10pm for next regular meeting.) |
...
Expand | ||
---|---|---|
| ||
Meeting minutes 17/12/2015 Attendees: Dan Gillman, Jay Greenfield, Larry Hoyle, Steve McEachern, Barry Radler, Ornulf Risnes, Chris Seymour, Dan Smith Dan Gillman opened with a review of the PPT he provided earlier this week on “Tracking Datums”. Key points in Dan’s proposal:
Jay: What about the collection of copies of the Datum? What is this thing (if not Datum)? Larry: How do we identify the particular Datum that is put into the DataPointInstance Jay: asking does Dan want a class to indicate that all of the Datums represented the same conceptual thing. Dan agreed. Ornulf: if we have access to the Variable Cascade, can we infer the relevant concepts associated with the Datum? Ornulf: What does this add that we don’t already have?
Jay’s interpretation was that the RHS of Dan’s model could improve the model, the LHS is more complicated. Suggests that there are two roads:
Dan: aim of his model is trying to associate a copy of a Datum and an InstanceVariable into a DataPointInstance. Ornulf: not comfortable with where we are at. He argues that we CAN re-use DataPoints, and that we can track DataPoints (he is currently doing this in RAIRD). Dan asks can Ornulf reuse STRUCTURES. Jay suggests that what Ornulf is doing is actually using DataPointInstance (but naming it DataPoint, as is currently in the model). The question here is fundamentally about reusability. Larry: Is what is "in" the DataPointInstance a Signifier? And is DataPoint the LOGICAL and DataPointInstance the PHYSICAL? Dan: key argument is that we have the concept we want to represent (e.g. the NUMERAL five) and a series of strings that signify the concept (e.g. different strings of 5, IV, ...)
Dan: what isn’t currently covered is the fact that DataPoints can be RE-USED. Ornulf argued that he thinks that’s covered, but Dan's position is that we don’t yet have the “empty bin”. Dan S./Larry: are we talking about the difference between a logical and a physical, between empty and populated, ...? (Dan G. left the meeting at this point) Dan S. suggests that everything that Dan G. is covering is represented in the current version of the model in Lion - in particular, we can address a DataPoint from the InstanceVariable and DataRecord HOWEVER, Dan S. did have a concern that Ordering in the DataStructure is ordering DataPoints. Dan S. suggests that ordering should be of InstanceVariables. Dan S. argued that DataStructure relationship should be to InstanceVariables rather than DataPoints. Larry asks whether the relationship should be between the DataRecord and InstanceVariables. Dan notes that if the Record complies with the Structure, then that isn’t necessary. Questions for discussion at the next meeting:
Next meeting: January 14, 2016. GoToMeeting: https://global.gotomeeting.com/join/148887013 Proposed time is ONE HOUR EARLIER - 2100 CET. Steve to poll group members about this. NOTE ALSO NO MEETING DECEMBER 31 |
...
Expand | ||
---|---|---|
| ||
HOW FAR DO WE WANT TO GO WITH WHAT WE DESCRIBE? Jay has put together a deck and has a proposal He is modifying GSIM model of “data set” First thing that’s interesting is that the way GSIM represents attributes, it doesn’t give them a possibility of giving them a structure. We’d want to modify it so it could have a structure. This would be a hook to enter what Larry and Arofan are doing. Discussion took place about what defines 1NF/3NF in the GSIM model and Jay’s propsal. But does it matter or can the terms be changed for description? The description that Jay proposed makes sense, but terms should be changed to avoid NF’s. Attributes need to be worked into the GSIM model as they are variables. There are variables in the attribute sets. LARRY - In DDI do we want to model a datum as a collection of variables or a single variable? DAN – it’s a single. LARRY – but then Ornulf describes a datum as a collection of variables So what are the terms to be used if we’re calling a datum a single variable?E Datum Data Structure
Coming back to Jay’s stuff this morning. 2 different types, logical record and the basic idea of key value pair (reordering above)
Would the key-value pair be possibly triples? Graph data? Where are we in relation to the work done yesterday? We have a basic structure to then describe a CSV file. DAN - What could be called a key-value triple which contains a variable (attribute), unit (ID), value (measure). (There are parallels between this and the datum structure.) So this a the fundamental thing. Let’s use that to define a record, and from that define a CSV. Record is an ordered set of these key-value triples (“kvipple”) that share the same unit. Larry making a proposal We’ve got this record which has 3 collections associated with it: ID, Measures, Attributes. Record, ID, Measures, and Attributes are all collections. Then we want to define a structure of records. That can be instantiated as a dataset RecordSet is a set of Records (a sub-class of collection) DataStore store of a RecordSet STEVE - Can we describe a CSV at this point? Moving from RecordSet to DataStore we move from logical to physical. We have separated the logical and physical forms A CSV is one type of DataStore, and all the logical parts are in the RecordSet. Fixed Format is also another of DataStore. What does a Key-Value Triple option look like? How can this work with aggregated data. GSIM didn’t try to tackle them all under one structure; are we trying to do it with one? We can use the basic model of building this up, but we have to interpret it differently and have different relationships associated with it in the case of aggregates. We need to solve the problem of dimensional data. Take the combination of the values of each of the dimensions; every combination defines a different cell. Applied to the unit type in the micro data, itself defines an aggregate unit. Record: Cell Unit Type (e.g. “people”) Dimensions (e.g. “age”, “sex”) Measure (e.g. “income”) Key: 40 y.o. male plumbers (1. . n components) The component could be represented by variables Each kvipple is a cell. And every cell is a record. The unit incorporates the key. Are we losing the dimensions? Does the model work? The only thing that’s really changing is the idea that the unit is going from one kind of object to an abstract collection object. It’s the set as a completed set, not as the individual element within – is the unit. The dimension isn’t lost; it’s a combination of aggregated variables. Unit + dimensions+ variable + value = Key The unit is shared by the entire cube. It describes the characteristics of the entire population. (working with census data) For the microdata dimensions are constant (e.g. person). For the macrodata the unit is constant. Key is M,40. Variable is income. Value is 27,000 Is the unit the cube or the combination of things in the key. What is the unit? In a microdota case each cell is a record. The unit is identified by the key; it’s the interpretation of each cell. Dimesional data takeaways:
Units either by groups or individual they mean different things. The unit is dependent on the key. What's the unit of analysis? The unit of the cube or the unit of the cell? What do we want to do with it? The unit question - the answer lies in where we attach more information. We want to put in rules for putting together different slices to put together the RecordSet in the unit. We need to say what the "thing" is before we put everything together. Need to look at how datum is described from the point of view of the variables. The following email and links were provided by Ornulf following the call: Regarding the question of relations; we've lately come across some interesting thinking in what seems to be an alternative (and more forgiving) way of Data Warehoursing; Data Vault Modeling: |
Expand | ||
---|---|---|
| ||
In seeking to start creating a simple logical structure, we began by looking at the the 4 objects that had been created during Dagstuhl: DataPoint, DataStructure, DataStore, and DataStoreSummary. Also Dan Gillman began brainstorming a model of DataStructure along with the group. Review of the DataStructure led to discussion if any parts of it needed to be reviewed and redesigned. A DataStructure is an ordered set of DataPoints (a record). And a RecordSet is a collection of DataStructures (a table). The discussion raised the issue of types of records and sequence of records. Question – do we want to describe a very simple CSV (all DataPoints in a column are the same variable), or a more complex type e.g. a Household, Person structure with record type variables and sequence variables? If all records do not contain the same sequence of variables then we need to describe record types and sequences. |
...
Expand | ||
---|---|---|
| ||
DataDescription Meeting Minutes: Thursday March 26th, 2015 Attendees: Jay Greenfield, Dan Gillman, Larry Hoyle, Barry Radler, Ornulf Risnes, Steve McEachern Jay walked through the thinking of where the current Process model is now at, and what had fed into the work so far. He pointed out that the model (and 3.1 generally) were based on our “traditional” model of questionnaires and datasets, but that now new datatypes are becoming commonplace and possibly dominant. Our recent work has largely been exploring these types. Known cases we are now asked to support include:
Jay pointed out that we need to take on board a new notion of lifecycle, or in other words, per Ornulf, there is more than one way to generate a datum. Dan and Jay both pointed out that in this “new world”, we have no clear paths to a datum. This is something that needs to be further fleshed out. Dan’s comment: The logic for Questionnaire data is clear: question - observation - capture - datum. Other cases are less so. e.g. Derivation: generates data, but requires no question. Here the input is an existing datum. Ornulf noted that a derivation has various characteristics: it has an input datum, a formula for the derivation, and a datum as an output. Larry gave an example from a clinical psychologist in which a process is used to collect a combination of questions and observations, but the ultimate “thing” being recorded is actually the scale score as the datum. Barry noted that there are similar sections in MIDUS where the parts are not relevant, but it is the whole that matters. Barry points out that the step between capture and datum (subsumed now within Observation and Process Step) is “hiding” a number of significant steps - but that we can probably draw on the strength of the process model to document this. Jay considered a similar case of Computer Adaptive Testing which works from a battery of test questions to ask a set of increasingly difficult or easy questions, and that adapts based on previous responses. Dan points out that there are some similar cases in the survey community, and Barry gave a similar case of conjoint analysis in marketing, as did Jay in EHR. It may therefore be appropriate to start digging into the process model to see if we can accommodate some of the above use cases using the current combination of Capture, DataDescription and Process. Jay suggested that we should be exploring these in detail - and that it cannot be rushed. It would be useful therefore to now develop these use cases to test out the current version of the model version, to (a) assess the current objects and process model, and (b) determine what else needs to be included. Suggested worked use cases:
Jay noted his work with Splunk here, where they are always aggregating and disaggregating from the datum level.Dan noted worries here about confidentiality in such a process. Jay also recognised this, but pointed out the access rights associated with each datum as one means to resolve this. Ornulf also had been addressing this solution in the RAIRD work, using statistical disclosure control on the end products. Moving forward, it was agreed to take away these use cases, and start describing using the Capture/DataDescription/Process views. Example cases are given above, but it would be good to get additional cases of interest to the members of the group - particularly where group members are collaborating on cases. This work will require some extensive thinking, so the agreement was made to continue to work on these use cases, but to switch focus for our fortnightly meeting to the Physical Data Description. Next meeting: Thursday 9 April. Time to be confirmed (due to Daylight savings changes in Europe and Aust/NZ) Agenda will be to review and evaluate the current status of Physical Data Description. This will need to focus on:
In preparation, it would be useful if team members could review the three pieces of work so far in this area:
|
Expand | ||
---|---|---|
| ||
Data Description Meeting 11/3/2015 Attendees: Steve McEachern (ADA, Australian National University), Larry Hoyle (IPSR, University of Kansas), Dan Gillman (BLS), Barry Radler (MIDUS, University of Wisconsin), Simon Lloyd (ABS), Ornulf Risnes (NSD) We updated the progress since the last meeting, particularly the document Steve and Barry generated out of the "Linking..." presentation developed by Dan and Jay. This integrated model, bringing together the interface between Capture and DataDescription, is available here as a PDF, with the objects and relationships specified in the document available in the http://lion.ddialliance.org Drupal site.
The general conclusion from the discussion is that the relationship between ProcessStep, Observation and Datum looks sound, but that the ProcessStep and Observation objects may need additional work in order to see if they are sub-classes of a broader type. Thus the next meeting will explore further the requirements both Capture and DataDescription have for the Process model. In the interim, additional email discussion will continue around comments on the Capture-DataDescription link, building on Jay’s discussion of similar issues in HL7 and OpenEHR. The provisional time for the meeting will be Thursday March 26 at 8.00PM Central European time. The GoToMeeting URL is: https://global.gotomeeting.com/join/148887013 However given Jay’s existing work and his role with the Process model, which are the next step in our discussion, we will coordinate times around Jay’s availability if required. |
...
Expand | ||
---|---|---|
| ||
2014-03-17 Meeting MinutesTime: 15:00 CET
Meeting URL: https://www3.gotomeeting.com/join/685990342
Agenda:1) Status update. Where are we now with SimpleDataDescription? (ØR)
2) Clarify relationship between domain experts and modeler. Define role responsibilities, desired workflow in group (ØR, AW?)
Domain expert adds object descriptions and relationships Modeler puts them into the overall model Then iteration What is the status of round trip? Drupal to xmi to EA? Yes. Is there machine actionable feedback into Drupal? No. It is possible but some work is required. It is not clear yet if there are resources for this task. Furthermore there are different positions on the issue if the roundtrip makes sense.
3) Identified issues with the current version (ØR/all) a) Model is sparse on properties for InstanceVariable, RepresentedVariable, ConceptualVariable. Out of scope for this group? Comments: These objects currently only exists in the SimpleDataDescription package. Discussion about GSIM/DDI 3.2 and who’s responsible for the “core variable objects”. b) Do we need DataSerialisation (the physical counterpart of DataDescription)? DataDescription already relates to InstanceVariable, which relates to Field (column) in the RectangularDataFile. Because of this, a path exists from the Fields in the RectangularDataFile via InstanceVariable up to DataDescription and “TOFKAS” c) DataSerialisation has no relationship to RectangularDataFile. If we decide to keep DataSerialisation, surely the relationshop to RectangularDataFile must be added.
4) TODO; Identify outstanding tasks (ØR/all)
See above.
6) Plan milestones (based upon TODO-list, goals and availability) (ØR/all) Overall milestone plan/timelines to be clarified during NADDI sprint. Thérèse Lalor (ABS) is currently the project manager for DDI4 - but only until July 2014.
Other notes: |
...