Simple Data Description meeting minutes

 30 November 2017 meeting minutes

Notes of 30 November meeting

Attendees: Dan Gillman, Larry Hoyle, Steve McEachern

Discussion began with a discussion of the distinctions between precision and number of digits, and then similarly between intended and physical data types. (Note that there are differences that are introduced by the choice of platform).

Larry provided an update on the analysis of the DD model that he had used in rendering the Australian Election Study codebook in DDI4.

There was one outstanding item to be resolved: Number of variables

Discussion:

  • is this a specific case of something that could assist for collections (number of items in a collection). Could be added as an attribute. Dan noted that "numbers of items" may not be static - so may be difficult to include.
  • Larry suggested that this could be tied into VariableCollection (the former VariableGroup). Steve suggested that this is conflating two issues - number of variables in a record (our specific problem) and VariableGroups.
  • Dan suggested instead to include this as an optional attribute in the Collection pattern. (Could consider static versus dynamic in the future).

To be added as an issue for the Modelling group - suggest adding into SimpleCollection.

As part of the discussion, the group identified that there was a need also to file an issue to allow representation of groups of VariableGroups (and more generally of "Groups of Groups").

Both issues were filed for the Modelling group.

At this point, the Data Description group is satisfied that the model is sufficient to support the requirements of the DDI prototype, and is ready for handover to the Modelling group.

Proposed next meeting:
Hiatus in December
Return to meet week of January 8th
Meeting day and time to be discussed and confirmed.

 5 October 2017 meeting minutes

Data Description Meeting - 5 October 2017

Attendees: Larry Hoyle, Dan Gillman, Dan Smith, Steve McEachern, Jay Greenfield

Agenda for this meeting to outline basic work program for Dagstuhl.
Key priority for Dagstuhl is to finalise links/interactions between DD and Data Capture.


Questions on Larry's work

To facilitate this discussion, Larry walked through his slides (JIRA issue 20)

1. Is Datum "the thing we have written down", or "the thing we are observing"?

This is an application of the Signification pattern
Signifier - the string - FormatDescription
Signified - the thing that is being represented (the label/handle) - Observation (both are NOUNS here)
Sign - the association between the Signifier and the Signified - Datum

2. Is Datum in the LogicalRecord or in the PhysicalRecord?

Datum - the Sign - is in the LogicalRecord
Signifier - the string - is in the PhysicalRecord (in the FormatDescription)

3. Is DataPoint in the LogicalRecord or in the PhysicalRecord?


Further comment from Dan Smith: Trying to clarify where Observation comes from. It is not in the DataCapture model or in DD. Later discussion suggests that Observation is probably a process - we will need to develop this in Dagstuhl.


Touchpoints for DataDescription and DataCapture

Proposals coming from the DataCapture group:

1. When creating a ResponseDomain for use within either RepresentedMeasure or RepresentedQuestion, would like to be able to reference a RepresentedVariable in the cascade

You could have multiple domains joined together in DDI3 - proposing for DDI4 a 1-to-1 relationship between ResponseDomain and a Value Domain associated with a RepresentedVariable.

Note also that Capture is REUSABLE - and therefore Capture is REPEATED and PROSPECTIVE.


2. After data has been collected, how do we say where it has been collected from.

This is the RETROSPECTIVE case.

DC have not gone to this in the Capture model, but there is the capacity to record the SourceCapture in an InstanceVariable.

Data collection would have to be done as a PROCESS. However we do want to ensure that the InstanceVariable is able to point to the Capture that created it.

Should this be an InstanceCapture? Dan Smith suggests probably yes.

Dan G suggests one means for traversing the questionnaire by working up the cascade to the concept. Dan S suggests that there is actually a graph - one through the Cascade, and the other through the Instrument to the Concept (finding all the Data that have been collected from this instrument).

There is still the open discussion of where the Sentinel values should fit in the cascade - Dan S suggests putting them at the Definition level rather than the Usage level.

Larry also noted that we still need to keep in mind Units (particuarly changing Units, e.g. in data harmonisation)


3. Common data elements

(This is coming from ICPSR project. Jay also identified the NIH example of this: https://cde.nlm.nih.gov/home)

These are definitions that combine Questions with Representations. Dan S suggests that we explicitly model this in DDI4.

The CDE is the Item (RepresentedQuestion or RepresentedMeasure), plus its ResponseDomain, plus the RepresentedVariable it creates. May also want the ConceptualVariable.

This could be a View, which brings together the relevant content from DC/DD, etc.


Work program for Dagstuhl:

1. Addressing the above touchpoints 1 & 2

2. Reviewing and resolving the maturity issues identified by Jay in LogicalDataDescription

3. Exploring the CommonDataElements use case (item 3 above). Feeds into UseCase program in Week02.

Work process for Dagstuhl: Jay notes that we don't have everyone in the room. We will need to coordinate possible dial-in times at Dagstuhl. Noted that end of day Germany is start of day in US (4pm in Dagstuhl is 9am in Minneapolis).


Next meeting:

To be confirmed - will be early November after Dagstuhl workshops

 21 September meeting minutes

Meeting minutes 21 September

Attendees: Dan Gillman, Jay Greenfield, Larry Hoyle, Steve McEachern


The meeting focussed on triage and resolution of open Jira issues. Status update on issues:

Resolved
DDI4DATA-6
DDIDATA-13
DDI4DATA-17
DDI4DATA-18
DDI4DATA-19

Complete prior to Dagstuhl
DDI4DATA-7 (Larry)
DDI4DATA-8 (Jay)

For discussion at Dagstuhl
DDI4DATA-1
DDI4DATA-3
DDI4DATA-4
DDI4DATA-20

Next meeting

Focus on Dagstuhl preparation and planning

Date: Thursday 5 October, 2017, 1500 CEST

 14 September meeting minutes

Data Description meeting 14 September 2017

Attendees: Jay Greenfield, Larry Hoyle, Steve McEachern

Discussed the development of the graph approach that's being developed in the modelling group.

Considered whether there is a situation where there may be a hierarchy of ClassificationSeries

Worked through open JIRA issues:

DDI4DATA-19 https://ddi-alliance.atlassian.net/browse/DDI4DATA-19
Where to put the file name?
Looked further at PhysicalInstance from DDI3.3
Sense is that we will need to work on more modelling of this to bring this in.
Assigned to Jay

DDI4DATA-1 https://ddi-alliance.atlassian.net/browse/DDI4DATA-1
Several properties related to the content of cells that are lists

Will return to issue list next week (21 September).

In the meantime, Steve to review open issues, Jay to work on FileName proposal.

Next meeting: 21 September, 1500 CEST

 13 August 2017 meeting notes

Data description

Larry, Jay, Steve

Jay described some of his current work with Chifundo in Africa - may be an interesting use case.

Larry then went through his review of the current state of the Data Description / Capture relationship.
Specifically the presentation he had developed at the end of the last meeting.

Points to note:
Is there a level below the Instance level - namely the Use of that Instance
There is a need to link a Capture to a Datum
What's missing are the relationships between the packages (and against specific objects that bridge between packages)

Steve noted that some of the ideas covered in Larry's ppt revisited discussions that we had explored in Sept 2015. This content is worth reviewing again - along with the related discussion which occured in Sept - Dec 2015.

Next meeting: Propose to convene a joint session with the Capture to develop shared content and relationships, expressed as relationships in the model.

See JIRA issue from 20 Data Description for copy of Larry's slides.


 29 July 2017 meeting minutes

July 29 2017  meeting minutes

Attendees: Dan Gillman, Steve McEachern, Larry Hoyle, Jay Greenfield


Discussion for this meeting - What do we need to do now?

  • Clean up the remaining open issues from the Review
    Preparations for the Dagstuhl review
    Look at the Data Capture to Data Description relationship
    - What's not in Capture?
    - Could Capture take what we have, and bring it back into their model

Points for Capture discussion:
1. need to work with the Capture group to identify and model the relationships between the two parts
2. should there be a parallel structure from Capture to Description?
3. Consideration of whether there is a level below the Instance Instrument (like we have discussed the format level below the Instance Variable)

On point 3 - Jay asks is that structural or process? Dan indicated that he is having a little trouble moving to this sub-level.Dan's question: Do we really want DDI recording what is happening during some process? Larry notes Datums are recordings. Attributes could also be paradata.

So essentially here - are we using DDI to record paradata? Larry suggests that we don't want a structure specifically for paradata - that Viewpoint can handle this. Dan wonders the reverse - that there might be specific structures required. eg. How long does it take to answer a question? How long does it take an interviewer to ask a question? Probably similar issues in other processes?

Larry notes also that we haven't tied the Datum to a Unit - possibly it employs an Agent and executes a Process. Dan is ok for this?

Jay - returning to Viewpoints - without specifying what the paradata elements are, the Viewpoint does at least allow a way of enumerating them. Jay suggests that we will want to specify what these paradata attributes consist of - we may just need to see if/how they can be applied in the paradata context.

Dan expects and hopes we could provide a general structure for this - so is agnostic on how this could be achieved at this point.

Larry - could use Viewpoint to enable paradata to be captured at the Unit Record level.

Dan - thought experiment on paradata. Posits paradata are METAdata - that they describe something (e.g. the interview process). Alternative definition as paradata are "data about data", or "data about the process".

Dan and Larry discussed whether the challenge is that this process paradata may not apply at the unit level at some point (e.g. sampling or perturbation process). At that level, would there need to be a different paradata content (e.g. aggregate versus unit paradata).

Jay pointed out that there are other combinations of things that we would apply similar processes to - e.g. the performance of a questionnaire item, or the length of an instrument.

Dan response: are we using paradata as descriptive data, or for statistical analysis? If descriptive, then we may need something different (Steve isn't so sure though). If statistical - then the generalisable case probably applies.

Next steps:
- Larry to distribute content he has been working on in this space mapping Capture - DD relationships (and possible changes to Capture). See attached.
- Steve and Jay to meet next week to see if we can map the Viewpoint model to paradata
- Need to round out the outstanding items on the Q2 and Lawrence reviews (particularly 17 and 18)


 29 June 2017 Meeting Minutes

Data Description Meeting notes 29 June 2017

Attendees: Dan Gillman, Jay Greenfield, Larry Hoyle, Steve McEachern

Discussion on realisation of new collections pattern and its implications

- Jay noted that Wendy was going to work on CodeList realisation
- Intent was to include a realisation of the Signification pattern
- At the CodeList level, we would be able to introduce concepts
- Talk about concepts as a whole, but also Concepts at the CodeItem level
- Wendy and Jay indicated that they won't engage any of the NodeSet and Node classes
- Will hopefully then have a framework to roll out elsewhere (particularly Statistical Classification)

There was then some question about the application of a concept to the Statistical Classification (Dan wasn't following the approach Wendy - and Jay - had developed based on the current state of the MODEL)
- Larry noted that we can currently model Concepts without a Signifier (through a Name and Label).
- What is the difference between a StatisticalClassification and a CodeList?

  • In a CodeList, you are NOT assuming exhaustivity of the CATEGORIES. Also in a CodeList, you might be using MULTIPLE categories - you are not using the list for a classification to an INDIVIDUAL category.

- Larry: is there a need for additional properties on a CodeList - that would then make it a SC?
- Characterisation of SC is having properties of EXHAUSTIVITY and MUTUAL EXCLUSIVITY
- Three cases of possibly the same thing: StatisticalClassification, CategorySet and CodeList
- The distinction may come down to USAGE rather than STRUCTURE
- Thus can we make the modelling of a SC is a USE of a CodeList, that imposes EXHAUSTIVITY and MUTUAL EXCLUSIVITY
- We would need to make that a CodeList has all of the structures needed to represent a complex StatClassification
- (We may also have a naming problem - people will look at the model and EXPECT to see something called a "Classification")
- (Also How would you represent the MultiSelect option in

Result of this discussion:
1. Does maintaining the distinction of SC and CL make sense given the above?
2. Potential alternative: SC becomes a use of CL, with relevant attributes and structures to enable the representation of the SC is possible

Updates on Q2 Review and Kansas sprint items:
The following items were updated:


Next meeting

Steve is unable to attend the next scheduled meeting on Thursday 13 July. Either another group member will chair or a different meeting time will be determined.


 14 June, 2017 meeting minutes

Meeting minutes for Data description meeting, 15 June 2017

Attendees: Dan Gillman, Jay Greenfield, Larry Hoyle, Steve McEachern

Implications of Lawrence sprint

  • Revisited collections.
  • Previous version required writing out all the relationships as pairs.
  • Revised version is "New Collections" - Wendy has posted this on Lion.
  • Wendy will then be working up some of the realisations.
  • Big implications exist for Ordered Relations

Will need to update

  • LogicalDataDescription
  • PhysicalDataDescription
  • ValueDomains, Classifications, CodeLists etc.

All are likely to be affected by this change.

Don't yet have the names for the classes, but will make changes when they are established.

What are the implications of the changes to Collections pattern?

  • Don't need to list all the Variables in a Collection, and then list their relations in pairs
  • One example of this was the LogicalRecord and PhysicalRecord (and their relations).
  • We would probably now have a DataRecord which is a list of DataPoints, or possibly also an OrderedListRelation (to describe a hierarchy).
  • Lots of implications, and considerations in a number of directions.
  • This is all Dan's fault :-).

Question for this group: how to manage this next development.

  • One area for consideration is the GSIM (and now DDI) Node Set.
  • We no longer need the generic method, but can put all the power in the pattern.
  • Don't need the intermediate "realisation" classes.
  • Instead we just realise the pattern, and add in the particular attributes for the type of pattern we want, rather than relying on inheritance.

When will we have a useable "NewCollections" pattern to use?

  • Wendy is working on Realisations now, and Modelling checked yesterday
  • We could also work on realisations in parallel, to see if we can leverage the Pattern to realise our own classes.
  • Suggest that we use "Powerpoint engineering" to develop this in the first instance.
  • We are also trying to figure out if we can break the pattern.
  • We would then move to Powerpointless engineering once we are confident in the pattern itself.

So what are we going to pick first?

  • Before answering this, Larry raised a related question of the DataDescription view - and what is the difference between DD and Codebook. Dan suggested that DD is subset, but Jay suggested it is probably an intersection.
  • Dan asked whether the Codebook focussed more on InstanceVariables than the full cascade. Jay noted however that DD also enables the management of data at other points in the data lifecycle - that Codebook often related to data at the time of PUBLISHING. Dan thought that this could still be a subset of Codebook. There was some disagreement here :-).
  • Larry noted that the key distinction probably is that DD includes DATUMs - but a Codebook would not. This ends up then being more about the touchpoints between the two (and also Data Capture). This needs to be one of the key outputs of the Dagstuhl meeting.
  • This also leads into Marketing - we need to produce Views that PEOPLE CAN USE. A good example is the World Bank. They won't use the DDI codebook (at least not right now) - they will rather be developing a CORE codebook but a set of supplemental codebook extensions of particular requirements.
  • We are also well positioned to string together multiple views into a larger view to accommodate particular needs (such as the World Bank). The production framework does allow us to do this.

So what to focus on first?

  • Dan suggested something we use a lot - VALUE DOMAIN.
  • Dan will take the first shot at this, and distribute for discussion.
  • Jay will also take a first run at RecordLayout.

We would then progress to other key realisations:

  • ValueDomain
  • RecordLayout
  • CodeList
  • CategorySet
  • ConceptualDomain
  • Classification (note that these fall within Conceptual, but we need them)

Remaining items in Q2 review:

  • Dan's paper on Justifying the Variable Cascade - is written, and will be posted for discussion (Wendy and Larry have received this)
  • Paper on Bridging the Gap from DDI to SDMX - need to develop to address Issue 13
  • Paper on "multiple Sentinel Conceptual Domains, and then look at how they could be used in combination" - Steve is still developing this, and will bring back for the next meeting

Next meeting:

  • We will review the open items, so that we can see the extent to which the above papers close out some of the open issues.
  • Before next meeting, Steve will review the open issues on the review and update their status.

Agenda for June 29 meeting:

  • Confirm updates to the Q2 Review
  • Examine the first two Collections realisations: ValueDomain and RecordLayout


 21 April, 2017 Meeting minutes

Meeting minutes April 21, 2017

Attendees: Dan Gillman, Jay Greenfield, Larry Hoyle, Steve McEachern

Discussions continued on the applications of the variable cascade, in particular the 2 possible applications of the conceputal variable and conceptual domain.

Initial discussions suggest that a prescriptive view mapping the conceptual domain to the value domain on a one-to-one basis may be the preferred approach - Dan will consider this further in his discussion paper - to provide an unassailably clear explanation of the variable cascade.

 March 30 2017 meeting minutes

Meeting minutes, March 30, 2017

Attendees: Larry Hoyle, Jay Greenfield and Steve McEachern

Larry raised the question of the complexity of code lists

Nodelist -> CodeList; CodeItem -> Designation, Designation → ... etc.

There's a large number of classes to be completed for a variable. Wondered for example should a CodeItem just have a PROPERTY of a Code?
Jay suggested that a lot of this came from GSIM. (Often relates to the management of Code Lists / Statistical Classifications). It isn't clear from the GSIM approach how codes are associated with categories.

Larry demonstrated the extent of description that is required to describe a single variable. Jay noted that the reusability is beneficial in various situations. Steve also noted that some of this complexity is also present in earlier DDI versions. The lack of reusability is also a problem for DDI-C (the same code lists then are repeated over and over).

Sense from the group was that we should put the capacity for "trimming" the instance on the table and see how this works (probably in the tools, not in DDI-Views). Might be a recommendation for tools as to how to approach this.

Next meeting: Thursday April 20, 1500 CEST

Comparable times:

Mannheim, Germany Thu, 20 Apr 2017 at 3:00 pm CEST
Bergen, Norway Thu, 20 Apr 2017 at 3:00 pm CEST
Canberra, Australia Thu, 20 Apr 2017 at 11:00 pm AEST
Washington DC, USA Thu, 20 Apr 2017 at 9:00 am EDT
Ottawa, Canada Thu, 20 Apr 2017 at 9:00 am EDT
Lawrence, USA Thu, 20 Apr 2017 at 8:00 am CDT
Madison, USA Thu, 20 Apr 2017 at 8:00 am CDT



 March 09 2017 meeting minutes

Data Description meeting minutes - 09 March 2017

Attendees: Steve McEachern, Larry Hoyle, Dan Gillman

Upcoming calendar:
- Proposal is to stick with CET 1400 for upcoming meetings
- March 23: US participants will be one hour later (9am DC, 8am Kansas)
- April 6: No meeting
- April 20: US participants back to current time (8am DC, 7am Kansas), Australia one hour earlier (10pm Canberra)

Today:

Updates on action items from last meeting
- Dan and Steve still working on papers
- Larry has been developing the DDI4 markup in a YAML template

Larry walked the group through the template


Noted in reviewing:

1. Proposal for the modelling group:
- Higher level concern: what are the interoperability requirements for DDI? What do we need to interoperate on?
- Example was the controlled vocabulary in PhysicalDataType. The CV comes with a significant number of its own additional properties - cVAgencyName, cVID, etc...
- Need to take out a lot of these - they are no longer necessary with URIs.
- Larry and Dan suggested that all of this information could be removed - instead we could get by with just CONTENT and URI - in fact, having the additional properties may lead to interoperability issues. (It could even be a DDI URN).
- i.e. Let's make this a USEABLE standard, not an overwhelming one
Note: we need to review the DataDescription model to find all the instances of this that WE wish to remove.


2. More on intendedDataType and Additivity:

Dan suggested we may want to put Additivity alongside intendedDataType.
This is more of a general problem as well - how do we describe the "intended" or "appropriate" characteristic of the variable. Properties include:
"Type" of measurement (e.g. nominal -> ratio)
Unit of measurement
IntendedDataType

In the end we need a more detailed characterisation of the variable - to allow an algorithm to determine what are the permissible/appropriate machine actionable operations on a particular variable. Need to discuss this further.


3. Scale

No place for this (which came out of 3.2).
Again, suggest that this is part of the UnitOfMeasurement?
(e.g. Dozen is 12, Dozen x Dozen = Gross, and Dozen x Dozen x Dozen = Great Gross)


4. InstanceQuestion

Doesn't currently have a relationship to InstanceVariable?
Dan noted that there may not necessarily need to be
InstanceQuestion has no direct question text - it is in REPRESENTEDQuestion
Should be able to inherit this.
Should also be able to point the relationship between IV and IQuestion as well.

5. InstanceVariable

Has "measures" from two different things:
- inherits (from RepresentedVariable): "measures" Universe
- relationship: "measures" Population
Larry suggested to qualify the name, e.g. the relationship should be "measuresPopulation" as the name of the relationship
May be a more general problem in the model where we have inheritance of relationships.

Next meeting: March 23, 1400 CET


 February 23 2017 meeting minutes

February 23, 2017 meeting minutes

Attendees: Dan Gillman, Larry Hoyle, Steve McEachern

The group discussed the open issue 16, particularly considering the "Need to look at multiple Sentinel Conceptual Domains, and then look at how they could be used in combination."

Larry agreed to provide the DDI4 expression of the variables in his sample SAS file, and how these support the 3.2 ManagedRepresentations.

Dan suggested developing an overview of the possible sentinel domains that might exist across the data lifecycle (e.g. data collection, processing, editing, imputation, dissemination), and how the sentinel domains might be used in real-world situations. Dan also noted that there is difference between substantive and sentinel domains, and demonstrate the possible uses within a processing workflow. Steve undertook to develop this paper.


 9 February 2017 Meeting minutes

Meeting minutes: 9 February 2017

Attendees: Dan Gillman, Jay Greenfield, Larry Hoyle, Steve McEachern

Agenda for this meeting was continuing discussion of the open Q2 issues.

Issues 11 and 12:
Jay had added additional description to the LogicalDataDescription. Issues were agreed as resolved.

Issue 13:

FormatDescription includes now ValueMapping - which gives the additional "level" in the variable cascade (below IV). Dan to write a "Justifying the Variable Cascade" paper walking through the cascade. This is tangentially related to Issue 13, informing the query about relationships from Measure

Related issue in 13 is the issue of identifying Measures and Dimensions (particularly how to establish the relationships being mapped to SDMX). There was a brief discussion of the need to clarify how Viewpoints could be used to describe these in dimensional data.
Suggestion - to develop a paper "Bridging the gap from DDI to SDMX". Dan Gillman to draft for discussion

Question from DG: In terms of Viewpoints - do the combination of Attribute roles provide an identifier for a Dimension? (Larry asked why we would want to do it?)
Cells contain the measure. Attribute combinations identify the location of the cells. So should the attribute combinations be given an IDENTIFIER role or an ATTRIBUTE role?

Some questions were raised about the cardinality of the three ViewpointRoles. May also want to have the ability to have an attribute on each Measure that is in a cell.
(As there may be MULTIPLE measures within a cell).

Jay notes that having complex identifiers is an (increasingly) common situation - for example with big data systems.

Issue is to be held over while the above papers are developed.


Issue 14: ConceptualDomain and ValueDomain

Several questions are raised here. It does appear that some of the questions will be addressed by the "Cascade" paper above. The questions do also seem to be looking at how to then connect the physical layer.

Dan to develop papers for the next meeting.


Issue 16: Documentation of DDI-Views versions of 3.2 ManagedRepresentations

Larry has prepared a SAS data file with the set of managed representations that exist in DDI3.2 (except for a ManagedScaleRepresentation), and a 3.2 instance documenting these. Now need to put together a DDI-Views instance that does the same.

Last discusion: Larry raised whether sentinel values are missing values. Dan argued the reverse - that missing values are ONE TYPE of sentinel value. Need to look at multiple Sentinel Conceptual Domains, and then look at how they could be used in combination.

Issue held over for next meeting.



 January 26, 2017 meeting minutes

Meeting minutes 26 Jan 2017

Attendees: Dan Gillman, Larry Hoyle, Jay Greenfield, Steve McEachern


The meeting focussed on discussion of the open issues from the Q2 review.

Comments were added to JIRA and status updated for all items discussed, and the comments are replicated below.


DDI4DATA-9

A value domain could participate in several DataTypes. As such, incorporating IntendedDataType into ValueDomain would be restrictive.
Example - If you add up a set of numbers with a Scale DataType, it will be different from a set of numbers with a floating point (which has greater precision) than

Right now we have IntendedDataType on RepresentedVariable - this seems to the group to be the correct place for the attribute.

Status: resolved


DDI4DATA-10

New property formatPattern to be added to ValueAndConceptDescription. Will be a 0...1 property and will use the UAX35 standard (see link in the Issue Description).
Issue assigned to Larry to complete this work.

Status: In progress


DDI4DATA-11 & 12

The comments here (in Issues 11 and 12) suggest that there is some misunderstanding of the purpose of a Viewpoint. The Viewpoint provides the capacity for the end user to describe the use of a set of variables in a particular context. (This is similar to the Measure/Attribute/Identifier roles that exist in GSIM. The difference is that in GSIM the roles are fixed, but the roles of Variables can change in DDI. The roles are also applied to both dimensional and unit record data in GSIM.

Need to develop some documentation to clarify this meaning. No need for changes to the model as it stands.

Jay will add relevant documentation into Lion to address this.

Status: In progress


DDI4DATA-13

Need to clarify some misreading of the model:
a) There is a relationship between IV to Concept - as IV inherits from RV and CV.
b) DataPoint in DDI contains only one IV - whereas in GSIM it can contain one OR MORE.

Point (b) has implications, particularly if DDI considers the inclusion of complex values such as lists in DataPoints in future. The group agreed to return to this question (and Issue 13) at the next meeting, as well as Issue 14 which was not discussed.


Next meeting: Thursday February 9th 2017, 1400 CET.


 1 December 2016 Meeting Minutes

Data Description Meeting minutes, 1 December 2016

Attendees: Jay Greenfield, Dan Gillman, Larry Hoyle, Steve McEachern

Jay provided a short discussion of the process and workflow views, and the need for the review of these by the modelling team - as to whether the approach is suitable (and potentially overly complex). Also some consideration of the pre-event design versus post-event description aspects of this model.

Opening of discussion focussed on assignment of the tasks from the previous meeting. The following allocations were agreed:

Issue 1: Steve and Larry
Issue 2: Referred to the modelling group
Issue 3: Steve and Larry
Issue 4: Dan Gillman to convene sub-group (Dan G, Jay, Flavio, Daniella - and probably also George Alter and Ornulf)
Issue 5: Issue on hold - depends partly on outcome of Issue 4

Related issue to the event history - Can we describe an algebra for how to put data together (or not) in certain circumstances. There are elements of the Event History model that need also to be discussed in regards to TIME and temporal characteristics. Jay also mentioned a related project he was working on with Eric P and Daniella Meeker on automated curation of electronic health records, which would be based on FHIR - and also describing events.

Also related was the development of the VariablePointer type - to describe “which variable a value is associated with”. Both Issue 4 and Issue 5 might be appropriate to be discussed by the sub-group. Dan G suggested however that we want to maintain separability (Jay also).

Larry also noted that the Pointer type probably allows us to decompose and address the DataPoints and Datums (also Dan G agreed).

We need to flag also in the Issues list - return to discussing DataPoint, Datum, DataStore (issue listing). Proposed to discuss this at next meeting, and to list out in the Issue Tracker.

Also agreed to assign some of the properties issues to Steve and Larry to work through: Issues 1, 3, 7 and 8.

Jay and Larry also raised the notion of how to handle some of the RDF bindings for CSV on the web. Larry pointed out the use of "correspondsTo" in Lion that may be problematic (and relates to Dan Smith's point raised at a previous meeting).

Next meeting: 15 December, 1400 CET.
Agenda: list out in the Issue Tracker the outstanding issues on DataPoint, Datum, DataStore and InstanceVariable.

Follow up meetings:

NO MEETING DECEMBER 29

First meeting for 2017: Thursday 12 January, 1400CET


 DataDescription Meeting 2016-11_17

Data Description Minutes 2016-11-17


We reviewed DDI4DATA-1 through DDI4DATA-5 and we talked about IASSIST presentations.


DDI4DATA-1 and DDI4DATA-1 are overlapping lists of properties that may not be represented in either LogicalDataDescription or FormatDescription. Larry will do crosswalks to determine what the delta might be.


DDI4DATA-2 talks about whether we are able to represent RDF triples. The focus is on whether we can support URL references (resource references). We discussed how RDF was perhaps something that doesn’t belong in Data Description but is one of our bindings. Dan Smith talked about a paper that came out of the Paris sprint that delineated an approach for producing RDF as a binding. However, this approach didn’t account for different RDF vocabularies like, for example, the CSV vocabulary that we were exposed to this year in Dagstuhl. Dan Smith pointed out that mapping to other standards would be a departure from the Paris paper approach. We agreed that this issue needs to be referred to Modeling Team.


DDI4DATA-4 considers whether we have the datatypes needed to support constructing an event history. Specifically, are we able to talk about the relationships between, for example, variables or unit types over time. Flavio wondered whether PROV provides a good pattern we might adopt in approaching event histories. Dan Gilman suggested we spin up a small group to review that could include Dan Gilman, Daniella, Jay and Flavio among others.


DDI4DATA-5 considers whether we are missing a ViewPoint role. This role would be used to describe “which variable a value is associated with”. Larry presented an example. Jay said perhaps we can do this already because a LogicalRecordLayout in combination with a ViewPoint is very flexible. We thought we should try to model Larry’s example with our existing classes. Flavio mentioned that we might explore an application of functional languages here.




The example had two tables


id

age

height

1

45

167

2

66

180



And

var

value

ID

1

age

45

height

167

ID

2

age

66

height

180


The first table is a traditional tabular layout where each column is associated with an InstanceVariable.


In the second table each DataPoint in the the var column contains a pointer to the instance variable associated with the DataPoint to its right. The value column contains a string representation which can be interpreted through information associated with the InstanceVariable and its ValueMapping.


 3 November 2016 meeting minutes

Data Description meeting 3/11/2016

Attendees: Dan Gillman, Dan Smith, Flavio Rizzolo, Jay Greenfield, Ornulf Risnes, Larry Hoyle, Steve McEachern

Review of the outcomes of Dagstuhl week 02 (and week 01)

  • The attendees reported that Instances were developed for various use cases, and documented on the Data Description team page.
  • Issues were raised in terms of the requirements for Event History data. This would probably be a new type of data: a pointer (fundamentally to suit Event History data). It was noted that this was probably required for the RAIRD model. The rectangular approach can probably be generalised for this, but probably still needs a separate DataType.
  • There are two issues (4 and 5) which appear to be related to this, but may not be exactly correctly docmented. Dan G.'s sense was that this could probably be addressed by an extension to the Rectangular class.
  • Flavio had a query in regards to what the pointers were for: to the units involved, or to the records about the unit.

Event History Data

George Alter had some examples for this, as did Ornulf. Example for Ornulf: Marital Status
- The change in status
- The units for which the change occurred (the husband and wife)
- The records about those units
- The timestamp (and other characteristics) about the recording of the change
Ornulf suggested that he will probably be able to discuss this more clearly in February

Flavio suggested that there may be a similarity to Linked Data, but Ornulf thought the closer parallel was to how Stata handles merges - you can link anything to anything.

Arofan to write up EventHistory - note as Issue in JIRA

Further developments

Jay discussed the additional work he had done with Daniella Meeker and Eric Prudhommeaux afer week 1, which may also feed into this discussion (particularly Use Case 2)

More on Dagstuhl Week 02

The cases that were covered in Week 2 were:
- CSV
- Fixed width (Fixed, SegmentedFixed, HierarchicalFixed)
- Aggregate
- EventHistory

Code lists

May be some means for simplifying the nesting of code lists. Dan G. and Larry had a look at this - Dan has documented some of the issues with this in the Dagstuhl outputs. Dan suggested that this is primarily for the Modelling team to consider.

Larry noted that there may be some additional points to consider about how the CodeList-CodeItem-Code layering operated. Dan also thought that this should be considered alongside Names and Labels. Flavio also noted that some of the repetition between CodeItem-CategoryItem-ClassificationItem-Node-etc. may be there because of GSIM - and may be able to be simplified. Dan and Flavio agreed to review this.

DDI and GSIM

Dan G. also noted that DDI will probably work well as an implementation model for GSIM - Klas Blomqvist, Flavio and Dan may be able to progress this within the NSI community. (These need to be socialised inside the community to align CSPA and CSPA-LIM modelling - and the HLG - as well. Particularly the forthcoming CSPA project proposals - with the upcoming meeting in Geneva).

Variable Cascade

Dan G. indicated that we need to write up how to use the VariableCascade and how to use particularly the Substantive and Sentinel domains - and the likely fourth level of the cascade. (and possibly the need to separate out how everything is linked out at the InstanceVariable). Also what the attributes really mean and where they are applied.

Ornulf also noted a possible further extension which may require the change to the RepresentedVariable level but not the InstanceVariable level (i.e. changes to the SubstantiveDomain over time). Flavio suggested that adding a time dimension to the Variable cascade may assist in this.

Data Capture and Data Description

Dan S. noted a set of issues that came from the DataCapture/DataDescription - particularly that there is an interest in linking the DataDescription to the Capture that produces it (so people can determine the lineage of the DataDescription). Dan noted that the process pattern may be able to handle this. There are two elements to this:
a) This variable is derived from that variable
b) This variable was generated from that variable using this process

Jay agreed to describe some of the requirements here, using an example derived from the OMB review process. ADD ISSUE TO JIRA HERE


Finally, the group agreed to put together (can't recall final agreement here?)

Next meeting: 17 November 2016

 6 October 2016 - meeting minutes

Data Description Meeting Minutes, October 6 2016

Attendees: Larry Hoyle, Barry Radler, Jon Johnson, Achim Wackerow, Jay Greenfield, Steve McEachern

This meeting was focussed on planning the priorities for upcoming meetings at Dagstuhl. Readers here should note that the minutes of this meeting are for this reason ordered in terms of priorities discussed rather than in temporal order of the meeting.

Priorities for the Dagstuhl Workshop

Achim's suggestion was to focus on the description of the current status of the model rather than extensions of the model into new directions. Larry suggested that this could be a couple of examples of data marked up for use would be good here (aligned with the use cases).

Priority 1: The application of the Data Description model to the basic use cases
- Unit record
- Aggregate
- Database
- Data lake

Comments: Can we use Jay's paper as the starting point for this? DOC: Overview of Data Description model - Jay Greenfield, September 2016
Reference also the application use cases developed in the 2015 workshop.
Jay noted that we are well positioned to be able describe clinical data in the current model (based on the discoveries he noted in the paper).

Priority 2: (Week 2 primarily, Week 1 if relevant) Relationship between Instrument/DataCapture and DataDescription

Priority 3: (Week 1 and 2) Paper also required on the relationship of the new Data Description model to GSIM more generally and SDMX.

Priority 4: (Week 2) What is the relationship here to DDI-C and DDI-L - to demonstrate both the application of the model and how the migration path would look like.

Priority 5: (Week 2) - What did we learn from the first week?
- What adaptations do we need to make?

Priority 6: (Week 2) Initial work planning for 2017 - updates/new areas:
- Qualitative data - particularly the Physical
- Datum classes
- Other formats

Next meeting:
- NO MEETING OCTOBER 20 (Dagstuhl workshop)
- Next meeting is November 3. Time to be confirmed due to changes in Daylight Savings.


Appendix: Notes regarding above priorities and possible additional content

The following are notes taken from Sept 22 minutes and other sources about required future work for the Data Description model. These may be relevant to the above, or including the work planning discussion for reference

1. Datum classes
- Continuing work on Datum

2. Other formats
- e.g. RDF, DBs (relational, graph, noSQL, ???, ...)
- The current model also supports some flavours of RDF (particularly N3) that reduces RDF in a particular way, but may not work with a graph in general - i.e. the key issue is that we don't have a model here that would represent triples in general

3. DD and FHIR/OpenEHR
- Larry noted that we had made provision for Event and Cube Layouts in the FormatDescription. Larry thought we could establish this as another form of layout. Dan S also thought that we may be able to use the proposed nesting of LogicalRecordLayouts to address this as well. This and other complex types can be discussed further in Dagstuhl. We may also have ways of using the extension approach within DDI4 to model some of the FHIR extensions.

4. Variable cascade:
- Additional levels
- Conformance with GSIM (Dan Smith will comment on the review)
- Being able to describe the relationship to GSIM


 22 September meeting minutes

Data Description Meeting 22 September 2016

Attendees: Dan G, Dan S, Larry, Jay, Steve, Barry

Comments from Modelling Group

The modelling group was going to discuss the DataDescription and other packages due for release.
- Methodology is likely to be held back, however the Methodology PATTERN should be part of the release (consistent with the release of the release of patterns more generally)

For DD, we will include:
- VariableCascade
- LogicalDataDescription
- PhysicalDataDescription
- FormatDescription

What will not be released:
- Datum-related classes

Wendy also added a DataDictionary which provides the functional view which will use the Logical and Physical (there are also separate Logical and Physical views - the pHysical view probably has only limited value)

There was also a discussion on the relationship between Instrument/DataCapture and DataDescription, which raised some serious discussion among the Modelling group. Barry made the comment that the model has been pretty stable since Dagstuhl 2015. He noted that it was also not fully clear at this piont what the touchpoints between the two models are (although Dan G has a sense that this is reasonably in hand). This will depend somewhat on completing the work on the Datum-related classes, but can be addressed in the second week of Dagstuhl 2016.

Use case discussion

Jay then went through the use cases he had developed from a STRUCTURAL (rather than CONTENT) perspective to see whether the current model had the capacity to support other cases.

Jay noted that LogicalRecordLayout is great for representing tables, but not for representing other schemas (including JSON, RDF and FHIR). He found that it could achieve this if the LRL was made to be NESTED. As such he suggests (and actually added to Lion??) the NESTS relationship to allow it to nest sub-layouts

The current model also supports some flavours of RDF (particularly N3) that reduces RDF in a particular way, but may not work with a graph in general - i.e. the key issue is that we don't have a model here that would represent triples in general.

He also found that FHIR type models can also be represented using the VIEWPOINTS classes (see document for details). There are some issues with representing the different areas of FHIR. This could be addressed by adding a Type property to the AttributeRole - which would then allow us to model OpenEHR and FHIR.

The second issue with FHIR is the question to how to address the "Resource References" in FHIR.

Generally, Jay noted that the content of FHIR is very broad - but the basic architecture is relatively small - which may make the interoperability with DDI more amenable.

Larry noted that we had made provision for Event and Cube Layouts in the FormatDescription. Larry thought we could establish this as another form of layout. Dan S also thought that we may be able to use the proposed nesting of LogicalRecordLayouts to address this as well. This and other complex types can be discussed further in Dagstuhl. We may also have ways of using the extension approach within DDI4 to model some of the FHIR extensions.

On the Type property on AttributeRole - Larry asked about whether we should ... (missing notes on this comment)
It could possibly also be addressed through multiple viewpoints, and/or multiple AttributeRoles. Dan S noted that we should also provide recommendations about what level we think Viewpoints should be applied? The question is partly where the Type should be applied.

Actions:
- Jay will revise the use cases over the next few days, and circulate for discussion among the group
- Group members to respond to Jay's updated document via email discussion
- Next meeting (Oct. 6th) to focus on any preparation required for Dagstuhl


 8 September 2016 meeting minutes

Data Description Meeting 8 September 2016

Attendees: SM, DG, BR, DS, FR, JG, DS

Agenda

1. Review PhysicalLayout and FormatDescription
2. The proposed "Physical" layer in the Variable Cascade

Larry lead the discussion walking through the updates to the PhysicalLayout model and the changes he had made. The group walked through the objects in the model in turn.

PhysicalLayout and ValueMapping

  • Larry noted that we may want to make the PhysicalLayout (PL) realise a Collection.
  • PhysicalLayout also a relationship to the new ValueMapping object (basically describing how the IV is physically represented in a file).
  • Larry noted that if PL realises a Collection, then the ValueMapping can be ordered using an OrderRelation.
  • Flavio asked whether there could be more than one IV associated with a ValueMapping. (It was noted that a ValueDomain could encode more than one IV.)
  • Dan also asked whether ValueMapping was really a "Field" (in more familiar terms) - and Jay asked whether the implication was whether we should rename it. This should be considered in the review.
  • Flavio suggested changing the ValueMapping relationship to PL to a Composition (which would support the realisation of Collection Larry raised earlier and allow the Order relation). The group agreed with this suggestion.
  • Larry suggested an additional (optional) relationship to a PhysicalSegmentLocation. Dan G and Flavio agreed
  • Dan G recommended removing the start, end and length from ValueMapping and use the relationship to connect to the properties on Segments. All supported this.

Summary of proposed changes to PL and VM:

  • ValueMapping: change relationship to PL to a Composition
  • ValueMapping: add optional relationship to PhysicalSegmentLocation
  • ValueMapping: remove the properties of start, end and length as we can use the relationship to PhysicalSegmentLocation to use the properties of Segments
  • ValueMapping: retain the name, but include note on the object to suggest possible alternate name of Field

Content not included

  • Larry then reviewed the W3C properties that he hadn't included at this point in his word doc. Flavio suggested holding off on adding these for now - and just have the word doc available as reference for reviewers.
  • NOTE FOR JIRA: Jay asked if we have the relevant content available to reference a triple. We may, but we need to return to this at a later date.
  • Procedural note: Larry also recommended starting a JIRA project for following our unresolved issues for us to return to in future (such as Jay's RDF point above).

VariableCollection and related objects

Flavio: is VariableCollection a LogicalRecordLayout? Wendy has made the case that we want to order both Physical and LogicalRecords. But Flavio asks do we need VC? It seems to be LogicalRecord. Jay: and what about RecordRelation in that case?

Summary:
- RecordRelation: move to the LogicalDataDescription
- VariableCollection: use the LogicalRecordLayout (and move to Logical)
- StructureDescription: remains
- StructureDescription points to DataStore
- StructureDescription gets an Overview attribute (to provide an overview description)
- PhysicalLayout points to LogicalRecordLayout
- DataStoreSummary is removed
- DataStore adds a new property RecordCount

NOTES on the summary:

  • Steve: Does a file have multiple PhysicalLayouts? Yes. Then Flavio suggests keeping StructureDescription. Dan S then suggests pointing PL to LogicalRecordLayout.
  • Dan S noted that the parallels maintained in logical and physical was similar in DDI3 - and made it harder to keep collections in sync. Suggestion: avoid the duplication.
  • Dan S: should the DataStore then link to the PhysicalLayout that it represents. DataStoreSummary could then be removed and the RecordCount could be a property on DataStore

Item 2 for future discussion: The proposed "Physical" layer in the Variable Cascade

  • Agreed to hold this over until next meeting
  • Results of discussion will NOT be included in the review version - i.e. note in model, but any changes will be made after the review version is completed.
  • JIRA NOTE: Note this as a future issue in JIRA - will need significant discussion.

TO DO:

  • Flavio will make the edits above to Lion
  • Jay will review the revised model and validate against a couple of usecases
  • Pending Jay's review, the group present agreed that the model is now ready for review by the modelling team.
  • Can other group members not on the call at the end of the meeting (DG, OR, CS, AW) please indicate if you have any issues with the final model, otherwise it will be handed over to the modelling team for review after Jay's review next week.

Next meeting: Thursday 22 September, 1500 CEST
Agenda focus: The proposed "Physical" layer in the Variable Cascade. (Final revisions following Jay's review only if required)

 25 August 2016 meeting minutes

Meeting notes 25 August 2016

Attendees: Larry Hoyle, Dan Gillman, Jay Greenfield, Barry Radler, Flavio Rizzolo, Steve McEachern

Larry walked through the spreadsheet he and Steve put together last week (LayoutAttributes2016_08_24.xlsx). Larry also put together a description of each of the properties, pulling together the content from W3C and other specs (https://docs.google.com/document/d/166bRLyLyKQxk1YqH5lX-QWwH19pn4bM7VIXDO-jDcdI/edit?usp=sharing). The group proposed to work through these properties through the meeting.

Comments on specific properties have been added into the Google Doc. General queries on the approach and content are included below.

Proposal to have these properties in two classes: PhysicalLayout and a new "VariableLayout". i.e. the VariableLayout describes the physical attributes of the representation of the InstanceVariable in the physical file. (Flavio suggests finding an alternate name).

Note that there are also several levels of application: TableGroup, Table, etc.

Jay queried what to do about the VariableLayout - how would we model this. Dan G. notes that if the sentinel values change, then we need a new IV.
Physical could just be the relationship of an IV to a physical layout. And it doesn't need to be a 1-1 relationship. Larry's VariableLayout class is trying to achieve this. Flavio suggested that this is fundamentally another layer in the variable cascade (hence name should be PhysicalVariable or PhysicalMapping). Flavio agreed to develop the UML for this.

There was a suggestion however that we should move the PhysicalDataType out of the IV and into the PhysicalMapping. Agreed on this, but also agreed that we retain INTENDED DataType in the RepresentedVariable.

Question: Is a PhysicalRecordLayout now a collection of PhysicalMappings??
Flavio notes that PhysicalMapping has a parallel in GSIM of DataStructureComponent.
So PhysicalMapping becomes of the DataStructureComponents of the PhysicalRecordLayout.

A new layer for variables??

Steve asked about the implications of needing to define a new IV for each physical format (e.g. SAS and SPSS). Barry also asked a variation on this question.
Dan suggested that we need to draw a line is at some layer - below what level do we not care? Sense of the group was that we have the classes (by inclusion of PhysicalMapping) to achieve what we want. There was substantial discussion on the implications of this.

A possible alternative approach is to define a canonical format within an organisation's system and then to make transformations from that system. This had some preliminary positive response within the group but requires further consideration.

Three levels of interoperability:
- Distinction between different concepts at the Conceptual layer - requires a CONCORDANCE
- Agreement on substantive concepts but differences in designations at the RepresentedLayer - results in a TRANSLATION
- Agreement on concepts at the substantive and sentinel level is the Instance level (i.e. same designations - results in SIMPLE TRANSFER

Next steps:

- Members of the group to make comments by next Thursday Sept 1st.
- New PhysicalMapping and Properties will then be added to Lion by Larry.

Next meeting: September 8, 2016, 1500 CEST

 11 August 2016 meeting minutes

Minutes of August 11 meeting

We need RecordLayouts at both the Physical and Logical levels. May need to map the relationships between the Physical and Logical Level DataTYPES.

DataPoint: can have a DataType in the LogicalDescription.
Expressed as a String in the Physical file (probably a standard Expression language).
W3C includes a number of "standard" datatypes, defined in the XML schema definition (https://www.w3.org/TR/xmlschema11-2/), discussed in section 4.6.2 of the Tabular Data Model (https://www.w3.org/TR/tabular-data-model/#datatypes).

Need to flesh out the minimum set (from RFC4180) plus some additional properties that we would want to account for, eg.:
- successive delimiters
- leading characters
- leading rows (they have header row, but there may rows we need to "ignore"

Tabular data model allows for reading columns from right to left (e.g. Egyptian) and other directions (bottom to top). The question is which extended set of content would we want to support. The starting point might be to account for all the properties considered in the W3C tabular data model, plus then also allow for extensions (eg the right-to-left order above, different character sets, etc.). The custom metadata parameters may be suitable to enable that.

Action: to put together the superset of properties that support the CSV on the web and PHDD descriptions. (A good start for this appears to be the "CSV Dialect Description Format" that is identified in the tabular

Steve to map the RFC4180, Dialect, TabularDataModel and PHDD together, and distribute for review. (Has Achim already done this (Steve to contact Achim).)

Outstanding questions:
1. Do we want to do Fixed Width formats?

Describe a layout for each variable in the FW file.
(Start, End, Length). Also maps the InstanceVariable to the FW variable.

We also need a way of mapping the IV to the "segment" of the file. The qualitative model of segments may provide the means for doing this (describing a record as a series of segments). The mapping then indicates that IV associated with the segment - Format layout would be a collection of segment identifiers, with a link to an instance variable. (The PhysicalSegmentSet and SegmentByText would be the relevant classes).

We may also want to consider whether this may be a Pattern.

Someone needs to put together a proposal for the FormatDescription to use the Qualitative classes to model the FixedWidthLayout.
- Segments
- Mapping the IVs
- Formats

2. Logical and Physical Record Layouts

For discussion


Next meeting

Sub-group to discuss superset of properties for FormatDescription - August 18, 1500 CEST

Regular full group meeting - August 25, 1500 CEST


 4 August 2016 subgroup meeting minutes

4 August 2016: Sub group meeting to review FormatDescription properties and CSVWeb spec relationship

Participants: Larry Hoyle, Flavio Rizzolo, Steve McEachern

Should we cover both CSV and Fixed Width? Decided to address CSV first, and then Fixed Width if possible

The group ran through the CSV on the Web use cases to start identifying characteristics that we may or may not have addressed in our model.

We noted that the following are missing:
- Connectors from IVs to the PhysicalLayout (or possibly a "PhysicalVariable" that is related to an InstanceVariable
- VariableCollection (probably renamed PhysicalRecordLayoutOrder)
- Casting mechanisms from the Logical DataType to the Physical DataType (and possibly from one Physical DataType to another - eg. from SPSS to SAS)

Question: How far down into this transformation do we want to go?

Larry: could we handle this through regular expressions to parse the string that is received to the system-specific format.

We may also want to support the parsing of the content.

CSV on the web: "A regular expression, with syntax and processing as defined in [ECMASCRIPT], may be used to validate the format of a string value. In this way, the syntax of embedded structured data (e.g. html, json, xml and well known text literals) can be validated.
However, support for the extraction of values from structured data is limited to the parsing the cell content to extract an array of values. Parsers must use the value of the separator annotation, as specified in [tabular-data-model], to split the literal content of the cell. All values within the array are considered to be of the same datatype."

We agreed to focus on the extraction of the values from the file.
We could then provide a stub to describe the process of casting this to a particular system's format. This would be an "intermediate" format that the developer would then need to determine final format for their given system.
We did note that the Tabular Data Model (https://www.w3.org/TR/tabular-data-model/#parsing-cells) did have formats for various types (Section 6.4 on Numeric, Boolean, ...) but then a formatting approach for other types. We may want to include the formats included here, plus the "other" option.

(We note that this "other format" uses ECMASCRIPT - as their mandate was to support data ON THE WEB - we have a broader mandate to support online and offline system (e.g. SPSS, SAS, databases, ...)

In a CSV, it's important to note whether successive delimiters means a "missing" value. e.g. in a CSV, two onsecutive typically mean a missing
in a tab delimited file, two tabs may (or may not?) mean that there is a missing value
THerefore we need a parameter which indicates the meaning of successive delimiters

The RFC4180 standard referenced in the tabular data model provides a sound starting list for the properties we need incorporate. We would also want to include:
- successive delimiters
- leading characters
- leading rows (they have header row, but there may rows we need to "ignore"


 28 July 2016 Meeting Minutes

Minutes of Data Description Meeting, 28 July 2016, 1500 CEST

Attendees: Dan Gillman, Steve McEachern, Barry Radler, Flavio Rizzolo, Achim Wackerow


The meeting focussed on reviewing the current content of the model by applying the current classes to a simple CSV usecase.

Slides from this review are included in the previous minutes.

Comments on each slide are as follows.


Slides 1-3:

No comments. Descriptions of the container and structure classes appear to be appropriate and comprehensive.

Slide 4: Structure:

Add in a visual representation of viewpoints (e.g. colour-coding of Ivs associated with multiple viewpoints).

Slide 5: Format Layout.

PhysicalLayout:

What properties are still needed? Examples: characterSet, escapeCharacter, lineTerminator: line feed??,
We might want to use the TabularData usecases document as test cases here (https://www.w3.org/TR/csvw-ucr/
Can we also identify the set of properties being suggested by W3C?

VariableCollection:

Application of the VariableCollection: is it necessary? Couldn’t we just re-use the LogicalRecordLayout?
(May depend on where we want the re-use to lie? Do we have a LogicalRecordLayout that has more than one PhysicalLayout?)

Input from others involved in Edmonton and Norway would be appreciated here as to the purpose of "VariableCollection" class

Variable Order:

How do we map the variables in the CSV to the Logical content (ie. The InstanceVariables). We can use LogicalRecordLayout to identify the variables and a “ColumnLocation” (or PhysicalRecordLayoutOrder???) to map the columns to the instanceVariables. This needs to have a relationship to the PhysicalLayout (or probably the StructureDescription).

The ordering on the logical level is going to be semantic – in the physical level it is the order of columns

Overall:

We may want to rename a few objects here to provide clarity (e.g. "physicalDataType" appears with IV in the Logical data dsecription. This is consistent with the IV, but confusing in the context of the Logical model.

Slide 6: Do we use the following

Cases identified for both objects: Viewpoint and LogicalRecordLayoutOrder.

Slide 7: What is missing

Could we map these classes to the classes of the W3C TabularData spec?
(It would also provide an implicit review of our spec to see what we have missed)
See the working group:
https://www.w3.org/2013/csvw/wiki/Main_Page
And the vocabulary:
https://www.w3.org/TR/tabular-metadata/
See particularly the diagram of classes on the vocabulary page


Next meeting:

Sub-group meeting to map CSV on the web classes (Steve, Flavio, others interested): August 4, 1500 CEST

Full meeting: August 11, 1500 CEST


 Preliminary Slides for 28 July 2016 meeting

Slides for discussion at today's meeting

 23 June 2016 meeting minutes

Additional Meeting held 23 June 2016

Attendees: Jay Greenfield, Larry Hoyle, Steve McEachern, Flavio Rizzolo

Model from Feb 2016 (from Flavio) was as follows - but this was not yet finalised in Lion prior to the Edmonton meeting

PROBLEM:
Work from Edmonton conflated the structure and the container into DataRecord when it had the viewpoints linked to DataRecord and removed (Instance)DataStructure

We need the DataStructure to describe the STRUCTURE of the Record "Type"
We need DataRecord to be the CONTAINER of DataPoints/Datums

COMMENTS:
The DataStructure is basically a SCHEMA

(In Edmonton, they created a "RecordDataPointOrder" which performed the same function)

Jay had a query of the scope:
- Suggested that this "DataStructure" is the format of the RECORD but NOT the relationship BETWEEN records

If we agree on this then we have:
- DataStructure: describes the structure
- DataRecord: contains the DataPoints

DataStore then is a CONTAINER for a collection of DataRecord(s), which REFERENCES zero or more LogicalRecordLayout(s) (but at time of "publishing" it requires ONE or more).

(Aside for the modelling team: we need to be able to describe points in the data lifecycle where cardinality is required as opposed to optional - e.g. in documentation, or more formally in the model, perhaps as a profile?? An example would be a DataStore in design/development versus in use. Jay came up with a possible approach with two states: staging and production (or possibly design and implementation)


NAMING:
DataStructure is seen to be a confusing name - Suggested DataRecord
Preferred name is "LogicalRecordLayout"
(Recommend also changing PhysicalLayout to PhysicalRecordLayout)
RecordDataPointOrder in FORMATDESCRIPTION would be moved to the LOGICALDATADESCRIPTION (probably becomes the LogicalRecordLayout)

RELATIONSHIPS BETWEEN RECORDS:

Still need to describe relationships among records

  • At the logical level we want to describe the fact that there is a RELATIONSHIP between two records (e.g. a household level and a person level record) - INCLUDE IN THE LOGICALDATADESCRIPTION
  • At the physical level we need to describe how that relationship is represented in the format in use (e.g. position in the record, keys, identifiers) - INCLUDE IN THE FORMATDESCRIPTION

Jay would like in the future to walk through a data warehouse example. And we will also need to walk through a number of other relationship types.

There was further discussion about what to do with data formats that have only an implicit identifier for records (e.g. a line in a CSV file)
Jay suggests there are two mechanisms for describing this:
- Declarative OR
- Procedural
(Larry noted that you could use the procedural approach in combination with the qualitative "segments" to describe a structured text file)

POSSIBLE CHANGES:

  • Rename the DataStructure object to LogicalRecordLayout
  • Moving the relationships from the Record to the LogicalRecordLayout
  • InstanceVariable relationship should also be moved from Record to LogicalRecordLayout
  • Add the LogicalRecordRelation (in the LogicalDataDescription) i.e. the equivalent to the PhysicalRecordRelation
  • RecordRelation rename to PhysicalRecordRelation (and this may have subclasses)
  • Classes to move FROM FormatDescription TO Logical:

RecordDataPointOrder is (probably) REPLACED by LogicalRecordLayout

Q2 RELEASE:

What then should we aim for for the Q2 release?
- Question from Flavio
- Can we finish FormatDescription for the Q2 review? Recommend that we "fix it up". but that it may need some additional work after the review
- LogicalDataDescription will be completed (but not the Datum section of that model)

ACTIONS:
- Flavio to update the LOGICALDATADESCRIPTION model in Lion
- Leave the FormatDescription (i.e. Physical) for now

NEXT MEETING

Regular meeting, June 30, 1500 CEST 


 16 June 2016 meeting minutes

Notes from Data Description meeting 16 June 2016

Attendees: Dan S., Dan G., Jay, Barry, Larry, Flavio, Steve

Discussion of Identification and Annotation issues (per Wendy's document for the TC)
https://ddi-alliance.atlassian.net/wiki/display/DDI4/Technical+Committee?preview=/491555/36470786/Identification-Annotation.docx

Broad discussion occurred for the different requirements for annotation, identification and administration

Current status:
The following are required: "isUniversallyUnique" and "isPersistent"
The remaining properties and relationships are optional

Question from Dan G.: are there any objects that are identifiable but not annotated
There are a small selection (eg. segments of text)

One option is to use notes and point to the notes. However it was found in the Norway sprint that this affects interoperability (e.g. undermined the round-tripping between DDI-C2.5 and DDI4 Codebook view).

There was discussion of the possibility of limiting the properties that are available (to be "seen") in a View. This was discussed in Edmonton, and it was noted that a View cannot EXCLUDE a property - it can only make it OPTIONAL. There was a proposal from Edmonton

Dan G: it might be possible to SHOW different properties within a view at the APPLICATION level (rather than in the model). This might therefore be handled by the database rather than the model.

It was also noted that this then creates interoperability challenges - for interoperability, if a property is there, there has to be a way to handle the property. In particular, it's straightforward for you to create your own instances, but what happens when you need to use an instance created by someone else.

Flavio also asked about what to do when an instance of a class might be used by more than one view (i.e. how do you pass through a value in a field that you don't care about, but someone else has populated it). For example, if someone produces an instance with a fully specified InstanceVariable, but you only are interested in a small set of the properties of that instance variable, how do you handle the additional IV properties.

Issues proposed for discussion by the modelling group:
1. How to set properties as OPTIONAL WITHIN a view
2. How to maintain properties established within one instance of a view when it is used in another view (how do you hide it).
3. Where to manage the integrity of instances of views (in the model, or by the user within their application e.g. in the documentation)

In view of Issue 1, Larry has reviewed the model to identify the current classes, properties etc that are required - and that list will be considered by the modelling group.

What about delivery for the minimum product for the next release?
1. Continuing work on the model (e.g. physical description, datum, record and data store)
OR
2. Working up the examples (e.g. describing the physical and logical content of a CSV)

Dan S: can we describe the logical content of a file, without describing every record within the file.
- We can describe each record TYPE - not each record
- use sequences to describe a record

(This issue is to form the basis of an additional meeting on June 23)

Examples to work up:
1. Viewpoints
2. Logical content of a CSV
3. Physical layout of a CSV
4. Application of a variable cascade

Example: work out the example of how to describe a CSV file
Possible content: SEX, AGE, INCOME, IDNUM

Steve to start a document based on
- Include sample content form the CSV file and
- YAML based on Ornulf and Oliver's output from Kalvag



 9 June 2016 meeting minutes

NOTES FROM Data Description meeting
Thursday 9 June 2016

Attendees: Dan G., Dan S., Ornulf Risnes, Steve McEachern, Jay Greenfield, Barry Radler

Focus of this meeting: Review of Sprint outcomes for this group, Ornulf to discuss his issues

Introduction to sprint summary was provided by Dan.
Significant discussion in the sprint on Codebook, and also work on Methodology, but no direct discussion on Data Description.

However, the sprint did focus on the capacity to which the current content in the Data Description can be integrated into the Codebook view. (The key reference here was a focus was on replicating a DDI-C instance).

Job now is to produce something that can be used by the Codebook view.

Right now, we don't really need any of the Datum content, and not a great deal from the Physical Description. Barry queried the use of the Data Capture, and this also had only limited use within the Codebook view at this time.

There were two core groups for much of the week, focussing on three basic "codebook" types:
- one corresponding closely to DISCO
- a second focussed on replication
- the third was a traditional DDI-C codebook
The content of the sprint is currently in Google Docs (https://drive.google.com/drive/u/0/folders/0B0RsNaHM6CqxM0R0ckptX3hOSlE), but will be exported over to the wiki in the near future (Marcel is responsible for this).

The result of the analysis of the DataDescription content to use in Codebook view suggested that the abstracted variable cascade may be too complicated to use inside the view. Ornulf's sense was it was "human mappable", but that it would be difficult to automate the mapping in a machine useable way.

This lead into the discussion of Ornulf's concerns with the current state of the model (refer to Google Doc here). The key issues here were basically two:
a) the cascade needs to be simplified for use in the Codebook
b) that the high level of requirements for identification of objects within the model makes it very difficult to manage - that there will be a significant identification management requirement if identification is made compulsory

Commentary on Ornulf's points
a) Simplifying the cascade
Focus on the good Data Description logical model - the use case for application with the Codebook view is only one usecase here (as the three codebook types above suggest)
- Larry noted that the RV and CV are not required in the model but can OPTIONALLY be REFERENCED.
- Dan S. noted that we may want to hide some or much of the content in a Package when we specify a particular view. Jay suggested that the Technical group at the sprint had worked to address this (that was then reflected in work that Oliver Hopt did to produce an .XSD)
- Larry suggested that we may therefore want to minimise the number of properties that are MANDATORY - to focus more on what we SHOULD do, but not what we MUST do
- We can still inherit all of the properties from the higher
- (One quick way to check this would be to identify where we have properties with a minimum cardinality of 1 - indicating Mandatory). Dan S. noted that this had been done in DDI3, and had (by 3.2) made much of both the properties and relationships optional - primarily because a proportion of the DDI content could only be collected at a certain stage in the data lifecycle
- Dan G.'s concern with reducing Mandatory content was that all of the properties and relationships get instantiated in the way that they are intended. He wondered therefore how do we "ensure" this "correct" instantiation.
- Ornulf took the position that some of this could alternatively be handled by an organisation's internal systems

Suggestions for modelling group:
1. We should review the extent of mandatory content within the model, to consider the extent to which it is manageable for implementations of the model in systems
2. As a corollary of (1), there are implications for reducing the integrity of the model - particularly the interoperability of the model - of reducing mandatory content. How can we caution or control against this. (Is this guidance, tools, etc...?)

b) Comments on Ornulf's comments on Identification
- For the Physical Description - it doesn't need to be in the Codebook consideration
- Dan S. agreed that there is an overuse of identification within the current form of the model. (e.g. IF statements in the Data Capture, or the roles in Viewpoints in the Logical model)
- Dan S. suggested that objects be reviewed to assess the REAL requirements for identification of the object
- Dan S. noted that there is a high level of identification, but it is possible to use
- Dan did indicate that we can't not have "identifiers" - but that we may be able to use internal identifiers, but then append agency identifiers (e.g. nsd.*) onto the internal identifiers. He pointed to a paper he put together from the 2014 NADDI sprint on this:
https://ddi-alliance.atlassian.net/wiki/display/DDI4/Identification - recommended to revist the suggestions in this paper.
- On the variable cascade: we want to retain the cascade in the DataDescription package - but then be able to collapse the cascade in other situations.
- Ornulf suggested that there may be still one issue that isn't resolved regarding the use of inline versus referenced content

Suggestions for modelling group:
1. Review the need for identification on objects
2. To propose for development the description of the "physical layout" (for example the CSVW) (and should this done for the Codebook view release)

Discussion for next meeting:
- Minimum Viable Product
Meeting Thursday June 16 2016, 1500 CEST


 19 May 2016 Meeting notes

Meeting notes 19 May 2016

Attendees: Dan Gillman, Steve McEachern, Flavio Rizzolo, Jay Greenfield, Larry Hoyle, Ornulf Risnes

Focus of this meeting was the agenda for the Norway sprint: recommendation was to focus on CSV description. Suggestion for the sprint would be to focus on documenting a real life example to see what's missing. At the sprint, Larry suggested to try to represent what's in an existing archived data set documented in DDI2.5 and see if it can be reproduced in the current DDI4.

Question: Do we want to provide our own physical description, stick with the logical, or to map to the physical description of something (e.g. the W3C standard).

  • Flavio discussed presenting the logical representation of the relevant objects required to populate a physical model
  • Jay asked if we are going to support the physical model at the logical level - question for Dan.
  • Flavio suggested that we need some logical descriptions of a physical file (e.g. CSV, ...). Flavio suggested that we could document the logical elements of a physical file (e.g. EOF, data types, ...)
  • Flavio suggested that the external standards can provide some insight for us in understanding what might be missing (e.g. do we have everything needed to describe a W3C CSV file)
  • Jay asked a further question on what progress was made in the development of the PhysicalLayout at Edmonton. Larry referenced the "PhysicalDataDescription" pacakage which was developed.
  • Jay asked whether this was literally a physical model, or something in between a logical and physical. Dan suggested that this is the ability to describe the "physical layout" of some data, rather than the literal physical format of the file. Larry felt that we were trying to describe how to layout the file.
  • Flavio indicated that we are trying to describe a physical representation, without actually describing a "physical representation" :-)
  • Jay suggested that we may not want to call this package "Physical Data Description" - rather it is a "metamodel". Larry suggested perhaps "Layout Description"? Or perhaps "Data Format"? Would we also include data types? Perhaps "Format Description"? (This was then applied to the Lion model by Larry).
  • Ornulf noted that this choice of "Format Description" provides a nice connection to the RDBM "Relational Model".

Various people noted the variations that are going to exist within "data types" - which Dan noted may have different implications for the operations that you can perform on them. They may also be implemented slightly different in different software applications.

Larry noted that we don't have the "classification level" (equivalent to ISO11404 "family") is not yet in the DDI4 model. We may need additional levels of the "data type" - intended and actual - as well. These are partly in the model (under "datatype") but need fleshing out.

On "schemes" - it was noted that we need both the "scheme" and the "name of the scheme". Larry made the change to "schemeEntry". 

 Meeting Notes Thursday May 5, 2016

Attending: Dan Gillman, Ornulf Risnes, Barry Radler, Flavio Rizzolo, Larry Hoyle

Data Description 2016-05-05


What is the issue we’re trying to solve? 

W3C spec exists and is a W3C recommendation. PHDD and W3C overlap. They have a role in serialization.  W3C has headers and  (anemic) metadata - id,  label, and general text for the column.

We can use the physical side of the  W3C model without using their underlying data model.


W3C Section 5.6  is the metadata header for a column https://www.w3.org/TR/2015/REC-tabular-metadata-20151217/#columns



W3C has a larger set of datatypes than DDI e.g. supporting multiple responses as a complex cell value


 It would be nice if we could connect instance variables to columns in the W3C model. Making a connection to them from the DDI community


Our initial mission was to describe a simple rectangular CSV table. We wandered from the simple, but have crreated a logical model that will be useful going into the future and for other ways of storing/representing data.


Can we take Edmonton PhysicalDataDescription  (lion and map it to W3C)?  http://lion.ddialliance.org/package/physicaldatadescription


We can use section 8  Parsing Tabular Data for physical description. (https://www.w3.org/TR/tabular-data-model/#parsing )

From W3C we need:

 escape character, encoding, line terminators,

And probably: comment prefix, header row count, skip blank rows, skip columns, skip rows, trim


Should we leave fixed width formats to the future? – just handle CSV at Norway sprint?

Much legacy data in archives may be stored in fixed width layouts.


 April 28, 2016 - Meeting notes

Meeting notes Thursday April 28, 1500 CEST

Attending: JG, SM, LH, BR, OR

Larry provided an overview of the Physical model developed in the Edmonton sprint. He also highlighted the possible relevance of both the PHDD developed by DDI, and the Tabular Data Model developed by W3C (https://www.w3.org/TR/tabular-data-model/). He walked briefly through the Physical Layout and RectangularLayout, to determine how to describe the simple CSV use case. He also pointed out the Cube and EventLayout, and then noted that there would be other variations that could be developed in future.

Ornulf noted that there are two core areas for the PhysicalLayout:

  • Data exchange
  • Long term preservation (where text is the long term preservation format)

He notes also that there are other formats that will still need consideration as well, but this appears to be a sound start. It would make a nice exchange format that would allow transport between systems - and would make for a good demonstration.

Jay raised his issue that “i still dont understand what a datastructure is and i dont under the relationship between our logical model and our physical model”. He asked whether we have a description of structure in the LogicalModel, and how does it relate to the structure in the PhysicalModel. In particular, he noted a concern with DataStructures being a Collection of DataRecords.

Larry suggested that the best test of this model will be when we apply it to some real world data and are able to describe it at a relevant level of granularity. Ornulf suggested that they have this in RAIRD - but also that they are not strictly speaking concerned with the physical.

Ornulf also noted that the distinction between physical and logical is that the logical is able to be queried, where the physical is primarily about storage and serialisation. Jay agreed with this, and then noted that part of the concern may be whether we have sufficient detail in the Logical model that allows us to represent a Schema.

Larry noted that the DataStructure had different semantic interpretations in the Edmonton sprint - initially as a Schema, but later when it was modelled as a Container for DataRecords. Larry suggested that a Schema in the relational database sense had an implicit order - and hence blurred the lines between Logical to the Physical - but Ornulf and Jay disagreed here, arguing that the order is implied. Ornulf felt that the DataStructure is consistent with a Schema as well as a Container.

Steve queried Jay as to whether there was anything within the proposed model that would limit what he might want to do (i.e. have we boxed ourselves into a corner). The group noted that the DataRecord remains unordered in the Logical model, and the order is applied in RecordDataPointOrder in the Physical model. This allows for extensions from DataRecord and from PhysicalLayout.

Jay did make a point that we may be “reaching for a lot” with the RecordRelation, and that it may be a limitation. We will need to explore this further.

Where to next?
We don’t need to account for every possible alternative - but rather to allow the extensions from appropriate points.
In reference specifically to the W3C proposal, the group could see some good complementarity between the two, and noted that there may be some compromises that need to be made.

Next week:
The aim for next week’s meeting is to update the objects in the Physical model with the relevant properties - drawing on PHDD and the W3C tabular model.
Jay also requested that we aim to flesh out some of the documentation of the model in a more

Ahead of next week’s meeting, group members are requested to review the PHDD/TabularData comparison document completed by Achim and the GESIS team, and also the documentation on the objects in the Physical model. We will then discuss the object properties in the meeting next Thursday.

Next meeting: Thursday May 5th, 1500 Central European Summer Time.


 April 21, 2016 - Meeting notes

Meeting notes - Thursday April 21, 2016

Attendees: Dan Gillman, Jay Greenfield, Larry Hoyle, Steve McEachern, Gillian Kerr, Barry Radler, Ornulf Risnes


Welcome to everyone, congratulations to Flavio on the birth of his new son.

Group agreement on keeping this (Thursdays at 1500 CEST) as the new meeting time in the short term.


Dan, Larry, Jay and Barry participated in the NADDI sprint in Edmonton, and provided a review of the progress there.

Changes were made to:


Specific changes:

  • Viewpoint was moved from DataStructure to DataRecord
  • New object is “DataStoreLibrary” (formerly Catalogue).
  • InstanceVariable is now clearly related to DataRecord only

Observation-Datum relationship is now (mostly) cleaned up, following discussion with Barry.

Jay had a query about what is the relationship between a DataStructure and a DataStore. Dan indicated that a DataStructure is USED by a DataStore.

  • A DataStructure CONTAINS DataRecords

  • A DataStore USES a DataStructure

Jay clarified his question to whether we are able to include different types of DataStores (e.g. relational data store, table, schema, graph, ...). Ornulf was confident that this was the case based on his experience with RAIRD. Jay also queried whether a DataStructure could contain different types of DataRecords, and the belief is that this is manageable.

Observation is going to need to support both a “Capture” and a “Generation” that produces a Datum. But the Observation location (and link to the Capture model) is now provided.

Also had some involved discussions about the Datum. The Datum is a general object, and then the committing of the Datum to a Record through a DataPoint makes it now a Copy.

 


PhysicalDataDescription:

The physical ordering of records and layouts is now also included into the PhysicalDataDescription, along with the PhysicalLayout, etc. (Suggest also referencing the PHDD for this added note: FYI it has not been determined yet if PHDD will be published due to the publication of similar vocabularies since the development of PHDD-wlt). The properties of the external file are expressed in “verbal” form (isDelimited, the delimiter type, etc.), but the focus is on the RectangularLayout in the first instance.

Ornulf noted that this is only a small subset of the possible serialisations. He also noted (referencing that it isn’t possible to model ALL of the possibilities, but that we should at least model some of the common ones (hence the first examples of Rectangular/Cube/Event).

Jay raised, as an example of this, the work he was doing with Splunk, which assumes initially that ALL data is unstructured, and treats the data entirely as key-value pairs. Order is then extracted from the Splunk content using regular expressions and then “late-binding” a data model to enable a particular data structure.


Conclusions:

General consensus is that the LogicalDataDescription modelling looks reasonably sound at this point, and Physical is a good start, but does need (possibly after the initial release) to allow for


Follow-up:

Ornulf had a related question of what happened to the discussion about the variable cascade and where to tie in the SentinelValueDomain.

  • Most at the sprint agreed that the tie should be to InstanceVariable

  • Dan Smith has proposed tying to the RepresentedVariable

  • The InstanceVariable version has been accepted for now as a starting point to allow the model to be disseminated for discussion - but it is recognised that there is still not a consensus here.

  • Larry noted that there may also be an option for separating the Categories (i.e. the ConceptualDomain) and the Codes (ValueDomain) at different levels.


Preparing for the Norway Sprint

Ornulf queried also whether we will be working on this in Bergen sprint. There will be no work directly - but DD will be important input into the Sprint in terms of the CODEBOOK view - as the Codebook group need these objects to build the codebook.

Larry suggested that it would be “ideal” to be able to express in DDI how to READ the DDI and generate code in a specific package to allow you to read the DATAFILE.

Needs ahead of the Sprint:

  • Get the PhysicalDataDescription bedded down for a RECTANGULAR file

  • Reference PHDD in order to do this

Group agreed to an additional meeting next week (Thursday April 28) to facilitate progress on this.


Upcoming meetings:

1. Additional meeting: Thursday April 28, 1500 Central European Summer Time

2. Regular meeting: Thursday May 5, 1500 Central European Summer Time

3. Regular meeting: Thursday May 19, 1500 Central European Summer Time

(Norway Sprint is the week of May 23rd)

 March 24, 2016 meeting

Meeting Thursday 24 March 2016

Attendees: Dan Gillman, Larry Hoyle, Jon Johnson

Discussion: Dan presented some of the general findings from his evaluation of the variable cascade. He feels there is no obvious or canonical way to separate the levels in the cascade. The decisions need to be based on re-usability. Jon joked that his might mean we have 57 levels instead of 3. Group agreed – 57 levels seems about right! J

The discussion then turned general as Jon asked what kind of model are we trying to build: information, metadata, or data? Diving deeper into this, the group realized the distinction might be active metadata (metadata model) versus descriptive metadata (information model), and the distinction or boundary between the two might not be clear. Dan gave the example of the sample description model. An instance of this model could easily be seen as both informational and active. Group agreed that DDI is trying to handle both active and informational metadata. Group also agreed that these needs might be counter to each other. This tension might explain why solving some of our problems seems so hard.

Returning to the variable cascade, the group agreed that knowing the conceptual domains for a variable is an information model concern, whereas knowing individual codes or representations might be more of an active metadata issue. Where the other attributes lie was not discussed. However, the domain issue, which is what started the investigation, might be an example of the deeper concern.

Group noted that our scheduled meeting in 2 weeks will coincide with the NADDI meeting. Several members will be at NADDI, so it might be useful to postpone the meeting.

 Dan Gillman position paper on datum management - response to 10 March meeting

The following is Dan's response position following up the item from the meeting of March 10, 2016.

Email from Dan (24 March, 2016) as follows:

Group,

As promised, I have looked closely at the variable cascade to determine if it makes sense to attach a SenVD to RV instead of IV. I am unable to resolve this. In part, this is due to some observations I lay out below:

1) There are several independent cascades in use together –
a. SubCD to SubVD
b. SenCD to SenVD
c. Intended Datatype to Actual Datatype
d. Dimensionality to Unit of Measure (though Dimensionality may not be part of the DDI as of now)
e. There’s a connection between units of measure and datatypes. Suppose there is a variable called area of property on property owners. The datatype of such a measure is real or float. But is this distinguishable from some linear measure such as length driveway? Yes, through characterizing operations. An area measure allows for the calculation of perimeter and area, whereas a linear measure only allows for length.
2) At the bottom, there are several considerations that are independent of each other –
a. MIME type or application
b. File location
c. SenVD, since the codes for missing categories might not be application specific
d. SubVD, since different codes or even a different datatype (integer vs real, for instance) might be used to represent the same data

A star schema view from the perspective of a variable is schematically illustrated below -

SubCD - SubVD Actual Datatype – Intended Datatype
\ / |
Variable    |
/ | \ |
SenCD – SenVD | Unit of Measure – Dimensionality MIME Type
/\_______________________________/____ File Location
Concept

This approach allows one to search against any subset of criteria. But, we lose the re-usable conceptual and represented variables because there is no variable cascade.

The main question we have to answer is what attributes change the CV, RV, and IV, and what attributes don’t (if that even makes sense).

I have to admit I am feeling at somewhat of a loss. My main point of concern from the position Dan Smith is advocating is that asking both the SenVD and SubVD to attach at the RV seems to require more RVs and results in gratuitous differences. The SenVD often changes with the application or the MIME Type, which are below the RV in any use case we’ve discussed.

So, I don’t have a solution.

Yours,
Dan

 10 March 2016 minutes

Meeting minutes, 10 March 2016

Attendees: Dan Gillman, Dan Smith, Barry Radler, Flavio Rizzolo, Jay Greenfield, Larry Hoyle, Ornulf Risnes, Steven McEachern

(Note: Steve’s minutes only begin halfway through the meeting)

Major discussion about the location of sentinel values - or more specifically where should the sentinel codes and categories be located.

Early stages of meeting, Dan S and Dan G have involved discussion which presents the two core perspectives: (may be best for you both to describe your positions here)

One option is that the InstanceVariable is the Variable associated with a MIMETYPE (and not a single instance of a variable)?

Dan G sees the potential for sentinel CATEGORIES in the RepresentedVariable, and then the CODES in the IV.

Dan G: we could define IV as something that changes when the MIMETYPE changes?

Dan S: makes the note that most organisations will have one (or homogeneous) system(s) which would manage this - that would enable the management of the missing values in this way.

Dan G: warns against being overly concerned about alignment with GSIM, given that the DDI development has been much more detailed and critical in this area than GSIM has been.

Jay/Flavio: suggestion for a level between RV and IV of a PLATFORM variable

We need to decide what is going to be split where: (e.g. codes, categories, data types,)
Suggestion that we may need a fourth level (although this is not entirely clear): thus this could be Conceptual/Represented/Platform/Instance

Ornulf also noted that this could be problematic for the current usage of IVs in RAIRD, which would use the Instance rather than Platform variable.

Dan G suggested that keeping the set of CATEGORIES will be the key requirement in these splits. It could be that this is necessary to retain at the CONCEPTUAL level.

Dan G to write a proposal, and then follow-up action is to be determined within the group in the next week.

Next meeting:

Steve has requested to MOVE the meeting to the start of the day on Thursday, rather than the end of the day, due to changes in his commitments (for Fridays). Proposed new time is the hour prior to the Technical Committee meeting.

In addition, due to changes in daylight savings in the US (March 15), Europe (March 27) and Australia (April), the timing of the next two meetings will both be affected by daylight savings time changes.

Thus the proposed times for the next two meetings are as follows:

Thursday March 24, 1400 Central European Time

  • 0800 Madison, Lawrence
  • 0900 Washington DC, Ottawa
  • 1400 Mannheim, Bergen
  • 0000 (Friday) Canberra
  • 0200 (Friday) Christchurch

Thursday April 7 onwards: 1500 Central European Time

  • 0800 Madison, Lawrence
  • 0900 Washington DC, Ottawa
  • 1500 Mannheim, Bergen
  • 2300 Canberra
  • 0100 (Friday) Christchurch


 Datum discussion meeting, 23 February 2016

Attendees: Jay Greenfield, Dan Gillman, Barry Radler, Steve McEachern

Continuation of meeting from 16 February.


Core of the issue per Jay: by attaching the IV to the DataPoint – and not the Datum – as a result we don’t (and can’t?) know the IV of a Datum until it is put into the DataPoint.
Hence the square peg/round hole distinction.

Dan: we don’t want a new IV every time we have a copy of a Datum. We want it every time we change the FORMAT/FILETYPE of the Datum.

From last meeting’s notes:
"So what does COPYING data do?
- It changes the InstanceVariable
- It doesn’t change the Datum
What does a TRANSFORMATION do?
- It changes the RepresentedVariable - there is a new ValueDomain
- It changes the Datum"

Hence in the ICPSR example: if we download the data from ICPSR, are we transforming or copying the datums?
Barry: and if we download in two different formats (SAS vs SPSS), do these data sets share the same IV? (Given different sentinel values).

With different mime types, we will likely have different value domains, and different properties at the bit level.
Do we care about this within DDI?
We care that when we move between SPSS and SAS, that we are changing the value domain, and

Dan: it is the MIMETYPE differences that create the IV differences – not all the copies.
Therefore:
If we have a “Record” (not our current notion, but here defined as:) – an ordered set of DATUMS (not DataPoints)…
How do we find all of the locations where I have copied that “Record”.

Jay: to do this, I would query the DataDescription to find all the DataSets using the InstanceVariable – and then list those DataSets.
Dan: this would list all the versions of the same MIMETYPE (SAS versions) but not of another MIMETYPE (SPSS versions).
Why? Because you may have multiple instance variables associated with different copies of a DATUM? Why: because they have different IVs.
Jay: then you could instead query the RV, or even the CV
Dan: but this may then blow out the results of your query – I have all references to the CV “Sex”, but what about the measure of the Sex of “Dan Gillman”.
I.e. Your ability to track an ID (e.g. Person) by it’s Unit (“Dan Gillman”) is limited under that scenario.

Jay: you can now find all of the results associated with a particular CV.
Dan: True - but you still can’t find the Datum that corresponds to a particular unit?

Barry: Need to different frame of reference possibly. For example: compare the results of two analysts – the analytic results would be the same.
Dan: we could lay this aside – but it seems likely to return to us when we go to the point of doing other things to the data (e.g. TRANSFORMATIONS).
We either care about all of the copies, or we care about none of them.

The result of the Observation PROCESS is the DATUM.
We can’t track all of the Copies of that Result – they are the Observation.
Dan’s claim: there is power in knowing where all of the copies are.

Some possible:
The sentinel values in Data Capture can also change (e.g. X in a form becomes 0-1 in a distributed DataSet – this would be a Transformation)
Observation transformation: The process of writing down the Observation
Datum transformation: The
Steve’s summary:
The differences between sentinel values are reflected in the IV. But can we follow the Datum?
We have been conflating the need to FIND all of the Datums using an IV (Jay), with the need to FOLLOW an individual Datum over it’s lifecycle (Steve – provenance). We need to better articulate the Provenance use case to make this distinction

 Capture-Datum sub-group meeting, Feb 16th, 2016

Sub-group meeting 16 February 2016 - Capture-Datum-DataPoint-DataDescription relationship

Participants: Dan Gillman, Barry Radler, Steve McEachern, Jay Greenfield (30 minutes in)


What is our aim? To understand the “lifecycle” of the Datum and DatumSet, and how they impact on Data Description, Capture and Process.

 Dan G: A simple case: data is “captured”, collected, added to a dataset and provided to the “client”. e.g. a question about Gender.

For an individual case, we don’t want to record the fact of recording the respondent’s Gender. Rather, we just want to note that we have made the Observation of the respondent’s Gender.

 There are then multiple copies that are made for the various versions (copies) of the Datum that are made in different stages of the processing cycle (recording, processing, dissemination, ...) in different formats (SAS, Stata, SPSS, ...). They are all copies, but they all share something: the concept of the person’s Gender has been captured in the Datum. We can’t “recapture” that - all we do is have the “cloud of copies” of that “original” Datum.

So conceptually we know that Unit “Dan Gillman” has the “Gender” of “Male”.

We could transform how we record it - from recording M and F, to recording 0 and 1. 


Comments: 

Barry: is COPYING a type of transformation? Dan: NO. Argues that we need to separate the idea that a pure copy of something, is different from changing the characteristics (values, codes, ...). The copy is repeating the same thing, the recoding is not.

What if we change the metadata between two different software versions (e.g. SAS contains things that SPSS does not). Are we changing the InstanceVariable? Yes. Are we changing the Datum? No.


So what does COPYING data do?

- It changes the InstanceVariable

- It doesn’t change the Datum

 What does a TRANSFORMATION do?

- It changes the RepresentedVariable - there is a new ValueDomain

- It changes the Datum


Steve: does this imply 

A) that Datum is (or should be) associated with the CV (OR possibly the RV)??    AND

B) That the “cloud of Datums” share a Conceptual OR Represented Variable?


Barry gives the MIDUS example where they use RepresentedVariable to link different InstanceVariables with the same concept (e.g. Gender) but that use different codes (and even additional Categories). If it’s ConceptualVariable - then the Datum cloud shares the same set of CATEGORIES but not necessarily the enumerated CODES associated with the categories.

So what is the correspondence then between the categories? This might be managed through a CorrespondenceTable (in the Classifications model).


Example: Marital Status

Single (=Not married) - Married

Single (=Never married) - Married - Widowed - Divorced

Conceptually the two could be harmonised through a correspondence.

(Jay notes: There is a similar situation with harmonising Race and Ethnicity)


What does this imply for Datum?

Dan: We don’t record each individual Datum - but the “cloud of Datum copies”. This becomes tantamount to the Observation. We will know where the copies are because they share the same InstanceVariable.

Dan suggests here that we do not want to conflate the Datum with the InstanceVariable. We can continue to make copies of the Datum - we only need the InstanceVariable when we want to do something with the Datum. 

 Steve - Is this a Physical–Logical distinction:

  • The Datum is the physical
  • The DatumCloud is the logical


What actually changes the Datum?

  • Transformation CHANGES the DATUM
  • Copying and Re-recording duplicates but does NOT change the Datum
  • From this we note that we need to be able to distinguish a TRANSFORMATION from a COPY.
  • Data are TRANSFORMED when something fundamental about the Datum has CHANGED - either the underlying CONCEPT and/or the CODE(s) have changed.

Further examples (and complications):

What if we move a DATUM from SAS to Stata? The SUBSTANTIVE values would not change, but the SENTINEL values would change. So this move would require Transformation of SOME of the data - the SENTINEL values would change, but not the SUBSTANTIVE values.

 Jay’s suggestion - the SIGNIFIER change should be handled OUTSIDE the DataDescription. It is noted in the PROCESS of recording (CAPTURE?).


So where to next?

We seem to be agreed on where a DATUM and some of the ancilliary processes now imply (e.g. TRANSFORMATION and COPYING). We now need to agree on what this then implies for the three areas of the model - CAPTURE, DATA DESCRIPTION and PROCESS (and probably METHODOLOGY).


Next meeting: Tuesday 23rd, 8am US EST (7am CST, Midnight Canberra)

ALTERNATIVE: AFTER THE METHODOLOGY MEETING ON MONDAY?? (i.e. MONDAY 2pm US EST).

 11 February 2016 meeting minutes

Meeting Minutes 11 February 2016
Attendees: Barry Radler, Dan Gillman, Jay Greenfield, Steve McEachern, Larry Hoyle, Flavio Rizzolo, Chris Seymour, Ornulf Risnes, Jon Johnson

Started with discussion of the Presentation of Definitions - LINK.

Discussion of Datum/DataPoint:

Dan G: Need to think about what we want to talk about. Datum can only be RECORDED once. After that, it is always a copy. Hence the need for a "DatumCopySet".
Larry: Is the DatumCopySet a Capture?

DG: we need to understand how Datum behaves, and how they get stored and moved throughout a system. We can’t talk about Datum without a DataPoint - a Datum gets recorded in a DataPoint. DataPoint is structural while Datum is conceptual.

Jay: Can you talk about the copies, without talking about the processes that created the copies? Dan: copies are so frequent that we may not need to capture that.

Ornulf raised concerns about overloading the model - such as:
- Tracking copies of Datums
- DataStructure cascades
For example in Datum copies, can we get sufficient information to assess if we have the same underlying designation and value for a Datum copy? The remainder would/should be left to the system to manage. Dan G argues that the proposed approach does this.

Jay asked the broader question of what problem is the DatumCopy solving? Dan: the models don’t separate sufficiently. For example, in a recent model, data structures couldn’t be addressed without them ALREADY being populated with data. We need a way of distinguishing the structure from the recording. Steve noted: This goes down to the level of Datum and DataPoint, and then has parallels in the agggregate structures.

DatumCopySet: refer to the same Concept and Value, but also use the same Signifier. In doing this, we link the Capture and the Storage of that Datum.
Dan also noted that there is no need for an “original” Datum in this approach.
Larry: how do these then get recorded and represented? Ties into the discussion of the VariableCascade. For example, the SPSS and SAS versions would have different IVs, but the same RVs (and same SUBSTANTIVE value domain). Here the DatumCopySet then has a parallel with the RV (sharing the same substantive value domain).

Larry: Can we say a Capture is putting a Datum in a DataPoint?


Suggestion: Steve, Dan and Barry to work through a usecase to demonstrate how this would work. Sub-group to meet asap, and report back to full group.


DataStructure - DataRecord relationship

Flavio walked through the first of the three DD views the other working group had developed. The three views are shown below:

VIEW ONE

VIEW TWO:

VIEW THREE:

Key Suggestion of the View was to have the DataRecord (the recorded content) point to it’s DataStructure. In the basic case this is an Instance DS, but moving to the broader case (per Model 20160208) we can also have a Cascade of Structures.

Question from DG: do we need the container to be holding Variables? Couldn’t it be for other things?
Flavio: these are DataStructures for capturing variables - not for other things. We would need other classes for other types.

DIscussion of the three views to continue at the next meeting.

Next meeting:

Thursday 25th February, 2200CET.

Subgroup on capture-datum relationship to meet in the interim.


 Pre-reading for Meeting Feb 11, 2016

This is the current version of outputs of the two subgroups. Outputs are for discussion at meeting Thurs Feb 11


Definitions sub-group:

Summary of current definitions - LINK TO GOOGLE DOC


Data Structure Cascades sub-group:

Current version of model (Flavio Rizzolo, 8 Feb 2016):


 28 January 2016 meeting minutes

Data Description Meeting Minutes 28-1-2016


Attendees: Dan Gillman, Larry Hoyle, Dan Smith, Barry Radler, Jay Greenfield, Flavio Rizzolo, Steve McEachern, Ornulf Risnes

Apologies: Chris Seymour


Review of Monday’s meeting:

  1. Start on Tracking Datums: for further discussion

  2. Heterogeneous records


It was noted that heterogeneous record have different IVs - and therefore the record would not and could not be reused. As such, Flavio added the association back to the DataRecord and updated the model (below).


Flavio noted in an email prior to the meeting:

“The only change w.r.t. Monday is the addition of an association between Data Record and Instance Variable to support heterogeneous Data Structures, i.e. containing different types of records. The idea is that when the Data Structure is homogeneous, i.e. all records are of the same type, then IVs must be associated to the Data Structure. That's the traditional rectangular file/table case in which you design your schema way before having any records in it. In other cases you don't have a fixed schema in advance that applies to all your records because they might be of different types. For those cases you won't know your IVs until you get your records, so the IVs are associated to the Data Record instead.

There is a similar argument to be made for Viewpoints: we could associate them to Data Structures when they are homogeneous, but when they are heterogeneous it will make more sense to associate them to the Data Records themselves.”


Dan G.’s concern is that the DataStructure may not be reusable in the case of Heterogeneous records. Flavio compares the case of a Table, where a Column in a table may be repeated, but not the IVs. Larry asked whether it would be the same Concept being re-used each time.

Dan G. comes back to the question of whether a DataStructure is:

  • an empty structure (e.g a blank spreadsheet

  • a labelled structure (eg. spreadsheet with labelled columns and rows)

We might conceive of the structure as having three layers:

  1. The empty data structure: the container eg. table (rows and columns)
  2. The concepts in the structure: e.g. RV (concept and data type)
  3. The content: data in the cells

Comments:

  •  

    Jay raises a concern that what Dan is describing sounds like a workflow. Dan G. queries whether this is a workflow or reuseable items? Jay is concerned that we wouldn’t be able to capture all of the reuse of the items over time, due to the need for transformations, etc. Dan G. agreed, but wanted to explore how much we would be able to retain.

  • Flavio suggested we may be able to use the Conceptual/Represented/Instance distinction to manage Dan’s distinctive layers. The missing part would be the container (empty data structure) - we would need Row and Column artefacts to be able to do this.

  • Dan G. suggested we could possibly talk about the Shape of the data - which would be closely associated with DataTypes (possibly using ISO11404).
  • Jay suggested that an alternative would be to use the Member/Collection structure to build up the structure. Flavio felt that these were not really the same thing - the empty structure is really a description of the “physical” structure. Dan indicated that the semantics are different - variables have a Measurement semantic, and structures have a Computational semantic.


Returning to IV association with a structure: Larry notes that this has the problem that it is linking a DataStructure with a specific Capture. Suggested that there may be use for a “RepresentedDataStructure”, which Dan S. notes has a parallel with DataSet in ISO11179.

Dan S wondered about DataStructure:

  • Is it a collection of DataRecords?

  • Is it a collection of InstanceVariables?

  • Is it a RecordType? (Yes, in the case of HOMOgeneous records)

  • Might need to separate out the different functions

Dan’s interpretation appears consistent with the Drupal definition - but it appears to have morphed somewhat. Flavio suggests that the additional characteristic of DataStructure is that is contains a Schema, as well as the DataRecords.

Dan G. notes that we appear to be mixing up definitions as well - we need to bed these down.

We also seem to be struggling with the conflation of the physical and logical attributes of the structure. Larry notes that we can cover a lot of the Schema characteristics, using IVs, DataRecords and Members/Collections. Flavio suggests the inclusion of the Schema - to separate the distinction between the schema and the grouping of the DataRecords.

Barry summarised the problem as largely being the conflation of the Instance and Represented characteristics of the data.

(Dan G.’s definition of DataRecordType: a set of DataPoints, possibly associated with IVs, that provide an empty structure that can be used to create a DataRecord.)


Where to next:

Flavio to make an additional revision to the model to reflect today’s discussion (possibly adjusting DataStructure to DataRecordType).

Also need to clean up the definitions: Dan G. suggested describing and defining the different objects we are going to need at different points in data production.


For next meeting:

Two work groups to work over the next two weeks

  • Barry R., Steve, Dan G. - definitions

  • Flavio, Jay, Larry - cascades

Teams to report back progress at the next meeting.


Next meeting:

Thursday February 11, 2200 CET


 25 January 2016 Model review extra meeting

Minutes of model review meeting, 25 January 2016, 2200-2300 CET

Attendees: Flavio Rizzolo, Barry Radler, Jay Greenfield, Larry Hoyle (from 2215), Chris Seymour, Steve McEachern (until 2240).

Apologies: Dan Gillman


Flavio overviewed the updates to the model he'd identified from recent meetings. Diagram is included below.


There were three main changes:

1. containsInstanceVariables - relates InstanceVariable to DataStructure

  • allows homogeneous DataRecords
  • do we want HETEROgeneous DataRecords in a DataStructure

2. ordering of InstanceVariables

  • Needed to add “realizes” to allow IV, DP to be Members of a Collection
  • Added to allow ordering of IVs

3. isViewedFrom - relates ViewPoint to DataStructure



What is missing?

Subset of IVs that highlight the subset as Identifiers

Could make this by making the Subset an object which is a subclass of a Collection. This is not in the model right now though.

Also we may not want to call this "Identifier".

This would allow providing a DataRecord “key”

Comes back to the question of whether we have HETEROGENEOUS records in a DataStructure.


Comments on Flavio’s revisions:

Jay found in trialling he could model openEHR in the proposed model, as the IVs could be modelled, and then progressively the Archetypes.

 The only issue he had was in the modelling of a fact table. He felt this was related to Flavio’s issue over the identifiers.

Also felt he understood the question of how to represent an empty table. Relationship from IV to DS allows this.


But linking heterogeneous records still appears to be problematic, and Jay indicated he would want to do this to describe, for example, relationships.

Flavio noted that the same issue exists with a structure like an XML database.

Flavio noted that this may make sense to make the options more flexible - e.g. if you don’t know what you’ll be getting in advance - which means that you can’t necessarily put NULL values on the DataRecord.


Examples of heterogeneous records:

  • Census PUMS files (both person and HH records)

  • Graph structure (triples are all different)

  • person information arrives with varying amounts of content.

  • XML file

  • JSON file??


Larry: one thing he can’t still see. Have a traditional table, and want to turn it into a fact table. Does this model address the Datum/DP versioning problem that had been discussed earlier. Larry suggests that the Datum/DP needs some sort of identification to allow us to track it.

Agreed that this requires reconciliation between Dan G.’s points and Flavio’s model. Focus on this at next meeting this Friday.

Larry also noted that we also need to consider the transformation when we would aggregate the Datum’s into an Aggregate Statistic. Question also whether we would try to follow all the Datums that are included in the Aggregate.

Steve left the meeting at this point (10.40pm).

Further minutes from colleagues.


Next meeting: Thursday January 28th, 2200 CET.

Agenda:

Continuation of model review

Reconciliation of model issues between Dan Gillman's powerpoints and Flavio’s model

 14 January 2016 Minutes - Data Description Meeting

Data Description meeting, 14 January 2016, 2100 CET

Attendees: Barry Radler, Flavio Rizzolo, Dan Smith, Jay Greenfield, Ornulf Risnes, Steve McEachern, Dan Gillman (from 21.40 onwards)

Apologies: Larry Hoyle


There were three outstanding questions from the previous meeting designated for discussion - see previous meeting notes below.


1. Relationships between DataPoint and DataStructure

It was agreed to remove the relationships between DataPoint and DataStructure

  • sppcifiesOrder

  • specifiesIdentifierOrder

And then add relationships from DataStructure to InstanceVariable - the same two relationships above

Questions on this point:

  1. Query from Flavio - link to DataRecord or DataStructure? Dan S. argued for DataStructure, as all DataRecords in a structure are the same - AGREED.
  2. What does DataRecord provide then? Groups together different Measures, Identifiers and Attributes with specific roles. (Note that DataRecord needs a clarification of the definition). Ornulf clarified that the original point of the DataRecord was to group the combination of Datums (each with it’s InstanceVariable) and their Roles into a Collection.

Dan’s argument: DataRecord and DataStructure store data, but Viewpoint stores relationships

Flavio: DataStructure has homogeneous DataRecords only (confirmed by Ornulf)

THUS - need to add to DataStructure definition that it is a homogeneous set of DataRecords.


Agreed that the following needs to be added to the model documentation:

  • A DataStructure can have no DataRecords and therefore no DataPoints - i.e. no records yet collected. It must however have IVs to define what the DataRecord should look like.

  • A DataStructure is a Collection of homogeneous DataRecords

  • A DataRecord must have DataPoints.

  • The DataPoints are then populated with Datums

  • Ordering of IVs would be OPTIONAL (not always appropriate in a Logical structure)


Further questions:

Dan: How do we associated specific Viewpoints with the DataStructure?

Jay: Can a Viewpoint describe, for example, an RDF triple? Dan suggests that this might be possible to do with the use of Roles (e.g. Predicate is defined as an Identifier role for an IV)

Ornulf noted that some of the uses here are documented in the paper from he and Dan authored at the Dagstuhl sprint

https://docs.google.com/document/d/1-vxWdastNsTWMf8qlR35wj1128FNSX-4YBrA_MJBaLk/edit 


Different Viewpoints could be layered on top of the DataRecord. You also don’t necessarily need to use the Viewpoint.


Dan S. noted than that three layers that can be used:

  1. Logical description of a DataStructure
  2. DataRecords and DataPoints
  3. Viewpoints

You will always need to use the DataStructure, but the other two will be optional

DataStructure will therefore have the following relationships:

  • Viewpoints (0 to Many) associated with a DataStructure.
  • DataRecord (0 to Many) associated with a DataStructure.
  • InstanceVariable (specifies Order and specifiesIdentifierOrder)


2. ORDERING:

Agreed that Ordering of DataRecords in DataStructure should be possible but OPTIONAL.

Ordering of InstanceVariables in a DataStructure still needs to be clarified.


3. Usecases

This point wasn't covered directly in the discussion. Agreed that there is a need for testing usecases against the model now, but need to finalise the clean-up of Lion (per Wendy Thomas's review - see minutes below). Agreed therefore that Flavio would update Lion/Drupal, and we would have a special meeting Monday Jan 25 to review this, ahead of the regular meeting on Jan 28. Steve, Jay and Flavio will convene the review meeting, with others welcome if available.


Actions:

  1. Flavio to update the model, and then Flavio/Jay/Steve to meet and confirm. (Special meeting invite for Monday week meeting).
  2. Flavio to circulate model updates to Dan G as well.
  3. Dan G. to review his position on Datum reusability, in light of model updates


Next meeting(s):

a) Review meeting Monday Jan 25th, time TBC.

b) Regular meeting Thursday Jan 28th, 10PM CET, GoToMeeting:

https://global.gotomeeting.com/join/148887013

 

(Note that meeting time will return to CET 10pm for next regular meeting.)

 Comments from Wendy Thomas on Lion content of DataDescription model

Wendy Thomas has provided a review of the current objects and properties in the Lion version of the model. The group will need to review these ahead of the next release.

 17 December 2015 meeting minutes

Meeting minutes 17/12/2015

Attendees: Dan Gillman, Jay Greenfield, Larry Hoyle, Steve McEachern, Barry Radler, Ornulf Risnes, Chris Seymour, Dan Smith


Dan Gillman opened with a review of the PPT he provided earlier this week on “Tracking Datums”.

Key points in Dan’s proposal:

  • a DataPoint should exist only if it’s “parent” (a DataStructure) exist.

  • Datum is misnamed (it is actually a group of things)

  • DataPointInstance is the association of a Datum with an InstanceVariable

  • ValueDomain in the model could be either Substantive or Sentinel


Jay: What about the collection of copies of the Datum? What is this thing (if not Datum)?

Larry: How do we identify the particular Datum that is put into the DataPointInstance

Jay: asking does Dan want a class to indicate that all of the Datums represented the same conceptual thing. Dan agreed.

Ornulf: if we have access to the Variable Cascade, can we infer the relevant concepts associated with the Datum?


Ornulf: What does this add that we don’t already have?

  • Dan: didn’t think we have a coherent way of talking about this from the perspective of the DataPoint.

  • Ornulf indicated that he believes we can navigate much of the content in Dan’s model using the existing model

  • Dan: argues that the current model conflates the DataPoint with his new DataPointInstance

  • Dan G and Dan S both argue that the model doesn’t allow us to talk about an empty DataStructure. Dan S notes that DataPoints are NOT reused as currently specified - this apprears to be a point of clarification needed between Dan and Ornulf’s interpretation of the model

  • Ornulf: DataPoint is related to a Record and to an InstanceVariable

  • Dan: as soon as it is associated with an InstanceVariable, a DataPoint has a relationship with a single Datum.


Jay’s interpretation was that the RHS of Dan’s model could improve the model, the LHS is more complicated. Suggests that there are two roads:

  1. Does this improve what we have?

  2. Assuming that we understand that we are storing an individual copy, ... (missing some detail on this point - please add comment here)


Dan: aim of his model is trying to associate a copy of a Datum and an InstanceVariable into a DataPointInstance.

Ornulf: not comfortable with where we are at. He argues that we CAN re-use DataPoints, and that we can track DataPoints (he is currently doing this in RAIRD). Dan asks can Ornulf reuse STRUCTURES. Jay suggests that what Ornulf is doing is actually using DataPointInstance (but naming it DataPoint, as is currently in the model). The question here is fundamentally about reusability.


Larry: Is what is "in" the DataPointInstance a Signifier? And is DataPoint the LOGICAL and DataPointInstance the PHYSICAL?

Dan: key argument is that we have the concept we want to represent (e.g. the NUMERAL five) and a series of strings that signify the concept (e.g. different strings of 5, IV, ...)

  • Conceptual: the NUMBER five

  • Represented: the SIGNIFIER - the NUMERAL five

  • Instance: the actual written down recording

  • (COMMENT FROM STEVE: Colleagues - have I got this right?)

Dan: what isn’t currently covered is the fact that DataPoints can be RE-USED. Ornulf argued that he thinks that’s covered, but Dan's position is that we don’t yet have the “empty bin”.

Dan S./Larry: are we talking about the difference between a logical and a physical, between empty and populated, ...?

(Dan G. left the meeting at this point)


Dan S. suggests that everything that Dan G. is covering is represented in the current version of the model in Lion - in particular, we can address a DataPoint from the InstanceVariable and DataRecord

HOWEVER, Dan S. did have a concern that Ordering in the DataStructure is ordering DataPoints. Dan S. suggests that ordering should be of InstanceVariables. Dan S. argued that DataStructure relationship should be to InstanceVariables rather than DataPoints.

Larry asks whether the relationship should be between the DataRecord and InstanceVariables. Dan notes that if the Record complies with the Structure, then that isn’t necessary.


Questions for discussion at the next meeting:

  1. Dan S.’s solution of realigning the relationships from DataStructure - by removing the to DataPoint and instead making the relationship from DataStructure to InstanceVariable) possibly addresses Dan’s concerns. Dan S. also noted that this would also allow the ViewPoints, Attributes to become OPTIONAL in specifying a logical structure. Comments requested on this.

  2. Ordering concerns need to be taken into account - Ornulf argues that this doesn’t really make sense in a LOGICAL structure. Previous discussion (from Flavio) is that possibly it could be OPTIONAL. Any comments?

  3. Jay: it would be useful to have USECASES to reflect the uses of the required (IV/DS) and optional (VP/DP/DR) parts of the model. Suggested for Jay to look at the openEHR case. Could others volunteer for the simple CSV case? (Steve happy to coordinate of the CSV group - would be nice to align/compare this with the new W3 TabularStructure: http://www.w3.org/TR/2015/WD-tabular-data-model-20150416/ ).


Next meeting:

January 14, 2016. GoToMeeting: https://global.gotomeeting.com/join/148887013 

Proposed time is ONE HOUR EARLIER - 2100 CET. Steve to poll group members about this.

NOTE ALSO NO MEETING DECEMBER 31


 Pre-reading for Meeting Dec 17 2015 - Tracking Datums

Linked here is Dan Gillman's slides on Tracking Datums, for discussion at the meeting Dec. 17, 2015.

SLIDES LINK (PPTX file)

Please review the slides ahead of the meeting.

 3 December 2015 meeting minutes

Data Description meeting minutes - 3 December 2015, 10PM Central European time

Attendees: Chris Seymour, Dan Gillman, Flavio Rizzolo, Jay Greenfield, Larry Hoyle, Steve McEachern


Continuing discussion from last meeting: Dan Smith and Ornulf not present - issues raised by them are to be held over


Commentary for noting on Ornulf’s email comments - for discussion at next meeting.


a) Ornulf’s suggestion on requirement for single measure per Viewpoint

Dan notes the blood pressure example we have been using has 2 measures. Here we will have two measures that all have the same attributes. Therefore it seems that the idea that attributes can only apply to a single measure seems inconsistent.

Suggestion that Ornulf could manage this by implementing only one measure support in his own system. Flavio notes that this shouldn’t create inconsistencies if the standard supports multiples but not if people implement local limitations.

Some discussion ensued from this on the nature of Viewpoint (and it’s level in the structure). Dan noted that Viewpoint is setting the roles of the particular Datapoints/Datums in the particular context. Jay suggested that Ornulf’s implementation may be using Viewpoint as a higher order than originally intended - by adding constraints. Flavio noted that it is higher order structures (DataRecord) that impose the structure on the data - Viewpoint is for setting roles.

Jay notes Dan Smith’s issue that a DataRecord may or may not be ordered. Dan had suggested that the Record NOT be ordered - others suggested that order should be optional. Larry noted that the DataRecord should be extending Collection - but Flavio noted that we have modelled this as a “Realises” Collection (effectively the same).

Jay asked whether the DataRecord has a DataStructure. Larry noted that a DataStructure is a Collection of DataRecords.


Conditions that result from this:

  • A DataRecord may (or may not) have an order.

  • A DataStructure will have DataRecords

Flavio noted that the model appears to be flexible enough to support the use cases that we have seen so far in terms of data structures. Specific implementations of the model may need to place specific constraints to enable their specific requirements.

e.g. in Ornulf’s case, the approach would be that if there were multiple measures in a Viewpoint, the attributes in the Viewpoint would need to apply to both measures.

(Flavio notes that we couldn’t currently support specifying attributes applying to “not all” the measures in the Viewpoint).


Jay had a query about an openEHR example. In openEHR, there is basically data and then 3 categories of attributes. It may require implementation of hierarchical relationships between attributes - and (Dan G. noted) associational relationships between roles as well. Dan was concerned about putting constraints on the roles - that the relationships need to apply to the association of the InstanceVariable with the role.


B) Datums requiring a Datapoint

Dan G. was concerned about (his interpretation that) Ornulf was wanting to track DataPoints. Dan thinks of the DataPoint as a structural thing (e.g. DataStructure is a set of organised DataPoints).

Conceptually speaking, Dan is interested in following the Datum of “eye colour of Dan Gillman” - i.e. Datum has a role INDEPENDENT of the DataPoint (it’s storage location).

Ornulf (as Dan interprets it) appears to want to track the DataPoint (Ornulf can respond on this).

Datum is the Designation - what you are writing down BEGAN as a Signifier, but it has meaning associated with it making it a Designation. The meaning is recognised in the relationship to a Concept in a ValueDomain.


We could associate Datum with many things if we wanted:

  • An InstanceVariable?

  • A ValueDomain? (Not as precise - will be reused)

  • A RepresentedVariable? (It may be represented, but then recorded in different ways in different packages)

What do we care about? If we don’t care about the SentinelValues, then we could only follow SubstantiveValues. (Larry notes that we may want to follow the Concepts underlying the


Jay: this conversation started with the note that Datum and DataPoint have no Properties. Jay asks would this be resolved if Datum had a Value? (NOT a Unit).

Flavio: “Value” is problematic - but we could say that Datum has a Representation (or Multiple Representations) which is a Signifier. A Representation could then be tied to either a Sentinel or Substantive Value. A Signifier has two aspects to it. (Dan G) Think for example of writing the numeral 5. You can write it down infinite times - each time is slightly different - a different instance of that numeral. The numeral itself has a concept. But there may be the potential to move between the different representations, and to possibly lose information in the different representations. Flavio asked whether we should group those representations?


Dan suggested that the way we need to implement this correctly is to say that each Instance of the recording of the Datum needs to be tied to a particular conceptual class. He is going to do some homework on this prior to the next meeting.


Jay noted that a Representation is a physical thing that has a reference to a Concept. The concept is the Value - that lives in a conceptual system.

Re openEHR: JAy notes that they can set up attributes and assign them to categories. We will want to give this some further thought as to how we might implement this in DDI. Jay will give a presentation at the next meeting.


Next meeting

Dec 17th,10 PM CET (i.e. same time)

We will look at possible alternative times for meetings next year, but will work on the same time for now.


 19 November meeting minutes

Attendees: Dan Gillman, Jay Greenfield, Larry Hoyle, Steve McEachern, Barry Radler, Ornulf Risnes, Dan Smith


The meeting focus was on the review of the current state of the objects in the Data Description model on Lion (http://lion.ddialliance.org/package/logicaldatadescription)

Dan Smith provided comments during the week which formed the basis of discussion. Dan's comments are noted below in italics, and discussion on each comment is included below.


1. The Datum entity should be a property called "Value" or "Datum" on the DataPoint with a cardinality 0..n (if the current 0..1,0..n relationship is proper).

- The InstanceVariable defines the ValueDomain, it does not need to be repeated again in a separate Datum entity. A ValueDomain describes the value's -type-, not a an actual value.

- The DataPoint and Datum are defined by the same InstanceVariable, so the relationship, from the Datum entity to InstanceVariable, is also redundant.

There is no reason for Datum to be an entity. My impression from the discussions was that a Datum is an actual instance of data, a value, not a conceptual entity, and the DataPoint describes the cell for the actual value (datum) to be placed into.


Core argument: Description of a designation is being conflated with the Datum

Ornulf supported Dan S.’s position that it may just be a value.


Dan G: what is a Value? Value is a Concept. Datum is a Designation. Signifier is what is written down

Larry asks what about the situation where you move a particular value between software packages – how can we represent how they have been moved into a different package?

Dan S. – taking a step back – this can be achieved by using the InstanceVariable to capture the package-specific detail (including a sentinel domain) and the RepresentedVariable to capture the represented content.

Jay suggests that this is largely captured by the Conceptual model (specifically the Variable Cascade). Larry isn’t so sure.


Ornulf: in testing this, appears that the current content provides the machinery to achieve the management of the Datum.

Dan G.: Since the Datum is the meaning and the representation.

Larry: so where is the Unit in this approach. Ornulf notes that the DataRecord groups a Datum with a Unit. Larry still wonders where the Unit is here? It could be associated with the Identifier, but in certain representations, the Unit may be lost (e.g. in a TripleStore?).


Ornulf suggests: a DataRecord is a family of Datums that relate to a Unit (a person, a concept, …). We need to have something to group these DataPoints into a Record.

Larry suggests that we may need Unit tied to the DataRecord. Ornulf suggests that we don’t really have a UnitType associated with a Record.

UnitTypes are associated with the ConceptualVariable.

Ornulf argues that the Unit for a Record has to be inferred from the Units in the InstanceVariable – as we may group DataPoints into Records with different Units. Dan: the InstanceVariables in a record may not need to have the same UnitType.


DataPoint will have a UnitType inherited from the ConceptualVariable. Datum will have a Unit inferred by the grouping of the DataPoints in the DataRecord.


Does this solve Larry’s problem: A Record lets us have an InstanceVariable association, which allows us to group them and reason about them together.


BACK TO DANS ORIGINAL COMMENT:


Proposed points

A Datapoint is always associated with a single InstanceVariable

The DDI value domain describes the ... (missed comments here)

Result:

a)    Datums shouldn’t inherit from ValueDomain – agreed by all to change.

b)   Since Datapoint always have only a single InstanceVariable, and Datums are associated with DataPoints, there is no need for a relationship between Datum and InstanceVariable (it can be inferred through DataPoint)


Dan G: Agreed on the need to have to have a place to write a Datum down – the DataPoint. Dan was concerned however about how structures might get re-used.

Dan S: tentative agreement that InstanceVariables can be found through the DataPoint. Dan G: alternative expression – a Datum cannot exist without being written down in a DataPoint.


Dan S: The Designation or Code is described by the ValueDomain of the DataPoint.

But there is no place to put the code right now. DataPoint needs an attribute that allows that to be written down. Suggestion therefore is that there needs to be a Datum-like representation. (This could be Datum)


Jay: not sure we want to give up Datum as an Entity. Larry agreed. Right now, neither Datum or DataPoint have any properties – we need to work this through.

 Suggestion - for consideration before the next meeting: Datum is removed as an Entity, and instead becomes a Property of DataPoint. (That implies that a given Datum can only be in ONE DataPoint).


2. InstanceVariables can not be ordered in a DataStructure, only individual DataPoints can be ordered. Same comment for the record identifiers, InstanceVariables should also be able to be noted as

DataStructure keys/identifiers. Perhaps specifiesOrder and specifiesIdentifierOrder in DataStructure should relate to InstanceVariables.

It seems incorrect that the total order of all known DataPoints is specified by the specifiesOrder of DataStructure.


Basically - Do we want to order:

  1. DataPoints in a Record?
  2. InstanceVariables in a DataStructure?
  3. Both?


Probably both – 2 (IV in DataStructure) is probably the more common thing. It’s not there right now, but needs to be in the model.

The question – will the DataRecord ever have an ordering different from the associated DataStructure?



3. DataStructure realizes a Collection, but it is ordering three separate collections of different types (InstanceVariable order, key/identifier order, and DataRecord order).

 This comment was held over for discussion at the next meeting.


Next meeting:

The next meeting is scheduled for Thursday December 3rd at 2200 Central European Time (10pm).

As this is the evening of the conclusion of the EDDI meeting, we are not sure of attendance. Could all members of the group please indicate your intention to attend (or not).


 5 November 2015 meeting minutes

Attendees: Dan Gillman, Jay Greenfield, Larry Hoyle, Steve McEachern, Barry Radler, Ornulf Risnes, Flavio Rizzolo, Dan Smith


Meeting times

With the change in Daylight savings, there was discussion of alternate times. It was agreed to continue meeting at 10PM Central European time for the near term, with revision as required in future.

Dagstuhl review

There was a brief review of the Dagstuhl 2015 meeting for the benefit of those unable to attend.

Review of new objects from Dagstuhl:

  • Measure/Attribute/Identifier roles
  • Viewpoint
  • Record

Jay noted that he had had an opportunity to review and catch up on the model with Flavio.

Ornulf has started implementing some the ideas developed at Dagstuhl in his work at NSD - so far no issue.


What to do next?

The group identified a number of areas that require attention in the near term

  1. Housekeeping, particularly cleaning up the model objects: (e.g. Datum - does it have a Value)
  2. Relationship between DataDescription and DataCapture, and relationship between DataDescription and CoreProcess (and Methodology) - i.e. Transformations and Observations. It was noted that Flavio is updating CoreProcess model based on discussions at Dagstuhl
  3. Dan Smith raised the question of other areas of Statistics which is previously covered in DDL3.2. This is clearly relevant, but the question is only whether we are now changing the scope of this working group - should Statistics be covered elsewhere.
  4. Jay suggested whether we might also want to develop a paper describing the Data Description model and it’s objects and relationships.
  5. Versioning of a Datum was raised by the TC on Nov 5th meeting. Also Larry and others noted whether we need to have Value as an attribute of Datum.
  6. Larry also noted that we should engage with Qualitative data - need to review the use of the Data Description model in Qualitative view.


Physical data description

Additional comment on physical description from Ornulf - see email discussion week of Nov 3rd

Ornulf suggests being able to describe data inline, but not much much more than that. Most agree with Ornulf, although Steve noted that there may be implications for data archives - who do not have significant influence over file formats received, and thus may have a greater interest in the documentation of legacy formats.


Next steps and next meeting

For the next meeting, the group agreed to complete our “housekeeping”, to review the current version of the model (available at http://lion.ddialliance.org/package/logicaldatadescription) and circulate any issues to be resolved via email. The next meeting will then focus on the clean-up, and an agenda for future meetings to resolve outstanding issues.

Next meeting date: Thursday November 19th, 10pm Central European Time

GoToMeeting URL: https://global.gotomeeting.com/join/148887013

Equivalent meeting times:

Mannheim (Germany) Thursday, 19 November 2015 at 10:00:00 PM CET UTC+1 hour
Canberra (Australia - Australian Capital Territory) Friday, 20 November 2015 at 8:00:00 AM AEDT UTC+11 hours
Christchurch (New Zealand) Friday, 20 November 2015 at 10:00:00 AM NZDT UTC+13 hours
Bergen (Norway) Thursday, 19 November 2015 at 10:00:00 PM CET UTC+1 hour
Ottawa (Canada - Ontario) Thursday, 19 November 2015 at 4:00:00 PM EST UTC-5 hours
Washington DC (U.S.A. - District of Columbia) Thursday, 19 November 2015 at 4:00:00 PM EST UTC-5 hours
Minneapolis (U.S.A. - Minnesota) Thursday, 19 November 2015 at 3:00:00 PM CST UTC-6 hours
Lawrence (U.S.A. - Kansas) Thursday, 19 November 2015 at 3:00:00 PM CST UTC-6 hours
Madison (U.S.A. - Wisconsin) Thursday, 19 November 2015 at 3:00:00 PM CST UTC-6 hours
Corresponding UTC (GMT) Thursday, 19 November 2015 at 21:00:00

 Variable Cascade and extensions

The attached diagrams show the current model for the variable cascade and some extensions to model the specific representations that are available in DDI3.2. Decisions need to be made as to whether to use relationships to classes, inheritance, or ComplexDatatypes for some of the specific representations and their content.

Some 3.2 elements are not modeled here. NumericRepresentation in 3.2 has the following content:


 = 
xs:NMTOKENS
 = 
xs:boolean
 = 
("Nominal" | "Ordinal" | "Interval" | "Ratio" | "Continuous")
 = 
xs:string
 = 
xs:integer
 = 
xs:integer
 = 
xs:integer
   >
   

The constraints on values are not part of 3.2.

 Further revisions to Transaction/Datum modelling

The following is a transcription of the email notes from Jay and Steve in the development of revisions to the Transaction/Datum content since our last meeting.

The revisions are detailed below in the two powerpoint files, with the most recent version in the image presented here. Please review ahead of the meeting on Thursday.



Informatics version 4 (Current version, Sept. 29, 2015) - POWERPOINT

Informatics version 3a (Revision Sept 28, 2015) - POWERPOINT



Jay Greenfield, Sept. 28 2015


Re the model, I worked on it quite a bit.

What I attempted to do was not break any new ground but to reuse work we already have done in the service of describing “derivations” and representing “collection/capture events”.

In this way I was able to bring to bear Methodology, Collections and the Core Process Model.

Perhaps I did break new ground with collection/capture events. They had to be hosted, so I created an Archive.

By and large though, I am impressed at how we can marshall all these classes in the process of representing data description.


Steve McEachern, Sept. 28, 2015


Hi Jay,

 Thanks for this. Just to clarify a couple of things:

 1. Events and observations

I note that Events are moved to “Events or Things”, and are “real world” things. You’re making then a distinction here between a derivation (which I think is now a Process (through Methodology)?) and an Observation. So you see these as quite distinct things? I can see the argument for that, but I would also wonder whether an Event – such as performing a Calculation – could be characterised in the same way?

 2. Analysis

This is nice – provides a clear place for where Analysis might be hooked in!!

 3. Tables and Cubes (data aggregates)

These are now all sub-classes of DataStore, right? So we are proposing here that we have the key object being the DataStore – which is a collection of Datums.

This means that we have the core of our “collective” structure. We just need to figure out how to identify and describe the DataStore subclass.

Do you have some thoughts on what the attributes of a DataStore might be?? We probably need to revisit where we got to in Minneapolis on this one:

  • Rectangular Table Store: could be
    • Statistical aggregate tables: rows are ??, columns are Variables
    • Unit record files: rows are records of a case, columns are Variables
  • Data cubes
  • Graphs: 

4. Archive:

I can see this, but it will need to be clearly defined for our traditional Data Archive user community.


I’ll have more questions I’m sure, but I think this has come together nicely. Seems like a focus on DataStore at Dagstuhl would be a nice conclusion to the conversation started just after last year, and in Minneapolis.


Jay Greenfield, 29 September, 2015


Steve:


 I think I have begun to address #1 and #4.

 Again what is encouraging is that 1) it didn’t take much and 2) what it took is NOT original — just a subclass of method which is in line with how certain “reference ontologies” think.

 I did finally add the note I had in my head about one of these reference ontologies.

 “Incorporating” “reference ontologies” might be a good jumping off point for the review team we are bringing together: here is a use case.

 Please distribute. I think it will put pressure on us to grow “OrderRelation” but that is a good thing.


Jay


 24 September 2015 Meeting Minutes

Notes from Sept 24, 2015 meeting

Attendees: Jay Greenfield, Steve McEachern, Larry Hoyle, Dan Gillman, Chris Seymour, Ornulf Risnes, Barry Radler


Jay and Steve introduced the revised high level diagram integrating previous work, that has been put together by Jay, Steve and Flavio. This is being generated in Google Docs. https://docs.google.com/document/d/1XYBRO7XLx6sasJKcIzdCxWkBlJkVqacmjpdBonkNR_Q/edit?usp=sharing

(Note that the document below allows comments, but editing is restricted right now to Steve, Flavio and Jay)

The current version of this diagram, as was discussed at the meeting, is below.


The key elements of this high level diagram are.

  • Model links in Transaction, Datum, and Event

  • Generation and Observation are subclasses of Event

  • DataStores are also introduced and related (through DataStructure and Record) to DataPoint

  • Generation is associated with a DataStore (this is poorly modelled by Steve, and subject to revisions)

Flavio has also made minor revisions to his model of Transaction.

Comments on this updated modelling:

  • Dan noted Universe should probably be Population

  • Dan queried what was meant by Spawning? Is it similar to the UNIX sense of spawning? Jay didn’t thinking so but Ornulf and Chris did think that the relationships between the DataStores do appear to work in this manner in their work at NSD and StatsNZ.

  • Derivation is probably poorly connected here - but can be fixed. It probably needs to show more clearly the link between Derivation and Datum, and with Spawning.

  • Larry noted that we could use Collections to build the Record and the DataStructure

  • Dan’s concern: he’s not sure how DataStructures are integrated into DataStores. Jay agreed, noting that Derivation was a first attempt at this, but that it’s not quite right.

  • Ornulf also noted that DataStore in this way works well in practical terms, in the experience they have had with the RAIRD project


Dagstuhl meeting preparation

Group needs to provide a presentation for the first day.

We also want to bring together an overview of what do we want to do for the week. Initial suggestions from Steve would be to focus on:

  • Finalising the Transaction integration

  • Clarifying the Datum-DataPoint-Record-DataStructure-DataStore relationships

  • Working through the integration of Derivations (and Spawning)

(Side note from Steve after meeting: These three elements appear to be the final requirements of the DataDescription model. It is possible therefore that one aim of the week would be to finalise the DataDescription model for inclusion in the next release.)

Steve to compile a proposed outline for the week and the presentation, for discussion at next meeting.

It was also noted that some group members are keen to move into the work as soon as possible when we arrive (Ornulf has volunteered even Sunday evening). It is possible that some team members could start while the Monday plenaries are occurring - only require some members of the team at plenary. Steve will discuss this with the organising team.


Next Meeting

Thursday 8th October, 10PM Central European Time.

GoToMeeting: https://global.gotomeeting.com/join/148887013



 15 September 2015 working group meeting notes: Jay, Steve, Flavio

Notes on meeting to discuss revisions to Transaction/DatumStructure model

Attendees: Jay Greenfield, Flavio Rizzolo, Steve McEachern

Initial discussion was over where the "disagreement" currently lies.
Debate seems to be over what is associated with Transaction, Datum and Observation, and the relationships between them.

There are many characteristics of an Observation are of interest.
(Side note: Event could be a Superclass: Observation and Derivation are subclasses)

Additional information describing the context of how an Observation is observed.
e.g.
Measure(s):
* Systolic Blood pressure
* Diastolic BP

Context (attribute):
* Observer: IV "I
* Instrument: IV "Instrument", Datum "BPCuff"
...

Identifier:
* Key

Flavio suggested that we could link Observation to Transaction.
Transaction is the thing that gathers all the information about the Observation that occurs.

We probably want to distinguish between an Event (Observation or Derivation) and the Recording of that Event.

We also need to disconnect Observation directly from Datum as it currently stands, as that limits to one Datum in a Transaction.

Two options:

  1. Either link Observation only to Datum in it's Measure role OR
  2. Link Observation to the whole Transaction (several Datums - ID, Measure(s), Attribute(s))

Jay liked the second approach - Steve and Flavio agreed.

 Datum would then be linked to Transaction

  • One of the Datums would include the Measure value(s)
  • Other Datums would include the attributes of the context
  • Each Datum has a different Instance Variable associated with it.

There are other issues it raises to be discussed, e.g. relations between Transaction, Datum and Data Structure

Flavio noted that in Jay's diagram, probably only requires changing the link of Observation from Datum to Transaction.
Steve noted that Flavio's DatumStructure model Version 2 from 28 August also largely reflects this (probably just needs adding in subclasses of Event: Observation and Derivation)

Jay suggests Flavio try modelling the second approach above, then Jay will fill in examples to flesh this out from OpenEHR. Jay thinks OpenEHR call this a Transaction (or an Archetype).

Note for remodelling:
Need to account for the possibility: can be more than one Measure in a Transaction

  • e.g BP: consists of two measures
  • Scale: consists of many measures, structured in a certain way

The next step once we update the Transaction would be to dock it within the DataStore

At this point we need to revisit the discussion as well on aggregation of Datums into DataStructures.

Aggregation:
Objects: Records, DataStructures and DataStores
How do we define these things
e.g. Record: Union of Datums sharing common Unit, ...

Similarly are our Structures suited to all aggregates - e.g. What would a Record look like in the context of a Cube?

NEXT STEPS:

Flavio to revise model, Jay to provide OpenEHR examples.
Steve to put together slide deck on the aggregate structures (Records, DataStructures and DataStores) 

 September 10, 2015 meeting minutes

Minutes of 10 Sept Meeting

Present: Steve McEachern, Dan Gillman, Jay Greenfield, Larry Hoyle, Barry Radler, Ornulf Risnes, Flavio Rizzolo

Apologies: Chris Seymour


Steve provided brief overview of the previous meeting, and the three example use cases that have been developed since the last meeting.

There has been a discussion via email (captured on the wiki) that detailed the follow on conversations about the use cases.


Jay provided an introduction to his Informatics case. This largely provides a high level overview embedding the Transaction content within the larger Data Description model, particularly as to where Datums sit in a DataStore. Ornulf noted that NSD has also been working on how they will develop their DataStores, which he will share with the group.

Ornulf then moved on to his example of how to generate derived datums - particularly, within Jay’s model, it appears that Transaction will provide a useful framework for describing the means through which a set of Datums are reorganised/aggregated/recorded. In particular, it will also allow the tracking of the provenance of how the derived Datums were generated.

Ornulf also noted that Jay's model and the linking to the DataStore would provide a good framework for thinking about the links between different Data Sources - particularly how one DataStore might be the source for another.

Ornulf then considered the derivation process he provided a case for. He saw this within the “Observation/Derivation” section of Jay’s model - except to note that Derivation is not currently present there.

There was then some discussion of how to fit in Derivation into the model (probably revisiting Observation), and also to adjust Jay’s model to reflect the relationship between Observation, Datum and DataPoint. Here “Transaction” probably sits to the side from this Observation-Datum-DataPoint link.

The group felt that Derivation probably sits as a separate class - but sharing some common (and some distinct) attributes with Observation. Sorting out the similarities and distinctions here would probably also help us to flesh out the Measurement/Attribute distinction.

Ornulf noted that they will need to be able to say for a given Datum whether it is a Measurement or Attribute. Dan asked whether this is a characteristics of the Model or of the Datum. Ornulf wasn’t sure if this was the case. He gave the example of “Source” information for the origins of another Datum (e.g. the source of an Grade is defined in the Source attribute and stored as a Datum). This is dependent on how the Measurement/Attribute characteristics are represented in the model.

Jay agreed to revise the content of his Informatics high level model to integrate the discussion from this meeting. It was then suggested that Jay, Steve and Flavio look to review this revision before the next meeting, and bring back to the group for discussion in two weeks.


Next meeting: Thursday 24 September, 10pm Central Euro Time

 GoToMeeting: https://global.gotomeeting.com/join/148887013

 Example Use Cases for Datum testing

The following is a set of use cases provided by team members for use in evaluating the Datum-Transaction model.

I've also included the email conversations on each of these following the examples.


1. Data transformation 1 - row and column calculations (Ornulf)

Calculating a new Datum on the basis of two other Datums

https://docs.google.com/spreadsheets/d/144AlQS-cSNbHscRWHNu6OduxHMoI3FTNg6FYNQkoZCg/edit#gid=1072131427

Ornulf's comment:

I have been asked to produce a simple example of representing data transformations for our meeting Thursday.


2. Data Transformation 2 - Multidimensional scaling (Larry)

Larry's comment: Attached is an example of a transformation that is done by a "black box" procedure on a whole dataset.


3. Informatics and the Process Model (Jay)

Informatics example - PDF PPT

Process model - PDF PPT

Jay's comment: In Informatics I tried to think end-to-end about transactions at a very high level. This is less for modelers and more for everyone else. Many details remain to be worked out.  I modified my process model document so it plugs into the larger picture.


Email discussions:


  1. Data transformation 1:

Ornulf Risnes - 2015/09/05:

Hi everyone,


I have been asked to produce a simple example of representing data transformations for our meeting Thursday. (Larry is doing a more complicated example.)

 To finish the example I need your help describing/understanding a subtle problem I have been struggling with for a while.


To illustrate what I have in mind, I have made a small example Body Mass Index (BMI) dataset:

 https://docs.google.com/spreadsheets/d/144AlQS-cSNbHscRWHNu6OduxHMoI3FTNg6FYNQkoZCg/edit?usp=sharing


I want to talk about the green datum and the red datum in that dataset.

 Red: BMI for one person

Green: Average BMI for a group of people

 Both the Red and Green datums are produced by a formula. The difference is the "direction" (in lack of a better word) of the calculation.

The Red datum is calculated with a formula using datums in the "horizontal" direction, i.e. on a per-record basis. This is equivalent to e.g. the "generate"-command in Stata which operates in this direction.

The Green datum, on the other hand, is calculate by a formula that uses datums in the "vertical" direction. This is equivalent to e.g. the "egen mean()"-command in Stata.

 In Excel (and other spreadsheet technologies) you can mix directions fairly flexibly. As we know, Excel uses ranges (D4:E4 or D4:D7) to group datums as parameters to formulas.


What I need help with is how we can generalize this "directionality" into a pure understanding of such ranges (or whatever we need to solve this).

 Hope I made myself clear on a Friday night.

 Thoughts?

 Ørnulf


Barry Radler - 2015/09/05

Ornulf,


Perhaps your two datum 'direction' calculations can be represented by the identifier and measure characteristics of the Datum object in the revised model we looked at last week?

 I'm thinking your horizontal calculation is derived via a case-level (or identifier) property, while your vertical mean calculation is derived via a variable-level (or measure) property.

 It's Friday afternoon here, prior to a 3-day weekend, so my mind has already left work...


Barry


Ornulf, 2015/09/09:

Barry, others,


thank you for your replies.

 I have reviewed the feedback, and have reached a new level of clarity (or rather: ignorance-reduction) since last week. As always I prefer text-based models and examples over UML...


To illustrate this, I have made a second spreadsheet/workbook in the spreadsheet I shared last week:

https://docs.google.com/spreadsheets/d/144AlQS-cSNbHscRWHNu6OduxHMoI3FTNg6FYNQkoZCg/edit#gid=1072131427


Var_A Var_B Var_C Var_Derived

----------------------------------

1 45 12 15.7

2 34 15 16.1

3 65 17 16.8

4 74 15 15.2

5 53 17 

6 43 13 

-----------------------------------


Var_Derived can be described like this (Pseudocode):

 Var_Derived = ln(Var_A) + sin(Var_B) + AVG(Var_C) IF Var_A <= 4


Here we have a number of things going on:

 1) The two mathematical functions ln() and sin() both operate on single values, and thus operates in the "horizontal" direction.


2) The AVG()-function, on the other hand, is an *aggregation* function (or statistical function) that takes a set of values for a given variable in the "vertical" direction.

 It makes sense to talk about "vertical functions" as aggregation functions or statistical functions instead. We are all familiar with this type of functions from egen, SAS or SQL. Statistical functions include:

SUM, COUNT, MIN, MAX, STDEV, VARIANCE, KURTOSIS, SKEWNESS, GEOMEAN, CHECKSUM

 Stata's "egen" has some confusing functions (e.g. rowmean()) that blur this clarity a bit - but I see rowmean() as a convenience function that easily can be expressed in terms of simple arithmetic operations in the "horizontal" direction. As such, we don't need support for statistical functions that operate horizontally (i.e. on a per-unit basis)

 TODO:

Statistical functions are typically accompanied by a GROUP BY clause where you specify a set of "break-variables" or grouping variables. The concept is simple, and as long as you tie grouping clauses to *one* statistical function at a time, you may support different grouping clauses in one expression. We still need to find a solid model for this.


3) The IF-clause defines the set of units that the formula operates on.

This definition of course complements the Universe/Population already defined for the variables (typically linked to the Instance Variable)

 This set-definition or filtering-criteria-stuff is more relvant when you have statistical functions that operate on sets of values than aritmethic functions that operate on one unit at a time (horizontally).


Does this make sense? Is it compatible with the process model? Can we use the concept of statistical/aggregate function when we talk about cube-datums and their provenance?

 Will write a follow-up to Larry's email shortly.

 Ørnulf


2. Data Transformation 2 - Multidimensional Scaling:

Larry Hoyle - 2015/09/06

Attached is an example of a transformation that is done by a "black box" procedure on a whole dataset.


Ørnulf Risnes, 2015/09/09

Larry,


I don't know or understand MDS at all, but I have this general question:

 the two new variables that you produce; wouldn't these variables typically be _named_ coefficients after the procedure?


If it makes sense to talk about result-variable-1 as "Dim1" and result-variable-2 as "Dim2" I think this is less problematic than originally thought. A lot of analyses produce multiple coefficients, and we have to be able to talk about provenance of analytical results the way we talk about provenance for datums that go into the analyses.

The value for Tom X Dim2 is 0.771126. We have to be able to talk intelligently and structured about how that number came about. Don't we?


Ørnulf


Larry Hoyle, 2015/09/09:

The MDS (multidimensional scaling) procedure maps (projects) data from one space into another space of smaller dimensions. Names for the resulting dimensions are subjective.  

In the example Sally originally has a position in 4 space expressed as a vector (3,3,4,0)  and in the new 2 space of (-0.2, -1.4E-05) The data in the new 2 space are a "best" (but usually lossy) representation of the data, i this case used to plot the points on a plane.

 This example is a common use case for network data where it's desirable to plot  the nodes and edges in either a 2 or 3 dimensional plot.

 MDS can be thought of as similar to springform algorithms where there is an attractor between every pair of points and the algorithm minimizes the energy in the system when squashed into the new number of dimensions.

 Other related procedures are factor analysis, and projections for maps where the 3d points on Earth are projected into a plane with varying constraints.

 Larry


Ornulf Risnes, 2015/09/09:

I won't pretend I understand MDS yet, but have two generic comments:

 * lossy functions are also commonplace in e.g. creating cubes from microdata and some of the statistical functions I listed in my example earlier are indeed lossy. I don't think that matters when you talk about the creation of the new datums. Sure, you cannot undo and go back without the raw data - but it's still meaningful to talk about the formula

 * some of my more clever colleagues have pointed out to me that the "horizontal" and "vertical" issues I was struggling with last week have commonalities with the MapReduce-way of solving calculations. I only understand what they talk about in intervals of 5 seconds at a time, so I can't really say whether or not MapReduce can help us understand/explain data transformations in a DDI-context. (will discuss more with my colleagues before our teleconference tomorrow)

Anyway - my point is that we should also be able to describe the result-datums of MapReduce operations or other operations regardless of their complexity. Or something...


Ørnulf


Dan Gillman, 2015/09/09:

Ørnulf,

 I don't feel I am really up with you and others in this discussion, but your reference to row and column operations reminds me of the relational algebra, which may be related to this map-reduce stuff. In the relational algebra you can operate on a row or a column, and these allow you to drill to a single datum.

 On the other hand, when I hear "reduce", the map-reduce operations might be projections of a multi-dimensional space to one of lower dimension. I just am not familiar enough with this yet.

 Dan


Wendy Thomas, 2015/09/09:

I would also think Matrix Manipulation would be of use here. At least we should be able to describe this.

 wendy


Larry Hoyle, 2015/09/09:

Good point. Many software packages can express transformations as linear equations in matrix form: Stata has Mata, SAS has IML, R has R. These include functions that act at the matrix (table?) level e.g. determinant

 --- Larry Hoyle


 Minutes for 27 AUgust 2015 meeting

Minutes of 27 August Meeting

Attendees: Barry Radler, Flavio Rizzolo, Jay Greenfield, Steve McEachern, Ornulf Risnes, Larry Hoyle

Apologies: Dan Gillman


Flavio discussed the draft version(s) of how we might fit the DatumStructure and the Transaction into the model

(see images in previous minutes for the model)


Version One of the Transaction Model

The main new objects are:

  • DatumStructure

  • Transaction (with attributes time - Timespan, space - Location, and identifier)

  • Event, and

  • Key (which identifies the DatumStructure)


Questions from Ornulf

  1. Wanted to know more about the TemporalIntervalRelationPair, and it’s relevance here. Flavio explained that this allows us to describe the relationship between two or more transactions. Ornulf noted that it could be inferred from the Timespan and Location. Larry noted that this approach might also allow us to specify relations explicitly.

  2. Barry asked whether the RelationPair objects are REQUIRED or optional. Flavio indicated that they are optional - that we could use them if we wanted to and had this available.

  3. Jay felt that he would be able to use this further in articulating how we could utilise this within a Questionnaire Design


Version Two of the Transaction Model


Additional characteristics

  • Transaction is the link to Datum

  • Transaction can be related to Datum as an Identifier, an Attribute or a Measure

  • Transaction might also be applied to other things

  • DatumStructure is then a grouping of Datums for a given Key


Questions on Model 2:

  1. Barry asked what were the major differences between the two models.

  2. Ornulf asked whether the DatumStructure has one or multiple identifiers associated with it. The question is whether we would allow multiple Measures or multiple Datums in a DatumStructure. This is a subtle difference which needs to be clarified to finalise this section of the model

  3. Jay noted that the key issue here is the Identifier/Attribute/Measure characteristic of the Datum. This model appears to allow us to incorporate that distinction.


General discussion

General question of whether we will still need the DatumStructure (does it need removing, or renaming, or...). Larry gave the example of it’s use in a Questionnaire - we would want to be able to identify all of the variables which were captured in the execution of the questionnaire. (This might however be better be described as a RowStructure?).

At this point, it would be useful therefore to take this version of the model and try applying it to some examples. Jay indicated that he would be able to trial this with a Questionnaire. We would also want to try this with the BloodPressure measurement. A third example that would be useful is a Transformation of a Datum.

Ornulf noted that we would want to know the difference between Key and Identifer (the relation on the Datum-Transaction relationship). Flavio indicated that they may be same, but the Identifier could also identify the Datum within a Transaction, and have multiple Transactions within a Structure. Sometimes we would need both.

Ornulf also noted that we might want a “forgiving” model - with much of the content being optional rather than mandatory. We often won’t have a significant amount of this information.


The group agreed that we should progress with Version 2 of the model above. We still need to clarify:

  • Distinction between Key and Identifier

  • What role is DatumStructure fulfilling

  • Can this now be extended into our Aggregates (Dimension, UnitRecord, RelationalTable - there will be others)

  • Is there still a need for Events in this framework? (Ornulf will have more to say about this at the next meeting - it is probably a separate issue?)

  • How will this then fit into DataCapture?

  • What other types of Transactions will also be relevant?

For the next meeting, Jay and Ornulf will prepare examples of the Questionnaire/BloodPressure and DataTransformation to test whether the proposed model can be applied. Steve will also explore whether the aggregate structures can be developed from the Transaction.


Next meeting:

Thursday 10 September, 10PM Central European Time

GoToMeeting - https://global.gotomeeting.com/join/148887013

 Options for the DatumStructure model

These are two proposed alternatives from Flavio and Steve for how we might include DatumStructure and Transaction in the model


Option One:

Datum Structure as a grouping of Transaction


Option Two:

DatumStructure removed, and replaced by Transaction



 13 August 2015 meeting minutes

Meeting minutes - 13 August 2015

Attendees: Dan Gillman, Flavio Rizzolo, Steve McEachern, Jay Greenfield, Larry Hoyle, Ornulf Risnes


1. Comments from Ornulf on minutes of last meeting

FHIR model looks like an interesting start.

Agreed with Larry’s comment last time on the need for properties of datum - which was discussed further below.


2. Defining properties of datum:

Dan asked whether we are distinguishing datum from the datum structure? That had not yet been decided.

As it stands, we seem to be defining as follows:

- Datum: the result of a determination/observation

- DatumStructure: the content inherent in that determination

We have the potential for disembodying “Datum” if we don’t associate it closely enough with real-world events.


3. Comments on DatumStructure:

If we have IV, we don’t Universe (it already inherits)

Should we replace Event with Transaction (based on FHIR /transaction example we have seen)? Possibly - need to discuss through that we may need both Event and Transaction.


4. Comments on a new “Transaction” object:

Transaction is the characteristics of the determination of the value. We will need to be explicit about the characteristics of a transaction - such as the Interval?

We are distinguishing the Event being recorded from the Transaction that is recording it

Jay noted that there are several common ontologies in this space, with the common characteristic of distinguishing between the real world event and the recording of that event (the Information object). The information in HL7 is put into something called a Document Model (sections, lists, ...)

Flavio argues that Transaction has to have a Time attribute (beginning and interval or end). Similarly Larry argues for a Place attribute. Dan points out that there are some data that don’t have a spatial element (e.g. the number of protons in an atom). Ornulf noted that we would need to record the time and location of the Event and the time and location of the recording of that Event.

Dan noted that we can have “Information Object Events”. We collect some data, then use the results of several variables to compute the values of a new variable. The “time” associated with that computation may or may not be meaningful. (In addition, the “Observation” terminology also has potential interpretations that we may need to be careful about.)

Flavio argued that the Timestamp in terms of a Transaction is somethin we would wan’t to record, even if it is not Meaningful in terms of a “real-world event” - Time is recorded, even if we don’t really “care” about it. Jay similarly argued from an archival perspective that the time of derivation or similar is captured - e.g. where and when did some do a calculation. Steve also had an interest in the generation of derived content, where the generation or calculation time is still of interest.

Ornulf - it may also be that we have the time and place of the event, and the time and place of where we observed the event.

Dan’s concern is whether we will Always “care” about Time and Space. Ornulf and Larry argued that they need to be optional, but that we should at have them in most of the transactions we deal with.

Flavio argued that the Transaction time needs to be associated with the Transaction. Where should we associate the time of the Event itself - probably needs to be associated with the Datum? Need to enable the ATOMIC characteristics of a transaction, so that it can’t be subdivided.


5. Moving forward:

Agreed that we need to do some out of session work to provide a proposed model of some of the objects under discussion. Flavio and Steve will work to model the following and bring back to the group for discussion:


  • Datum

  • DatumStructure

  • Transaction


  • Event

Next meeting:
Meeting to be moved one hour later, to Thursdays at 10pm CET.
Next meeting: Thursday 27th August at 10pm Central European Time.

GoToMeeting - https://global.gotomeeting.com/join/148887013

 30 July 2015 meeting minutes

Minutes of Data Description Meeting, 30 July

Attendees: Steve McEachern, Larry Hoyle, Jay Greenfield, Chris Seymour, Barry Radler

We began by opening up discussion of Transactions as an alternative to Events, continuing from a series of Emails post previous meeting. Jay started discussing his email on the work by Daniela Meeker.

This approach is based on HL7. In this model (see previous minutes entry), everything is an ACT - administrative acts, procedures, fine-grained measurement, etc.

Jay noted that it may be that we can’t escape this very atomic structure. You may want to group things at the transactional level - particularly to describe the relationships between the items in a Transaction. The Meeker structure may also be consistent with the way big data systems are capturing data as well (using NoSQL and JSON over the top of this).

Notably though, this approach probably allows us to separate the RECORDING of the Event from the OCCURENCE of the Event itself (which seemed to be creating confusion for some).

One of the challenges was to then figure out what ties everything in an aggregate data structure together (e.g. items in a RECORD, variables and cases in a DATASET, and alternatively into a DATACUBE).

Barry considered the application of this to the MIDUS study. There is a particular need to ensure that the context of a particular capture is recorded appropriately.

Jay considered further the Meeker work to describe an ItemSet as a combination of Items within the same Transaction (i.e. with the same TransactionID and PersonID) which would cover the description of context that Barry describes.

It is probably the case that we should consider replacing Event with Transaction in our DatumStructure.

The granularity of the Transaction may able be a function of the domain - e.g. recording the explosion (at an atomic level) of the atomic bomb is a different time scale from recording the sequence of a hospital visit (an eternity?).

It is also the case that we may not always use certain data aggregations (is DataCube still appropriate)? HL7 is moving in this direction - e.g. Triple structures, documents, as well records and cubes. See FHIR from HL7 - http://hl7.org/fhir/ .

Similarly we will also want to consider whether Transactions are useful for enabling the description of other Processes within the DDI4 model. So for example is a Transaction a useful object for capturing the Cleaning of a Datum from a DataCapture, or the extraction of a combination of Datums into a DataSet?

It appears that a combination of the Process model and the DatumStructure/Transaction would be the appropriate way for describing this (and Dan G. did some modelling of this that was discussed in December in this group).

Coordination and modelling issues

Barry queried where responsibility for developing Datum and DatumStructure, and how to describe processes for e.g. editing Datums, would sit within DDI4 development (as he has responsibilities as Chair of the DataCapture group). It was agreed that Datum sits best with the Data Description group, and Process model by available from within the Process group. DataCapture and others can then make use of the Data and Process objects within their models.

Larry noted in passing that we have no properties yet on Datum. We really should have some property that allows us to record a value with a DataType.

We will also need to rationalise the difference between the GSIM and DDI4 Datum objects.

Next meeting

For next time, need to consider:

  • Transactions as combinations of Items
  • Existing or potential aggregate structures for Transactions: (e.g. FHIR)
  • Building aggregate Data structures (Unit and Dimensional data structures, but also other options)

Next Meeting: August 13, 2100 Central European Time,

GoToMeeting - https://global.gotomeeting.com/join/148887013

Addendum (Jay)

I was asked to engage Eric Prud'hommeaux by Achim who was weighing whether to accept Mary's invitation for Dagstuhl 2015 or not. We talked and this is what he wrote Mary:

After discussing the modeling status with Jay, I'm convinced this
makes sense for me. I have no relevent census or population modeling
experience, but have worked clinical trial and clinical data modeling.

I think one of the biggest questions a group like this will face is
how generic to be. Traditionally, exchanged clinical data has a
generic concept of an Observation, the semantics/utility of which
comes from leveraging huge medical vocabularies like SNOMED-CT. The
cost is that this representation is far from what researchers want to
deal with as it would make structured questions large and opaque. This
is traditionally not a big issue because the data was extracted
through excruciating curation processes. As we impose more interop and
exchange requirements, that curation becomes a conspicuous impedence
in analysis pipelines.

I suspect your needs are more closely aligned with doctors writing
decision support and event detection rules. Traditionally, there is a
lot of code to write to enable their rules to work on an EMR, but
again, as we get better at interop, we have better chances to strip
that down to something declarative, portable, and maintainable.

My work in FDA and HCLS is all focused around that. If you'd like to
persue this, let me know, The other wrinkle is that I'm about out of
travel funding for the year so I'm at some risk for actual attendance.
I think I will put together a PowerPoint that describes the current state of our modeling and how we might be on the cusp of an approach that it is at once generic and specific and can support the needs that Eric is describing. It will walk the reader through what I told Eric verbally on our call. As discussed in our last meeting, the current data description approach we discussed can support the identification and eventual analysis of care pathways (Daniella Meeker). Daniela was also invited to attend Dagstuhl. I am not sure if she accepted. We had our talk last Saturday. In any event I will try to get permission from her to incorporate parts of an interesting manuscript she is working on about care pathways in the PowerPoint.
 Email Discussion following July 2 meeting

The following documents the additional email discussion on Events that followed in the week after the July 2 meeting.


Flavio Rizzolo, 3/7/2015:

I don't want to complicate this even more, but the way this discussion is going reminds me of the definition of context (sorry for using the "c" word) in this paper :

 http://www.cs.toronto.edu/~flavio/birte2010.pdf

 We don't need to go into the details of the framework, just consider the running example in the introduction for now. 

 Table 1 contains a set of observations about a characteristic (patient's temperature) made by an agent (a nurse) with an instrument (thermometer) at different points in time (time and date). The events in this case could be identified by Patient+Time+Date and the additional metadata about the events is given by Tables 3, 4 and 5. This additional metadata is used in this framework to assess the quality of the data in Table 1 and for quality query answering, but it could be used in many other ways. Note that Table 1 has only the minimum amount of information to identify the event (Patient+Time+Date in this case) and link the observation to the contextual metadata. Maybe that's all we need in our datum structure: an identifier of the context that allows us to link the observation to this additional, rich metadata that describes the observation. And in this contextual metadata is where you could link events and create event hierarchies like those needed to express Jay's example and the different steps given by Larry below.

 Just thinking out loud...

 Flavio 


Jay Greenfield, 3/7/2015:

In a paper I am reading by Daniella Meeker temporal pattern matching is used to identify treatment sequences where each “moment” in the sequence is a tuple consisting of a patient id, a transaction id and an “itemset” where an itemset corresponds to a group of HL7 acts performed at once (a battery of tests, one or more medicine orders, etc).

This is pretty much what we did at Minneapolis except we called the grouping thing an “event”. Flavio is calling an event an identifier for context. Daniella avoids the c-word and calls it a transaction. What encourages me here is that Daniella uses these tuples to circumscribe clinical protocols — both expected (standards of care) and actual (treatment in reality).

 The point is that the context / transaction id has a semantic that can vary. When I was in the emergency room waiting to get hydrated the other day the “itemset” covered a half hour and included a single blood pressure reading and my average pulse over a 30 minute period. In a Daniella example an itemset covered a “prescription time” and includes a set of drugs. So the semantic we attach to the transaction id is variable.


Patient id

Transaction id

Item

Value

1

1

BPD

85

1

1

BPS

60

1

1

Average Pulse

37

1

1

Transaction Interval

30

1

1

Transaction Begins

10:33:21

1

1

Exception 1 Type

-

1

1

Exception 2 Type

-


Wendy Thomas, 3/7/2015:

The idea of a transaction would probably resonate within a number of research communities. In essence an interview is a transaction where there are a number of environmental conditions, the questionnaire itself, and possibly other variables such as trigger for start of transaction, etc.

 Wendy 


Flavio Rizzolo, 3/7/2015:

I like the word transaction better since event seems to be a little overloaded these days, not to mention the c-word. At least in the data management domain transaction is a well-defined term and has nice properties, like ACID (atomic, consistent, isolated and durable), which seems to apply in our case. 

 I think the key thing here is what Jay just said: this transaction / event / c-word has a semantics that can vary. I would add that this semantics is given by all the metadata that describes the transaction, e.g. instruments used, agents, relationships to other transactions, etc. whether they are expressed in Datalog predicates, an ontology, or just a UML model. But this goes beyond the actual datum structure, for which just a relationship to a transaction (or a transaction id) should suffice.

 From a modelling perspective, these transactions could be described with: 

i) a few well-defined, concrete extensions that satisfy the common use cases related to unit and multidimensional data sets; or

ii) the referential metadata objects of GSIM (or the DDI4 custom metadata structure?).

 I personally favor the former for the known use cases and the latter for unforeseen extensions, but that is something to be discussed further. 

 Flavio



 02 July 2015 Meeting Minutes

Meeting 2 July, 2015.

Attendees: Dan Gillman, Ornulf Risnes, Jay Greenfield, Larry Hoyle, Steve McEachern, Flavio Rizzolo


Opened discussion with update on where discussions were at.

 Ornulf walked through his key concerns, based on notes of the discussion Ornulf and Steve had at NSD. These are identified within the ppt slide on Google docs.

 Dan still wondered whether the approach is quite right.


Review of the current model (per Google Docs / Drupal version):

 Larry noted that key should be a combination of instance variables

 On event, Larry sees Event as the recording of the datum - eg the creation of the cube, or the recording of the Observation in the Datum

 On Key, Jay took us back to the model in Drupal. He noted that Key had two subtypes - DimensionKey and UnitKey.  The DimensionKey would be a Complex structure.


Discussion of alternative models

Ornulf raised the possibility of comparison to the DataCube RDF - http://www.w3.org/TR/vocab-data-cube/

In the DataCube RDF, the Key is really a combination of Area, period, dimension, and Measure. This should still be able to be represented as InstanceVariable.

 In terms of Events, Jay suggested that the link is between annotations and Events, with Event as a grouping.


Next steps:

Dan suggested that we may want to look at the dimensional data case and work from there. This would be: time, space, dimensions, measure

 Jay noted that not adding semantics may be a good thing - these can be added later.

 In essence here we have three possible approaches. Can we reconcile these?

 Flavio suggested that the DimensionalKey should be able to be mapped to the current structure. We still have the issue of the Event to consider.

Jay noted concerns about data flows (eg. Continuous monitoring of individuals) don't really have a reference period. Flavio wondered whether there was a use case for this.

 Can we provide a generic structure to describe the RDF structure?


Next meeting: Thursday 30 July, 2100 Central European Time


 18 June 2015 Meeting minutes

Meeting 18 June 2015, 2100 CET

Attendees: Dan Gillman, Jay Greenfield, Steve McEachern, Ornulf Risnes, Flavio Rizzolo, Chris Seymour, Achim Wackerow

Apologies: Larry Hoyle, Barry Radler


The meeting commenced with a review of the progress on the model made at the Minneapolis sprint and subsequent modelling in Drupal by Flavio Rizzolo (http://lion.ddialliance.org/ddiobjects/datumstructure). Steve began with a summary of the current state of the Datum section of the model, provided in the overview here: https://docs.google.com/presentation/d/10vCbqQVKbAsD5dPJHXl_T5OWhkC7IgSjgV5NQnhgFQ0/edit?usp=sharing

Dan Gillman also proposed a revised definition for Datum, circulated to the team prior to the meeting:

"I have refrained from asking to use a definition of datum I reported at the Stanford IASSIST and METIS (UNECE Statistical Metadata Work Session) about the same time, but the PowerPoint file linked below makes me think it’s time. This is from work I did with Frank Farance, friend and former standards colleague in the 11179 community. The definition states
datum := designation of a value
where
value := concept for which a notion of equality is defined

This will sound very academic and abstract until you realize the following –
1) designation is just the terminological way of referring to a term, name, numeral, code, or some other way of representing a concept, and the representation, or sign, is the letter, numeral, string, or some other symbol in use
2) every datatype defined in 11404 (the standard on datatypes) contains the property that equality is defined for all the values in its value space (the values a datatype defines)
3) equality is a necessary condition for copying, which all data undergo in computer or “paper and pencil” processing
4) it is easy to see where it is the concept rather than the designation that is used in computation – we know 2 + 3 = 5, but it is true independent of whether the numeral 2 designates the number two, etc., for in Roman numerals, II + III = V just the same

I would like the above definition of datum used in our documents."

Unfortunately time did not permit discussion of this proposed definition at the meeting, and will need to be followed up at a subsequent meeting.


Comments on the model summary

While the remainder of the slide deck was presented, the main discussion for the remainder of the session focussed on two sections:

1. DATUM STRUCTURE EXAMPLES

(https://docs.google.com/presentation/d/10vCbqQVKbAsD5dPJHXl_T5OWhkC7IgSjgV5NQnhgFQ0/edit#slide=id.gb374ea45a_2_43)

Ornulf noted that the current way in which attributes are embedded in the Event may oversimplify the attributes. Key may have similar issues, but he was more concerned about Event.

Flavio noted that the key is subclassed into dimensional and unit keys.

Jay also noted that.. (Missed comment here - need to add)


2. AGGREGATE DATA STRUCTURES - UNIT RECORD DATA

https://docs.google.com/presentation/d/10vCbqQVKbAsD5dPJHXl_T5OWhkC7IgSjgV5NQnhgFQ0/edit#slide=id.gb374ea45a_2_26

Steve noted that this current version implies that a Record may not necessarily have values captured in the same event. Flavio noted that we may want to associate event with either the Record or RecordSet.

For discussion is how Event gets included here within the Unit Record structure. One option may be to manage this by using subclasses of Record - for examples in Time Series, we wouldn’t want Event to be included in the Record (as a record will have content from multiple measurement events).
Dan wanted to caution to ensure that we retained sufficient information with Event to ensure that it is able to distinguish between things.
Ornulf noted that there are many more ways of organising unit record data that Record does not capture. That said, we were trying to describe a CSV specifically, but the basic table structure of CSV does cover a significant number of use cases (CSV, Spreadsheet, relational database table, ...)

It was agreed to continue discussion on the implications of the Event requirements for aggregate data structures at the following meeting.


Next meeting will be Thursday July 2nd. The time may be adjusted to earlier in the day depending on the availability of group members.


 Summary slides following Minneapolis sprint, June 2015

There is a new slide deck summarising the model output on Datum Structures from the Minneapolis sprint. This is for discussion at the next meeting of the Data Description group on June 18.

The slides (in Google Docs) are available at:

https://docs.google.com/presentation/d/10vCbqQVKbAsD5dPJHXl_T5OWhkC7IgSjgV5NQnhgFQ0/edit?usp=sharing 


The slide deck is based on Flavio Rizzolo’s modelling of the Datum Structure, which he has added to Drupal:

http://lion.ddialliance.org/ddiobjects/datumstructure

Plus some of the definitions we developed in Minneapolis for aggregate structures

https://ddi-alliance.atlassian.net/wiki/display/DDI4/Simple+Data+Description+meeting+minutes

 MPLS Sprint 2015-05-29 Data Description Morning Meeting Minutes

We had got to needing to be able to describe the dimensions, unit, variable, and value.

  • Can we break out any datum in this style?



KeyValueAttributesPopulationEvent
Unit RecordUnitIDValueVariable

Universe

(e.g. persons)


Dimensional (n-cube)Dim1, DimNValueVariable, (Count

Unit of analysis

(e.g. people)








"Population" depends on your unit of measure.

  • Datum
  • Record
  • RecordSet
  • Cell
  • Slice

 Two things are being described: cell in the cube, and the cube itself

A slice also needs to be described.

The cell is defined/identified by the key.

The population for the n-cube and the population for the microdata that generated the n-cube are slightly different things.

Population imposes a context on the value.

What's the population in the datum case?


KEYEVENTVALUEVARIABLEUNIVERSE
Blood Pressure Ex.Patient ID10pm Collection120Systolic BPPatients

""80Diastolic BP"

""Skinny (code)Cuff Type"

IDSurveyFGenderPatients
Cube Ex.Region1,Industry1

2015-01 as revised

in 2015-02

300Number(employees)

Establishments in

target regions


Region2,Industry1"200""

Region1,Industry1

2015-02 as revised

in 2015-03

200""


Record: Ordered set of Tuples all of the same Key and Universe

RecordSet: Accumulation of records sharing all of the same Universe.

Do these definitions work for a cube? Yes.

Cell: Set of Tuples all of the same Key, Event, and Universe.

(Images of board: 1, 2, 3, 4, 5)

 MPLS Sprint 2015-05-26 Morning Meeting Minutes

HOW FAR DO WE WANT TO GO WITH WHAT WE DESCRIBE?

Jay has put together a deck and has a proposal

He is modifying GSIM model of “data set”

First thing that’s interesting is that the way GSIM represents attributes, it doesn’t give them a possibility of giving them a structure. We’d want to modify it so it could have a structure.

This would be a hook to enter what Larry and Arofan are doing.

[See the ppt]

Discussion took place about what defines 1NF/3NF in the GSIM model and Jay’s propsal. But does it matter or can the terms be changed for description?

The description that Jay proposed makes sense, but terms should be changed to avoid NF’s.

Attributes need to be worked into the GSIM model as they are variables. There are variables in the attribute sets.

LARRY - In DDI do we want to model a datum as a collection of variables or a single variable?

DAN – it’s a single.

LARRY – but then Ornulf describes a datum as a collection of variables

So what are the terms to be used if we’re calling a datum a single variable?E

Datum

Data Structure

  1. “Datum Structure”
    1. identifier(s)
    2. measure(s)
    3. attribute(s)
    4. Logical records
      1. Measure(s)

Coming back to Jay’s stuff this morning.

2 different types, logical record and the basic idea of  key value pair

(reordering above)

  1. Logical records
  2. Key-value pair
  3. Datum structure (which builds a logical record)

Would the key-value pair be possibly triples? Graph data?

Where are we in relation to the work done yesterday? We have a basic structure to then describe a CSV file.

DAN - What could be called a key-value triple which contains a variable (attribute), unit (ID),  value (measure). (There are parallels between this and the datum structure.) So this a the fundamental thing. Let’s use that to define a record, and from that define a CSV.

Record is an ordered set of these key-value triples (“kvipple”) that share the same unit.

Larry making a proposal

We’ve got this record which has 3 collections associated with it: ID, Measures, Attributes. 

Record, ID, Measures, and Attributes are all collections.

Then we want to define a structure of records. That can be instantiated as a dataset

RecordSet is a set of Records (a sub-class of collection)

DataStore store of a RecordSet

STEVE - Can we describe a CSV at this point?

Moving from RecordSet to DataStore we move from logical to physical. We have separated the logical and physical forms

A CSV is one type of DataStore, and all the logical parts are in the RecordSet. Fixed Format is also another of DataStore.

What does a Key-Value Triple option look like? How can this work with aggregated data.

GSIM didn’t try to tackle them all under one structure; are we trying to do it with one?

We can use the basic model of building this up, but we have to interpret it differently and have different relationships associated with it in the case of aggregates.

We need to solve the problem of dimensional data.

Take the combination of the values of each of the dimensions; every combination defines a different cell. Applied to the unit type in the micro data, itself defines an aggregate unit.

Record: Cell

Unit Type (e.g. “people”)

Dimensions (e.g. “age”, “sex”)

Measure (e.g. “income”)

Key: 40 y.o. male plumbers (1. . n components)

            The component could be represented by variables

Each kvipple is a cell. And every cell is a record. The unit incorporates the key.

Are we losing the dimensions?

Does the model work?

The only thing that’s really changing is the idea that the unit is going from one kind of object to an abstract collection object. It’s the set as a completed set, not as the individual element within – is the unit.

The dimension isn’t lost; it’s a combination of aggregated variables.

Unit + dimensions+ variable + value = Key

The unit is shared by the entire cube. It describes the characteristics of the entire population. (working with census data)

For the microdata dimensions are constant (e.g. person). For the macrodata the unit is constant.

Key is M,40. Variable is income. Value is 27,000

Is the unit the cube or the combination of things in the key.

What is the unit?

In a microdota case each cell is a record.

The unit is identified by the key; it’s the interpretation of each cell.

Dimesional data takeaways:

  • There’s something going on here


Units either by groups or individual they mean different things. The unit is dependent on the key.

What's the unit of analysis? The unit of the cube or the unit of the cell? What do we want to do with it?

The unit question - the answer lies in where we attach more information.

We want to put in rules for putting together different slices to put together the RecordSet in the unit.

We need to say what the "thing" is before we put everything together.

Need to look at how datum is described from the point of view of the variables.

The following email and links were provided by Ornulf following the call:

Regarding the question of relations; we've lately come across some interesting thinking in what seems to be an alternative (and more forgiving) way of Data Warehoursing; Data Vault Modeling:

http://en.wikipedia.org/wiki/Data_Vault_Modeling

They have this distinction between Hubs (Units), Satellites (Datums) and Links (relations between Hubs) that looks pretty relevant.

Perhaps some of the participants have heard about this (and discarded it). If not, it's worth a glance at least.

Here's a slideshare that also goes into the newer "hyper agile" data vault solution, where satellites (datums) have a flattened-out structure:
http://www.slideshare.net/kgraziano/agile-data-warehouse-modeling-introduction-to-data-vault-data-modeling

If we strive for 3NF (not sure how I feel about that though) we definitely should take DW modeling into consideration.

 MPLS Sprint 2015-05-25 Afternoon Meeting Minutes

In seeking to start creating a simple logical structure, we began by looking at the the 4 objects that had been created during Dagstuhl: DataPoint, DataStructure, DataStore, and DataStoreSummary. Also Dan Gillman began brainstorming a model of DataStructure along with the group.

Review of the DataStructure led to discussion if any parts of it needed to be reviewed and redesigned.

A DataStructure is an ordered set of DataPoints (a record). And a RecordSet is a collection of DataStructures (a table).

The discussion raised the issue of types of records and sequence of records.

Question – do we want to describe a very simple CSV (all DataPoints in a column are the same variable), or a more complex type e.g. a Household, Person structure with record type variables and sequence variables?

If all records do not contain the same sequence of variables then we need to describe record types and sequences.

 MPLS Sprint 2015-05-25 Morning Meeting Minutes


Discussion focussed on what has been completed and plans for the Sprint.

Done:

  • Logical Model
    • Value Domains
      • Sentinel vs Substantive
    • Variable Cascade
      • Conceptual/Represented/Instance
  • Concepts
    • Variable
    • Datum


To-Do:

  • Subset of Structure:
    • Simple Logical Structures
    • Physical Desciption
    • Describe a CSV

First goals are to complete this to-do list.

We can't yet:

  • Put data in a structure - or as a derivation of some kind
  • Track a datum (creation, editing)
  • physical descriptions

Discussion was raised about the definitions of "simple" and "complex". Can these distinctions be maintained?

Simple is caring about a datum just as a datum. We talk about what simply applies to the fact that it's a datum - never mind how it arrived.

GSIM model – attributes and unstructured data – represented variable has attributes in a structure. GSIM has a unit data structure – one instance is a table.

Dan brought up that bringing in the data structures issues brings us from simple to complex. So should the two just be combined into a singular data description.

Transformations were talked about with the mention of VTL by Dan and if it should be reviewed and perhaps adopted if it makes sense and does what is needed.

What is in the pipeline as deliverable for the next release? Simple data description.

Coming up with a model that's logical data structure and is "simple" could be a goal for the week.

So the start will be working through the Simple Logical Structure, Physical Description, and then be able to describe a CSV file.

Afternoon will be working through the subset of structure.



 Meeting minutes 21 May 2015

Attendees: Larry Hoyle, Steve McEachern, Barry Radler, Ornulf Risnes, Chris Seymour

Discussion focussed on

A. Plans for Sprint

B. Comments on custom metadata paper

 Starting with custom metadata, Larry raised further thoughts and concerns he had on the potential for recursion in having instance variables describing other instance variables. This works for the RAIRD data structure situation, but may not work in all circumstances.

 Another example where this may be useful is one Larry raised of using this model to describe the “universe” for some metadata item.

 Is there some generalizability of different parts of the data description that might make use of this structure? Ornulf raised the possibility of requiring the conceptual and value domains (rather than the variables). Larry raised that for conceptual purposes you need the two variables, but in implementation the two would only use the instance variable.

 Barry raised the parallel with a “conceptual instrument” as opposed to an implemented instrument. Depending on the mode of collection used, the implemented instrument may use the conceptual model, but have some specific characteristics that cant be known in advance and that are critical for understanding and describing the measurement. In implementation we would again only make practical use of the implemented question, but make reference to the conceptual/represented questions.

 Ornulf made the point that in practice, a conceptual variable is in fact a placeholder for reusable content that we will want to implement. In that context we may want to avoid specific use of the higher levels of the cascade. The conceptual in this case is really a “class of variables” – a set of variables that share a concept, (or similarly a value domain). The RAIRD approach may or may not make this normalised – that is still to be determined.

 There was agreement that there may need discussion on the use of the custom metadata model for describing data structures. Ornulf noted that fundamentally the structures are straightforward from an analytical perspective. It appears that Jay’s data structures from Splunk will be the more relevant to the group in testing the applicability of the custom model.

 In terms of work for the sprint, there was agreement that there were three main areas of work to complete:

-       Review and completion of the datum model

-       Evaluation of the custom metadata model for other data description requirements

-       Review and completion of the physical data description (based on PhDD and Dagstuhl output)

The aim will be to have a useable model for evaluation by early in the week of the sprint, with a teleconference to occur Tuesday May 26th at 9am Minneapolis time (9am Minneapolis, 4pm Bergen, Norway 2am Wed Christchurch, NZ). Contact details for meeting to be provided on Monday when known.

 The outputs from the sprint will then be used by the RAIRD team in a follow-up discussion at NSD in mid-June, working on their metadata model. Other groups will also be asked to provide stress testing at that time.

 Joint meeting with Modelers 14 May 2015

ATTENDEES: Steve, Arofan, Barry, Wendy, Dan G., Jay, Larry, Ornulf

Background:

  • Based on some work Achim, Larry, Wendy, and Arofan were doing to describe some loosely defined data description.
  • Jay: National Cancer Institute wants to grow their metadata repository for NIH (caSDR)  <wiki.nci.nih.gov>  <cdebrowser.nci.hih.gov> What they want to do is modernize. Grow the semantics and become players on the semantic web. push into RDF Problem of transitioning from current model to one with attributes and dealing with evolution of attributes
  • HL7 is probably the only organization that knows how to do bindings (XML, RDF, etc.)
  • The context around the data element is the key thing
  • Different context may mean different models

See document: Structured Custom Metadata for DDI v2

  • Primary place this happens is in the use of controlled vocabularies  all we do is reference an external list but don't have a means of validation   User attribute pairs (intended for internal system support)
  • Section III: (pg 3) wehre stuctured customized data may be useful
  • Goes through options: CV approach with a description of CV in DDI so you could validate using DDI collection structure
  • Metamodel approach (pg 11)- describes attributes, structure, type etc using lists, hierarchies, and graphs
  • Custom metadata report containing a set of key value pairs
  • When you do know how to model something you want to model it, this would be used for those things you don't know
  • V. Proposal for types of data description in DDI 4 ability to describe rectangular files, ncubes, datum-based data structures but could use the metamodel for qualitative and unanticipated quantitative data structure
  • How does this mesh with what this group is doing?
  • This approach seems to work for unanticipated content.
  • Can't see any problems with this approach but may be too complicated for use in the Raid project. However, the proposal is to have explicit modes for the datum-based data strucutres.
  • Places wehre we need to add context to data points.
  • The word context makes the hair on the back of Dan's neck stand up. It is not a well defined thing at all so can we get rid of the "c" word0?
  • "Custom metadata attributes"
  • pg 11 image is a similarity between CM items and raid variables

May need more discussion in Mpls. This may be closer in a way to what Ornulf has done than it appears at first glance.

We have to be sensitive to complexity. Great to have the metamodel but I can imagine that at each extension point you would have a direct model based on the metamodel. So from the users perspective it would look like what Ornulf designed.

You want to hide this stuff from the end user behind the extension point.

You want it to be consistent and regular and the use of the pattern can help us do that.

Who is going to be at the sprint (Dan, Barry, Larry, Steve, Wendy) Ornulf is not available Wed or Thurs, but we can set up time Monday or Tuesday. Shoot for Tuesday morning (7 hour difference)

In terms of instrument you could set up parameters for an instrument that you could validate.


 Data Description Meeting Minutes 7 May 2015

Meeting Attendees: Jay Greenfield, Steve McEachern, Larry Hoyle, Barry Radler, Ornulf Risnes

Jay presented the work he has been doing with Splunk around eCQM lifecycles, which appears similar to what the group has been working on with a “datum lifecycle”, and provides a similar approach to providing provenance chains for collating and integrating unstructured and structured data into a common structure, and provides a data structure (with late bindings, and normalised data) for reporting.

Larry queried what the underlying data structure was in the final structure, and Jay noted that he was unable to describe this directly (partly for commercial-in-confidence reasons). However there is likely to be RDF and JSON expressions that could be described. Jay gave some additional information on the background to the Splunk company to help understand where their approach and modelling has come from.

Jay noted that there is some distinction between the predictive and descriptive sides of describing such workflows. The eCQM lifecycle model also has some nice parallels with the DDI and datum lifecycles, and the group discussed using this lifecycle and the NADDI evaluation as two use-cases with which to road test the instrument and data description models in the Minneapolis sprint.

The next meeting will be a joint meeting with the Modelling group in one week’s time (provisionally) on May 14th at 10pm European Central Time (Meeting URL to be confirmed). The group will then meet again at our regular time on May 21st at 9.00PM Central European Time in preparation for the Minneapolis sprint.

 Data Description Meeting Minutes 23 April 2015

Data Description meeting minutes - Thursday April 23rd, 2100 Central European Daylight Savings Time

Attendees: Steve McEachern, Dan Gillman, Larry Hoyle, Barry Radler, Simon Lloyd, Chris Seymour, Ornulf Risnes

Larry discussed the need for meeting with the Modelling group around their metamodel proposal, and the purpose of the metamodel. This was seen as a possible way of addressing Ornulf’s requirement in RAIRD for contextual information on a variable (discussed at the previous meeting).

Dan argued that we need the ability to describe something as either a variable or an attribute depending on the particular situation in which it has been used. Dan also commented on his concerns about the use of a metamodel, and what the implications of that will be.

The understanding is that there is a framework here that can describe elements in the model as, for a given situation, either a measure or an attribute to understand the particular measure and the context under which it was generated (e.g. Ornulf’s blood pressure example discussed in the previous meeting).

Larry gave a short overview of the proposed model - the description of the information objects required, and then the model for using those objects.

Dan queried where the Measure element of the variable(s) would be included in the proposed model - that is, how do we then merge the defined attributed to the Measure. Ornulf articulated this further, with reference to his RAIRD work. He noted that in GSIM and DDI4, all of the information is contained in instance variables. This still has issues that need to be played out in the joint meeting.

Dan noted that we also need to be able to describe relationships between measures, as well as measures to attributes. e.g. Systolic to Diastolic BP, Latitude to Longitude. Larry gave a couple of additional examples where this can also be applied. It does not appear that the initial version of this metamodel proposal enables this at this time.

The group has indicated that we would like to see the metamodel proposal distributed ahead of our next meeting on May 7th. We would then propose the joint meeting for Thursday May 14th at 10pm Central Euro time, subject to availability of Modelling group contributors.


Next meeting: Thursday May 7th, 2100 Euro Central Daylight Savings Time

GoToMeeting URL: https://global.gotomeeting.com/join/148887013

Equivalent times:

New Zealand - 7am Friday

Canberra - 5am Friday

Mannheim, Bergen - 9pm Thursday

Washington DC - 3pm Thursday

Madison - 2pm Thursday

Lawrence - 2pm Thursday

Minneapolis - 2pm Thursday

 Data description meeting minutes April 09, 2015

MEETING MINUTES

Data Description meeting, 9 April 2015

Attendees: Dan Gillman, Jay Greenfield, Steve McEachern, Ornulf Risnes


Aim of meeting was to restart development of the Physical Data Description side of our activity, which has been dormant since Dagstuhl.

This was informed by three sets of work:

  • The PhysicalDataDescription view (developed at Dagstuhl 2014) - (hyperlink)
  • The proposal from the SCOPE committee in the US (Word doc link)
  • PHDD RDF description in the DDI vocabularies - (hyperlink)
  • (And in some ways, the "DataPoint" section of the GSIM model - (hyperlink)

The discussion commenced with review of SCOPE by Dan. Dan gave the history of the SCOPE project within the US Federal statistical system. The aim was to integrate basic requirements across agencies, to enable better data discovery, resource sharing, cost savings, etc. Description is based on variable types - e.g. standard "variables" but also "dimensions" from cubes. So cubes are described as dimensional variables.

Comments on SCOPE:

* Ornulf liked the simplicity of this model. Much of the model can be mapped to the represented variable - and Ornulf has a sense of how we can also incorporate the Datum level.

* Brief question - Is SCOPE extensible? e.g. Variable Label


Ornulf then continued on and gave an example in his email of how to represent some of the OpenEHR examples in this way. Dan commented that a generalisation of the Capture-Observation-Datum model may be the relevant placeholder model that is needed to store the context information that Ornulf is looking for.

Ornulf took a step back to point out the short term and long-term requirements of this group. He was trying to point out the need for a simple description, and to look at attributes of a measurement process (e.g. room temp in a blood pressure measurement) may be able to addressed by thinking of the attributes as their own variables

Dan noted that the contextual information on a variable can be

* adjacency of the columns

* attributes of the field (similar to Ornulf¹s example)

* reference to another system (by URI)

* by inference

Jay pointed out that we could use the DataStore and DataStructure (from Physical Description), and that Ornulf¹s example uses the attributes approach above along with the DataStructure to do this. Jay has a similar example which he shared in a Powerpoint.

The question that arises is how we want to represent context. Dan argued that context was "anything that you want to include in your association to a variable". To achieve this, you then need a way of relating variables to context that can be represented in other variables (such as the blood pressure and temperature example). For blood pressure, the context is how is it measured, and the situation it is measured in. If we can represent associations, then we can probably achieve the context description through variables. However this is currently not part of the "data dictionary" approach in SCOPE. This lead into a discussion of the current PhysicalDataDescription and particularly where the DataStructure then becomes relevant.

Dan¹s question: do we want to describe relationships between variables each time on an ad hoc basis, or do we want to describe a more standardised approach that is repeated between usages. (i.e. using the same variable in multiple data structures and multiple ways).

Jay indicated that there are many examples in health where reporting of variables is dependent on reporting on a set of related information, in a specific information model (e.g. blood pressure requires instrument information, temperature, etc.)

 Dan suggested that the DataStructure and the "Information model" (in Jay's sense of the term) should be separated. He argued that DataStructure is largely static, for storing and transferring data. Independent of the DataStructure, the information model can be described separated as a "map" for how it can be described together. It can be based on the instance variables, as "in order to understand A, you need to have B and C as attributes of A".

Ornulf then had the related question of how to represent the Case Identifier (e.g. a Person ID or a Primary Key in a relational DB table). There was a belief that standard ways of handling primary keys would be suitable for this.


The group agreed at this point that there was a need to separate the DataStructure and Information Model. This means that we need an additional set of objects for describing information models, which we don't yet have. Jay mentioned a possible need for a metamodel for the information model, for which we may already have candidates - e.g. JSON, UML, SQL.

To further this, at the next meeting, the group will then focus on consideration of the current DataStructure objects in the Drupal site, and then start to map out the requirements for the Information Model objects.


Next meeting: Thursday April 23rd, 9pm European Daylight Savings Time

GoToMeeting URL: https://global.gotomeeting.com/join/148887013

Equivalent times:

New Zealand - 7am Friday

Canberra - 5am Friday

Mannheim, Bergen - 9pm Thursday

Washington DC - 3pm Thursday

Madison - 2pm Thursday

Lawrence - 2pm Thursday

Minneapolis - 2pm Thursday



 Data Description Meeting Minutes 26 March 2015

 DataDescription Meeting Minutes: Thursday March 26th, 2015

Attendees: Jay Greenfield, Dan Gillman, Larry Hoyle, Barry Radler, Ornulf Risnes, Steve McEachern

Jay walked through the thinking of where the current Process model is now at, and what had fed into the work so far. He pointed out that the model (and 3.1 generally) were based on our “traditional” model of questionnaires and datasets, but that now new datatypes are becoming commonplace and possibly dominant. Our recent work has largely been exploring these types.

Known cases we are now asked to support include:

  • Administrative data

  • Qualitative data

  • Experimental data

Jay pointed out that we need to take on board a new notion of lifecycle, or in other words, per Ornulf, there is more than one way to generate a datum. Dan and Jay both pointed out that in this “new world”, we have no clear paths to a datum. This is something that needs to be further fleshed out.

Dan’s comment: The logic for Questionnaire data is clear: question - observation - capture - datum. Other cases are less so. e.g. Derivation: generates data, but requires no question. Here the input is an existing datum.

Ornulf noted that a derivation has various characteristics: it has an input datum, a formula for the derivation, and a datum as an output.

Larry gave an example from a clinical psychologist in which a process is used to collect a combination of questions and observations, but the ultimate “thing” being recorded is actually the scale score as the datum. Barry noted that there are similar sections in MIDUS where the parts are not relevant, but it is the whole that matters.

Barry points out that the step between capture and datum (subsumed now within Observation and Process Step) is “hiding” a number of significant steps - but that we can probably draw on the strength of the process model to document this.

Jay considered a similar case of Computer Adaptive Testing which works from a battery of test questions to ask a set of increasingly difficult or easy questions, and that adapts based on previous responses. Dan points out that there are some similar cases in the survey community, and Barry gave a similar case of conjoint analysis in marketing, as did Jay in EHR.

It may therefore be appropriate to start digging into the process model to see if we can accommodate some of the above use cases using the current combination of Capture, DataDescription and Process.

Jay suggested that we should be exploring these in detail - and that it cannot be rushed. It would be useful therefore to now develop these use cases to test out the current version of the model version, to (a) assess the current objects and process model, and (b) determine what else needs to be included.

Suggested worked use cases:

  • Ornulf’s derivation process for RAIRD event data

  • Larry’s clinical psychology example

  • An administrative data example (Steve??)

  • Other suggestions??

Jay noted his work with Splunk here, where they are always aggregating and disaggregating from the datum level.Dan noted worries here about confidentiality in such a process. Jay also recognised this, but pointed out the access rights associated with each datum as one means to resolve this. Ornulf also had been addressing this solution in the RAIRD work, using statistical disclosure control on the end products.

Moving forward, it was agreed to take away these use cases, and start describing using the Capture/DataDescription/Process views. Example cases are given above, but it would be good to get additional cases of interest to the members of the group - particularly where group members are collaborating on cases. This work will require some extensive thinking, so the agreement was made to continue to work on these use cases, but to switch focus for our fortnightly meeting to the Physical Data Description.

Next meeting: Thursday 9 April. Time to be confirmed (due to Daylight savings changes in Europe and Aust/NZ)

Agenda will be to review and evaluate the current status of Physical Data Description. This will need to focus on:

  • The file description

  • The logical structure.

In preparation, it would be useful if team members could review the three pieces of work so far in this area:

 Meeting minutes 11 March 2015

Data Description Meeting 11/3/2015

Attendees: Steve McEachern (ADA, Australian National University), Larry Hoyle (IPSR, University of Kansas), Dan Gillman (BLS), Barry Radler (MIDUS, University of Wisconsin), Simon Lloyd (ABS), Ornulf Risnes (NSD)

We updated the progress since the last meeting, particularly the document Steve and Barry generated out of the "Linking..." presentation developed by Dan and Jay. This integrated model, bringing together the interface between Capture and DataDescription, is available here as a PDF, with the objects and relationships specified in the document available in the http://lion.ddialliance.org Drupal site.

  • Dan gave some initial comments on the model: What about those datums that are produced out of an observation that is not from a capture, e.g. a datum from a derived variable

  • Barry and Larry made the point that any observation is an outcome of a process - but that may not be generated by an instrument (e.g. generation of survey weights)

  • Complicating processes include: editing, computation, derivation, weighting

  • The term “observation” also alludes to originating from a physical source - where the above are not originating in the physical, while the machine generated processes

  • DDI 3.2 has a Generation as an output of producing a Datum from another machine data source - this might be a good existing option to draw upon

  • Capture and Generation would be sub-classes of a higher level class

  • Ornulf makes the point that this “first capture” versus later “derivations” may over-complicate the model - and may also create an artificial distinction

  • It may be the case that this distinction may be better defined within the Process group (as a “Processing Cascade”??)

  • The distinction between observation and generation would then arise when you determine where this arises in the processing cascade.

  • The class could also be a base class in the Conceptual view, an “UberDatum”

The general conclusion from the discussion is that the relationship between ProcessStep, Observation and Datum looks sound, but that the ProcessStep and Observation objects may need additional work in order to see if they are sub-classes of a broader type.

Thus the next meeting will explore further the requirements both Capture and DataDescription have for the Process model. In the interim, additional email discussion will continue around comments on the Capture-DataDescription link, building on Jay’s discussion of similar issues in HL7 and OpenEHR.

The provisional time for the meeting will be Thursday March 26 at 8.00PM Central European time. The GoToMeeting URL is:

https://global.gotomeeting.com/join/148887013

However given Jay’s existing work and his role with the Process model, which are the next step in our discussion, we will coordinate times around Jay’s availability if required.
 Meeting minutes 26 Feb 2015

Data Description Meeting 27/2/2015

Attendees: Steve, Larry, Jay, Barry, Wendy


The meeting started by reviewing the suggested work plan for now to May.

We started by considering the current status of Datum/DataPoint, and the need for integration with Capture/Instrument.

Jay gave his sense of where Datum is at for now. He felt that this is close to conclusion, but does need some reasonable time dedicated to it, and ought to incorporate the most recent work from Ornulf. It was felt that this work could potentially be done at Minneapolis, although it is noted that Ornulf can't attend that sprint.

The discussion moved to the related issue of how Capture and DataDescription come together. Jay returned to the Linking Instrument Observation Datum and Variables” paper he and Dan authored that was presented at the Jan 15 meeting, which walks through the link (through Observation and through Questionnaire). This paper still requires some further refinement to harmonise the different objects between the paper, the current DataDescription and Capture objects, but provides a likely resolution of the interface between the two views. Barry and Steve will meet to reconcile Jay and Dan’s paper model in the terms and objects in use, and then to bring this into Drupal. This will then be discussed at the next meeting.

There was a side discussion on the current status of the Drupal (Lion) system. Further clarification is needed on the recent status of the “freeze” on the system. An update is needed from the modelling team on the current usage of Drupal.

 Barry also noted that there is a need for structured use cases for sharing between the Views. Jay suggested the need for end to end use cases.

The next meeting will review the reconciliation process Steve and Barry will complete. It is proposed to then move into a detailed discussion of the Physical Data Description, starting with the current PhysicalDataDescription output from Dagstuhl, and then looking to reconcile this with the more complex case expected from the Datum-DataPoint discussions - particularly the work being done by Ornulf's team on RAIRD at NSD - the paper for this (linked here) was discussed at the Feb 12 meeting.

As an update, Jay also provided the paper he and Dan have written for the FedCASIS conference - available here.

Actions:

- Steve and Barry to reconcile the Linking paper with the current DataDescription and Capture views in Drupal

- Ornulf to provide updated version of his Datum use case

Next meeting: Thursday 12 March, 8.00pm Central European Time

GoToMeeting Link: https://global.gotomeeting.com/join/148887013

 Feb 12 2015 Meeting minutes

Simple Data Description, 12-13 February meeting minutes

Attendees: Steve McEachern (chair), Dan Gillman, Jay Greenfield, Larry Hoyle, Barry Radler, Wendy Thomas, Achim Wackerow

Apologies: Chris Seymour


The meeting started with a status update on Jay and Dan's work on the Variable Cascade and DataPoint/Datum proposal. Dan and Jay will be presenting their paper at a conference in March. Paper soon to be revised in light on this and will distributed to the group once ready for the conference. Ornulf is planning to apply the case to work he is doing with the Data Without Boundaries project.

Ornulf then walked through his current use case, based on the outcome of work from recent meetings on the RAIRD project between NSD and Statistics Norway. He is working on the RAIRD metadata model now, but may also be applying this to DWB project.

Datum model discussion

Ornulf walked the group through a couple of his RAIRD applications, particularly the work they are developing on InstanceVariable storage in their data store, and how to capture the provenance associated with a particular IV. He presented an overview of this work (https://docs.google.com/document/d/1x3iBpL1-WTSYELRY0elETeUpYsf_MPXk7_6z37maoc0/edit#). Ornulf's presentation was followed by various comments and notes:

  • In this example, provenance becomes an attribute of the InstanceVariable in the IV record layout
  • Jay noted the parallels to the OpenEHR approach in Ornulf’s example – but suggested it could provide some additional structure to the attributes.
  • Dan raised the issue of links to value domains from IVs – Ornulf commented that the “Description” field has now been updated to “ValueDomain”, and made this change to the Google Doc during the meeting
  • Some discussion occurred on the capacity of the DataPoint within this to store multiple responses (e.g. which newspapers did you read, sections of a telephone number) – i.e. representation of response arrays
  • There was also additional discussion of the capacity of storing a summary score and it’s constituent parts (eg. APGAR scores)
  • Ornulf pointed out his interest in the value of GSIM’s distinction of roles of an IV: as attribute, identifier or ?variable? (need to clarify this distinction)
  • In a relational database, this would result in a representation of a table for each variable – i.e. a Bazillion variables means a Bazillion tables
  • Larry pointed out the value in treating attributes as variables to reuse content with similar attributes

 There was a short discussion on the characteristics of DataPoint – is it a row in a table, or is it a cell in the table. The sense was that it could be both – this will depend on your focus. Jay made the point that this distinction is largely managed by DataStructure.

Steve then also quickly noted Chris Seymour's document updated mapping of the proposed model to StatsNZ's MEP model. There appears to be a good mapping of the proposed model to the implemented model in this case, which was taken as a good indicator that the discussion and proposed model may be reaching maturity. It was also noted that the interfaces between the DataDescription view and other views (Conceptual and Instrument) are becoming clearer as a result.

Next steps 

The end of the meeting discussed next steps for the working group.

Firstly we considered possible content for the Minnesota sprint:

  • Non-quantitative data formats
  • Newer data formats (register data, process data, physical capture instruments, …)
  • Working on building datums into different structures
  • Complex structures (eg. Hierarchical record layouts, data stores, …)
  • The microdata – aggregate link

It was also need to review the SimpleDataDescription view for the second DDI4 release.


The following suggested work plan was proposed for going forward with the working group:

  • Finalising our discussion on Datum
  • Finalising the SimpleDataDescription
  • Reviewing and clarifying the DataStructure section (from Ornulf/Chris/Justin in Dagstuhl – Jay and Dan have also discussed this)
  • Look at extensions to the Complex case for discussion in the Minneapolis sprint


Actions:

  • Wendy will also take the work from this group to the Modelling team to look at the interfaces between DataDescription and Conceptual
  • Updated papers: there are revisions from Jay and Dan’s paper for March, and Ornulf’s use case. These may be ready for the next meeting


Next meeting: 

Thursday February 26th at 8.00PM Central European Time

GoToMeeting Link: https://global.gotomeeting.com/join/148887013


 Meeting 29 January 2015

Notes on SimpleDataDescription Group Meeting, Thursday 29/1/2015, 8.00PM CET

Attendees: Steve McEachern (Chair), Dan Gillman, Jay Greenfield, Larry Hoyle, Barry Radler, Ornulf Risnes, Chris Seymour, Achim Wackerow

To begin, a brief overview of the current state of work was provided by Steve and Dan (given Steve's absence at the previous meeting)

The meeting then considered a review of Dan and Jay's paper

Dan walked through the paper and questions and issues were raised throughout:

  • Question regarding ideas of "re-recording" (seen largely as "copying") and "data processing"
  • Ornulf noted what seems to distinguish one Datum from another is usually the InstanceVariable. In many cases, Unit, Question, etc. remain the same - the data is simply copied
  • Barry gave the example of harmonization over time of the same RepresentedVariable - has similar characteristics

Jay also noted that Flavio and Dan have completed some parallel work on tidying up the Unit object(s) which will enable better association between Unit and *Variable.

Additional comments:

  • Jay commented that one question for any implementer will be "What information do I carry with my Datum?". We need to be able to model this.
  • Dan then walked through the various binary relationships in his model. (Note that the terminology needs to harmonised with the objects in the Instrument view).
  • Jay commented on the CollectionProcess-Observation relationship. Notably, there are some paradata that may want to be bound to the Observation at this point (e.g. characteristics of the Process, Observation mode). For example, in some cases there may be a need to record the order of events (e.g. in a health data capture protocol executed by a health professional).
  • Barry noted that there also may be no "Instrument" - but rather a process of data capture in some other way (eg. scraping tweets). Chris noted that StatsNZ creates "artificial instruments" to enable this form of capture. Jay noted that CDISC has a similar capacity for recording such "artificial instruments".
  • Dan identifies that the major implication of this model is that there is no direct link between the Question and Variable.
  • Larry questioned whether the final relationship - between InstanceVariable and Datum - needs refining. The question was asked whether copying a Datum creates a new Datum, and this was agreed to be necessary.

Following the review, as time had run out, it was agreed to continue the following work outside the meeting for consideration in 2 weeks:

  1. Dan and Jay to provide a new revision of the paper incorporating (a) edits suggested in the review, and (b) harmonisation of terms
  2. Ornulf and Chris would also revise their example papers to evaluate the new model Dan and Jay had provided. Ornulf had a second example he intended to apply. These examples may be used as additional material in the revised paper.

Next meeting: Thursday 12th February at 8.00pm Central European Time
(i.e. Thursday 12/Friday 13 February at the same time(s) for all)
GoToMeeting URL: https://global.gotomeeting.com/join/148887013

 29 January 2015 pre-reading - User examples

Chris Seymour and Ornulf Risnes have provided examples for the group to consider in regards to our discussion of the Datum-DataPoint-Observation objects.

Chris - MEP Example (MEP IDE - Example.docx) and Background information (MEP IDE.docx).

Ornulf - Helper variable example (DDI_Datum_Helper_variables_example.pdf)

This will be discussed at the group's Jan 29 meeting

 Meeting 15 January 2015

Present: Wendy, Dan, Jay, Chris, Jannik, Barry, Ørnulf

Absent: Steve, Justin, Achim, (Larry? He wasn't in the invitation list it seems)

 1. Overview of the new datum-datapoint-observation model for the Instrument and Physical Description teams (Dan Gillman)

  • Dan took us through this, with comments from Jay and others along the way. Many/most expressed confusion regarding the relationship between Datum/Observation and Represented/Instance variable.
  • Moreover, we briefly visited parts of the current DDI-4 model on a shared screen, and a small discussion on the relationship between Unit Type, Universe, Population and the variable cascade emerged before we decied to go back to the bottom-up approach we had started on.
  • The group found that the model laid out by Dan has qualities and potential, but that the Datum/Observation construction need to be approached from other angles (via examples) to clarify the link to the variable cascade. The variable-"dimension" (WHAT) is but one of many contextual dimensions potentially present for a Datum. Others include:
  • WHO, WHEN, WHERE, WHY and HOW
  • Jay commented that the HOW is perhaps the least developed dimension in the DDI, and wanted more focus on that. HOW here indicates how the datum was collected/measured/generated/came about.

 2. Overview of the data process model for the Instrument and Physical teams (Jay Greenfield)

  • Jay didn't go through this as such, but instead added general comments during Dan's walkthrough under point 1.

 3. Discussion of the intersection-Interaction between Instrument and Data Description (All)

  • The point was repeated that a new potential Datum-focused part of the model has to relate nicely with both Instrument and Data Description. The attitude in the meeting seemed to be that such relations will be possible, but that we need a deeper understanding of the Datum first - and possibly also more discussions on this within the Instrument groups.

 4. Next steps

  • There was an agreement that we needed examples before our next meeting, and Jay, Chris and Ørnulf agreed to come up with examples and circulate them.

 5. Next meeting

  • Group leader Steve was absent, so we didn't agree on a date for a next meeting. Wendy checked the Gotomeeting calendar, and confirmed that Thursdays 8PM CET (same time as this meeting) is available going forward. It is up to Steve to schedule the next meeting. (I would like to add that if 6AM is too early for you in Canberra, Steve, I think one hour later could also work for most others.)
 Meeting 8-9 December

Attendees: Jay Greenfield, Dan Gillman, Larry Hoyle, Achim Wackerow, Steve McEachern

The majority of this meeting involved two presentations by Dan Gillman and Jay Greenfield.

The first by Dan provided an overview of his updated proposal for modelling datum, datapoint and observation - available as a Powerpoint presentation. Key elements of this model were:

  • Element out of a value domain associated with a unit
  • Data - written down
  • Observation – before written down Make datum more real Observation intersection between unit and value domain.
  • Datapoint is the place the observation is stored in data structure.

There was also some discussion of the inclusion of Time and Space within this model.

  • Time (and space) – can be part of concept of variable
  • Time and space – can be part of collection activity (register data)
  • Each observation can be recorded at a separate time

A question raised was what do they call “observation” in simple instrument (“measure”)? (the word observation is a row in SPSS, the word observation means an activity as well). It was recognised that there needs now to be a discussion with the Instrument group to look at the point of intersection between the Instrument model and the DataPoint/Datum objects here. This meeting needs to be set up in the New Year.

The second presentation by Jay provided an overview of a draft approach to a data processing pipeline that he had been working on with Wendy Thomas - available as a PNG file.

  • This pipeline includes a physical data description, plus a Formula attribute for a DataPoint (which Larry suggested might more generally be called an Algorithm)
  • Jay modeled the creation and use an excel spreadsheet – creating a processing pipeline that was based on a use case from the DDI 3.2 documentation.

It was noted that the Data processing pipeline discussion needed with physical data description group (Ornulf, Chris, Justin and Achim), and Wendy Thomas.

Action Items:

  • To progress the above, it was agreed that there now needs to be further discussion with (a) those involved in the Physical Data Description and (b) the Simple Instrument group.
  • Steve will coordinate the organisation of a meeting in the New Year to facilitate this discussion.
  • The next meeting of the group is therefore yet to be determined.



 Meeting 18 November 2014

Attending: Dan Gillman, Jay Greenfield, Larry Hoyle, Steve McEachern, Achim Wackerow

The meeting commenced with a brief overview of the recent decisions of the Advisory Group, particularly the acceptance of the achieved output from Dagstuhl by the DataDescription group in the LogicalDataDescription and PhysicalDataDescription packages and associated views. It was noted that the Advisory Group recognised that the recently established discussion on Datum and DataPoint should continue. In part, this was to recognise the intersection between previous work and structural relationship between files and tables. Larry noted that Datum brings subject (measured entity) into the model more explicitly. The Advisory Group also indicated however that these were not central (at this time) to advancing the completed work from Dagstuhl to the next stage for review by the Modelling team.

The majority of the meeting then focussed primarily on a discussion of the recent changes to and demonstrations of the model. Jay began with an overview of some of the reorganisations he had made to the objects within particular packages, most notably:

  • Represented variable at logical view (represented and instance) 

  • Datum from logical to physical view 

  • The Confusion around unit type and universe – partly due to the synonym of universe as population. As a result, Universe had been moved from the conceptual level to the Instance, and Unit type to explicitly link with represented variable

This discussion then continued to explore the efficiency gains that are provided by the implementation of the new variable cascade approach. In particular, Jay and Dan presented an overview Dan gave on Friday 14 November at Booz Allen regarding the new approach (slides available here). Of particular note was the efficiencies noted in the number of objects to be managed, and the number of value domains, due to the availability of the SentinelValueDomain. Along with the efficiencies highlighted by Dan in the presentation, it was also noted that there were potential additional gains:

  • in enabling additional SentinelValueDomains to be associated with each stage in the Data production lifecycle - which might then allow each group in a production process to manage and use their own SentinelValueDomain, making explicit what is already implicitly done in practice.
  • The value of the inheritance from Conceptual -> Represented -> InstanceVariable should make for superior comparisons (e.g. in pattern matches for Search)
  • Overall, the new approach appears to have fewer things to manage and more efficient searching.

It was suggested that the examples presented by Dan in the presentation might be further extended to a larger number of variables, as the gains appear to be even greater as the number of variables to manage increases.

Following this overview, there was a discussion of the means through which the new variable cascade might be introduced to enable transfer of existing DDI Codebook instances (and DDI-C users, such as Nesstar and IHSN users) to the new cascade. Steve reported a query raised by the Advisory Group, in particular how might the new SimpleCodebook view use this? There was also the question of whether everyone use the full cascade? (especially those mving from DDI 2.x). The question was raised whether we could take something out of codebook and put it into this new representation?

Jay suggested that this may be possible through the use of a profile (projection?) which ensured all the attributes that the InstanceVariable inherits or references from the ConceptualVariable and RepresentedVariable are available at the InstanceVariable level. It was also recognised however that the emphasis should preferably be on demonstrating the efficiency gains from adopting the full cascade, particularly the reusability of both SubstantiveValueDomains (through inheritance) and SentinelValueDomains (through repeat use from previously implemented InstanceVariables).

ACTION: Larry undertook to provide a couple of worked examples (extending the examples provided in Dan's BAH presentation) that demonstrate how both DDI-C and DDI-L 3.2 might be achieved in the DDI4 Variable approach. (Larry noted that DDI-L 3.2 already has the capacity for ManagedMissing value descriptions).

At the end of the meeting, Jay also walked through briefly (given time limitations) some minor changes he had made to the Physical Data Description package objects, in particular the move of DataPoint and Datum to this package. Jay noted that there were also additional objects introduced into this package, particularly Formula and ProcessingInstruction. Jay indicated that these two areas and their relationship to DataPoint and Datum needed more detailed discussion among the group.

It was also noted that Ornulf's email discussion had focussed on similar issues, particularly on the resolution of the meaning of Datum. Ornulf had not yet had the opportunity to forward his discussion paper on his thinking about Datum, but this was pending.

It was resolved therefore to focus the next meeting on a discussion of Datum (per Ornulf's paper) and Jay's extensions into the processing and Formula objects.

ACTION: Jay to work through an example of data collection and derived variable creation using the proposed Datum and Formula objects.

ACTION: Ornulf to finalise his discussion paper on Datum for distribution to the group.

The next meeting date and time is to be determined via Doodle poll, due to the clash with EDDI in 2 weeks, and the SimpleCodebook meeting (at the same time in 3 weeks). Steve will send out the poll for all to contribute their availability.

The meeting concluded at 3.05pm CET.


 Data Description View objects

The Views section of the Lion site is currently experiencing some bugs. This note is simply to record the contents of the two Data Description views currently in development (coming out of Dagstuhl).

PhysicalDataDescription

Completed in Lion at - http://lion.ddialliance.org/view/physicaldatadescription

Objects:

DataPoint
DataSet
DataSetStructure
DataSetSummary

LogicalDataDescription

To be completed in Lion (currently unable to save due to "Alias" bug - reported to Documentation team 13 Nov 2014) - http://lion.ddialliance.org/view/simpledatadescription

Objects:

Object

CategorySetCodeListConceptualDomain
ConceptualVariableDataPointDatum
DescribedConceptualDomainDescribedValueDomainEnumeratedConceptualDomain
EnumeratedValueDomainInstanceVariableRepresentedVariable
SentinelValueDomainSubstantiveValueDomainUnit
UnitTypeUniverseValueDomain
 November 3 meeting minutes

Discussion via email over the last week(s) since the Dagstuhl sprint has been wide-ranging. Notes from the Dagstuhl workshop, and the activities of the Data Description group within that sprint, are available at the Dagstuhl Sprint site.

Discussion this evening continued to explore where the different issues are, without a clear resolution. There was however a definite sense that there will need to be continued discussion of the implications of the “Datum” model.

Several key points distilled from emails and explored in the meeting:
- The discussion started by noting that, in Dagstuhl – we haven’t got datum quite right. The question is: Do we want to open Pandora’s box?
- Dagstuhl model can’t represent the datum. The model just has datum there but doesn’t really model how it fits in.
- Strong belief among the meeting attendees (OR, LH, JG, SM) that the Datum discussion is useful and important for where data is heading (away from traditional “variables x cases” data matrix). Jay walked through the example of data mining, where the modelling of the set of Datums/Facts is not referencing the traditional matrix model in any real way. Data mining instead starts from facts and build constructs rather than starting from hypotheses. National Children’s study is an example – begin with a platform of information and use data mining.
- Jay suggested looking to the CDISC SDTM - Study Data Tabulation Model (http://www.cdisc.org/sdtm) section describing a rectangular structure in CDISC for ideas
- Need some description that is “useable” for the community (Wolfgang)
- However Jay noted that we need also to be able to deal with the "Reality that is changing beneath us" (e.g. RAIRD, NSA). RAIRD was noted as an excellent use case we can work against.
- Related to this, is the example of data “Reshaping” (used in Stata and R) – how do we model the behavior when you transform data without loss?
- One organising method for this may be the W5h molecule – to model around that (Who, what, when, where, why, and how)
- We have identified certain issues with the InstanceVariable that may be better associated with the Datum, depending on whether you are bottom=up or top-down
- Top-down/Bottom-up approach has parallels with deductive, hypothesis-testing / inductive, hypothesis-generating distinction in epistemology of research.
- Remodelling may need to occur with the existing Data Description – particularly within the Datum/DataPoint/DataSetStructure section of the model (see for example Jay’s datasets v4 diagram)
- the W5h framework (introduced to the discussion by Dan Gillman - possibly from ISO19773??) appears to be important in our understanding of this as well, particularly for Unit Record files – Ornulf noted (in email) that a Variable is essentially the collection of a set of Datums on “What”, while a row (or a “Case”) is essentially a collection on “Who”. On reflection, I (Steve) think that there is probably a means for using the 5Ws to identify the Universe for a Variable (or Datum?) as well.

Of particular note at the end of the meeting was that the Datum discussion may have implications for several groups:
A) the Grouping discussion in the Modelling Group (for bottom up collection of a set of atomic units into larger groupings),
B) the Qualitative group in regard to organising of objects and facts - in that inductive approach is not dissimilar to the Data Mining approach of CompSci, and of inductive analysis in various Qualitative methods. Another example is from Qualitative annotations on a cell in a table – essentially these are metadata on a datum. Annotations can become data – can affect processing, combination.
C) the Data Transformations discussions – in that data transformations – often on sub sections of data or even on specific Datums – in the form of “this fact was wrong – change this fact”.
D) Methodology working group: there are strong parallels between top-down and bottom-up approaches and inductive/deductive methods. For example, Hal Varian (Chief economist at Google) has argued that don’t need hypotheses any more. Machine learning instead allows brute force induction from the data - possibly not only best model but best methodology. (It was also noted that the existing DDI model may essentially be a deductive model?). Need to reconsider methodology and study inception – deductive vs inductive.

For next steps, it was resolved to continue the email discussion, to be lead primarily by Ornulf articulating some more specific use cases to work against, partly to clarify his own thinking as it has evolved through the email discussion, and then others to respond. Others should also feel free to put use cases forward. The next meeting will be convened in two weeks time.

The other step for this is to present the discussion to the DDI-MF Advisory Group, noting particularly the implications of the Datum discussion and it’s generisable aspects for the other working groups (Steve and Larry to raise). The intent is to seek clarification from the Advisory Group as to whom (and which group) to involve in the Datum discussion and determine for which group it is considered “in scope” (if any).

 Pre-Dagstuhl additional notes

Update from Steve McEachern

Following the 6 October meeting, Jay and Dan provided a further revision to the Variable cascade model, shown below and attached as Version 12.

This revision is for discussion at the Dagstuhl workshop.

The additional aspect of this work also for discussion is the Physical Data Description, to be based on the recent PHDD release from Larry, Achim and Thomas.

The current version of this is available at: http://www.ddialliance.org/Specification/RDF/PHDD

An overview of the UML model for PHDD is available at that site. 



 Meeting 06 October 2014

Attendees: Steve McEachern, Dan Gillman, Larry Hoyle, Jay Greenfield

Meeting notes:

Meeting commenced 11.00pm (AEST)

Jay introduced the current state of the “changing data types” problem that he and Dan had been discussing - thinking for example of the date-time data types in different software packages (compare for example SAS and SPSS!!).

Dan continued this introduction to consider the extent to which we might be able to essentially “manage” the data types that we use within the DDI frameworks - based partly on the ISO 11404 framework. He used the example also of currency which is often managed as a real number - but in fact has the characteristics of an integer scale (to the second decimal point).

This ISO 11404 framework comprises three elements:

  • axioms

  • characterising computations

  • (need the third element from Dan)

Data types such as currencies have certain computations that cannot be made against them (e.g. rounding of the “remainders of cents” - mils?)

Larry makes the point that the particular data type applied to, for example, a date, is just a representation of the point in time we are referencing. It is simply that we are using different representations of the same point in time.

Dan continued with Larry's comment, to suggest that part of the solution might be to “take the representation” out. He suggests that our computations are actually of concepts rather than their representations - so we can still potentially manage our variables in this way.

So the discussion we may wish to consider is:

  1. We could conceivably apply the data types to the conceptual value domain.

  2. We may alternatively manage data types separately, and then attach them to variables at the appropriate time and place.

    1. This option would potentially assist in enabling us to manage our content (by reducing the volume of variables created when we simply change data types).

    2. Suggestion is that the intended data type be at the represented level, and the implemented data type at the instance level.

Jay gives an example of the conflicts that have occurred within the date type within CDISC.

http://www.nesug.org/proceedings/nesug06/cc/cc17.pdf


There remained one outstanding item from the previous meeting: which version of the Sentinel value domain is preferred?

  • Suggestion here is that there be two domains: Substantive domain at the Represented level and the Sentinel domain and the Instance level (need to make a revision of Jay’s Variable Cascade document).

  • Alternative was the Sentinel list and a map (or filter) - as per the Variable Cascade 09/22/2014 Figure 1.

Larry makes the point that the pooled “Sentinel Value Domain” is likely to have a combination of data types (eg. SAS characters, SPSS real numbers, etc.)

It looks therefore like the better solution will need to be to manage a set of Sentinel Value domains instead of the single domain.

Dan and Jay resolved to update the Variable Cascade to address this over the next week following the meeting, with discussion to occur via email. The next formal discussion of the group will occur at Dagstuhl.


Next steps

Resolution: Dan and Jay to provide a revision of the recommended Sentinel + Substantive Value domains structure for updating the Variable Cascade model.

Additional activities:

  • Group to continue to consider the problem of reconciling data types - with suggestions for discussion at Dagstuhl

Next Meeting: Dagstuhl Workshop 2014


 Meeting 22 September 2014

Attendees: Steve McEachern, Dan Gillman, Larry Hoyle, Jay Greenfield, Ornulf Risnes, Justin Lynch

Meeting notes:

Meeting commenced 10.10pm (AEST)

Dan provided an overview of the updated variable cascade model distributed by Jay on Sept. 20 - see A Variable Cascade 20140922.docx. He highlighted the key approach now adopted in the model, focussing on sentinel values, but also the key remaining issue:

  1. Whether the sentinel values should be managed as a separate domain similar to the substantive domain, OR

  2. available to be selected from a broad (unmanaged) list - possibly along the lines of the category set (i.e. a “master sentinel category set”)

Approach 1 produces a (potential) exponential growth in value domains if we  manage of each domain in turn - consider the example in use case three in Jay’s document.

Approach 2 uses a simpler mechanism by providing (basically) a single “one big list” - using a map to link the sentinel values and the codes used in the instance variable. This is simpler, but does not really allow for the management of the sentinel code list - which may be important to us.

Question raised - Is there value in managing the sentinel value domain in the same way that we would manage the substantive value domain (in the represented variable)?

Point raised by Larry - we may wish to manage common sentinel value domains (e.g. SAS missings, SPSS missings, etc.). In particular, this might necessitated if the different studies or software use different data types (eg. SAS vs SPSS missing, date formats). This tends towards Approach 1.


Next steps

Resolution: team members are to explore the cascade paper further in light of the discussion today and to (hopefully) identify their preferred option, to be brought to the next meeting.

Additional activities:

  • Dan and Jay to explore the problem of reconciling data types within the proposed cascade

  • Steve to review the Physical (PHDD) and Logical (Variable Cascade) models to assess the points of intersection of the two sides and highlight any outstanding issues for Dagstuhl, then review the overview status of the Simple Data Description package/library/view to determine it's readiness for discussion at the Dagstuhl sprint.

Next Meeting: Monday October 6th, 2014, 1400 Central European Time


 Meeting 8 September

Attendees: Steve McEachern, Dan Gillman, Larry Hoyle, Jay Greenfield

Meeting notes:

To open, Steve reviewed the activity from the last meeting

The meeting primarily consisted of discussion of the "straw man" model of the conceptual/represented/instance variable model developed by Jay in collaboration with Dan. The visual representation of the model is represented below:

Jay's notes on the model are as follows:

Note that a “Conceptual Variable” here maps to the GSIM “Variable”. Also, note that an Instance Variable inherits its value domain from the Represented Variable that it takes its meaning from.

Dan Gillman has an example:

The conceptual variable marital status might be measured with two different sets of categories (in separate studies) as follows:

    1. 1.     Single, Married
    2. 2.     Single, Married, Widowed, Divorced 


These 2 categorizations result in 2 represented variables [mstat_simple, mstat_ex] in my mind. I was saying some people (outside the DDI community) want to say that even the conceptual variable has to change in this case. I think that makes little sense, and I hope everyone in our group agrees the conceptual variable does not change in situations such as this.

Continuing along these lines, represented variables like mstat_simple and mstat_ex may get sentinel values “along the processing cascade”. In this instance each new value set does NOT necessitate a new represented variable. Instead there may be multiple instance variable associated with one represented variable. In a process model we would reference one or more of these instance variables at different points along the processing cascade.

A Master Sentinel List (MSL) facilitates this arrangement. Again, quoting from Dan:

MSL should be structured so that categories are separated from designations (codes or other). The links are between the designations and the instance variables. It might go something like this:

IV -> MSL-codes <- MSL-categories, where the -> symbol indicates a one-to-many relationship, in the direction of the arrow. 


Thus, the MSL-codes structure resolves a many-to-many relationship between IVs and SVs, and the SVs are categories, not the designations. An IV uses possibly many SVs, and each SV may be used by possibly many IVs. 

There probably needs to be more discussion around the Conceptual Variable Unit Type and the Instance Variable Population. It would be neat and I would like to argue that the difference between Unit Type and Population is a function of sentinel values.


Discussion of the Variable model:

The discussion of the model was largely supportive of the model as presented, with agreement among the attendees regarding the basic conceptual/represented/instance variable distinction.

There was some discussion over the role of the "Master Sentinel Category Set" and Extension Code List, particularly with regard to the issue of respondent-driven issues such as "Refused" or "Dont know" responses. Additional use cases are to be considered to explore this set of objects in more fine grained detail - Dan and Jay will consider this further before the next meeting.

There was general agreement on the distinction between population and unit type - where unit type is the general unit being observed, and the population is that set of units within a given temporal and spatial context - eg. voters is the Unit Type, where voters enrolled to vote in Australia as at 1 January 2014 would be the Population.

There was some short discussion of the two data point and datum classes in the model. Dan identified that the two needed to be reversed - that "Datum" was the appropriate class to link to the Instance Variable. "Data Point" was held as non-specific class that likely connects this package to others (potentially a Cell in a table or in a Physical Data Set). This class shoul dbe reconsidered at a later point when this work is integrated into the broader DDI4 model.

Next steps

At the end of the meeting, there were two further actions:

  • Jay and Dan will complete further work to finalise the "Master Sentinel Category Set" modelling
  • Steve will review the overview status of the Simple Data Description package/library/view to determine it's readiness for discussion at the Dagstuhl sprint.

Next Meeting: Monday September 22nd, 1400 Central European Time


 Meeting 25 August

Attendees: Steve McEachern, Dan Gillman, Larry Hoyle, Jay Greenfield, Ornulf Risnes

Action items from 11 August:

  • Everyone to review PHDD
  • Everyone to review Dan's document
  • Achim to provide some information on the issues around the complexity of data description in DDI 3


Meeting notes:

To open, Steve reviewed the activity from the last meeting

Continuation of discussion of the distinction between the represented variable and instance variable.

The following is a summary of the various lines of discussion that occurred.

  • Where do we draw the line between represented and instance? Eg. Larry’s case of sentinel values.
  • Do we need to split the GSIM “Instance Variable” into a Logical Instance Variable and a Physical Instance Variable?
  • What do we want to view in the Instance Variable? Physical - Quasi-physical - Logical

Examples/use cases for consideration:

  • How do we manage missing values?
  • What do we do when data is managed in different systems – eg. 32-bit vs 64-bit systems –which may not allow certain data formats (e.g double format)

In the data management example – if the data type changes, both the instance and represented variables change. We may have a more complex case of conceptual/represented/instance than GSIM accounts for – characteristics may be changing at more than one level here, which makes reuse much more challenging.

What are alternative approaches here?

  • Ornulf pointed out that it may be possible to manage the variable by changing the represented and instance level, but maintaining the conceptual level. 
  • Dan’s concern was that the tieing of the categories and the codes representing them can be done poorly (e.g. he noted this was the case in 11179). 
  • Jay noted that some of the harmonisation of longitudinal content can be achieved by thinking of some categories as concepts (e.g. certain missing categories are the same over time), and then merging/harmonising on those concepts over time. This may be a reflection of the represented variable.

Discussion centred around the issue that the core of the problem is ensuring that the reuse needs to be of concepts (e.g. conceptual variables or categories) rather than of codes. i.e. We need to ensure that we have semantic interoperability at the conceptual level, rather than necessarily at the representation level. Or in other words – need to clarify the relationship between the category and the code.

Representing cells or a "datum"

Continuing the discussion: Larry asked whether we may have a problem because we don’t have the notion of the representation of a cell (as opposed to a variable). There may be a need for representing an individual data point or datum (in the GSIM sense??) within a data file.

Next steps

At the end of the meeting, we noted that we now have two points of confusion to clarify:

  • Instance variable clarification
  • The need for datum as a class

The concern raised by Steve was that we have two important discussions, but need to find a way to “get out of the weeds”. To this end, two actions were proposed:

  • Jay will follow up with a “straw man” proposal around the instance/represented/conceptual framework to frame our next discussion. 
  • Larry will (time permitting) develop a similar idea for the “datum”.

Dan suggested increasing the frequency of meetings. For the next two weeks the other DDI meetings and US Labour Day make this difficult, but we will look at this possibility at our next meeting. We will also aim to continue to discuss out of session via email, with a summary to be posted to the wiki ahead of the next teleconference.

Next Meeting: Monday September 8th, 1400 Central European Time


 Meeting 11 August

Attendees: Larry, Steve, Achim, Dan, Jay

Action items from 30 July:

  • Everyone to review PHDD
  • Everyone to review Dan's document
  • Achim to provide some information on the issues around the complexity of data description in DDI 3
  • Thérèse will provide a box and arrow diagram for Dan's work
  • Steve will take an initial look at how the two fit together

Discussion of PHDD and SCOPE (Dan Gillman) documents

Larry and Achim gave an overview of the PHDD framework, outlining the original intent and the basic elements of the model, which focus on physical data descriptions. Dan then followed with a similar overview of the SCOPE draft model - which focuses on the logical data description. Dan noted that the focus of the SCOPE group resulted from U.S. statistical agencies coordination - “ Statistical community of practice and engagement group”, SCOPE - which was intended to coordinate on metadata for agency activities, including the data.gov initiatives. It was particularly noted that data dictionaries were undefined, and this might form a particularly useful starting point for the SCOPE group - hence the proposed model.

What is in scope?

The group then continued on to consider the question of what should be in scope for a data description. In particular, we wanted to consider whether the focus should be on the Physical or the Logical - or potentially both. Steve provided a short overview of how he saw the intersection of the two models - see "Notes in advance of team meeting" below, noting particularly that the point of interaction appears to be at the variable - the physical representation of the variable within the data file, and the logical characteristics of that variable within the data description - i.e Logical = what it means Physical = how it is laid out

There was general acceptance that the group should continue to consider both physical and logical at this time - although the two may become separate packages/views at some appropriate point later in the Moving Forward process.

Which variable do we mean?

Given that the intersection of the physical and logical was seen to be the variable, there was then extended discussion regarding the characterisation of "variable" within the model. (This is also something that has been discussed without resolution in the Simple Instrument and Conceptual teams).

The focus was particularly on the logical variable representation:

  • Jay asked about which variable are we talking about: Represented? Instance? 
  • Jay felt that the emphasis should be on the Logical at the intensional level (with an s) 
  • Logical level might have to have two parts to it ( represented and instance) 
  • Achim asked about where wihtin the description we might represent the variable name – for example in different physical representations for same study
  • A third level on the logical side is instance variable 
  • Other considerations were characteristics such as Unit type , sentinel values (name), and how the population may be different for unit type (time and space)

It was noted that there is some consideration needed of the equivalence within this discussion to the GSIM framework - which includes an instance variable.

Several use case examples were discussed.

Jay's use case: 

  • Recode of age collapse values – representation changes 
  • New represented variable (new value domain) and instance variable 
  • It was noted that the Idea behind instance variable is as a "variable in use" – i.e. variable in a file somewhere 
  • Question whether Data from one format to another format is this the same instance variable? 
  • Dan's position was that a Copy of the data should be same instance variable. Includes changing format. 
  • Jay argued for further specialisation of the instance variable to make it useful (e.g. by adding attributes to GSIM instance variable)

Continuation of instance variable discussion: Dan argued that the physical side should be purely a map to logical

Larry's use case

  • Copy from SPSS to SAS – must change sentinel values
  • keep .d = 99 -> don’t know, .r=999 -> refused
  • Sas/stata = spss
  • Missing at represented level (categories) vs Missing at instance level with different codes.
  • Managed missing representation at the instance level

Other properties of variables:

  • Data type – physical(realized) or logical(envisioned)
  • Logical integer – physical number of bytes

Use case: Currency

  • real vs real with two digits of precision 
  • Cents are truncated differently than rounding of reals.

Given the number of examples and the increasing complexity of the discussion on instance variables, it was felt that some further articulation of a proposed approach would need to take place between the meetings. Dan GIllman agreed to provide a first cut of a possible bridge between the physical and logical. The group would then reconvene on August 25th to further the discussion.


Next meeting

The next regular meeting takes place on August 25th at 2pm CET.

 (Steve McEachern: Notes in advance of team meeting 11/8/2014)


(Steve McEachern: Notes in advance of team meeting 11/8/2014)

The following is aiming to represent the relationship between several of the core elements across the DDI4 packages/views.

The final two columns are the likely relationships that exist between the PHDD and SCOPE (and their equivalents in other packages/views)

"Unit"ConceptualQuestionnaireDataDictionary(Logical)DataFile (Physical)
Basic unitConceptQuestion(Capture??)Variable (SCOPE)Column (PHDD)
AggregateConceptSchemeQuestionnaire(Instrument)LogicalDataset (DISCO)Table (PHDD)
Value domainConceptScheme(which??)ValueDomainRepresentedValueDomain (source??)Not Applicable??



 Logical description picture (derived from Dan's doc)

First attempt at creating box and arrow diagram from Dan's document. I have made some things properties where it seemed appropriate. I had Alistair check to make sure if was not crazy. Feel free to modify it as you please.

 PHDD class diagram

 Meeting 30 July

Attendees: Larry, Steve, Achim, Dan, Thérèse

The group needed to nominate a new team leader. Steve agreed to do this.

After a break in meetings, the group needed to remind themselves of what was being achieved. The team is creating a view called Simple Data Description. A view is the subset of information that is important to a use case. Thérèse created the view during the meeting and added a random object to make the view appear (this object should be removed). See: http://lion.ddialliance.org/view/simpledatadescriptio

We expect that we will need to add some objects to the library as not everything that is important for our view is existing. The new objects should be added to the package called New objects for Simple Data Description (http://lion.ddialliance.org/package/newobjectsforsimpledatadescription). Note: There are objects already existing in this package presumably from previous work of this group. These objects should be reviewed!

The use case for Simple Data Description says:

PurposeTo develop a robust model that can describe all aspects of simple, rectangular data file in our domain.
Description of viewThe model must include bridges from the physical representation of a rectangular data file to high-level conceptual objects in the model.

There was some discussion about this. Are we only talking about physical (the layout of data in a file)? Is PHDD in scope? It was agreed that physical should be distinct from logical. The logical is often reused. PHDD has some links to high level conceptual objects - column in PHDD = Variable, rows = Data Records and Table = data file.

How do we know where to draw the line for simple data description? The simple group should create something that caters for the simplest use cases. The complex team that follows will extend this. Following this, we have something that everyone can understand and use quickly. It is important to have a something for the simple use case. A criticism of DDI 3 is that it was too complex to use for those who just have a rectangular data file. The logical description in 3 was just too complex to easily understand.

The group does not need to start from scratch. There is PHDD (http://www.ddialliance.org/Specification/RDF/PHDD), DDI 3... Dan told us about a specification for data dictionaries that he has recently created with other US statistical agencies. This specification gives less than 20 objects that describe a data file at a basic logical level. See: Simple Data Description meeting minutes

The simple data description should include physical and logical. The group should use PHDD as a start for physical and the work by Dan as a start for logical. This would give someone a schema that would be fairly complete. We should then also look at DDI 3

Action items:

  • Everyone to review PHDD
  • Everyone to review Dan's document
  • Achim to provide some information on the issues around the complexity of data description in DDI 3
  • Thérèse will provide a box and arrow diagram for Dan's work
  • Steve will take an initial look at how the two fit together

Next meeting

The next meeting takes place in the week starting 11 August. A poll will be circulated to find the best meeting time.

 March 17 meeting

2014-03-17 Meeting Minutes

Time:

15:00 CET


 

Meeting URL:

https://www3.gotomeeting.com/join/685990342 


 

Agenda:

1) Status update. Where are we now with SimpleDataDescription? (ØR)

 

2) Clarify relationship between domain experts and modeler. Define role responsibilities, desired workflow in group (ØR, AW?)

 

Domain expert adds object descriptions and relationships

Modeler puts them into the overall model

Then iteration


What is the status of round trip?

Drupal to xmi to EA? Yes.

Is there  machine actionable feedback into Drupal? No. It is possible but some work is required. It is not clear yet if there are resources for this task. Furthermore there are different positions on the issue if the roundtrip makes sense.

 

3) Identified issues with the current version (ØR/all)

a) Model is sparse on properties for InstanceVariable, RepresentedVariable, ConceptualVariable. Out of scope for this group?

Comments: These objects currently only exists in the SimpleDataDescription package. Discussion about GSIM/DDI 3.2 and who’s responsible for the “core variable objects”. 

b) Do we need DataSerialisation (the physical counterpart of DataDescription)? DataDescription already relates to InstanceVariable, which relates to Field (column) in the RectangularDataFile. Because of this, a path exists from the Fields in the RectangularDataFile via InstanceVariable up to DataDescription and “TOFKAS”

c) DataSerialisation has no relationship to RectangularDataFile. If we decide to keep DataSerialisation, surely the relationshop to RectangularDataFile must be added.

 

4) TODO; Identify outstanding tasks (ØR/all)

  • Dan shares info on data.gov-Data dictionary

  • Dan shares a set of example data descriptions

  • Ørnulf pulls info from GSIM to produce candidate objects/properties for InstanceVariable, RepresentedVariables, ConceptualVariables

  • Larry shares findings/glossary for terms in extended attributes for SAS Enterprise Guide tool (below)

  • Ørnulf to suggest some “benchmark datasets” that can be used to document our work, and to “prove” that we are able to model a set of different data sets with our new model

  • Barry to flag potential issues from fieldwork with 3.2

    • Still a couple of months down the road

  • Ørnulf to harmonize minutes document and bring Larry’s notes in the right place

  • Ørnulf to try to arrange a meeting in April

  • Larry remembers to invite Ørnulf in case he’s needed for a virtual meeting during the NADDI sprint.


 5) Assign responsibilities for outstanding tasks (ØR/all)

See above.

 

6) Plan milestones (based upon TODO-list, goals and availability) (ØR/all)

Overall milestone plan/timelines to be clarified during NADDI sprint. Thérèse Lalor (ABS) is currently the project manager for DDI4 - but only until July 2014.


 

Other notes: