Proof of concept and recommendations of thesaurus, vocabularies, ontologies for given case studies; Recommendations and examples for good practice and libraries of bridging / transformations between ontologies
NOTES: https://docs.google.com/document/d/1Xdj8lIYJa2PiVpNW2u4mii6ZxI69kAjeOhNVSb576VM/edit?usp=sharing
First PM Session, Tue 8 Oct
In which cells in the matrix are we working?
In relation to which stage in processing does this relate?
What level of guidelines are we working on?
Topic 4:
Proof of concept and recommendations of thesaurus, vocabularies, ontologies for given case studies; Recommendations and examples for good practice and libraries of bridging / transformations between ontologies
Guidelines for terminology repositories 7
Guidelines for terminology developers 8
Guidelines for selecting a terminology resource 8
First PM Session, Tue 8 Oct
In which cells in the matrix are we working?
- Provider of Codelists and Classifications, I1, I2 (cells F-12, F-13)
In relation to which stage in processing does this relate?
- discovery
- ETL
What level of guidelines are we working on?
- High-level
- Practitioner
- Technical
- Ideal output: plan for working in the next day and a half
- Handling variable levels of expertise
- How do we ensure that the right solutions get to the right
- Start the conversation - architecture diagram
- Framework based on the use cases and then generalise
- Infant mortality data - ways of crawling to get the data
- Three levels:
- Architecture diagram - will speak to the high level
- Descriptions on each component - second level
- Show some code on how it happens - third level
Low-level demos
- Building (or choosing and assessing) vocabularies / ontologies - demo how to spin up a high-fidelity, specific vocab that can be mapped out to generalised community resources
- Mapping between formats - webpage/pdf/csv
- Bridging / mapping ontologies - can we come up with a common methods?
- Existing automated mappings: https://www.ebi.ac.uk/spot/oxo/
- Differentiate between expert curated mappings
- Use of Evidence & Conclusions Ontology - it has terms useful to annotate if the mapping was done manually or automated https://www.ebi.ac.uk/ols/ontologies/eco
- Filling gaps in terminologies when they are discovered during mapping
- Maintenance & versioning
- Provenance of classes and of object properties
Use cases
- Infant mortality
- Social exclusion / poverty
- Urban planning + sustainability + modelling
Data sources - different formats: Excel, TSV, etc
Http
ETL
Unknowns:
- How to transform data sources or represent them in hi-fi
- What is the transformation process
- What is the process to link up the data
- What is the ultimate representation for a knowledge graph
Portal/DataService/SPARQL endpoint as a data source
- How to maintain the link: nano-crediting a class with information about the portal vs adding information about the API rules/calls
Annotation resources
- datasets/ data files - formats, versions, topics, phenomena, time/space; provenance metadata (where it came from)
- Variables and parameters
- API rules/calls
- Provenance layer
Discussion about linking relational databases to linked data / terminology
- Conversion to triples (advanced) - beware lossiness in conversions, footnotes etc lost
- SQL tables linking elements (e.g. attributes) to IRIs (simple)
- Wrapping the relational databases http://d2rq.org/
General issues
- Understanding object properties across resources (e.g. from OBO federation OPs to other systems)
Data
https://data.unicef.org/country/deu/
https://data.oecd.org/healthstat/infant-mortality-rates.htm (CSV)
https://data.unicef.org/resources/dataset/child-mortality/ (XLS)
https://data.worldbank.org/indicator/SP.DYN.IMRT.IN (CSV, XML, XLS)
From the whiteboard:
- Infant Mortality
- Census
- Vital statistics
- DHS
- Oher admin
- … region / infant mortality
LONDON examples
- https://knoema.com/atlas/United-Kingdom/London/Infant-mortality (can't find download)
- https://data.gov.uk/dataset/d8834899-63d8-42fe-9488-36afb7f3e806/focus-on-london-health (xls, pdf)
- DanB has just published the xls here:
- ...
Data cubes survey: https://colab.research.google.com/drive/1TCZKR7jL9whWkK5uCrQQsj2O5uq_ZZ9f#
Term | Search | IRIs |
Infant mortality | http://purl.obolibrary.org/obo/NCIT_C16729 http://purl.obolibrary.org/obo/OMIT_0008353 | |
Infant mortality rate | Infant mortality rate | |
Infant mortality | https://lov.linkeddata.es/dataset/lov/terms?q=infant+mortality | |
Term requests
Guidelines for terminology repositories
Users of terminology repositories
Developers / maintainers of repositories
Guidelines for terminology developers
- Settle on some metadata conventions to add to terminologies such that portals etc can auto-populate metadata fields accurately with low overhead. Examples:
- OBO Foundry
- https://github.com/OBOFoundry/OBOFoundry.github.io/tree/master/ontology (index)
- https://github.com/OBOFoundry/OBOFoundry.github.io/edit/master/ontology/stato.md (example)
- https://github.com/OBOFoundry/OBOFoundry.github.io/edit/master/ontology/envo.md (example)
- Clement Jonquet’s approach: https://link.springer.com/article/10.1007/s13740-018-0091-5
- https://hal-lirmm.ccsd.cnrs.fr/lirmm-01605783/document
- LOV ontology descriptions https://lov.linkeddata.es/Recommendations_Vocabulary_Design.pdf
- DCTerms: https://lov.linkeddata.es/dataset/lov/vocabs/dcterms
- Vocabulary of a friend: https://lov.linkeddata.es/vocommons/voaf/v2.3/
- Ontology Metadata Vocabulary https://bioportal.bioontology.org/ontologies/OMV
- MIRO: Matentzoglu N, Malone J, Mungall C, Stevens R (2018) MIRO: guidelines for minimum information for the reporting of an ontology. J Biomed Sem 9:6 https://doi.org/10.1186/s13326-017-0172-7
- Dutta B, Nandini D, Kishore G (2015) MOD : metadata for ontology description and publication. In: International conference on Dublin core & metadata applications, DC’15, Sao Paulo, Brazil, pp 1–9 https://www.isibang.ac.in/~bisu/paper/MOD_CameraReady_ResearchGate_Version_Final.pdf
- Use continuous integration to check the consistency of your ontology
Guidelines for selecting a terminology resource
Technical focus. Break up master list into user stories with varying degrees of expertise and at different point of workflows:
- parties publishing datasets
- parties parsing datasets into something useful that faithfully captures what the dataset says (e.g. using ddi, rdf, etc.)
- parties integrating extracting data into common integrative representations e.g. knowledge graphs
Example:
Ten Simple Rules for Selecting a Bio-ontology
James Malone ,Robert Stevens,Simon Jupp,Tom Hancocks,Helen Parkinson,Cath Brooksbank
Published: February 11, 2016 https://doi.org/10.1371/journal.pcbi.1004743
- Licensing
- Does the resource follow an open? CC-0 (can’t practically cite an IRI)
- Can you fork it and develop independently?
- Adoption
- Is the resource used effectively by several adopters (for their specific purposes)? (quality of usage over numerics)
- Is there a contribution policy?
- Interoperability
- Communities of interoperation - which one(s) do you need your resource to talk to?
- No attempt to “lock in” users
- Are they reaching outside their comfort zone? When there is no natural technical bridge, do they also consider the approach?
- Expressivity:
- Is the expressivity checked? (Using OWL is no guarantee of meaningful expressivity )
- How much machine-readable expressivity do you need?
- Do you need to future-proof? You work may only need a vocab now (encode as SKOS), but do you plan to do more in the future? Start conversations along the semantic gradient if needed
- Maintainability
- How responsive are the maintainers of the ontology with term requests?
- What is the date of the last commit?
- How well documented?
- Are there example queries/competency questions (e.g. http://stato-ontology.org/)?
- Is there a term deprecation/obsolescence policy?
- Is it sustainable? E.g. sustained funding or plurality of developers,
- Are there automated quality checks (e.g. continuous integration)?
- Governance / Editorial policies
- Are new editors welcomed / trained?
- Is the process open?
- Tooling available?
- Are there communities developing tools to use the terminology?
- Quality
- Are there natural language definitions for the terms?
- Are there axiomatic definitions?
- Coverage
- Does the ontology have the required terms? What are the gaps?
Caveats
- Be aware / ask about legacy issues (advanced)
Scope of the Guidelines
- The Problem Space – A general statement positioning the challenge in terms of FAIR and any other frameworks which would be approachable from a cross-domain perspective. A discussion of the issues identified to which the guidelines offer solutions, structured along the typology suggested above: technical, methodological, other…
- Domain Relevance – What domains are covered by the specific guideline?
- Stakeholders – a discussion of the intended audience(s) for the guidelines. Researchers? Data Managers? Funders/strategists? Systems implementers?
- Specifications/Standards/Technologies – a description and explanation for the selection of the relevant resources applied to the problem. Domain standards? Generic technologies?
- Methodological Considerations – a discussion of the methodological implications of the guideline. Are there best practices in a business sense which would change to help provide a solution?
- Proposal – A detailed description of the overall guideline being recommended, and its business justification
- Elaboration of the Use Case – a description of the concrete case(s) analyzed in the formulation of the guidelines/solutions
- Exemplary Data and Metadata – Concrete examples of the kind of data and metadata being discussed.
- Application of Standards/Specifications/Technologies - “Code” examples of how the approach being advocated can be realized, in each of the identified standards/technologies and data/metadata examples.