Section: Vision

Editor's Notes

This section is meant to illustrate 3 overall stategies (to be covered later):

  • Creating cross-cutting infrastructure (like global search services, subscription network)
  • Encouraging data providers to support standards that enable automated discovery and integration
  • Creating key aggregated datasets that will be important as foundation for answering data questions

Further, this example is meant to illustrate some features of how these strategies can be applied, including:

  • The existence of a global data discovery service and the integration of that service into multiple, independently maintained portals
  • The support by multiple data providers for common export formats
  • The use of common metadata identification to allow data from different sources to be sensibly combined
  • The support of common standards for geolocation to enable mapping of data from different sources in a single (overlaid) mapping visualization.
  • a subscription/publishing service for pushing machine-readable information to subscribers about the availability of new data.


As we will see, we have quite a wealth of data resources today that can be mined for answers to difficult but timely questions about how infectious diseases affect world.  However, as we have explained, the current state of those resources and a lack of tools for automating the data discovery and integration, makes answering these questions in a timely way impossible.  Nevertheless, the resources we do have at hand gives us a view how our job could be easier.

...text describing some of the questions and analyses we want to be able to address...

We envision a future where all of the key data resources and their providers appear to us as a coordinated team ready to take on our question of the day.  For instance, as we become aware of an emerging disease outbreak, we might visit the Global Health Observatory (GHO) from the WHO to get an overview of the latest reported infections.  We wish to quickly explore a hypothesis regarding infection propagation and vectors.  In this world, the GHO would have links to other portals, such as the Gridded Population of the World and Global Biodiversity Information Facility (GBIF), which we can access to refine our hypotheses.  As we discover the social factors that play a role in the spread of disease, we will be able to access and integrate information about roads, schools, places of employment, etc..  In particular, we have an idea of the kinds of observations or measurements we need–say, both the locations of schools and their populations.  Starting with the GHO site, we can then trigger a query that submits a query for the appropriate data available from anywhere in the world related to such subjects that correspond geographically near the location of the emerging outbreak.

We might pause a moment in this story of an emerging outbreak to imagine how a university researcher might pose similar questions a year before the outbreak in an effort to predict its occurrence.  Perhaps she is browsing the GBIF portal exploring populations of reptiles as a possible disease vector and realizes that she needs data regarding roads, schools, or labor statistics.  From the GBIF website, she clicks a button that submits a query for such datasets available from anywhere in the world that overlap with the range of particular reptiles.  Indeed, there might be a dozen different data portals that she could visit that can tap into a global data search engine find data that correlates with data provided by that portal.

But now in our present, we have an actual outbreak to understand.  We have done some initial browsing and searching to the point of having a refined hypothesis about the infection propagation, and we have identified some candidate datasets we can leverage.  Through either refined searches or through filtering of our search results, we are able to find a database that can tell us the location of schools near the outbreaks as well as a second data source that lists the schools populations.   Both of these data sources can export their data into a common annotated table format, so we download these data directly into our R Studio environment on our local laptop where we can combine the school location and populations into a single table.  Our R environment already has a module for pulling the range data from GBIF, so from R Studio, we create a table contain range information for our suspect reptile species.  Our initial analysis through quick plots of the data show some interesting correlations, so now we need to see how this data relates to current infection incidents.

We return to the GHO which provides visualizations of the reported infections overlaid a Google map.  We upload our integrated dataset containing schools, populations, and reptile ranges which allows us to map this information on top of the view of infection incidents.  From that a prediction of how the disease will progress jumps out.  We download the incident data into our local R analysis environment.  

To see if this prediction will bear out, we need to integrate new infection reports as they occur and use that information for planning the deployment of doctors and supplies.  Through our the GHO web site, we sign up for email alerts about new data; at the same time, through our local R environment, we subscribe to receive new data.  The next day, when we connect again to our R environment, the new incidents are automatically downloaded so that we can plot the progression.