User Tools

Site Tools


Sidebar

Metadata Provenance Task-Group

Next conference

  • Weekly in June, new time slot!
  • Date: July 4th 2012
  • Time (Berlin time, CEST): 16:00
  • Topics: Agenda
  • IRC channel: #dcprov

DC-Provenance on Twitter

use_cases

Use Cases

Rationale for new properties, as defined here: Vocabulary

Automatic Indexing

  • Distinct automatically generated metadata statements from manually ones.
  • Further qualify automatically generated statements.
  • Proposed properties: dcprov:creationType, dcprov:rank

OAI-PMH

The following use-case deals with the representation of OAI-PMH data in the DC-PROV model. Especially with the representation of the provenance related information that may or may not be part of an OAI-PMH dataset.

The provenance features of OAI-PMH are briefly described here:

The follwiong example illustrates an origin description in OAI-PMH. In bold, we already indicate the possible mapping to DC vocabulary.

  • originDescription
    • harvestDate=“2002-02-08T08:55:46Z” altered=“true” → dc:modified
    • identifier = oai:odd.oa.org:z1x2y3 → dc:identifier
    • datestamp = 1999-08-07T06:05:04Z → dc:date
    • metadataNamespace = http://odd.oa.org/odd_fmt

A straight-forward approach seems to be to create implicitly another description set for the original description. In this case, according to the definition, we can use dc:source (A related resource from which the described resource is derived.).

The identifier, according to OAI-PMH, is an identifier for the record, not the described resource! Thta means that we use it as URI for the description set. The contents of the description sets are totally arbitrary, i.e. we are not concerned with the representation of them in RDF. It is interesting that with this approach, the provenance chain is intact, if every party provides information in that way, i.e. we have a quite natural fit between the OAI-PMH model and the DC-PROV model.

The following graph illustrates the implementation in DC-PROV:

Additional Notes:

  1. Ordered List ItemThe dotted owl:sameAs relationship is something that should hold, but is not part of our crosswalk. A crosswalk for the actual metadata encapsulated in OAI-PMH probably would contain such a reference to the original identifer, if it is not reused anyway.
  2. Regarding the example dataset, the information about baseUrl and metadataNamespace are not represented in DC-PROV, if it would be needed, it would get assigned to the description set.In linked data settings, it is not necessary, es the information can be deferred from the contents of the description sets.

OAI-ORE

OAI-ORE aims to provide a standard way of describing constituents or a boundary of aggregations for machine readability. Whereas OAI-PMH is metadata-centric OAI-ORE is resource-centric!

Further information about OAI-ORE are available here.

from: http://www.openarchives.org/ore/1.0/primer#Nutshell

Use Case / Crosswalk from OAI-ORE to our Domain Model

The figure above shows an RDF Graph expressed by a Resource Map that includes metadata properties about Resource Map and Aggregation. Note that aspects of the graph already described are grayed-out to emphasize the concepts introduced by the figure.

from: http://www.openarchives.org/ore/1.0/datamodel#Metadata_about_the_ReM

The following graph illustrates the implementation in DC-PROV:

(The original) ReM-1 contains everything; every triple (except triple: ReM1 - ore:describes - A-1). In our model we seperated ReM-1 into two Sets; Annotation Set vs. Description Set.

A-1 is an aggregation of 'something' with creator 'Y'. The whole aggregation is contained in the Description Set. ('graph for itself') A-1 is no metadata resource!

ReM-1 (Resource Map) was created by 'X' and essentially is the Description Set in our Domain Model. The Resource Map allows to contain metadata about itself and about the aggregation it describes. The Annotation Set consists of parts of the Resource Map (4 Triples in our example).

In our model the content of the Description Set does not make sense, it does not include any additional information. ore:describes does not make sense in our model either.

Old Use Case

The following example illustrates provenance as it is used in OAI-ORE. As one can see OAI-ORE already uses some dc vocabulary.

//<!-- About the Aggregation for the ArXiv document -->
  <rdf:Description rdf:about="http://arxiv.org/aggregation/astro-ph/0601007">
  <!-- The Resource is an ORE Aggregation  -->
  <rdf:type rdf:resource="ht tp://www.openarchives.org/ore/terms/Aggregation"/>
  <!-- The Aggregation aggregates ... -->
  <ore:aggregates rdf:resource="http://arxiv.org/abs/astro-ph/0601007"/>
  <ore:aggregates rdf:resource="http://arxiv.org/ps/astro-ph/0601007"/>
  <ore:aggregates rdf:resource="http://arxiv.org/pdf/astro-ph/0601007"/>
 <!-- Metadata about the **Aggregation**: title and authors -->
  <dc:title>Parametrization of K-essence and Its Kinetic Term</dc:title>
  <dcterms:creator rdf:parseType="Resource">
  <foaf:name>Hui Li</foaf:name>
  <foaf:mbox rdf:resource="mailto:lihui@somewhere.cn"/>
  </dcterms:creator>
  <dcterms:creator rdf:parseType="Resource">
     <foaf:name>Zong-Kuan Guo</foaf:name>
     </dcterms:creator>
     <dcterms:creator rdf:parseType="Resource">
     <foaf:name>Yuan-Zhong Zhang</foaf:name>
  </dcterms:creator>//
  • <!– About the Resource Map (this RDF/XML document) that describes the Aggregation –>
  • <rdf:Description rdf:about=“http://arxiv.org/rem/atom/astro-ph/0601007”>
    • <!– The Resource is an ORE Resource Map –>
    • <!– The Resource Map describes a specific Aggregation –>
    • <ore:describes rdf:resource=“http://arxiv.org/aggregation/astro-ph/0601007”/>
    • <!– Metadata about the Resource Map: datetimes, rights, and author –>
    • <dcterms:modified>2008-10-03T07:30:34Z</dcterms:modified> → dc:modified
    • <dcterms:created>2008-10-01T18:30:02Z</dcterms:created> → dc:created
    • <dc:rights>This Resource Map is available under the Creative Commons Attribution-Noncommercial Generic license</dc:rights>
    • <dcterms:rights → dc:rightsrdf:resource=“http://creativecommons.org/licenses/by-nc/2.5/rdf”/> → dc:rights
    • <dcterms:creator rdf:parseType=“Resource”>
      • <foaf:page rdf:resource=“http://arxiv.org”/>
      • <foaf:name>arXiv.org e-Print Repository</foaf:name>
    • </dcterms:creator> → dc:creator

Pubby Example

  • Description: Pubby is a Linked Data Frontend for SPARQL endpoints. It allows exploring and navigate through the links of the endpoint, solving the 303 redirection problems and dealing with the content negotiation. It also includes a metadata extension to add metadata to the provided data (provenance information). This metadata is described by default by the Provenance Vocabulary, but can be configured with the metadata.ttl file.
  • Problem: If we are trying to publish metadata about provenance information, the current modeling can be quite confusing (is the publisher of the annonymous graph Anon_0 the publisher of the resource or the publisher of the metadata aboout the resource?). The modeling of this example with our domain model might be a good example to illustrate how to use it in a real use case, as well as an alternative for the current representation.
  • Use case: The published provenance information is about the travel guides, images and videos of “El Viajero” (a section of the spanish newspaper “El País”). There is also additional descriptive metadata about the guides, such as the longitude in words of the guide, the size of the picture, etc., but we will leave it out of the scope of the example modeling. The endpoint can be found here. Below, it is described the model and the rationale for it, selecting only a few provenance statements about a travel guide identifier.
  • Original Pubby Modeling. The metadata section refers to the statements that appear in the pubby file. The metadata is divided in annonymous graphs, some of them with the type rdf:Graph and some not (being concepts of the Provenance Vocabulary, such as DataCreation). By clicking on the “More” links we can unfold the subgraphs contained in the Anon_0 graph, being able to explore their contents. A snapshot of the metadata section can be seec in the next figure.

  • Modeling with our domain model:

  • Rationale: The original provenance statements of the guide have been grouped in the DescriptionSet1 (date, creator and generation), according to our domain model. As shown in the previous snapshot, the description set is “described” by the “creator”, “rights” and “date” properties, while the “performedBy” property, besides being at the same level as the prevoius three, refers to the creation of the whole RDF serialization showed to the user (i.e. the creation of the Anon_0 graph). Therefore the “creator”, “rights” and “date” properties are grouped in the annotationSet1 and the “performedBy” property in the annotationSet2. The rest of the unfoldable fields in the graph have been modelled as description sets themselves, since they don't further describe the previous annotation sets. They are just subgraphs with common elements (like DataCreation1, DataCretionService1, etc.)
  • RDF:

The modelling using TriG Syntax and following the discussion of the last teleconf:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix dcterms: <http://purl.org/dc/terms/>.
@prefix dcprov: <http://namespaceNotYetKnown/.
@prefix example: <http://example.org/data/> .
#default graph
	{ 
		#this naming would allow to automatically refer to the description set from the resource
		<http://example.org/data/guideIdentifier/prov> a dcprov:DescriptionSet .  
		<http://example.org/data/AnnotationSet/annSet1> a dcprov:AnnotationSet .
		<http://example.org/data/AnnotationSet/annSet2> a dcprov:AnnotationSet .
		<http://example.org/data/AnnotationSet/annSet1> dcprov:describes <http://example.org/data/guideIdentifier/prov> .
		<http://example.org/data/AnnotationSet/annSet2> dcprov:describes <http://example.org/data/AnnotationSet/annSet1> .
	}
#DescriptionSet1
<http://example.org/data/guideIdentifier/prov> 
	{ 
		example:guideIdentifier dc:date "2011-05-27"^^xsd:date.
		example:guideIdentifier dc:creator example:Paco_Nadal.
		example:WasGeneratedBy1123344 opmo:cause example:guideIdentifier .	  	  
	}
#AnnotationSet1
<http://example.org/data/AnnotationSet/annSet1> 
	{ 
		<http://example.org/data/guideIdentifier/prov> dc:creator "2011-05-28"^^xsd:date.
		<http://example.org/data/guideIdentifier/prov> dc:date example:Prisa_Digital.
		<http://example.org/data/guideIdentifier/prov> dc:publisher example:UPM.
	}
#AnnotationSet2
<http://example.org/data/AnnotationSet/annSet2> 
	{ 
		<http://example.org/data/AnnotationSet/annSet1> prv:dataCreation example:DataCreation1.
	}
  • Concerns:
  1. The graph URI is made from the resource's URI, not via a direct assertion
  2. The necessity of an additional graph per resource → scalability problems?

Once the problem is solved, another interesting question would be how to access the metadata provenance of a resource, given its URI. Should I see the graph relationship when accessing example:guideIdentifier? How do I know that a resource has provenance without asking for it explicitly?. (How do I know that a triple belongs to a graph?)

OPMV

Use Case Inspired on the old use case, but not from real data.

We start from the Named Graph 1, which contains all the provenance statements about the rdf graph of school 1.We assume that Agent 123 was the one who created the metadata, while Agent 124 was the one who validated the process. The process started at one date and ended at another date, and as a result, we obtain the NamedGraph1 (generated at one specific hour).

Now, after some days, Agent 123 adds/modifies the previous set of statements in NamedGraph 1, resulting in another artifact (NamedGraph 2), derived from the previous one. Using OPMV, the representation of the example is as in this figure:

Original nodes (form OPM specification):

  • Artifact: Artifacts are the representations of physical or digital objets, always as immutable pieces of state. Represented by the ovals in the figure.
  • Process: Action or series of actions that affect artifacts and produce new artifacts. Represented as the rectangular boxes in the figure.
  • Agent: Catalyzer of the process, which controlls it and influences it. Represented as the hexagons in the figure

Original edges:

  • WasControlledBy: A process is controlled by an agent.
  • WasGeneratedBy: An artifact is generated by a process
  • Used: A process uses an artifact
  • WasTriggeredBy: A process is triggered by another process.
  • WasDerivedFrom: An artifact is derived form another artifact.

Extended nodes (to model the example):

  • Metadata Creation Process: Specific process that creates metadata. Subclass of Process.

Extended edges (to model the example):

  • WasCreatedBy, WasValidatedBy, wasModifiedBy: subproperties of the edge wasControlledBy, in order to determine the role that the agent played in the process.
  • UsedReference: subproperty of the edge Used, to specify that the used information was used as a reference for the process.

Using our domain model, the representation is as in the next figure. It is similar to OPMV, but way more simple:

Advantages/disadvantages of each model:

  • OPMV represents better the process and the evolution of the metadata (how it has been created, how, roles of the agents wich participated, etc).
  • OPMV is more complex, but it allows to assert more information.
  • Our domain example provides a very simple approach to the problem an it deals with it, but loses some information.
  • Our domain model is easier to use and model than OPMV.

How to Translate the source model into our model

  • First, separate the provenance from the provenance metadata at 2 (or more) different levels. In this case it is clear because the source model already does it.
  • Group the provenance statements in a Description Set, while the metadata provenance as Anotation Sets.
  • The translation of the edges to dc:properties is not completely clear, since we may not find all the equivalences and we may have to extend the dcterms vocabulary. For this use case I would recommend following the proposed mapping at the W3C Incubator (http://www.w3.org/2005/Incubator/prov/wiki/Provenance_Vocabulary_Mappings Link). It would imply to remove the processes and losing some of the information about them, but it would gain a lot of simplicity. In the figure second figure we may have to extend the property contributor adding one called “validator” if considered necessary.

—- Old Use case (expanding an example of opmv's guide. Too simple)

There are some real world use cases in the OPMV Guide (although it is a bit drafty yet). The available use cases are:

  • Data transformation by querying database
  • Updates to the RDF graph
  • Time related to data
  • Time related to data transformation

We are going to focus on the first example, expanding it to treat the metadata provenance from 2 perspectives: OPMV and our domain model.

Edubase (register of all educational establishments in England and Wales) starts publishing its data straight from its database, one page per school. The RDF generated for a school is generated on demand from the database by some code that formats the result of a query on the database as RDF/XML. The generated provenance graph is described below (extracted form the OPMV Guide):

  eg:school1
  rdf:type <http://www.w3.org/2004/03/trix/rdfg-1/Graph> ;
  rdf:type opmv:Artifact, prv:DataItem ;
  opmv:wasDerivedFrom _:queryResult ;
  opmv:wasGeneratedBy [
      rdf:type opmv:Process ;         
      opmv:used _:queryResult ;
      opmv:wasPerformedBy _:netcode ;    ### sub-property of opmv:wasControlledBy
      opmv:wasControlledBy <http://www.jenitennison.com/#me>       
  ]
  .
  
  _:queryResult rdf:type opmv:Artifact ;  
      opmv:wasGeneratedBy [
          rdf:type opmv:Process ;         
          opmv:used <http://example.edu/edubase> ;
          opmv:used _:query ;
      ] 
  .
  
  _:netcode rdf:type opmv:Agent ;   
      rdfs:label ".NET code that formats the result of a SQL query on the database as RDF/XML" ;
  .
  
  <http://example.edu/edubase> rdf:type opmv:Artifact, opmvTypes:SQLDatabase 
     rdfs:label "Edubase: the database about schools and education." .
  
  _:query rdf:type opmv:Artifact, prvTypes:SQLQuery
     rdfs:comments "select * from schools where ***"

In RDF we can see how the provenance from the RDF graph of the school (eg:school1) has been modeled. It has been derived from a query result, and generated by a process controlled by <http://www.jenitennison.com/#me>.

Expanding the use case example

Now lets assume we want to assert who is the creator of the provenance information (it can be taken as reference for other properties too).

In OPMV, it would be modeled as in the figure below:

The example detailed the previous code can be seen in the named graph one. Artifacts are represented as ovals, while the processes are the boxes and the agents the hexagons.

The meta-level is represented in the named graph 2. We have had to expand some of the opmv's properties (in this example, we have expanded the opmv:wasGeneratedBy property, creating the opmv:metadataGeneratedBy subproperty). We haven't reused the opmv:wasGeneratedBy to avoid confussion with the normal provenance. The named graph is necessary for asserting more levels of metadata provenance. (The named graphs are not declared in the RDF code. However, according to the opmv specification, it is how it would be done. OPMO would be modeled slightly different (using opmo:account)).

With our domain model, it would be modeled as in the figure below:

In this case, instead of the named graphs we would use the description set for grouping the previous statements of provenance, and the annotation set to describe the metalevel. Each of the statemnts in the annotation set would also be dcprov:annotations.

use_cases.txt · Last modified: 2011/06/22 16:04 by daniel