Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Term - parentOccurrenceID #413

Open
fja062 opened this issue Jun 3, 2022 · 15 comments
Open

New Term - parentOccurrenceID #413

fja062 opened this issue Jun 3, 2022 · 15 comments

Comments

@fja062
Copy link

fja062 commented Jun 3, 2022

New term

  • Submitter: Anne-Sophie Archambeau, Guillaume Body, Francesca Jaroszynska, Sophie Pamerlon
  • Efficacy Justification (why is this term necessary?): Occurrence records often contain partial data (i.e. different splits of the same data), most especially when the main occurrence record corresponds to a group of individuals. For example, a single occurrence of individuals may include multiple combinations of age and sex. A hierarchical occurrence structure is necessary to represent this complexity by allowing complete or partial child information to be provided for any parent occurrence, thus providing further precision on the main occurrence and avoiding losing information (for example for precision variables such as age and sex).
  • Demand Justification (name at least two organizations that independently need this term): European Food Safety Authority (Enetwild project), French Biodiversity Agency, potentially OBIS
  • Stability Justification (what concerns are there that this might affect existing implementations?): Could replace some instances where resourceRelationship is currently implemented
  • Implications for dwciri: namespace (does this change affect a dwciri term version)?: None

Proposed attributes of the new term:

  • Term name (in lowerCamelCase for properties, UpperCamelCase for classes): parentOccurrenceID
  • Organized in Class (e.g., Occurrence, Event, Location, Taxon): Occurrence
  • Definition of the term (normative): An identifier for the broader Occurrence that groups this and potentially other Occurrences
  • Usage comments (recommendations regarding content, etc., not normative): Use a globally unique identifier for a dwc:Occurrence or an identifier for a dwc:Occurrence that is specific to the data set.
  • Examples (not normative): Occ1-1-1-1; Occ1-4-26-1
  • Refines (identifier of the broader term this term refines; normative): individualCount | occurrenceStatus
  • Replaces (identifier of the existing term that would be deprecated and replaced by this term; normative): None
  • ABCD 2.06 (XPATH of the equivalent term in ABCD or EFG; not normative):
@tucotuco
Copy link
Member

tucotuco commented Jun 3, 2022

Thank you @fja062 for this submission. I have added labels and fee the demand justification has been met. Now we'll have to figure out if this is the best solution. Would you be willing to submit one or more use cases highlighting why you arrived at this solution? It will be interesting to see if instead what you need can be modeled with the Events and the parentEventID, for example.

@deepreef
Copy link

deepreef commented Jun 3, 2022

We have dealt with this before in our data modelling and implementation. Our conclusion was that it mostly applies in cases where an instance of Organism has individual constituents (e.g. a wolf pack or whale pod or school of fish or something has individual organism constituents). Thus, we ended up implementing a parentOrganismID property to track hierarchical relationships among organisms. As such, only the top-level Organism instance in the hierarchy would need to participate in a single Occurrence instance, and all the child Organism instances would inherit the Occurrence participation.

But this has limitations. First of all, you can't always assume that all constituents of a compound Organism instance participated in a particular Occurrence (e.g., if one or more members of a particular wolf pack were absent during a particular documented Occurrence). Second, there are properties of Occurrence (e.g., sex, lifeStage, behavior, reproductiveCondition, etc.) that are particular to individual members of a compound Organism instance. Our solution to this was simply mint additional Occurrence instances for each member of a compound Organism instance, and apply the properties to the individuals accordingly.

In that context, these are not so much parent-child Occurrence relationships, but rather a set of related/parallel Occurrence instances, that happen to share the same Location and Time, and/or involve Organism instances that are collectively part of a broader-scope Organism instance.

I'm not opposed to adding this term, but I agree with @tucotuco that it would need to be fleshed out with specific use cases to clarify when people would use this term, and clarify what the "parent" Occurrence instance is, and what the "child"/"children" instances are (and how many "generations" are allowable).

It also needs to tease out the implications for Organisms as participants in Occurrence instances, vs. MaterialSample instances participating directly in Occurrence instances. I can elaborate more on what I mean by that, but this post is already too long so will refrain unless asked.

@dr-shorthair
Copy link

I really don't like this use of the word 'parent'. It is a metaphor in a context where the original usage is not alien, thus prone to confusion or misuse sooner or later.

Furthermore, 'collection of which this individual is a member' and 'structure of which this individual is an element' are rather different as well.

I would recommend carefully distinguishing the different cases that you appear to be trying to aggregate here.

@fja062
Copy link
Author

fja062 commented Jun 7, 2022

The term that we propose applies to instantaneous, or snapshot, occurrences of an aggregation of individuals and can therefore be used in situations where the ID of a group is unknown. For our proposition, the term ‘group’ recorded in individualCount refers to a temporal aggregation of individuals, and not a social group.

One such use case is hunting bag data, where the total number of individuals hunted throughout a hunting season corresponds to the individualCount of a parent Occurrence. Information that may be partially available on sex and lifeStage may be detailed in nested child Occurrences.

Imagine a hunting season where 30 individuals are shot, and we know that 10 are female, 10 are male, and 3 of the females are adults. This partial information of the Occurrence demographics is hard to store in the main Occurrence, but can be described cleanly in child Occurrences.

Another use case is when at least some individuals observed in the individualCount of an Occurrence contain individual ID tags. For example, a group of Ibex were observed in France, and 4 of the 7 individuals were tagged. We can record this partial information in nested child Occurrences of the parent Occurrence ID.

In the example provided by @deepreef, the ‘group’ refers to a social group, which requires prior knowledge on the number of members etc. contained in the social group. By making the individualCount more generic, irrespective of the genetic and social relationship between the individuals observed, the parent Occurrence refers to the total number of individuals observed (and not necessarily the proportion of the social group observed).

This structuring of child-parent Occurrences is therefore very general, and applicable to any group occurrence where partial information is available. This will be particularly useful for snapshot occurrences such as camera trap or hunting bag data.
The use of ‘parent’ would be identical grammatically to that of ‘parentEventID’, i.e. ‘An identifier for the broader Event that groups this and potentially other Events’. It would also align with the grammar for the proposed term ‘parentMeasurementID’ #362.

@albenson-usgs
Copy link

albenson-usgs commented Jun 7, 2022

I'll attempt to describe a use case I see for this. A regular survey cruise is conducted every spring. It uses a trawl method. On a particular day at a particular location 614 Hippoglossoides platessoides are caught. It is not possible to weigh that many fish so a subsample is taken (5.5 kg) and a calculated weight of 27 kg for all 614 fish. Then 22 of those fish are measured for length with four being 14 cm, one being 15 cm, two being 17 cm, and so on. I don't see a way to represent this well in an occurrence table.

eventID occurrenceID scientificName individualCount measurementType measurementValue measurementUnit
event_1 occ_1 Hippoglossoides platessoides 614 weight 27 kg
event_1 occ_2 Hippoglossoides platessoides subsample weight 5.5 kg
event_1 occ_3 Hippoglossoides platessoides 4 length 14 cm
event_1 occ_4 Hippoglossoides platessoides 1 length 15 cm
event_1 occ_5 Hippoglossoides platessoides 2 length 17 cm

To me these are all the same occurrence of a taxon at a place and time but just subsets of the occurrence to facilitate measurements. OBIS's answer is to call all the subsets an event subset (third table) but I find this a bit confusing myself because these are subsets of the 614 individuals caught and I think it makes it more abstract to call these event subsets. A concern I have is that downstream users will see these as separate occurrences and think there were 636 individuals at that location and time. I would argue for having a parentOccurrenceID for the 614 individuals and then nested occurrences for the subsample weight and 22 length measurements.

@deepreef
Copy link

deepreef commented Jun 7, 2022

Thanks, @fja062 ! This is very helpful!

Imagine a hunting season where 30 individuals are shot, and we know that 10 are female, 10 are male, and 3 of the females are adults. This partial information of the Occurrence demographics is hard to store in the main Occurrence, but can be described cleanly in child Occurrences.

In this case, I would generate three Occurrence records, one with individualCount=10, sex=male; one with individualCount=3, sex=female, lifestage=adult; and one with individualCount=3, sex=female.

I assume I would need to do likewise if we had a term for parentOccurrenceID, and then I would create a fourth Occurrence instance with individualCount=20 (and nothing for the other terms), then each of the original three would be aggregated by virtue of the fact that they all share the same value of parentOccurrenceID, which refers to the fourth Occurrence record.

I can certainly see the value in that, but the question is whether the new term allowing for this simple aggregation is sufficiently more effective or efficient than the alternative (i.e., aggregating them by eventID and associated Taxon instance).

Of course, the alternative method of aggregating requires more "work", and breaks if the associated Identification values don't correspond. But the same is true in the reverse: What happens when multiple child Occurrence instances refer to eventID values or taxonomic identifications that are either different from each other, or different from the parentOccurrenceID instance?

Another use case is when at least some individuals observed in the individualCount of an Occurrence contain individual ID tags. For example, a group of Ibex were observed in France, and 4 of the 7 individuals were tagged. We can record this partial information in nested child Occurrences of the parent Occurrence ID.

Similar solutions and questions for this use case as well.

Again, I see the potential advantage of this, but I worry if the advantage offsets the potential cost (including costs of dealing with logical inconsistencies of differing associated Event and Identification instances -- not to mention potentially incongruous Organism instances -- within a "family" (parent+children[+grandchildred?]) set of Occurrence instances.

I suppose the most common use case would involve my "fourth" Occurrence instance being minted first (i.e., a single Occurrence instance with general property values, that is later parsed into more granular Occurrence instances with more specific property values). In that situation, it would be a bad idea to retain the original Occurrence instance but arbitrarily refine it to represent one of the would-be children, then mint however many additional instances are needed to accommodate each unique combination of properties. So it would be better to "deprecate" the original general one and replace it with the individual parsed ones (rather than leave it in alongside the more specific ones, which would generate redundant records and pseudo-replication).

However, having a parentOccurrenceID property would allow both the general one and the specific one to co-exist, with avoidance of redundancy.

I realize this is just sort of a long and winding ramble, but I'm still trying to get my head around the relative costs and benefits of introducing this new term.

@albenson-usgs :

To me these are all the same occurrence of a taxon at a place and time

Yeah, I guess that's the crux. In my mind, an Occurrence instance unambiguously represents the same Organism at an Event (= place and time), not the same Taxon at an Event. This is why I see all Occurrence records in the context of which Organism instances are in play. Adding Taxon to the mix gets complicated because that is (ideally) inherited through Identification instances, and is of course potentially dynamic over time.

On a related note in response to @fja062 :

In the example provided by @deepreef, the ‘group’ refers to a social group,

Yes, that is true for the example I gave, but it's not necessarily the case for all Organism instances (and their associated Occurrence instances). dwc:Organism , when applied to compound instances (more than one individual) is not restricted to defined social groups:

"A particular organism or defined group of organisms considered to be taxonomically homogeneous."

This includes any aggregation of taxonomically homogeneous individuals, regardless of whether they are defined in the context of social groups, kin, or any kind of ephemeral sets of individuals (e.g., flocks of birds, schools of fish, etc.). It even potentially includes "every individual identifiable to a particular taxon that has ever lived, or ever will live."

Granted, (almost) no one in TDWG-land thinks of it this way, but going by the definitions (and the pure logic of structuring information), it's technically true.

@albenson-usgs
Copy link

albenson-usgs commented Jun 8, 2022

@deepreef

In my mind, an Occurrence instance unambiguously represents the same Organism at an Event (= place and time),

Then what is the purpose of individualCount? In my mind an organism at an event is a single individual. So then for my example would you have 614 unique rows in the occurrence table? And then how do you include information that is a weight for the entire 614 organisms together?

@deepreef
Copy link

deepreef commented Jun 8, 2022

Then what is the purpose of individualCount?

This term predates the establishment of the Organism class. In an ideal world, this term would be organized in the Organism class, rather than the Occurrence class. But in general, it's probably better that DwC evolves at an appropriate pace towards a more ontology-like formulation.

Also, I should point out that the definition of an instance of the Occurrence class very explicitly establishes it as being rooted in an Organism, rather than a Taxon:

"An existence of an Organism (sensu http://rs.tdwg.org/dwc/terms/Organism) at a particular place at a particular time."

Ref: https://dwc.tdwg.org/terms/#occurrence

In my mind an organism at an event is a single individual. So then for my example would you have 614 unique rows in the occurrence table? And then how do you include information that is a weight for the entire 614 organisms together?

An instance of Organism can and often does consist of more than one individual (hence the reason the term individualCount is better organized in that class). Thus, if all 614 individuals belonged to the same Taxon, then they could be bundled into the same collective instance of Organism, and if they all participated in an event (=particular place at a particular time), then they could all be represented by a single instance of Occurrence.

If you wanted to record the fact that 311 of them were adult male, 209 of them were adult female, and 94 of them were subadults (sex indeterminate), then ideally you'd create three instances of Organism, and associate them with the three separate Occurrence instances that captured the respective values for sex and lifestage.

I use the term "ideally" above because in a practical sense, almost no one actually does it this way. The question of whether we all should be doing this way is open to debate. But we probably ought to at least adhere to the existing DwC definitions of terms as closely as possible.

@fja062
Copy link
Author

fja062 commented Jun 13, 2022

@deepreef thanks for the detailed reply!

Of course, the alternative method of aggregating requires more "work", and breaks if the associated Identification values don't correspond. But the same is true in the reverse: What happens when multiple child Occurrence instances refer to eventID values or taxonomic identifications that are either different from each other, or different from the parentOccurrenceID instance?

Could you give an example of where you might expect to see such a situation?

If you wanted to record the fact that 311 of them were adult male, 209 of them were adult female, and 94 of them were subadults (sex indeterminate), then ideally you'd create three instances of Organism, and associate them with the three separate Occurrence instances that captured the respective values for sex and lifestage.

If we define an Event as "An action that occurs at some location during some time" and Occurrence as "An existence of an Organism at a particular place at a particular time", then the 614 individuals in the example given by @albenson-usgs correspond to one Occurrence at one Event. This, I would say, then refers to one parent Occurrence of 614. I find reporting the same observation as multiple Occurrence instances somewhat misleading and less intuitive somehow.

parentEventID parentOccurrenceID occurrenceID scientificName individualCount
event1-1   Occ1-1 Hippoglossoides platessoides 614
event1-1 Occ1-1 Occ1-1-1 Hippoglossoides platessoides 4
event1-1 Occ1-1 Occ1-1-2 Hippoglossoides platessoides 1
event1-1 Occ1-1 Occ1-1-3 Hippoglossoides platessoides 2

and

measurementID parentOccurrenceID parentEventID measurementType measurementValue measurementUnit
meas1 Occ1-1 event1-1 weight 27 kg
meas2 Occ1-1 event1-1 subsample weight 5,5 kg
meas3 Occ1-1-1 event1-1 length 14 cm
meas4 Occ1-1-2 event1-1 length 15 cm
meas5 Occ1-1-3 event1-1 length 17 cm

The beauty I see in the structure of nested occurrences is that each primary (parent) occurrence supplies basic observation information (in this case individualCount, though it could also be occurrenceStatus information). This singular Occurrence can be used immediately without any need to aggregate multiple occurrences that are associated only at the eventID and Taxon / Organism instance to reach the total number of individuals observed at a given Event and Occurrence (i.e. the total number of observations is always reported, thus reducing potential for error, I would think). If further details are required/interesting, they can be sought from the nested instances:

parentEventID parentOccurrenceID occurrenceID scientificName sex lifeStage individualCount
Event1-16 Occ1-16-1 Cervus nippon 75
Event1-16 Occ1-16-1 Occ2-16-1-1 Cervus nippon male juvenile 4
Event1-16 Occ1-16-1 Occ2-16-1-2 Cervus nippon female juvenile 12
Event1-16 Occ1-16-1 Occ2-16-1-3 Cervus nippon male adult 28
Event1-16 Occ1-16-1 Occ2-16-1-4 Cervus nippon female adult 31

So it would be better to "deprecate" the original general one and replace it with the individual parsed ones (rather than leave it in alongside the more specific ones, which would generate redundant records and pseudo-replication).

I do see though the potential for confusion if the parent and child individualCount values are not respected. However, given that the parent-child formulation is not alien to DwC and is already in use for eventID and parentEventID without major misunderstandings, I would imagine the application to occurrenceID and parentOccurrenceID could be possible.

@ymgan
Copy link

ymgan commented Jun 13, 2022

I felt that the word parent is a little confusing indeed. They are not really parent or child but perhaps more of a group or subset.

According to my understanding, the concept seems to be similar to MaterialGroup for the new data model, see Environmental and community measurements? Would it make more sense to name it in a similar way?

@fja062 I am a little confused about the tables. Why are they referring to parentEventID but not eventID?

@dr-shorthair
Copy link

@ymgan Yep that was exactly my concern too - see my comment above #413 (comment)

@deepreef
Copy link

I share the concerns about the term "parent" in this context, but I'll focus my comments here in response to @fja062 :

Could you give an example of where you might expect to see such a situation? [mismatched EventID or associated Taxon]

If we define an Event as "An action that occurs at some location during some time" and Occurrence as "An existence of an Organism at a particular place at a particular time", then the 614 individuals in the example given by @albenson-usgs correspond to one Occurrence at one Event. This, I would say, then refers to one parent Occurrence of 614. I find reporting the same observation as multiple Occurrence instances somewhat misleading and less intuitive somehow.

My point was less about the logic than about the mechanics. Suppose we have a "parent" Occurrence record with three "child" Occurrence records, but the EventID is NOT the same for all four records? In some cases, this could make sense -- especially if the differing EventID values are linked to each other via parentEventID. But in other cases, one would be left to make inferences about there being an error in one (or more) of the records.

Another way of looking at this is coming up with an elegant way (other than instances of ResourceRelationship) that allows us to aggregate multiple Occurrences as sharing both the same Event and same Taxon (as well as potentially inheriting other term values from the "parent"). To me, the main value of any parentXXXXXID term boils down to inheritance, meaning that the properties of the "parent" are understood to be identical to, or a superset of, the properties of the "children".

I guess I'm biased, because I think we should abandon ALL xxxxxID terms from DwC that function as the equivalent of "foreign keys" to instances associated with other DwC classes, and instead represent all of these through ResourceRelationship instances (with appropriate and corresponding controlled vocabulary for relationshipOfResource/relationshipOfResourceID values).

@fja062
Copy link
Author

fja062 commented Jun 27, 2022

Thanks for the interesting discussion points raised. A common theme seems to be an issue with the usage of the vocabulary 'parent' and 'child'. However, this terminology is already in place elsewhere in DwC, for example for the eventID and the parentEventID, and under proposition for the term parentMeasurementID (see #362). I see the current proposition rather about the generalisation of the possibility to create hierarchies in the occurrence extension, as with the existing hierarchy found in the event core. The use of 'parent' in 'parentXXXID' is in my opinion arbitrary (i.e. could be replaced with an alternative noun), though given its existing usage for parentEventID it would seem logical to keep the naming standardised. I thus see the discussion of the word 'parent' as a separate discussion to the one in hand.

My point was less about the logic than about the mechanics. Suppose we have a "parent" Occurrence record with three "child" Occurrence records, but the EventID is NOT the same for all four records? In some cases, this could make sense -- especially if the differing EventID values are linked to each other via parentEventID. But in other cases, one would be left to make inferences about there being an error in one (or more) of the records.

I'm not sure that this scenario could arise if every entry of a child occurrence makes reference both to the eventID (=parentEventID in the tables above) and the so-called parentOccurrenceID, in the same way that every entry of a measurementOrFact would refer to both the eventID and the occurrenceID. In this way, there can never be an error in the relationship between the parent and child entries, and their associated event(s) or measurement(s).

Another way of looking at this is coming up with an elegant way (other than instances of ResourceRelationship) that allows us to aggregate multiple Occurrences as sharing both the same Event and same Taxon (as well as potentially inheriting other term values from the "parent"). To me, the main value of any parentXXXXXID term boils down to inheritance, meaning that the properties of the "parent" are understood to be identical to, or a superset of, the properties of the "children".

Indeed, the parentXXXID configuration allows us to build a hierarchical structure.

I guess I'm biased, because I think we should abandon ALL xxxxxID terms from DwC that function as the equivalent of "foreign keys" to instances associated with other DwC classes, and instead represent all of these through ResourceRelationship instances (with appropriate and corresponding controlled vocabulary for relationshipOfResource/relationshipOfResourceID values).

I find the resourceRelationship extension quite heavy, and not elegant for the use cases described above. Use of the resourceRelationship would result in the splitting of a single data set into several, with potential for the loss of the understanding of the 'superset' effect captured in a hierarchy. I imagine it would also be more complicated to associate the measurement or facts to the corresponding parent and child occurrences, for example where a hierarchical structure could allow measurements or facts to apply at differing levels in the hierarchy without repetition of the MoF information.

According to my understanding, the concept seems to be similar to MaterialGroup for the new data model, see Environmental and community measurements? Would it make more sense to name it in a similar way?

This is an interesting suggestion @ymgan. Could you give an example of how you think this might work in practice?

@Mesibov
Copy link

Mesibov commented Nov 1, 2022

I'm sorry I missed this discussion when it began, because the "nested occurrences" idea is a novel and interesting solution to a difficult problem.

However, while it suits relational databasing, it has the unfortunate effect of greatly multiplying record numbers in flat-file datasets, which is what GBIF users expect. It also seems a bit like a "slippery slope". If occurrences can be nested in order to disaggregate individual data items, why not do the same for recordedBy with multiple recorders and multiple recordedByIDs? And identifiedBy?

Getting back to the "hunting bag" example, what would be the objections to the following entries in a single record?

occurrenceID = [something unique]
eventID = [something unique]
sex = "male | female"
lifeStage = "adult | juvenile"
organismQuantity = "12"
organismQuantityType = "individuals"
organismRemarks = "3 males, 4 females, 5 juveniles"

@ymgan
Copy link

ymgan commented May 4, 2023

Hi,

@guillaumebody suggested the following as potential solution for a dataset from Antarctic GBIF/OBIS. (Thanks Guillaume) I thought this could work~

For the context, the researchers were assessing the diet of Pachyptila belcheri (occ_001), which preyed on Crustacea (occ_002) and Euphausia vallentini (occ_003) in this example.

Event core and eventID are omitted here for simplicity.

occurrence

occurrenceID parentOccurenceID scientificName basisOfRecord preparations
occ_001   Pachyptila belcheri HumanObservation  
occ_002 occ_001 Crustacea MaterialSample regurgitate content of occ_001
occ_003 occ_001 Euphausia vallentini MaterialSample regurgitate content of occ_001

eMoF

measurementID occurrenceID measurementType measurementValue
mea_001 occ_002 fraction diet 0.997
mea_002 occ_003 fraction diet 0.002

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants