Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sex - curation before uploading first vocabulary version #83

Closed
ManonGros opened this issue Mar 18, 2021 · 37 comments
Closed

Sex - curation before uploading first vocabulary version #83

ManonGros opened this issue Mar 18, 2021 · 37 comments
Assignees
Labels
content Label for issue concerning vocabulary content occurrence priority:high

Comments

@ManonGros
Copy link
Collaborator

Here is a file to edit: https://drive.google.com/file/d/1qyBLQnpLyF3qXlNJ2oP0QN6uoHbIbhxd/view?usp=sharing

It contains:

  • the list of existing concepts
  • a list of the values already mapped to the concepts (they are all in the Hidden sheet/tab for now)
  • the GBIF verbatim values for this field that appear more than 10,000 times or in 5 or more datasets

NB: We can make sure that the matching doesn't take into account any numbers for the values. For example, no need to match 1 Macho | 1 Hembra and 1 Macho | 2 Hembra, the data can simply be Macho | Hembra. We just need to notify @marcos-lg before he does the import.

Pease check instructions here: #70

@marcos-lg marcos-lg added the content Label for issue concerning vocabulary content label Mar 18, 2021
@pzermoglio
Copy link

I'd like to work on this one, have requested access.
Thanks

@timrobertson100
Copy link
Member

As we define this vocabulary, Isabel (DanBIF) writes:

I have a nice dataset with bees in GBIF https://www.gbif.org/dataset/0528c11e-4074-43bb-829f-13b161bd0a56 .
I notice that the sex “worker” is excluded in records, see e.g.
https://www.gbif.org/occurrence/1587094798
Is there a way to change that, so that Worker is accepted in the controlled vocabulary?

@pzermoglio
Copy link

pzermoglio commented Dec 27, 2021

Nice dataset indeed :)

But to the question, I'd say no.
"worker" is not a value for sex, it's a caste. No matter how obvious it may be for some areas of expertise that all(?) hymenopteran workers are females, terms should not be used to express concepts that are not contemplated in the definition of such terms. If all workers are females, then it'd be easy enough to fill in the sex field with "female" programatically.

I've seen similar cases happen with sex / reproductive condition / life stage.
E.g., "pregnant", for a vertebrate. We humans -biologists, experts in blabla, etc.- (note that there is some knowledge required anyway to make the inferences), can deduce that the animal in question is an adult female. However, "pregnant" is a reproductive condition, it is clearly not a sex nor a lifestage.

Back to castes, I understand that there is no specific field to capture "worker" right now, and that we don't want to loose information. However, misusing another term (like sex) does not help solve the problem. An alternative would be to try to find community support for creating a new "caste" term in Darwin Core. Otherwise, if just using Simple Darwin Core, "worker" would be best shared under dwc:dynamicProperties, encoded as key:value. (and yes, sadly a bit obscured...)

@CecSve
Copy link
Collaborator

CecSve commented Sep 5, 2022

I will begin preparing this vocabulary for production.

@CecSve CecSve self-assigned this Sep 5, 2022
@CecSve
Copy link
Collaborator

CecSve commented Sep 5, 2022

  • Add English labels to the concepts.
  • Check the already mapped values in the Hidden sheet/tab:
    Correct any errors.
    When needed, move the mapped values from the Hidden sheet to the Concept sheet as an alternative label but do not add any new concept.
  • Map as many verbatim values as possible and add them to the Concepts and Hidden sheet/tabs:
  • Incorporate VertNet mapping to the Concepts and Hidden sheet/tabs if possible: https://github.com/VertNet/DwCVocabs/blob/master/vocabs/
  • Incorporate the ALA mapping to the Concepts and Hidden sheet/tabs if possible.

@CecSve
Copy link
Collaborator

CecSve commented Sep 7, 2022

I have removed mapping of a sex + ? to the sex, as I would interpret this as unknown/best guess, and it is now mapped to unknown instead.

@DanBIF
Copy link

DanBIF commented Sep 7, 2022

W.r.t. comment #83 (comment) The argumantation in the comment is valid, but I really think it is a pity too loose information on caste for datasets on Hymenoptera. How about "redefining" the term "Sex" to be "Sex/Caste"? - then we could include "worker" as a valid value

@CecSve
Copy link
Collaborator

CecSve commented Sep 7, 2022

W.r.t. comment #83 (comment) The argumantation in the comment is valid, but I really think it is a pity too loose information on caste for datasets on Hymenoptera. How about "redefining" the term "Sex" to be "Sex/Caste"? - then we could include "worker" as a valid value

We would want to keep the vocabularies strict to only account for one specific term to avoid confusion for publishers as well as users. Values that refer to e.g. 'queen', 'worker', 'drone' etc. are present in the sex field in 0.34% of the datasets and in 0.05% of occurrences, so it is not a prevalent issue. It would still be possible to get the values from the verbatim data if you carry out a full download.

Caste systems/eusociality and other sorts of hierarchical social structures extend beyond Hymenoptera and Insecta, and it might be interesting to try to capture the variation in a DwC term (although for practical purposes it may be too complex and specific to capture in a standard), For now, I will map queen = female, but workers are not necessary females (e.g. in termites) so they will not be interpreted as female (worker would still be in the data as the verbatim sex). I am not aware of whether the term drones is used beyond Hymenoptera(?) but will leave it unmapped for sex, to avoid any misinterpretation.

@tucotuco
Copy link

tucotuco commented Sep 7, 2022

@DanBIF A proposal has been made for a term "caste" in Darwin Core. It would be good to lend support for the term in this issue.

@CecSve
Copy link
Collaborator

CecSve commented Sep 8, 2022

Ok @jhnwllr over to you. We have >3,500 hidden values for four concepts so please only

  • check the mappings in the Hidden tab

I have not added any new concepts, but have decided to translate VertNet's in question and indeterminable concepts to unknown. In case multiple sexes were mapped in VertNet (separated by pipe), I have translated them into our mixed concept. In cases where VertNet mapped e.g. numbers to unknown, I have opted to remove the mapping since it is not clearly related to the sex concept and most likely a misplacement. Also, in some cases the VertNet mapping was to either female or male even though the sex was followed by a ? - in these cases I have changed the mapping to unknown. Juvenile, juv, j and other immature life stages submitted as a sex term has been mapped to unknown. Values of e.g. 1F2J have been mapped to the sex we know is present, in this case female (this is assuming that F cannot relate to fletchling, multiple sexes are mapped to mixed). Values with variations of 7 males female have been mapped to mixed, even though there is no number for how many females there were, whereas e.g. 8M 0F has been mapped to male.

@CecSve
Copy link
Collaborator

CecSve commented Sep 8, 2022

NB: We can make sure that the matching doesn't take into account any numbers for the values. For example, no need to match 1 Macho | 1 Hembra and 1 Macho | 2 Hembra, the data can simply be Macho | Hembra. We just need to notify @marcos-lg before he does the import.

I missed this and have accounted for all the numbers after all... they are the reason for the high number of hidden values.

@CecSve
Copy link
Collaborator

CecSve commented Oct 27, 2022

The vocabulary is ready for you to check now @jhnwllr. Please use the Hidden_OpenRefine tab to check - you should compare the Hidden label to the Concept_dropdown.

@jhnwllr
Copy link

jhnwllr commented Nov 1, 2022

@CecSve this is my review. Feel free to disagree or question my comments.

might mean MALE https://en.wikipedia.org/wiki/XO_sex-determination_system
O

I think all of these should be Unknown

O"[=F]		
t		
upside down F symbol		

Why is this marked Unknown?
(Female)

Seems fair to call this Female
[size Female]

This might mean Male https://en.wikipedia.org/wiki/Hemipenis
1 hemipene, throat white, scales

This should be Unknown
1 juvenile

I bet this should be Female.
1 OV.F
1OV. F
2 OV. F
2 O.F

(later these are all marked as Female)

OV F
OV. F
OV.F
OVF
OVIG F.
OVIG.F

Should be Unknown.
1juv.

Should be Unknown.
2?

Previously such combinations were Male.
5M1J

Should be Unknown.
A

I think this means Adult Male.
A M

Should be Female.
Adult [Female]

Should be Unknown.
adult hooks

All Exx ex ex. ect have to be some system... I think this is Male.
Ex

This should be Mixed.
F (M juv.)

Interesting that this one is considered Female but other similar are marked Unknown.
F [M listed on tag]

Should be Mixed or Unknown
Femalemale

I bet this stands for female gonads or something.
FG

Should Hermaphrodite
hermafrodita

There are a lot of the bracketed reclassifications. I think we need some standard way of dealing with them.
M [cataloged as F] (marked Male)
M [F]! (marked unknown)

Others with question mark have been marked as unknown, but these are marked as male.
M by call [?]
M imm.?
M juv.?
M S.Adult?
M, juvenile?
M/adult (?)

I think this should be marked Unknown
male (?)/female

Should be mixed
Male female

These are all marked as Unknown but in other places similar are marked as Male or Female.

Prob. F		
Prob. Female		
probable adult female		
Probably [Male]		
probably F		
PROBABLY FEMALE		
PROBABLY MALE		
prop. F.		

Should be Female.
una hembra

@tucotuco
Copy link

tucotuco commented Nov 1, 2022

In my experience the parenthetical and bracketed entries often signify uncertainty. Whether uncertainty should result in "unknown" or in the suggested probable interpretation should be, at the very least, consistent.

@CecSve
Copy link
Collaborator

CecSve commented Mar 14, 2023

Be aware that a null interpreted Sex field can be populated based on data in dynamicProperties gbif/pipelines#478 - both interpretations use the same vocabulary to map

@CecSve
Copy link
Collaborator

CecSve commented Feb 21, 2024

Thank you for the thorough breakdown @jhnwllr - I will go through them one by one. Based on discussions with the NAOC group, the indeterminate concept will be applied on verbatim values that include ?, but values with () or [] will be mapped to the sex supplied - ? will overrule () and []. Prop., prob. and probably etc. will be treated as ?.

I will let you know if I have any questions.

@CecSve
Copy link
Collaborator

CecSve commented Feb 21, 2024

After consulting with the NAOC work group a while back, I have switched the unknown category to the concept indeterminate following the BODC vocabulary http://vocab.nerc.ac.uk/collection/S10/current/S105/, since unknown implies that no effort was made to determine the sex.

@CecSve
Copy link
Collaborator

CecSve commented Feb 21, 2024

In cases where there is no information on sex, e.g. only values relating to age: ? age days, ? 2nd yr., ? IMM, ? JUV and ? young I will map to indeterminate.

@CecSve
Copy link
Collaborator

CecSve commented Feb 22, 2024

Non-binary sexes will be mapped to the concept Other instead of Atypical. Atypical is loaded and for some taxonomic groups, e.g. Mollusca, hermaphrodism is quite common and therefore not atypical. The suggested definition of Other is Includes hermaphrodites (in which both sexes are manifested in a single individual) as well as other sex values, such as gynandromorphs, but which may not have been studied in detail, and so are lumped into a single class

@CecSve
Copy link
Collaborator

CecSve commented Feb 22, 2024

The Not_specified concept will be dropped as no values are mapped to it. If the value does not represent a sex, e.g. numbers, then the value will not be mapped

@CecSve
Copy link
Collaborator

CecSve commented Mar 1, 2024

The vocabulary is now uploaded to PROD: https://registry.gbif.org/vocabulary/Sex

@CecSve CecSve closed this as completed Mar 1, 2024
@CecSve
Copy link
Collaborator

CecSve commented Jul 2, 2024

When the vocabulary is ready to be implemented in the pipeline, the following clean-up of the verbatim values should be carried out before mapping the values:

remove trailing

,
.
'
*
*
"
'
white space
dash
!
!
]
)
[
numbers (not zeros)

remove within text string

,
/
numbers (not zeros)
;
.
+
|
&

remove (leading)

numbers (not zeros)
[
,
white space

recode:

[] and {} and {] and [} to ()

@marcos-lg
Copy link
Contributor

I was taking a look at the hidden values that we have and I think we need to redefine the cleanup rules.

We have hidden labels that contain [ and ], for example:

[FEMALE]
--?[=F]
?[illeg] [M]

So if we apply the suggested cleanup they won't match with any concept. For example, a verbatim value ?[illeg] [M] would be converted to ?(illeg) [M.

Also, if there are several rules to apply it's important to consider the order. For example, removing leading or trailing [ might class with the recode rule - in the example above, if I had applied the recode rule first the result would have been ?(illeg) (M)

@CecSve what are the cases that we want to solve with the cleanup so I can get a better understanding of what we need to clean?

@CecSve
Copy link
Collaborator

CecSve commented Aug 30, 2024

Thank you for checking! I was unsure whether the clean-up made 100% sense (I redid it 3 times).

I see the two comments I made here are conflicting #83 (comment) and #83 (comment). Let us stick to this (probably need to remap some values so the vocabulary is consistent with what users see, though it won't affect interpretation):

  1. Anything with a ? should be mapped as indeterminate. prob., probably, () etc. will be treated as ? and will also be mapped as indeterminate.
  2. I think we would want to recode after step 1 and then clean the rest.
  3. () or [] is mapped to the sex supplied.

I wonder if it is worth me going over it all first with these rules and checking to see if it still makes sense with the clean-up suggested in #83 (comment)? I can't find my script from last time, but then I can share it with you this time?

@tucotuco
Copy link

@CecSve There are conventions in Mammalogy and probably in other disciplines by association that parentheses signify uncertainty, so strings that contain them should be mapped to indeterminate also. The square brackets are different, as the convention for that is to signify information that was not recorded originally, but that does not carry with it the uncertainty of the parentheses, so that one should be fine to interpret by dropping the square brackets.

@CecSve
Copy link
Collaborator

CecSve commented Aug 30, 2024

@CecSve There are conventions in Mammalogy and probably in other disciplines by association that parentheses signify uncertainty, so strings that contain them should be mapped to indeterminate also. The square brackets are different, as the convention for that is to signify information that was not recorded originally, but that does not carry with it the uncertainty of the parentheses, so that one should be fine to interpret by dropping the square brackets.

Thanks for the reminder John! You have mentioned this to me before and I forgot. I will correct the suggestion.

I have still opted to have ? overrule [] so values containing both, for example ? [M] is mapped to indeterminate. Does this seem reasonable, please @tucotuco?

@marcos-lg
Copy link
Contributor

marcos-lg commented Aug 30, 2024

It's important to remember that the mappings should be in the labels so the interpretation is transparent and there are no rules in the Java code that can change the mappings of the labels. The only thing we can do is to clean up the values before the interpretation so we don't have to create a hidden label for each possible case. But the cleanup should be for characters that don't have any value. For example, in LifeStage we remove the leading numbers because there were a lot of verbatim values like:

  • 2 adults
  • 3 adults
  • 4 adults
  • ...

And we can't add a hidden label for every number.

Therefore, I can't do things like this because it's a mapping done in java that doesn't take the vocabulary into account:

Anything with a ? should be mapped as indeterminate. prob., probably, () etc. will be treated as ? and will also be mapped as indeterminate.

Removing the brackets is fine although I see that we have many hidden labels that contain brackets so it doesn't feel right to me.

If we are not sure about the need of the cleanup I suggest not to do it and add it a later point if we see it's necessary.

@CecSve
Copy link
Collaborator

CecSve commented Aug 30, 2024

If we are not sure about the need of the cleanup I suggest not to do it and add it a later point if we see it's necessary.

I agree. I have decided to take one last look to see if any cleanup would make sense. So I am remapping everything and will share the JSON with the steps I took.

@tucotuco
Copy link

@marcos-lg I think I am missing something. If you can clean up by removing things like square brackets, why can you not clean up by substituting "indeterminate", which is in the vocabulary, for anything with the patterns @CecSve mentions?

@marcos-lg
Copy link
Contributor

My concern is that the cleanup was intended to remove characters that don't have any significant value but a substitution might override the labels that are present in the vocabulary.

And I see that we have hidden labels like ?MALE which makes me think that we have already covered the cases with ?. In other words, if we do those substitutions these hidden labels will never be used and that doesn't feel right to me.

@CecSve
Copy link
Collaborator

CecSve commented Aug 30, 2024

And I see that we have hidden labels like ?MALE which makes me think that we have already covered the cases with ?. In other words, if we do those substitutions these hidden labels will never be used and that doesn't feel right to me.

To put it in context what I am preparing with the example you gave.

  1. I will map ?M to indeterminate
  2. I am cleaning any values that will lead to the same value (?M) if:
  • some spaces are removed (leading, trailing, whitespace)
  • numbers are removed
  • special characters are removed (e.g., ',!"# etc.)

I do this to have a shorter list than 3000+ verbatim values to map, leaving me with approximately half the amount of verbatim values to map. I will upload the original sheet and the JSON for the cleaning steps soon.

@CecSve
Copy link
Collaborator

CecSve commented Aug 30, 2024

@marcos-lg I have attached the edits I made in OpenRefine using GREL in JSON format. I haven't attached the full history of what I did since it is only relevant for standardizing concepts and hidden values before vocabulary mapping. We end up with 1.447 verbatim values mapped to 5 concepts.

If you think the following makes sense, then I will update the sex vocabulary based on what I just did:

  1. Pipelines will cleanup the verbatim values based on the steps included in this file (maybe a little restructuring is needed, but cleanup result should be the same)
  2. Cleaned-up values will get mapped to the vocabulary
  3. The cleanup steps should be documented in the technical documentation (probably in the data processing section)

Does this plan make sense, please? I hope this makes it more transparent what type of cleanup leads to constructing a vocabulary.

value_edits_history.json

@marcos-lg
Copy link
Contributor

What does this expression mean? value.replace(/[\\p{Zs}\\s]+/,' ')

We can try with that cleanup. I still see many hidden labels that won't be used (for example all the labels that contain numbers) but I checked some in prod and don't seem to be used anymore.

@tucotuco
Copy link

My concern is that the cleanup was intended to remove characters that don't have any significant value but a substitution might override the labels that are present in the vocabulary.

And I see that we have hidden labels like ?MALE which makes me think that we have already covered the cases with ?. In other words, if we do those substitutions these hidden labels will never be used and that doesn't feel right to me.

This tells me that the approach is not right. The method isn't sacred, the result is. So what is the goal? I would have thought that the goal is to do the best matching with the least vocabulary maintenance. I know vocabulary maintenance is important, or GBIF would not have decided to discard hidden labels that apply to very few records or very few data sets. @CecSve also mentions it in this issue.

So what happens when "male, maybe, sort of?" starts getting published and isn't in the list of hidden values. The proposed approach means that "male, maybe, sort of" has to be in the hidden values and mapped or it will not be improved. If processing turned everything with a '?' in it into "indeterminate", every appropriate improvement would be made immediately without any reliance on vocabulary maintenance. That seems like a serious win to me. Again, if I misunderstand something, my apologies.

@marcos-lg
Copy link
Contributor

marcos-lg commented Sep 2, 2024

AFAIK the goal was to move the interpretation from java code(enums and parsers) to the vocabulary so it's users who decide how values should be mapped. That's why hardcoding that rule of replacing ?with indeterminate breaks this approach. The main problem with this is that if we realize at a later point that we need to cover more cases to map them to indeterminate (for example, the ? in other alphabets) we'd have to change the java code(and release and deploy the changes) and that's what we want to avoid with the vocabularies.

One thing we can do is to extend the vocabulary to allow regular expressions (in the hidden labels or as a new field) although this will bring the case where 2 regular expressions in 2 different concepts might overlap (right now this doesn't happen because I check that the labels are unique within the vocabulary).

@tucotuco
Copy link

tucotuco commented Sep 2, 2024 via email

@marcos-lg
Copy link
Contributor

marcos-lg commented Sep 2, 2024

Ok, I understand. It just seemed that if you were doing things in code (which it still seems like you are), why not do something even more useful?

Yeah, that's true. To be honest I'm not 100% sure that adding the cleanup in the code was the best decision but it was convenient at that time.

I'll give a thought to the regular expressions to see if we can allow them without introducing much more problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
content Label for issue concerning vocabulary content occurrence priority:high
Projects
None yet
Development

No branches or pull requests

8 participants