Sex - curation before uploading first vocabulary version #83

ManonGros · 2021-03-18T09:43:35Z

Here is a file to edit: https://drive.google.com/file/d/1qyBLQnpLyF3qXlNJ2oP0QN6uoHbIbhxd/view?usp=sharing

It contains:

the list of existing concepts
a list of the values already mapped to the concepts (they are all in the Hidden sheet/tab for now)
the GBIF verbatim values for this field that appear more than 10,000 times or in 5 or more datasets

NB: We can make sure that the matching doesn't take into account any numbers for the values. For example, no need to match 1 Macho | 1 Hembra and 1 Macho | 2 Hembra, the data can simply be Macho | Hembra. We just need to notify @marcos-lg before he does the import.

Pease check instructions here: #70

The text was updated successfully, but these errors were encountered:

pzermoglio · 2021-03-18T15:16:02Z

I'd like to work on this one, have requested access.
Thanks

timrobertson100 · 2021-12-21T15:53:37Z

As we define this vocabulary, Isabel (DanBIF) writes:

I have a nice dataset with bees in GBIF https://www.gbif.org/dataset/0528c11e-4074-43bb-829f-13b161bd0a56 .
I notice that the sex “worker” is excluded in records, see e.g.
https://www.gbif.org/occurrence/1587094798
Is there a way to change that, so that Worker is accepted in the controlled vocabulary?

pzermoglio · 2021-12-27T16:18:12Z

Nice dataset indeed :)

But to the question, I'd say no.
"worker" is not a value for sex, it's a caste. No matter how obvious it may be for some areas of expertise that all(?) hymenopteran workers are females, terms should not be used to express concepts that are not contemplated in the definition of such terms. If all workers are females, then it'd be easy enough to fill in the sex field with "female" programatically.

I've seen similar cases happen with sex / reproductive condition / life stage.
E.g., "pregnant", for a vertebrate. We humans -biologists, experts in blabla, etc.- (note that there is some knowledge required anyway to make the inferences), can deduce that the animal in question is an adult female. However, "pregnant" is a reproductive condition, it is clearly not a sex nor a lifestage.

Back to castes, I understand that there is no specific field to capture "worker" right now, and that we don't want to loose information. However, misusing another term (like sex) does not help solve the problem. An alternative would be to try to find community support for creating a new "caste" term in Darwin Core. Otherwise, if just using Simple Darwin Core, "worker" would be best shared under dwc:dynamicProperties, encoded as key:value. (and yes, sadly a bit obscured...)

CecSve · 2022-09-05T14:15:17Z

I will begin preparing this vocabulary for production.

CecSve · 2022-09-05T14:21:40Z

Add English labels to the concepts.
Check the already mapped values in the Hidden sheet/tab:
Correct any errors.
When needed, move the mapped values from the Hidden sheet to the Concept sheet as an alternative label but do not add any new concept.
Map as many verbatim values as possible and add them to the Concepts and Hidden sheet/tabs:
Incorporate VertNet mapping to the Concepts and Hidden sheet/tabs if possible: https://github.com/VertNet/DwCVocabs/blob/master/vocabs/
Incorporate the ALA mapping to the Concepts and Hidden sheet/tabs if possible.

CecSve · 2022-09-07T10:44:45Z

I have removed mapping of a sex + ? to the sex, as I would interpret this as unknown/best guess, and it is now mapped to unknown instead.

DanBIF · 2022-09-07T10:49:43Z

W.r.t. comment #83 (comment) The argumantation in the comment is valid, but I really think it is a pity too loose information on caste for datasets on Hymenoptera. How about "redefining" the term "Sex" to be "Sex/Caste"? - then we could include "worker" as a valid value

CecSve · 2022-09-07T12:57:48Z

W.r.t. comment #83 (comment) The argumantation in the comment is valid, but I really think it is a pity too loose information on caste for datasets on Hymenoptera. How about "redefining" the term "Sex" to be "Sex/Caste"? - then we could include "worker" as a valid value

We would want to keep the vocabularies strict to only account for one specific term to avoid confusion for publishers as well as users. Values that refer to e.g. 'queen', 'worker', 'drone' etc. are present in the sex field in 0.34% of the datasets and in 0.05% of occurrences, so it is not a prevalent issue. It would still be possible to get the values from the verbatim data if you carry out a full download.

Caste systems/eusociality and other sorts of hierarchical social structures extend beyond Hymenoptera and Insecta, and it might be interesting to try to capture the variation in a DwC term (although for practical purposes it may be too complex and specific to capture in a standard), For now, I will map queen = female, but workers are not necessary females (e.g. in termites) so they will not be interpreted as female (worker would still be in the data as the verbatim sex). I am not aware of whether the term drones is used beyond Hymenoptera(?) but will leave it unmapped for sex, to avoid any misinterpretation.

tucotuco · 2022-09-07T13:12:59Z

@DanBIF A proposal has been made for a term "caste" in Darwin Core. It would be good to lend support for the term in this issue.

CecSve · 2022-09-08T12:59:04Z

Ok @jhnwllr over to you. We have >3,500 hidden values for four concepts so please only

check the mappings in the Hidden tab

I have not added any new concepts, but have decided to translate VertNet's in question and indeterminable concepts to unknown. In case multiple sexes were mapped in VertNet (separated by pipe), I have translated them into our mixed concept. In cases where VertNet mapped e.g. numbers to unknown, I have opted to remove the mapping since it is not clearly related to the sex concept and most likely a misplacement. Also, in some cases the VertNet mapping was to either female or male even though the sex was followed by a ? - in these cases I have changed the mapping to unknown. Juvenile, juv, j and other immature life stages submitted as a sex term has been mapped to unknown. Values of e.g. 1F2J have been mapped to the sex we know is present, in this case female (this is assuming that F cannot relate to fletchling, multiple sexes are mapped to mixed). Values with variations of 7 males female have been mapped to mixed, even though there is no number for how many females there were, whereas e.g. 8M 0F has been mapped to male.

CecSve · 2022-09-08T13:01:09Z

NB: We can make sure that the matching doesn't take into account any numbers for the values. For example, no need to match 1 Macho | 1 Hembra and 1 Macho | 2 Hembra, the data can simply be Macho | Hembra. We just need to notify @marcos-lg before he does the import.

I missed this and have accounted for all the numbers after all... they are the reason for the high number of hidden values.

remove hidden values with counts/numbers from the Hidden tab (Integration of new vocabularies into the interpretation step pipelines#747)

CecSve · 2022-10-27T13:19:09Z

The vocabulary is ready for you to check now @jhnwllr. Please use the Hidden_OpenRefine tab to check - you should compare the Hidden label to the Concept_dropdown.

jhnwllr · 2022-11-01T13:55:15Z

@CecSve this is my review. Feel free to disagree or question my comments.

might mean MALE https://en.wikipedia.org/wiki/XO_sex-determination_system
O

I think all of these should be Unknown

O"[=F]		
t		
upside down F symbol

Why is this marked Unknown?
(Female)

Seems fair to call this Female
[size Female]

This might mean Male https://en.wikipedia.org/wiki/Hemipenis
1 hemipene, throat white, scales

This should be Unknown
1 juvenile

I bet this should be Female.
1 OV.F
1OV. F
2 OV. F
2 O.F

(later these are all marked as Female)

OV F
OV. F
OV.F
OVF
OVIG F.
OVIG.F

Should be Unknown.
1juv.

Should be Unknown.
2?

Previously such combinations were Male.
5M1J

Should be Unknown.
A

I think this means Adult Male.
A M

Should be Female.
Adult [Female]

Should be Unknown.
adult hooks

All Exx ex ex. ect have to be some system... I think this is Male.
Ex

This should be Mixed.
F (M juv.)

Interesting that this one is considered Female but other similar are marked Unknown.
F [M listed on tag]

Should be Mixed or Unknown
Femalemale

I bet this stands for female gonads or something.
FG

Should Hermaphrodite
hermafrodita

There are a lot of the bracketed reclassifications. I think we need some standard way of dealing with them.
M [cataloged as F] (marked Male)
M [F]! (marked unknown)

Others with question mark have been marked as unknown, but these are marked as male.
M by call [?]
M imm.?
M juv.?
M S.Adult?
M, juvenile?
M/adult (?)

I think this should be marked Unknown
male (?)/female

Should be mixed
MaleÂ female

These are all marked as Unknown but in other places similar are marked as Male or Female.

Prob. F		
Prob. Female		
probable adult female		
Probably [Male]		
probably F		
PROBABLY FEMALE		
PROBABLY MALE		
prop. F.

Should be Female.
una hembra

tucotuco · 2022-11-01T14:05:14Z

In my experience the parenthetical and bracketed entries often signify uncertainty. Whether uncertainty should result in "unknown" or in the suggested probable interpretation should be, at the very least, consistent.

CecSve · 2023-03-14T12:28:26Z

Be aware that a null interpreted Sex field can be populated based on data in dynamicProperties gbif/pipelines#478 - both interpretations use the same vocabulary to map

CecSve · 2024-02-21T15:19:41Z

Thank you for the thorough breakdown @jhnwllr - I will go through them one by one. Based on discussions with the NAOC group, the indeterminate concept will be applied on verbatim values that include ?, but values with () or [] will be mapped to the sex supplied - ? will overrule () and []. Prop., prob. and probably etc. will be treated as ?.

I will let you know if I have any questions.

CecSve · 2024-02-21T15:32:32Z

After consulting with the NAOC work group a while back, I have switched the unknown category to the concept indeterminate following the BODC vocabulary http://vocab.nerc.ac.uk/collection/S10/current/S105/, since unknown implies that no effort was made to determine the sex.

CecSve · 2024-02-21T15:51:12Z

In cases where there is no information on sex, e.g. only values relating to age: ? age days, ? 2nd yr., ? IMM, ? JUV and ? young I will map to indeterminate.

CecSve · 2024-02-22T10:25:34Z

Non-binary sexes will be mapped to the concept Other instead of Atypical. Atypical is loaded and for some taxonomic groups, e.g. Mollusca, hermaphrodism is quite common and therefore not atypical. The suggested definition of Other is Includes hermaphrodites (in which both sexes are manifested in a single individual) as well as other sex values, such as gynandromorphs, but which may not have been studied in detail, and so are lumped into a single class

CecSve · 2024-02-22T10:28:27Z

The Not_specified concept will be dropped as no values are mapped to it. If the value does not represent a sex, e.g. numbers, then the value will not be mapped

CecSve · 2024-03-01T08:46:43Z

The vocabulary is now uploaded to PROD: https://registry.gbif.org/vocabulary/Sex

CecSve · 2024-07-02T08:18:19Z

When the vocabulary is ready to be implemented in the pipeline, the following clean-up of the verbatim values should be carried out before mapping the values:

remove trailing

,
.
'
*
*
"
'
white space
dash
!
!
]
)
[
numbers (not zeros)

remove within text string

,
/
numbers (not zeros)
;
.
+
|
&

remove (leading)

numbers (not zeros)
[
,
white space

recode:

[] and {} and {] and [} to ()

marcos-lg · 2024-08-29T13:56:45Z

I was taking a look at the hidden values that we have and I think we need to redefine the cleanup rules.

We have hidden labels that contain [ and ], for example:

[FEMALE]
--?[=F]
?[illeg] [M]

So if we apply the suggested cleanup they won't match with any concept. For example, a verbatim value ?[illeg] [M] would be converted to ?(illeg) [M.

Also, if there are several rules to apply it's important to consider the order. For example, removing leading or trailing [ might class with the recode rule - in the example above, if I had applied the recode rule first the result would have been ?(illeg) (M)

@CecSve what are the cases that we want to solve with the cleanup so I can get a better understanding of what we need to clean?

CecSve · 2024-08-30T07:05:51Z

Thank you for checking! I was unsure whether the clean-up made 100% sense (I redid it 3 times).

I see the two comments I made here are conflicting #83 (comment) and #83 (comment). Let us stick to this (probably need to remap some values so the vocabulary is consistent with what users see, though it won't affect interpretation):

Anything with a ? should be mapped as indeterminate. prob., probably, () etc. will be treated as ? and will also be mapped as indeterminate.
I think we would want to recode after step 1 and then clean the rest.
~~() or~~ [] is mapped to the sex supplied.

I wonder if it is worth me going over it all first with these rules and checking to see if it still makes sense with the clean-up suggested in #83 (comment)? I can't find my script from last time, but then I can share it with you this time?

tucotuco · 2024-08-30T07:42:53Z

@CecSve There are conventions in Mammalogy and probably in other disciplines by association that parentheses signify uncertainty, so strings that contain them should be mapped to indeterminate also. The square brackets are different, as the convention for that is to signify information that was not recorded originally, but that does not carry with it the uncertainty of the parentheses, so that one should be fine to interpret by dropping the square brackets.

CecSve · 2024-08-30T07:48:32Z

@CecSve There are conventions in Mammalogy and probably in other disciplines by association that parentheses signify uncertainty, so strings that contain them should be mapped to indeterminate also. The square brackets are different, as the convention for that is to signify information that was not recorded originally, but that does not carry with it the uncertainty of the parentheses, so that one should be fine to interpret by dropping the square brackets.

Thanks for the reminder John! You have mentioned this to me before and I forgot. I will correct the suggestion.

I have still opted to have ? overrule [] so values containing both, for example ? [M] is mapped to indeterminate. Does this seem reasonable, please @tucotuco?

marcos-lg · 2024-08-30T08:38:11Z

It's important to remember that the mappings should be in the labels so the interpretation is transparent and there are no rules in the Java code that can change the mappings of the labels. The only thing we can do is to clean up the values before the interpretation so we don't have to create a hidden label for each possible case. But the cleanup should be for characters that don't have any value. For example, in LifeStage we remove the leading numbers because there were a lot of verbatim values like:

2 adults
3 adults
4 adults
...

And we can't add a hidden label for every number.

Therefore, I can't do things like this because it's a mapping done in java that doesn't take the vocabulary into account:

Anything with a ? should be mapped as indeterminate. prob., probably, () etc. will be treated as ? and will also be mapped as indeterminate.

Removing the brackets is fine although I see that we have many hidden labels that contain brackets so it doesn't feel right to me.

If we are not sure about the need of the cleanup I suggest not to do it and add it a later point if we see it's necessary.

CecSve · 2024-08-30T09:06:18Z

If we are not sure about the need of the cleanup I suggest not to do it and add it a later point if we see it's necessary.

I agree. I have decided to take one last look to see if any cleanup would make sense. So I am remapping everything and will share the JSON with the steps I took.

tucotuco · 2024-08-30T09:10:19Z

@marcos-lg I think I am missing something. If you can clean up by removing things like square brackets, why can you not clean up by substituting "indeterminate", which is in the vocabulary, for anything with the patterns @CecSve mentions?

marcos-lg · 2024-08-30T09:28:05Z

My concern is that the cleanup was intended to remove characters that don't have any significant value but a substitution might override the labels that are present in the vocabulary.

And I see that we have hidden labels like ?MALE which makes me think that we have already covered the cases with ?. In other words, if we do those substitutions these hidden labels will never be used and that doesn't feel right to me.

CecSve · 2024-08-30T10:41:03Z

And I see that we have hidden labels like ?MALE which makes me think that we have already covered the cases with ?. In other words, if we do those substitutions these hidden labels will never be used and that doesn't feel right to me.

To put it in context what I am preparing with the example you gave.

I will map ?M to indeterminate
I am cleaning any values that will lead to the same value (?M) if:

some spaces are removed (leading, trailing, whitespace)
numbers are removed
special characters are removed (e.g., ',!"# etc.)

I do this to have a shorter list than 3000+ verbatim values to map, leaving me with approximately half the amount of verbatim values to map. I will upload the original sheet and the JSON for the cleaning steps soon.

CecSve · 2024-08-30T11:24:21Z

@marcos-lg I have attached the edits I made in OpenRefine using GREL in JSON format. I haven't attached the full history of what I did since it is only relevant for standardizing concepts and hidden values before vocabulary mapping. We end up with 1.447 verbatim values mapped to 5 concepts.

If you think the following makes sense, then I will update the sex vocabulary based on what I just did:

Pipelines will cleanup the verbatim values based on the steps included in this file (maybe a little restructuring is needed, but cleanup result should be the same)
Cleaned-up values will get mapped to the vocabulary
The cleanup steps should be documented in the technical documentation (probably in the data processing section)

Does this plan make sense, please? I hope this makes it more transparent what type of cleanup leads to constructing a vocabulary.

value_edits_history.json

marcos-lg · 2024-08-30T13:44:29Z

What does this expression mean? value.replace(/[\\p{Zs}\\s]+/,' ')

We can try with that cleanup. I still see many hidden labels that won't be used (for example all the labels that contain numbers) but I checked some in prod and don't seem to be used anymore.

tucotuco · 2024-08-30T17:33:14Z

My concern is that the cleanup was intended to remove characters that don't have any significant value but a substitution might override the labels that are present in the vocabulary.

And I see that we have hidden labels like ?MALE which makes me think that we have already covered the cases with ?. In other words, if we do those substitutions these hidden labels will never be used and that doesn't feel right to me.

This tells me that the approach is not right. The method isn't sacred, the result is. So what is the goal? I would have thought that the goal is to do the best matching with the least vocabulary maintenance. I know vocabulary maintenance is important, or GBIF would not have decided to discard hidden labels that apply to very few records or very few data sets. @CecSve also mentions it in this issue.

So what happens when "male, maybe, sort of?" starts getting published and isn't in the list of hidden values. The proposed approach means that "male, maybe, sort of" has to be in the hidden values and mapped or it will not be improved. If processing turned everything with a '?' in it into "indeterminate", every appropriate improvement would be made immediately without any reliance on vocabulary maintenance. That seems like a serious win to me. Again, if I misunderstand something, my apologies.

marcos-lg · 2024-09-02T07:28:04Z

AFAIK the goal was to move the interpretation from java code(enums and parsers) to the vocabulary so it's users who decide how values should be mapped. That's why hardcoding that rule of replacing ?with indeterminate breaks this approach. The main problem with this is that if we realize at a later point that we need to cover more cases to map them to indeterminate (for example, the ? in other alphabets) we'd have to change the java code(and release and deploy the changes) and that's what we want to avoid with the vocabularies.

One thing we can do is to extend the vocabulary to allow regular expressions (in the hidden labels or as a new field) although this will bring the case where 2 regular expressions in 2 different concepts might overlap (right now this doesn't happen because I check that the labels are unique within the vocabulary).

tucotuco · 2024-09-02T12:35:54Z

Ok, I understand. It just seemed that if you were doing things in code (which it still seems like you are), why not do something even more useful? Regular expressions in the vocabulary would be interesting, but would be beyond most vocabulary maintenance participants. if re.search(r'.*[\?\(].*', input_string): return "indeterminate". :-)

…

On Mon, Sep 2, 2024 at 4:28 AM Marcos Lopez Gonzalez < ***@***.***> wrote: AFAIK the goal was to move the interpretation from java code(enums and parsers) to the vocabulary so it's users who decide how values should be mapped. That's why hardcoding that rule of replacing ?with indeterminate breaks this approach. One thing we can do is to extend the vocabulary to allow regular expressions (in the hidden labels or as a new field). — Reply to this email directly, view it on GitHub <#83 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADQ72YC4LZQLHLJLATB7HLZUQHRVAVCNFSM4ZMIMZ32U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMZSGQYDAOJRGYZQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

marcos-lg · 2024-09-02T12:54:46Z

Ok, I understand. It just seemed that if you were doing things in code (which it still seems like you are), why not do something even more useful?

Yeah, that's true. To be honest I'm not 100% sure that adding the cleanup in the code was the best decision but it was convenient at that time.

I'll give a thought to the regular expressions to see if we can allow them without introducing much more problems.

marcos-lg added the content Label for issue concerning vocabulary content label Mar 18, 2021

timrobertson100 added the occurrence label Jun 2, 2022

timrobertson100 mentioned this issue Aug 9, 2022

Use unknown instead of undetermined for sex vocabulary gbif/rs.gbif.org#94

Open

CecSve self-assigned this Sep 5, 2022

CecSve assigned jhnwllr Sep 8, 2022

CecSve added the priority:high label Sep 22, 2022

CecSve mentioned this issue Sep 29, 2022

CountryName - curation before uploading first vocabulary version #73

Open

CecSve mentioned this issue Jan 27, 2023

Request for new fields to index and expose in search and download gbif/pipelines#666

Open

ymgan mentioned this issue Feb 22, 2024

TG2-AMENDMENT_SEX_STANDARDIZED tdwg/bdq#284

Open

CecSve closed this as completed Mar 1, 2024

marcos-lg mentioned this issue Aug 28, 2024

Integration of new vocabularies into the interpretation step gbif/pipelines#747

Open

2 tasks

Sex - curation before uploading first vocabulary version #83

Sex - curation before uploading first vocabulary version #83

Comments

ManonGros commented Mar 18, 2021

pzermoglio commented Mar 18, 2021

timrobertson100 commented Dec 21, 2021

pzermoglio commented Dec 27, 2021 • edited Loading

CecSve commented Sep 5, 2022

CecSve commented Sep 5, 2022 • edited Loading

CecSve commented Sep 7, 2022

DanBIF commented Sep 7, 2022

CecSve commented Sep 7, 2022 • edited Loading

tucotuco commented Sep 7, 2022

CecSve commented Sep 8, 2022 • edited Loading

CecSve commented Sep 8, 2022 • edited Loading

CecSve commented Oct 27, 2022

jhnwllr commented Nov 1, 2022

tucotuco commented Nov 1, 2022

CecSve commented Mar 14, 2023 • edited Loading

CecSve commented Feb 21, 2024 • edited Loading

CecSve commented Feb 21, 2024

CecSve commented Feb 21, 2024 • edited Loading

CecSve commented Feb 22, 2024

CecSve commented Feb 22, 2024

CecSve commented Mar 1, 2024

CecSve commented Jul 2, 2024

remove trailing

remove within text string

remove (leading)

recode:

marcos-lg commented Aug 29, 2024

CecSve commented Aug 30, 2024 • edited Loading

tucotuco commented Aug 30, 2024

CecSve commented Aug 30, 2024 • edited Loading

marcos-lg commented Aug 30, 2024 • edited Loading

CecSve commented Aug 30, 2024

tucotuco commented Aug 30, 2024

marcos-lg commented Aug 30, 2024

CecSve commented Aug 30, 2024

CecSve commented Aug 30, 2024 • edited Loading

marcos-lg commented Aug 30, 2024

tucotuco commented Aug 30, 2024

marcos-lg commented Sep 2, 2024 • edited Loading

tucotuco commented Sep 2, 2024 via email

marcos-lg commented Sep 2, 2024 • edited Loading

pzermoglio commented Dec 27, 2021 •

edited

Loading

CecSve commented Sep 5, 2022 •

edited

Loading

CecSve commented Sep 7, 2022 •

edited

Loading

CecSve commented Sep 8, 2022 •

edited

Loading

CecSve commented Sep 8, 2022 •

edited

Loading

CecSve commented Mar 14, 2023 •

edited

Loading

CecSve commented Feb 21, 2024 •

edited

Loading

CecSve commented Feb 21, 2024 •

edited

Loading

CecSve commented Aug 30, 2024 •

edited

Loading

CecSve commented Aug 30, 2024 •

edited

Loading

marcos-lg commented Aug 30, 2024 •

edited

Loading

CecSve commented Aug 30, 2024 •

edited

Loading

marcos-lg commented Sep 2, 2024 •

edited

Loading

marcos-lg commented Sep 2, 2024 •

edited

Loading