Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Add Check]: Check that if experimenters string does not have comma #33

Closed
rly opened this issue Feb 15, 2022 · 9 comments · Fixed by #227
Closed

[Add Check]: Check that if experimenters string does not have comma #33

rly opened this issue Feb 15, 2022 · 9 comments · Fixed by #227
Assignees
Labels
category: new check a new best practices check to apply to all NWBFiles and their contents

Comments

@rly
Copy link
Contributor

rly commented Feb 15, 2022

If a user provides a comma-separated scalar string for nwbfile.experimeter, the inspector should recommend that they instead store an array of experimenter names. Same for keywords and related publications.

@CodyCBakerPhD CodyCBakerPhD added the category: new check a new best practices check to apply to all NWBFiles and their contents label Feb 24, 2022
@bendichter
Copy link
Contributor

This is a bit tricky for experimenters, because they might have listed the name Last, First

@CodyCBakerPhD CodyCBakerPhD changed the title Check that if experimenters string does not have comma [Add Check]: Check that if experimenters string does not have comma Mar 15, 2022
@CodyCBakerPhD
Copy link
Collaborator

The BP section for experimenters does not describe the format in which they ought to write these things down: should this be updated to some standard, such as

Last Name, M. I., First Name

M. I. - 'middle initial'

And how should multiple experimenters be handled?

@bendichter
Copy link
Contributor

multiple experimenters should be handled as a vector of strings

@CodyCBakerPhD
Copy link
Collaborator

CodyCBakerPhD commented Apr 11, 2022

OK so, as I understand the heart of this original issue is to try to detect people encoding multiple experimenters with comma-separation instead of being added as different elements of the vector.

However, there are several different ways people can encode a single experimenter name, such as

(a) Firstname Lastname
(b) Lastname, Firstname
(c) [Optional]: Firstname, M.I., Lastname
(di) [Optional]: Lastname, Firstname, M.I.
(dii) [Optional]: Lastname, Firstname M.I.
(e) [Optional]: Lastname, M.I., Firstname

Now, we can still try to catch the heart of this issue by triggering the violation if any((experimenter.count(",") for experimenter in nwbfile.experimenters)) > 3

However, since the Best Practices make many references to trying to make NWBFiles machine-readable, I wonder if it might be in our best interest to encourage a single standard for encoding names (from a-e above), which would also make this kind of check easier and more reliable to catch.

Note: the experimenter schema also mentions that roles can be specified. I'd propose that a Best Practice for that could be optionally including NameOfRole: before the name of the individual to make this easier to parse.

@bendichter
Copy link
Contributor

@CodyCBakerPhD
Copy link
Collaborator

Here's DANDI's regex for name: https://github.com/dandi/dandi-schema/blob/ffa49ef5dc84bc5a234fbc9ca9ba04b872539893/dandischema/models.py#L48

Then their regex matches option (dii).

That could regex could be improved however; note that it allows numbers included in names (try re.fullmatch(pattern=r"^([\w\s\-]+),\s+([\w\s\-\.]+)$", string="Baker2, Cody1")) and disallows apostrophes (likere.fullmatch(pattern=r"^([\w\s\-]+),\s+([\w\s\-\.]+)$", string="O'Baker, Cody")).

Also note it does not accommodate role assignment of each experimenter, which the NWB Schema indicates is allowed

@bendichter So do you think we should make it a Best Practice to enforce structure (dii)? How should we deal with roles?

@bendichter
Copy link
Contributor

@CodyCBakerPhD good points. Maybe we should raise these issues on the dandi schema repo

@CodyCBakerPhD CodyCBakerPhD self-assigned this Apr 11, 2022
@CodyCBakerPhD
Copy link
Collaborator

Turns out that was an older version (not most recent master): https://github.com/dandi/dandi-schema/blob/master/dandischema/models.py#L60

And they may continue to think numbers in names are OK.

One thing to note though as I look through their code and how they use it, I'm not 100% sure they actually require the NWBAsset experimenter metadata to follow that, that's just what they enforce for their dandi assets after extraction. So whatever format the NWB team decides to support for this as a 'best practice' can always be mapped into what dandi expects via their intermediate metadata handling functions.

@bendichter
Copy link
Contributor

@CodyCBakerPhD yes, exactly. This is a dandiset metadata requirement. I'd like to have the option of automatically pulling the experimenter metadata into the dandiset metadata, which is why I thought it might be convenient to have the same regex as a best practice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: new check a new best practices check to apply to all NWBFiles and their contents
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants