Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data(base) licenses #102

Closed
merkys opened this issue Jun 13, 2019 · 25 comments · Fixed by #414
Closed

Data(base) licenses #102

merkys opened this issue Jun 13, 2019 · 25 comments · Fixed by #414
Assignees
Labels
topic/response-format Issue discussing changes and improvements to the API response format type/proposal Proposal for addition/removal of features. May need broad discussion to reach consensus.
Milestone

Comments

@merkys
Copy link
Member

merkys commented Jun 13, 2019

OPTiMaDe responses should contain the data license indicators. The implementation details depend on the scope of data licensing we want to use:

  1. Per-database license. This means that one license covers all the data in a database. Example: Wikipedia, which uses CC-BY-SA for all its content.
  2. Per-entry license. This allows data of heterogeneous licenses coexist in a database without breaching licensing requirements. Example: Wikimedia Commons, which specifies licenses per-entry.

Which option we choose depends on the nature of databases that will use OPTiMaDe. For instance, COD and TCOD contain only public-domain data, so option 1) would be sufficient. What do others think?

To specify licenses in a standard way I suggest using license names (abbreviations) from SPDX list of commonly used licenses, if we deem it exhaustive enough.

Edit: SPDX list has only free licenses, and misses public-domain, so not exhaustive enough.

@merkys merkys added topic/response-format Issue discussing changes and improvements to the API response format type/proposal Proposal for addition/removal of features. May need broad discussion to reach consensus. labels Jun 13, 2019
@merkys merkys added this to the 1.0 release milestone Jun 13, 2019
@merkys merkys self-assigned this Jun 13, 2019
@giovannipizzi
Copy link
Contributor

If we are not sure, maybe we should postpone this after 1.0

@rartino
Copy link
Contributor

rartino commented Jun 14, 2020

@merkys, @giovannipizzi This is marked milestone v1.0. Are we ok to push this to v1.1? I say 'yes'. Right now, if you want to redistribute data you've obtained via OPTIMADE you will have to check with the originator database for license information (their website, email, ...). It would be great to be more helpful than this, but IMO not crucial for 1.0.

@merkys
Copy link
Member Author

merkys commented Jun 15, 2020

Yes, I agree with @giovannipizzi and @rartino that we can solve this after v1.0.

@merkys merkys modified the milestones: 1.0 release, 1.1 release Jun 15, 2020
@merkys merkys mentioned this issue Jan 24, 2022
@merkys
Copy link
Member Author

merkys commented Jan 24, 2022

I am revisiting this issue after looking at #364, where licensing of archival data is discussed.

I would like to build on top of my initial proposal, accommodating subsequent @rartino's comment. Thus:

  • Base Info Endpoint could have attributes.license field providing a license for all its data;
  • Single Entry Endpoint could have attributes.license field providing a license for this particular entry, overriding the license at Base Info Endpoint;
  • If none of the above are present, then provider's website is to be consulted for license information.

As for how the license identifier is to be given, one solution would be to make attributes.license an URL (see this discussion why I do not suggest JSON:API links object here) pointing to license fulltexts. For licenses in the SPDX list, URLs SHOULD take up the form of https://spdx.org/licenses/<SPDX Identifier>, for example, https://spdx.org/licenses/CC0-1.0 for CC0-1.0. This way license names could in principle be queryable.

@rartino
Copy link
Contributor

rartino commented Feb 3, 2022

@merkys You may also want to include a link to your earlier PR on this (that you eventually closed): #107

To echo my comment there, I'm skeptical to "spread out" license information - in particular licenses of individual entries, as this creates a way to trick users by, e.g., hiding a single more strictly licensed entry among billions of unrestricted ones. I strongly prefer a model where there is only one single place per database to communicate everything license-related. The licensing communicated there could still be complicated and difficult to deal with, e.g., "all bcc structures are under CC-BY, and fcc under GPL3", but as long as that is communicated in a single place, no surprises are hiding in the data.

In case someone doubts that anyone would try to use a feature like this maliciously, I'll link this interesting entry by Cory Doctorow on a similar issue that has appeared due to the stipulations in the earlier CC licenses.

@merkys
Copy link
Member Author

merkys commented Feb 6, 2022

@rartino

To echo my comment there, I'm skeptical to "spread out" license information - in particular licenses of individual entries, as this creates a way to trick users by, e.g., hiding a single more strictly licensed entry among billions of unrestricted ones. I strongly prefer a model where there is only one single place per database to communicate everything license-related. The licensing communicated there could still be complicated and difficult to deal with, e.g., "all bcc structures are under CC-BY, and fcc under GPL3", but as long as that is communicated in a single place, no surprises are hiding in the data.

IMO, having clear machine-readable means to identify licenses in finest possible grain is meant to solve exactly the problem you are describing. Surely a top-level licensing file is great to have. However, for aggregate databases this might be difficult to achieve (take Wikimedia Commons for example which has per-file licenses). Thus my proposal has provisions for both the top-level license and per-record licenses.

License which says "all bcc structures are under CC-BY, and fcc under GPL3" hides surprises. Suppose the software/human misidentifies some corner cases, providing food for copyright/-left troll.

In case someone doubts that anyone would try to use a feature like this maliciously, I'll link this interesting entry by Cory Doctorow on a similar issue that has appeared due to the stipulations in the earlier CC licenses.

Thanks for the link, really interesting. However, my take-away from this story is that we as a community need better-worded licenses.

@rartino
Copy link
Contributor

rartino commented Feb 6, 2022

But, how should end-users handle the per-record licenses when fetching big data sets? Doesn't that mean that we have to spend CPU cycles and bandwidth to verify for every individual entry that the license is as expected?

I think there is no way to stop people from using dodgy licenses when publishing data (with OPTIMADE or otherwise) like "all bcc structures are under CC-BY, and fcc under GPL3". But, if there is just one place for such dodginess, I can manually check that place, accept or reject it, and act accordingly.

In my opinion, aggregate databases should export data under the strictest subset license and reference the sources for more permissive use.

@JPBergsma
Copy link
Contributor

I do not think I have a strong opinion one way or the other, but if we do work with licences we should also think about how to handle attribution for each Optimade entry.

@rartino
Copy link
Contributor

rartino commented Feb 10, 2022

@JPBergsma Individual attribution is indeed very important and - as far as I can see:

  • Doesn't create a potential legal minefield for users of the data.
  • Doesn't have to take up bandwidth/storage for those doing bulk analysis and do not intend to re-share the data.

However, the /references endpoint and relationships with those entries already exist for this use. Is there anything missing with what is presently possible to do?

@JPBergsma
Copy link
Contributor

I guess we could use the /references endpoint. There could however be cases where the authors of an article and the authors of an entry are different. Or there may not be a publication associated with the entry. How would you do the attribution in that case ?

@merkys
Copy link
Member Author

merkys commented Feb 10, 2022

@rartino

But, how should end-users handle the per-record licenses when fetching big data sets? Doesn't that mean that we have to spend CPU cycles and bandwidth to verify for every individual entry that the license is as expected?

Sure, but I do not think checking couple millions of strings is much nowadays.

I think there is no way to stop people from using dodgy licenses when publishing data (with OPTIMADE or otherwise) like "all bcc structures are under CC-BY, and fcc under GPL3". But, if there is just one place for such dodginess, I can manually check that place, accept or reject it, and act accordingly.

Agree, but I would like to assume good faith here. Surely someone may have a database where "all structures with prime UUIDs are under CC-BY, and proprietary otherwise", but if they put per-entry licenses, the user will not have to rely on prime sieve to see what they can use.

In my opinion, aggregate databases should export data under the strictest subset license and reference the sources for more permissive use.

This is surely a safe option, but in my opinion this may drive away users from otherwise permissive data. Moreover, the wording has to be really clear to convey the relation between this encompassing strict and overriding permissive license.

@rartino
Copy link
Contributor

rartino commented Feb 17, 2022

@merkys

Sure, but I do not think checking couple millions of strings is much nowadays.

You don't see a problem with saying that the recommended practice for perfectly normal OPTIMADE use like fetching 1M structures to use in an ML project is to retrieve the structures with the individual license field is included (which adds 1 million duplicated copies of a possible quite long string, perhaps doubling the data in total size) and then verify that every such string is equal?

Do you think any users of OPTIMADE will actually do this in practice?

Agree, but I would like to assume good faith here.

This is what the Cory Doctorow link was meant to show: this is the one place where we cannot assume everyone acting in good faith. The original formulations of the CC licenses assumed copyright holders would deal with misattributed copies in good faith - but in response, a whole business pops up trying to get people to misattribute CC:ed works so they can be exhorted/sued. My argument is that it is equally believable that we one day see a business pop up for extorting OPTIMADE users who have accidentally broken a single odd per-entry license.

@merkys
Copy link
Member Author

merkys commented Feb 17, 2022

@rartino

You don't see a problem with saying that the recommended practice for perfectly normal OPTIMADE use like fetching 1M structures to use in an ML project is to retrieve the structures with the individual license field is included (which adds 1 million duplicated copies of a possible quite long string, perhaps doubling the data in total size) and then verify that every such string is equal?

No, I do not. Checking 1M strings for computer is still cheaper than person-time spent reading and sorting out complicated license texts (I am not advocating for software lawyers, but most popular licenses should be easy to cite/check).

As for the size, we may define license field as excluded from the response by default, and only included by request. By the way, my original proposal was to use per-entry licenses only if they differ from the top-level license. This is usual in software projects: there is a top-level LICENSE file which rarely mentions all embedded 3rd party files. User would have to scan each file individually to make sure there are no lingering files of different licenses.

Do you think any users of OPTIMADE will actually do this in practice?

In practice, anyone is free to ignore any license. But I would not recommend to do that.

Agree, but I would like to assume good faith here.

This is what the Cory Doctorow link was meant to show: this is the one place where we cannot assume everyone acting in good faith. The original formulations of the CC licenses assumed copyright holders would deal with misattributed copies in good faith - but in response, a whole business pops up trying to get people to misattribute CC:ed works so they can be exhorted/sued. My argument is that it is equally believable that we one day see a business pop up for extorting OPTIMADE users who have accidentally broken a single odd per-entry license.

To me, Cory Doctorow's story tells that that particular CC license was a buggy one. Extortion businesses piggybacking OPTIMADE may arise regardless we add licenses in OPTIMADE responses or not.

I believe some (large?) part of OPTIMADE users do not know licenses of individual databases. "Open" does not imply "free", and this is a great opportunity for the cited extortion businesses. A lack of license must not be understood as equivalent to public domain/CC0 as well. By having standardized means to display licenses along OPTIMADE data we would raise the awareness in both users and providers.

@merkys
Copy link
Member Author

merkys commented May 31, 2022

During an in-person discussion with @rartino and @ml-evs I became convinced that entries of more restrictive licenses than the main body of data of an implementation belong to a different "sibling" OPTIMADE implementation.

There was also a suggestion to add a binary property is_compatible_with_cc_by_4_0 to indicate the data could be aggregated.

@merkys
Copy link
Member Author

merkys commented Jun 1, 2022

@ml-evs has pointed out that there might be a need to indicate file licenses in /files endpoint (#360) once we have it. For now I would say that all files in an implementation should as well be covered by the same root license. Should there be exceptions, either the root license has to spell them out, or these exceptions should belong to different "sibling" implementations.

@rartino
Copy link
Contributor

rartino commented Jun 7, 2022

@blokhin posted this relevant addition to this discussion in #414

For the case of the MPDS, we might have multiple licenses for the different sections of our data: CC BY 4.0, commercial/proprietary, per-vendor custom license, etc. We currently use per-entry custom field _mpds_data_license, which I would feel strong to recommend as a standard data_license / entry_license field.

and @merkys responded with:

For the case of the MPDS, we might have multiple licenses for the different sections of our data: CC BY 4.0, commercial/proprietary, per-vendor custom license, etc. We currently use per-entry custom field _mpds_data_license, which I would feel strong to recommend as a standard data_license / entry_license field.

This is a scenario quite closely fitting my reasoning which I have presented in #102 in my discussions with @rartino. If MPDS uses per-entry custom field, why not promote it to the standard? But this probably could be introduced in a follow-up PR in order not to block the current one.

According to this PR, MPDS licensing situation could be solved in two ways:

  1. Separate databases (one per each license);
  2. Specify all licenses and their governed domains in the top-level license file.

@blokhin Is any of these solutions suitable for MPDS? If so, maybe per-entry licensing could wait for the follow-up PR?

@blokhin
Copy link
Member

blokhin commented Jun 7, 2022

Separating the database into several sections according to a license is not really the best option for the MPDS (losing the holistic view). I’d rather support (2.) Specify all licenses and their governed domains in the top-level license file, but this still unfortunately remains ambiguous and not really useful for the consumer. I can create an additional PR for per-entry licensing as an extension of this thread as well as #414.

@JPBergsma
Copy link
Contributor

I agree with Evgeny here. For databases that obtain their data from multiple sources, it should be possible to set a per entry licence field. It could be just a key that refers to a licence defined at a higher level.

@ml-evs
Copy link
Member

ml-evs commented Jun 7, 2022

I agree with Evgeny here. For databases that obtain their data from multiple sources, it should be possible to set a per entry licence field. It could be just a key that refers to a licence defined at a higher level.

I think we somewhat touched on this with our solution @merkys /@rartino , that the overall database license can be complicated (i.e. describing subsets under different licences with a full-text description), which cannot be excluded from any OPTIMADE meta response.

I would not be against also having per entry licenses (to cover the use case of @blokhin) provided this overarching license already describes the caveats (and has a field for cc-by compatibility as discussed above).

@rartino
Copy link
Contributor

rartino commented Jun 7, 2022

@blokhin @JPBergsma
In line with what @ml-evs says, the main problem for me is that I do not want a design where a database can trick clients by not declaring its split-licensing up front, but rather can "gotcha" clients by unexpectedly having some entries indicated as licensed differently.

Are not the relevant use cases covered by implementing PR #414 with a database-wide license specification, and have databases with per-entry split-license describe this license setup in that link? That also means they get a clear place to explain the terms for a database-specific licensing field such as _mpds_data_license. Still, I prefer such license fields to remain database-specific, since a common standard field for this would make it easier for a database to ambiguously communicating license info only via that standard per-entry license field.

@blokhin
Copy link
Member

blokhin commented Jun 7, 2022

Let's put that the data provider MAY use a per-entry data_license field, and if it does, that field contents MUST belong to the ENUM declared in the obligatory top-level database-wide license specification.

@blokhin
Copy link
Member

blokhin commented Jun 7, 2022

We might even add an additional validation procedure taking 10 random entries from a provider and checking if their license is the same as declared in the top-level introspection.

@rartino
Copy link
Contributor

rartino commented Sep 2, 2022

I'm re-opening this because it was closed automatically with #414, but I think there are aspects remaining that were not completely settled in the discussions here and there.

And, to add to the discussion here - having drilled down into the question of what we are going to put in the fields added with #414 for some of our own datasets, we are going to have some datasets (so, OPTIMADE databases) where:

  • We are happy to let individual "datums" be open and free, e.g., CC-BY 4.0. We do not want to block aggregators, etc., from retransimission of these results.
  • We still want to retain the database copyright, i.e., we will write out that you cannot redistribute our whole database as a collection or a significant fraction of it.

If nothing changes (i.e. #414 remain in place as it is now) I guess we'll just put the above info as our database-wide license. I note however that there will be no way for aggregators to know that (reasonable) retramsmissions of our results are fine.

@merkys
Copy link
Member Author

merkys commented Jun 7, 2023

During 2023 workshop a question surfaced about how CC-BY 4.0 requirements are supposed to be met by OPTIMADE aggregators. In particular, there is a need to formally attribute a database from which individual entries are "re-translated". A possible solution is to say that retaining original self-links suffices (is this OK with CC-BY 4.0 terms?), but then self-links have to be either REQUIRED, or added by aggregators.

@rartino
Copy link
Contributor

rartino commented Mar 22, 2024

This has now been handled by #414 and #497.

@rartino rartino closed this as completed Mar 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic/response-format Issue discussing changes and improvements to the API response format type/proposal Proposal for addition/removal of features. May need broad discussion to reach consensus.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants