Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Work on translations to substitute into the quick reference guide #242

Open
baskaufs opened this issue Oct 25, 2019 · 14 comments
Open

Work on translations to substitute into the quick reference guide #242

baskaufs opened this issue Oct 25, 2019 · 14 comments
Assignees
Labels
Docs - Quick Reference Guide Maintenance Issues related to the management of standards documents.

Comments

@baskaufs
Copy link

We should create tables for the basic DwC metadata in various languages that could be substituted into the Quick Reference Guide. It would probably require some changes to the build script, or a different build script.

@kcopas
Copy link

kcopas commented Oct 26, 2019

Might take some digging to find them, but we do have some earlier translations in the files at the @gbif secretariat. Plus all the terms included in the eight languages now on GBIF.org, if it’s helpful.

Sent with GitHawk

@baskaufs
Copy link
Author

Yes, I have a copy of translations somewhere, but I think they are outdated and would need to be checked against the current list and have their definitions checked to make sure they are up to date. But I think first I need to work out how to get the DwC infrastructure to make use of them. I'll make a more serious effort to track down existing work after that. Thanks1!!

@baskaufs
Copy link
Author

baskaufs commented Mar 5, 2020

OK, the table I have is here. I don't remember where they came from. Definitions are in en, es, zh_hans, and ja. Labels are in the same languages except for ja.

@MattBlissett
Copy link
Member

OK, the table I have is here. I don't remember where they came from. Definitions are in en, es, zh_hans, and ja. Labels are in the same languages except for ja.

I think that might have come from http://rs.gbif.org/terms/dwc/dwc_translations.rdf .

The GBIF IPT would benefit from translations. Should we investigate setting up something to generate these translations (terms + definitions) as part of our translation system? (i.e. Crowdin). If it's given some XML with English attribute values, it would produce similar XML with translated attribute values (one file per language).

@tucotuco
Copy link
Member

I think that would be awesome.

@baskaufs
Copy link
Author

@MattBlissett Two questions:

  1. We have scheduled a workshop at the TDWG conference to get volunteers to work on translations of controlled vocabularies. Would it be possible to run the existing labels and definitions through your system and have the translators correct them rather than having to start from scratch?
  2. Are we wedded to XML? When representing multilingual labels and definitions, I have been trying to use a consistent JSON-LD format as demonstrated here: https://heardlibrary.github.io/digital-scholarship/lod/json_ld_test/establishmentMeans.json This form allows for the data to be both born Linked Data-ready (ingests into a triple store as RDF consistent with the Standards Documentation Specification) but also easily read by Javascript and scripting languages like Python (for example used to generate this: https://heardlibrary.github.io/digital-scholarship/lod/json_ld_test/display-cv.html?nl. As you know, reading XML isn't impossible, but I've found it to be a lot more clunky that JSON.

@MattBlissett
Copy link
Member

MattBlissett commented Aug 25, 2021

  1. Existing translations could be imported. About 80% of the labels exist, but not many of the definitions match — the translations are very old, and pre-date the splitting out of comments/examples, and lots of smaller changes.

  2. No; in fact CSV might be easiest. If we're to use the Crowdin system (under GBIF's account or a new one) then we'll need some scripting to structure the data in a way Crowdin can make best use of it, i.e. presenting it in a reasonable way for the translators. The benefit of the system is in tracking changes — an English definition can be changed, and translators are then prompted to check/retranslate.

Here's a rough example. I've taken the DWC term labels, definitions, comments and examples and split them into 4 separate CSV files, keyed on the short name. This is the English labels: https://github.com/gbif/crowdin-asciidoctor-testing/blob/translation_master/dwc_labels.en.csv (the columns are key, translation context and translation source string). Crowdin picks that up from GitHub, and the translators input the translations in this interface:
image

(I imported the existing Spanish translations.)

Crowdin then generates this file, which has the English string replace by Spanish: https://github.com/gbif/crowdin-asciidoctor-testing/blob/translation_master/dwc_labels.es.csv

@baskaufs
Copy link
Author

Oh, I hadn't picked up on the fact that Crowdin is crowd-sourcing and not AI. Then it's probably not the best for the controlled vocab translating since the idea was to have humans who were content experts do the translating.

CSV is awesome. That's what I've been using to generate the JSON-LD anyway.

What I've been drawing on to generate the JSON is something like this: https://github.com/tdwg/rs.tdwg.org/blob/master/dwc-translations/dwcTranslations.csv

Aside from the first three columns, the rest of the columns are all just label/definition pairs for each language, with column mappings to properties and language attributes here: https://github.com/tdwg/rs.tdwg.org/blob/master/dwc-translations/dwcTranslations-column-mappings.csv
But separate files would also be fine and I hadn't been handling the examples and comments since they don't exist for controlled vocabularies at this time.

Anything similar that's on GitHub could be used as a common source for GBIF (or anyone else) as well as by the DwC team to generate translations of the various term lists/guides.

@MattBlissett
Copy link
Member

Just quickly: we can easily choose the crowd for CrowdIn, and allocate only experts to the project.

I'll absorb the rest of what you wrote tomorrow.

@MattBlissett
Copy link
Member

The main advantage of Crowdin (or a similar tool, like Weblate) is — with some scripts — it automates the process of distributing files to translators, and integrating the resulting translations.

Translators would be chosen however we'd like, e.g. invited experts, paid translators, volunteers. Translators can discuss or comment on any translation string. New or changed translations are fed back into Git as a pull request from Crowdin.

CSV is awesome. That's what I've been using to generate the JSON-LD anyway.

What I've been drawing on to generate the JSON is something like this: https://github.com/tdwg/rs.tdwg.org/blob/master/dwc-translations/dwcTranslations.csv

This is also what I used to import the Spanish translations in the demo. I used https://github.com/tdwg/rs.tdwg.org/blob/master/terms/terms.csv for the source strings.

But separate files would also be fine and I hadn't been handling the examples and comments since they don't exist for controlled vocabularies at this time.

Crowdin supports either, but I chose separate files since I think they're easier for people to review – a changed line in one file clearly affects only a single language.

MattBlissett added a commit to tdwg/rs.tdwg.org that referenced this issue Oct 29, 2021
@MattBlissett
Copy link
Member

Hi @tucotuco, @baskaufs, @debpaul, @pzermoglio

I've improved the Crowdin integration for translation of Darwin Core term labels, definitions etc.

I've enabled it only for Darwin Core terms, and although the examples, comments etc are also available for translation, so far only the labels and definitions are used -- the same as Steve did for establishmentMeans, degreeOfEstablishment etc in the conference workshop last year.

The process runs on GBIF's Jenkins server. It runs whenever prompted by a change on GitHub, and after a ~10-30 minute delay from a change on Crowdin. It

  1. finds any changes in terms/terms.csv, and sends these changes to Crowdin
  2. takes any new or updated translations from Crowdin and adds them to terms/terms-translations.csv.

Steve's script generates files like establishmentMeans.json from establishmentMeans/establishmentMeans-translations.csv, and I'm using exactly the same structure for terms/terms-translations.csv. I haven't tested if Steve's script works on terms-translations.csv, as the script is still on a branch and I've not looked at how it works.

If you go to https://crowdin.com/translate/darwin-core/58/en-fr?filter=basic&value=3 (or your preferred language) and add another translation (copy one from the old translations or the GBIF portal French translation) and wait about 10-30 minutes, you should see the update on terms/terms-translations.csv.

There are settings within Crowdin to add a review step before the new/changed translation is exported, but for simplicity it's not enabled at present.

@baskaufs
Copy link
Author

This is so cool, Matt! I've put it on my todo list to try running my script with the file you generated.

The reason the script is on a branch is that I set up a branch (gh-pages) for GitHub Pages so that the generated JSON files would get served with the correct Content-Type headers. So that branch won't ever get merged. That may not be the best setup in the long run -- I didn't use the usual "docs" folder option because I already had a docs folder that I was using for something else.

@MattBlissett
Copy link
Member

I have made improvements to the process, and (I think) imported all the results from the TDWG workshop a couple of years ago. I haven't tried to import any other definitions — I think it's best if translators do that themselves, one at a time, as many of the definitions have changed.

Translations done in Crowdin will end up in these CSV files, around 1-2 hours after the change is made in CrowdIn:

Parts of this repository used for the website (e.g. terms.tmpl) can also be translated, this is easily set up in Crowdin. That could mean there terms.es.tmpl and so on appear.

@peterdesmet
Copy link
Member

Ping @ben-norton (in case he is not subscribed to this issue).

@tucotuco tucotuco added the Maintenance Issues related to the management of standards documents. label Mar 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs - Quick Reference Guide Maintenance Issues related to the management of standards documents.
Projects
None yet
Development

No branches or pull requests

5 participants