Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert formats to feed alternative usage #4

Open
hugolpz opened this issue Feb 6, 2023 · 4 comments
Open

Convert formats to feed alternative usage #4

hugolpz opened this issue Feb 6, 2023 · 4 comments
Assignees

Comments

@hugolpz
Copy link
Owner

hugolpz commented Feb 6, 2023

SPARQL2JSON:

JSON:

To be reused in:

Processing

Example :

# For Lingua Libre Bot
cat LL-LanguagesRecordsData.json | jq --raw-output '.[] | [.language, .records, .languageLabel] | @tsv'
# For Operations
cat LL-LanguagesActive.json | jq --raw-output '.[] | [.language, .records, .languageLabel] | @tsv'

CSV

For csv: convert json to csv or download csv directly.

@hugolpz hugolpz self-assigned this Feb 6, 2023
@hugolpz
Copy link
Owner Author

hugolpz commented Feb 6, 2023

@pamputt, which file format do you prefer to replace your listing lingualibre languages (lili qid) to feed the bot ? Json, csv, tsv ?

I would prefer to save both the Qid and the number pf recordings, so all those languages over 50k records get divided sparql queries.

@pamputt
Copy link

pamputt commented Feb 7, 2023

The less complicated the better, so TSV if the best (or CSV), but definitely not JSON.

@hugolpz
Copy link
Owner Author

hugolpz commented Feb 7, 2023

EDIT: Outch. Default Blazegraph API as used by Lingualibre.org endpoint only support xml, json. So best is to return json and use JQ to format this. (I'm on it)

JSON via sparql2data

sparql2data has built-in data validation, only saves response if valid.
Then, given query LL-LanguagesRecordsData.sparql, one can use Sparql2data as a module with command such as:

bash script.sh -q ./path/to/LL-LanguagesRecordsData.sparql -s lingualibre -f json
# Output response in ./data/LL-LanguagesRecordsData.json

JSON via Lingualibre API direct call

We can borrow the core code from sparql2data, to integrate it into the Lingua-libre-bot's code :

# Sparql query
query=$(cat ${sparql})
# echo "QUERY= ${query}" | head -n 5

# CURL SPARQL query on Wikidata
response=$(curl -G --data-urlencode query="${query}" https://lingualibre.org/sparql?format=json)
echo "RESPONSE: ${response}" | head -n 20

# First cleanup
clean=$(echo "${response}" | jq '.results.bindings' | jq 'map(map_values(.value))' | sed -E "s/https?:\/\/.*\/entity\///g" )

## IF is valid response, THEN print to local file, ELSE error message.
firstline=$(echo "${clean}" | head -n 1)
if [[ ${firstline:0:1} == "[" ]]; then
    echo "${clean}" > "./data/list_languages.json"; 
else
    echo "XHR response appears invalid, was NOT printed to  "./data/list_languages.json"
fi

JSON via Github Sparql2Json

Sparql2Data is also configured as a github page, with nightly builds, which can be queried as an API.

response=$(curl -G https://hugolpz.github.io/Sparql2Data/data/LL-LanguagesActive.json)
echo ${response}

TSV from JSON via JQ

JQ is a well known package to process and reformat JSON data, i.e. :

curl -G https://hugolpz.github.io/Sparql2Data/data/LL-LanguagesActive.json | \
    jq --raw-output '.[] | .language+"   "+.records+"   "+.languageLabel'

Or

curl -G https://hugolpz.github.io/Sparql2Data/data/LL-LanguagesActive.json | \
    jq --raw-output '.[] | [.language, .records, .languageLabel] | @tsv'

Output:

...
Q25	33462	Esperanto
Q336	56756	Odia
Q307	62224	Bengali
Q298	92252	Polish
Q21	255005	French

Loop

Then you will have to load ($cat?) and loop over that data which will provide several values per languages such as Qid, number of records, iso, ... to do what you want to.

#!/bin/bash
# USAGE: bash loop.sh file.tsv

filepath="$1"
while IFS=$'\t' read -r llqid records; do
  # Run a command with the two columns as parameters
  echo "Running command with parameters: $llqid $records"
if [[ $records >= 50000 ]]; then
    # yearly python run $llqid
else
    # minimal python run $llqid
fi
done < "$filepath"

@hugolpz
Copy link
Owner Author

hugolpz commented Feb 7, 2023

@pamputt , I see the way head.
lingua-libre/Lingua-Libre-Bot#22 has been merged, so I can move forward to refine your file into a documented bash script.

@hugolpz hugolpz changed the title Convert to to feed alternative usage Convert to feed alternative usage Feb 9, 2023
@hugolpz hugolpz changed the title Convert to feed alternative usage Convert formats to feed alternative usage Apr 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants