Fixes to download articles thumbnails #1794

pavel-karatsiuba · 2023-02-23T21:10:12Z

You can see random article thumbnails on the main page of ZIM with the top 1 million Wikipedia pages. But these articles should be sorted in order from the most popular.

The problem in the process of retrieving article details. In most cases, mwoffliner cant get articles details from API with one request. Wikimedia API provides a mechanism with which we need to send multiple requests to retrieve all articles details.

Mwoffliner takes thumbnails only for articles that are retrieved with one request. But for articles that retrieve with multiple requests, the mwoffliner does not take thumbnails.
The mwoffliner has a piece of code that allows us to retrieve thumbnails of the articles with multiple requests, but this code was not used for a while.
With this PR I am turning on a mechanism to take thumbnails for articles that are retrieved with multiple requests.
These changes can't be affected much by the performance because it does not generate additional API requests.

fixes: #1786

codecov · 2023-02-23T21:22:42Z

Codecov Report

Patch coverage: 50.00% and project coverage change: +0.40 🎉

Comparison is base (ad56c6b) 70.42% compared to head (d5f7a24) 70.83%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1794      +/-   ##
==========================================
+ Coverage   70.42%   70.83%   +0.40%     
==========================================
  Files          23       23              
  Lines        2597     2582      -15     
  Branches      594      591       -3     
==========================================
  Hits         1829     1829              
+ Misses        661      648      -13     
+ Partials      107      105       -2

Impacted Files	Coverage Δ
src/Downloader.ts	`62.78% <33.33%> (+2.90%)`	⬆️
src/mwoffliner.lib.ts	`72.18% <100.00%> (ø)`
src/util/misc.ts	`72.35% <0.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

kelson42

Always put a PR description with:

explanation about the problem (no code details)
explanation about solution (no code details)
impacts (for users, devs)

kelson42

The function handles many different details. The thumbnail is only one of them. I wonder why we need to handle the thumbnail in a special manner?!

pavel-karatsiuba · 2023-02-26T18:50:44Z

In 2019 was added code with filtering of the details field in the response. After filtering the details contains only 'subCategories', 'revisions', and 'redirects' fields and in some cases can be also 'thumbnail' and others.
But then the code was changed. And the 'thumbnail' option was removed from the response.

I returned the 'thumbnail' option to the details response. I made fixes with minimum impact to not touch other functionality.

Maybe right now we also have a problem with other fields which were removed from the response: 'coordinates' and 'categories'.

kelson42 · 2023-03-01T11:05:11Z

@pavel-karatsiuba Thank you for the explanation. Actually this is not the first time we have a problem around this (not all options been "continued"). This is a serious problem because the impact is really subtil and we don't see it immediatly. Please work on this PR so all details requested are automatically continued and that we won't create a new problem by adding or removing or renaming one of this detail.

I'm also concerned by the memory usage of the retrieving of these image URL (representing an article) because:

We download it for all articles (even if we need less than 100 of them on the welcome page)
I suspect we download them even if we don't need them for a "tiled welcome page" (typically with a selection).

Therefore I believe we should isolate the retrieval of these image URL in a dedicated method (working similarly than the one you have modified) but which would:

Be called only if we need a tiled welcome page
Be run only to retrieve 100 pictures... and not on all articles

pavel-karatsiuba · 2023-03-02T18:04:14Z

@kelson42
This PR is not so impacting memory usage or script execution time. With this PR will be saved additional text fields to the details array.
Thumbnails are downloaded later, only in case we need a tiled home page. The script downloads not more than 100 thumbnails.

kelson42 · 2023-03-03T12:15:33Z

@kelson42 This PR is not so impacting memory usage or script execution time.

I didn't say this is a problem with this PR, I said there is a concern about the memory usage regarding how we proceed. I don't have talked anytime about execution time/speed. Think about it like this. on WPEN, currently, we are retrieving a tile image URL for almost 7.000.000 articles (potentially), but in best case we use only 100 of them.... and actually most probably none of them! One URL is 100 bytes in average, so make 80x7.000.000 bytes=560MB. This is my concern.

With this PR will be saved additional text fields to the details array.

Yes,

Thumbnails are downloaded later, only in case we need a tiled home page.

I don't have written about Thumbnails, I have written about "image URL".

The script downloads not more than 100 thumbnails.

That was clear.

Please reconsider the problem and the solution "Therefore I believe we should isolate the retrieval of these image URL in a dedicated method"

pavel-karatsiuba · 2023-03-06T20:20:45Z

URLs for images which you can see on the welcome page are built from thumbnails URL. So we don't have additional data for images in details object. Also, we have strict restrictions about retrieving thumbnail data with details. We are requesting thumbnail data only for 100 articles. So in response, we will get additional fields with thumbnail data only for 100 articles.
So make 80x100 bytes=8Kb.
This is not really true because we are executing parallel requests and for my 8core processor it is 8 parallel requests. And we are getting details for 50 articles by one request. This is 88050=32Kb additional data.
Of course, I can fix this: I can execute a request for the first 100 articles, wait for it, and then continue executing. But this exception will look not really good and will save only some Kb of traffic.

I think this PR is fixing the bug from the ticket. I propose to create a new ticket if we need to make something else.

… retrieves data from API with continuation.

pavel-karatsiuba requested a review from kelson42 February 23, 2023 21:20

kelson42 requested changes Feb 24, 2023

View reviewed changes

kelson42 reviewed Feb 25, 2023

View reviewed changes

pavel-karatsiuba requested a review from kelson42 February 26, 2023 18:51

kelson42 force-pushed the downloading-articles-thumbnails branch from 5d5ec92 to 03c29dc Compare March 1, 2023 10:57

kelson42 approved these changes Mar 6, 2023

View reviewed changes

pavel-karatsiuba added 2 commits March 6, 2023 22:13

Fixes to download articles thumbnails

da425a2

Do not remove any properties from the details object when the request…

f70e9b1

… retrieves data from API with continuation.

kelson42 force-pushed the downloading-articles-thumbnails branch from d5f7a24 to f70e9b1 Compare March 6, 2023 21:13

kelson42 merged commit 5470f6a into main Mar 6, 2023

kelson42 deleted the downloading-articles-thumbnails branch March 6, 2023 21:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes to download articles thumbnails #1794

Fixes to download articles thumbnails #1794

pavel-karatsiuba commented Feb 23, 2023 •

edited

Loading

codecov bot commented Feb 23, 2023 •

edited

Loading

kelson42 left a comment

kelson42 left a comment

pavel-karatsiuba commented Feb 26, 2023

kelson42 commented Mar 1, 2023

pavel-karatsiuba commented Mar 2, 2023

kelson42 commented Mar 3, 2023

pavel-karatsiuba commented Mar 6, 2023

Fixes to download articles thumbnails #1794

Fixes to download articles thumbnails #1794

Conversation

pavel-karatsiuba commented Feb 23, 2023 • edited Loading

codecov bot commented Feb 23, 2023 • edited Loading

Codecov Report

kelson42 left a comment

Choose a reason for hiding this comment

kelson42 left a comment

Choose a reason for hiding this comment

pavel-karatsiuba commented Feb 26, 2023

kelson42 commented Mar 1, 2023

pavel-karatsiuba commented Mar 2, 2023

kelson42 commented Mar 3, 2023

pavel-karatsiuba commented Mar 6, 2023

pavel-karatsiuba commented Feb 23, 2023 •

edited

Loading

codecov bot commented Feb 23, 2023 •

edited

Loading