Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes to download articles thumbnails #1794

Merged
merged 2 commits into from
Mar 6, 2023
Merged

Conversation

pavel-karatsiuba
Copy link
Contributor

@pavel-karatsiuba pavel-karatsiuba commented Feb 23, 2023

You can see random article thumbnails on the main page of ZIM with the top 1 million Wikipedia pages. But these articles should be sorted in order from the most popular.

The problem in the process of retrieving article details. In most cases, mwoffliner cant get articles details from API with one request. Wikimedia API provides a mechanism with which we need to send multiple requests to retrieve all articles details.

Mwoffliner takes thumbnails only for articles that are retrieved with one request. But for articles that retrieve with multiple requests, the mwoffliner does not take thumbnails.
The mwoffliner has a piece of code that allows us to retrieve thumbnails of the articles with multiple requests, but this code was not used for a while.
With this PR I am turning on a mechanism to take thumbnails for articles that are retrieved with multiple requests.
These changes can't be affected much by the performance because it does not generate additional API requests.

fixes: #1786

@codecov
Copy link

codecov bot commented Feb 23, 2023

Codecov Report

Patch coverage: 50.00% and project coverage change: +0.40 🎉

Comparison is base (ad56c6b) 70.42% compared to head (d5f7a24) 70.83%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1794      +/-   ##
==========================================
+ Coverage   70.42%   70.83%   +0.40%     
==========================================
  Files          23       23              
  Lines        2597     2582      -15     
  Branches      594      591       -3     
==========================================
  Hits         1829     1829              
+ Misses        661      648      -13     
+ Partials      107      105       -2     
Impacted Files Coverage Δ
src/Downloader.ts 62.78% <33.33%> (+2.90%) ⬆️
src/mwoffliner.lib.ts 72.18% <100.00%> (ø)
src/util/misc.ts 72.35% <0.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Collaborator

@kelson42 kelson42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always put a PR description with:

  • explanation about the problem (no code details)
  • explanation about solution (no code details)
  • impacts (for users, devs)

Copy link
Collaborator

@kelson42 kelson42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function handles many different details. The thumbnail is only one of them. I wonder why we need to handle the thumbnail in a special manner?!

@pavel-karatsiuba
Copy link
Contributor Author

In 2019 was added code with filtering of the details field in the response. After filtering the details contains only 'subCategories', 'revisions', and 'redirects' fields and in some cases can be also 'thumbnail' and others.
But then the code was changed. And the 'thumbnail' option was removed from the response.

I returned the 'thumbnail' option to the details response. I made fixes with minimum impact to not touch other functionality.

Maybe right now we also have a problem with other fields which were removed from the response: 'coordinates' and 'categories'.

@kelson42
Copy link
Collaborator

kelson42 commented Mar 1, 2023

@pavel-karatsiuba Thank you for the explanation. Actually this is not the first time we have a problem around this (not all options been "continued"). This is a serious problem because the impact is really subtil and we don't see it immediatly. Please work on this PR so all details requested are automatically continued and that we won't create a new problem by adding or removing or renaming one of this detail.

I'm also concerned by the memory usage of the retrieving of these image URL (representing an article) because:

  • We download it for all articles (even if we need less than 100 of them on the welcome page)
  • I suspect we download them even if we don't need them for a "tiled welcome page" (typically with a selection).

Therefore I believe we should isolate the retrieval of these image URL in a dedicated method (working similarly than the one you have modified) but which would:

  • Be called only if we need a tiled welcome page
  • Be run only to retrieve 100 pictures... and not on all articles

@pavel-karatsiuba
Copy link
Contributor Author

@kelson42
This PR is not so impacting memory usage or script execution time. With this PR will be saved additional text fields to the details array.
Thumbnails are downloaded later, only in case we need a tiled home page. The script downloads not more than 100 thumbnails.

@kelson42
Copy link
Collaborator

kelson42 commented Mar 3, 2023

@kelson42 This PR is not so impacting memory usage or script execution time.

I didn't say this is a problem with this PR, I said there is a concern about the memory usage regarding how we proceed. I don't have talked anytime about execution time/speed. Think about it like this. on WPEN, currently, we are retrieving a tile image URL for almost 7.000.000 articles (potentially), but in best case we use only 100 of them.... and actually most probably none of them! One URL is 100 bytes in average, so make 80x7.000.000 bytes=560MB. This is my concern.

With this PR will be saved additional text fields to the details array.

Yes,

Thumbnails are downloaded later, only in case we need a tiled home page.

I don't have written about Thumbnails, I have written about "image URL".

The script downloads not more than 100 thumbnails.

That was clear.

Please reconsider the problem and the solution "Therefore I believe we should isolate the retrieval of these image URL in a dedicated method"

@pavel-karatsiuba
Copy link
Contributor Author

URLs for images which you can see on the welcome page are built from thumbnails URL. So we don't have additional data for images in details object. Also, we have strict restrictions about retrieving thumbnail data with details. We are requesting thumbnail data only for 100 articles. So in response, we will get additional fields with thumbnail data only for 100 articles.
So make 80x100 bytes=8Kb.
This is not really true because we are executing parallel requests and for my 8core processor it is 8 parallel requests. And we are getting details for 50 articles by one request. This is 88050=32Kb additional data.
Of course, I can fix this: I can execute a request for the first 100 articles, wait for it, and then continue executing. But this exception will look not really good and will save only some Kb of traffic.

I think this PR is fixing the bug from the ticket. I propose to create a new ticket if we need to make something else.

@kelson42 kelson42 force-pushed the downloading-articles-thumbnails branch from d5f7a24 to f70e9b1 Compare March 6, 2023 21:13
@kelson42 kelson42 merged commit 5470f6a into main Mar 6, 2023
@kelson42 kelson42 deleted the downloading-articles-thumbnails branch March 6, 2023 21:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Automated landing page does not pick obvious winners
2 participants