Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clearing Redis media storage after each dump #1826

Merged
merged 3 commits into from
May 23, 2023
Merged

Conversation

pavel-karatsiuba
Copy link
Contributor

Clearing Redis media storage after each dump

fixes: #1825

@codecov
Copy link

codecov bot commented Apr 4, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.49 🎉

Comparison is base (c045f2d) 70.91% compared to head (b2e901b) 71.40%.

❗ Current head b2e901b differs from pull request most recent head a1c0eb7. Consider uploading reports for the commit a1c0eb7 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1826      +/-   ##
==========================================
+ Coverage   70.91%   71.40%   +0.49%     
==========================================
  Files          24       24              
  Lines        2623     2623              
  Branches      595      594       -1     
==========================================
+ Hits         1860     1873      +13     
+ Misses        657      644      -13     
  Partials      106      106              
Impacted Files Coverage Δ
src/mwoffliner.lib.ts 72.15% <100.00%> (ø)
src/util/dump.ts 72.41% <100.00%> (ø)

... and 2 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@kelson42
Copy link
Collaborator

kelson42 commented Apr 4, 2023

@pavel-karatsiuba please explain clearly the nature of the problem and the nature of the solution and its consequences. Why you need a new redis store?

@pavel-karatsiuba
Copy link
Contributor Author

All paths to files are stored in the Redis store before downloading. Then the script is reading file paths one by one from Redis and downloads them. If we use a parameter --format then we don't save to Redis some type of files. So they are filtered before being saved to the Redis.
If we don't use --format parameter or if we use one value for the parameter --format then all works well.

In the case if we use two '--format' parameters then all media files are filtered with the first '--format' parameter. And paths for filtered media will be stored in Redis storage. Before using the next '--format' parameter we are not clearing Redis storage and all media data from the first iteration will be used also for the second iteration.

For example, the page contains pdf and jpg files. We are using a script with parameters --format=nopic --format=nopdf. In the first iteration, the script will filter the jpg file and save into a Redis store only the path to pdf. Then for the next iteration script will save in the Redis store only the path to the jpg file, this is right, but in the Redis, we already have a pdf from the first iteration. So as result in the nopdf ZIM file we will see jpg and pdf files.

My fix is clearing the Redis store after each dump. And I used an additional Redis store to separate media files and other files for example CSS. So my solution removes only paths to media files but does not remove paths to CSS.

@pavel-karatsiuba
Copy link
Contributor Author

Why we I need a new Redis store:
The scraper stores all paths to files in the Redis store. And then reads paths from this store and downloads files. In this store, the script is saving paths to all media files like ogg, png, pdf, and plus CSS. This is a good solution if we use the command without the --format parameter or if we use one time --format parameter.
But if we use more than one time --format then we need to remove all paths to media files after each iteration. But we don't need to remove paths to CSS. That's why I am using one store for media files and another store for other files. After each iteration, I am clearing the store with paths to media files and did not touch the store for other files.

@kelson42
Copy link
Collaborator

@pavel-karatsiuba Thank you for your explanations, the problem is clear now to me and the solution seems appropriate. This is a major bug and thank you very much for detecting and finxing it.

One point is still unclear to me. Populating the medias happens based on the parsing of HTML, so if we remove everything, then the CSS should be repopulated?! I also don't understand why you called somethin media when it seems it was only for CSS?!

Copy link
Collaborator

@kelson42 kelson42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still few points to clarify/fix.

@pavel-karatsiuba
Copy link
Contributor Author

For storing CSS I used storage with the "Files" prefix. For media files, I used storage with the "Media" prefix.
I also found that subtitles also should be stored with media files because the treatment of the media files also takes effect on subtitles. I made a commit with fixes.

@kelson42
Copy link
Collaborator

kelson42 commented Apr 12, 2023

@pavel-karatsiuba You don’t have answered to one of my questions.

how article subtitles do impact media scraping? Actually what are subtitles? What about the js dependencies?

I don’t really see why we should have a new redis storage, because either it has to be refreshed (which is the case of css/js IMO) for each article (for each —format) or it does not have to and then it should belong the details redis store.

@pavel-karatsiuba
Copy link
Contributor Author

The first thing I made is clearing the Redis store which contains media files and CSS after finishing the current iteration. On the next iteration media files were repopulated but not CSS. In the source code, I see a condition that is not allowing to do this: https://github.com/openzim/mwoffliner/blob/main/src/util/dump.ts#L61
That's why I made another storage for Media files and I am clearing it after each iteration independent from Files storage.

Subtitles are files in .vtt format. They are downloading near media files and can be skipped by the --format=novid parameter.

@pavel-karatsiuba
Copy link
Contributor Author

I understand what you mean and changed the logic. Right now storage with media files and CSS will be cleared after each iteration.

@kelson42 kelson42 changed the title To download media files used additional redis storage. Clearing Redis media storage after each dump Apr 13, 2023
@kelson42
Copy link
Collaborator

Subtitles are files in .vtt format. They are downloading near media files and can be skipped by the --format=novid parameter.

Now I remember, subtitles need clearly to be removed after each dump indeed.

src/util/dump.ts Outdated Show resolved Hide resolved
src/util/dump.ts Outdated Show resolved Hide resolved
Copy link
Collaborator

@kelson42 kelson42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments

@kelson42 kelson42 force-pushed the format-parameter-fixes branch 2 times, most recently from bfec274 to 082c542 Compare May 23, 2023 15:43
Copy link
Collaborator

@kelson42 kelson42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kelson42 kelson42 merged commit f177d2f into main May 23, 2023
@kelson42 kelson42 deleted the format-parameter-fixes branch May 23, 2023 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

--format parameter is not working properly
2 participants