Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix main page name for wikis with base URL rewritten #2025

Merged
merged 2 commits into from
Jul 14, 2024

Conversation

TripleCamera
Copy link
Contributor

@TripleCamera TripleCamera commented May 11, 2024

This is a very small change that has been tested on my computer, so I'll just use the web editor.

Fixes #1995.

@kelson42
Copy link
Collaborator

@TripleCamera Tahnk your for your PR and your patience, we will finally review your PR

@audiodude
Copy link
Member

Hi @TripleCamera, thanks for the PR. Can you look into the failing tests? Also, please add additional tests for the code you changed, or otherwise verify that the code is covered by existing tests.

Thanks!

@TripleCamera
Copy link
Contributor Author

Hi @TripleCamera, thanks for the PR. Can you look into the failing tests? Also, please add additional tests for the code you changed, or otherwise verify that the code is covered by existing tests.

Thanks!

OK. I am running the tests on my computer.

First, I need to run the saveArticles test on git main (npm test -- test/unit/saveArticles.test.ts). Unfortunately, it fails. I am investigating about this.

 FAIL  test/unit/saveArticles.test.ts (238.36 s)
  ● saveArticles › --customFlavour using VisualEditor renderer

    thrown: "Exceeded timeout of 40000 ms for a test.
    Add a timeout value to this test to increase the timeout, if this is a long-running test. See https://jestjs.io/docs/api#testname-fn-timeout."

      151 |     })
      152 |
    > 153 |     test(`--customFlavour using ${renderer} renderer`, async () => {
          |     ^
      154 |       const { downloader, dump } = await setupScrapeClasses({ format: 'nopic' }) // en wikipedia
      155 |       downloader.setUrlsDirectors(rendererInstance, rendererInstance)
      156 |       class CustomFlavour implements CustomProcessor {

      at test/unit/saveArticles.test.ts:153:5
      at test/unit/saveArticles.test.ts:18:1

After fixing this, I will test my patch.

@XeroAlpha
Copy link

Hi @TripleCamera, thanks for the PR. Can you look into the failing tests? Also, please add additional tests for the code you changed, or otherwise verify that the code is covered by existing tests.
Thanks!

OK. I am running the tests on my computer.

First, I need to run the saveArticles test on git main (npm test -- test/unit/saveArticles.test.ts). Unfortunately, it fails. I am investigating about this.

 FAIL  test/unit/saveArticles.test.ts (238.36 s)
  ● saveArticles › --customFlavour using VisualEditor renderer

    thrown: "Exceeded timeout of 40000 ms for a test.
    Add a timeout value to this test to increase the timeout, if this is a long-running test. See https://jestjs.io/docs/api#testname-fn-timeout."

      151 |     })
      152 |
    > 153 |     test(`--customFlavour using ${renderer} renderer`, async () => {
          |     ^
      154 |       const { downloader, dump } = await setupScrapeClasses({ format: 'nopic' }) // en wikipedia
      155 |       downloader.setUrlsDirectors(rendererInstance, rendererInstance)
      156 |       class CustomFlavour implements CustomProcessor {

      at test/unit/saveArticles.test.ts:153:5
      at test/unit/saveArticles.test.ts:18:1

After fixing this, I will test my patch.

It seems that you are in an area that network is restricted. You may need to use proxy.

@XeroAlpha
Copy link

By the way, this pr is just a revert for 687b97ee.

I think it is necessary to check this commit before merging this pr.

@TripleCamera
Copy link
Contributor Author

It seems that you are in an area that network is restricted. You may need to use proxy.

Yes, I fixed this.

Besides, thanks @XeroAlpha: The command of running a single test file should be npm run test:pattern 'test/unit/saveArticles.test.ts'.

Now I can pass the test without the patch, and fail the test with the patch.

I printed out the value of mainPage:

  • Before: decodeURIComponent(entries.base.split('/').pop())Main_Page
  • After: entries.mainpageMain Page

So I should submit another patch that replaces spaces with underscores in mainPage.

By the way, this pr is just a revert for 687b97ee.

I think it is necessary to check this commit before merging this pr.

Indeed. 687b97e belongs to #1197, which "should resolve #1180". I am getting confused:

  1. Why is spaceDelimiter a variable? In MediaWiki, page names in URLs always have spaces replaced by underscores. I have read the issue description but still couldn't understand.
  2. In Removing concept of spaceDelimiter from MWOffliner #1197, why is the step of "replacing spaces with underscores" removed?
  3. Originally, the value of mainPage was entries.mainpage.replace(/ /g, self.spaceDelimiter). In an intermediate commit, this was changed to entries.mainpage.replace(/ /g, '_'). However, in the final commit, this was changed again to decodeURIComponent(entries.base.split('/').pop()). Why? The intermediate version is good enough.

@kelson42 Do you know the reason?

@audiodude
Copy link
Member

  1. Why is spaceDelimiter a variable? In MediaWiki, page names in URLs always have spaces replaced by underscores. I have read the issue description but still couldn't understand.

Not sure what you mean here. I don't see any references to spaceDelimiter at HEAD.

  1. In Removing concept of spaceDelimiter from MWOffliner #1197, why is the step of "replacing spaces with underscores" removed?

What do you mean by "step"? I believe it was successfully removed.

  1. Originally, the value of mainPage was entries.mainpage.replace(/ /g, self.spaceDelimiter). In an intermediate commit, this was changed to entries.mainpage.replace(/ /g, '_'). However, in the final commit, this was changed again to decodeURIComponent(entries.base.split('/').pop()). Why? The intermediate version is good enough.

According to https://www.mediawiki.org/wiki/API:Siteinfo :

base: The absolute path to the main page. 1.8+

The example is:

"base": "https://en.wikipedia.org/wiki/Main_Page",

So this has the main page with the spaces already replaced. I believe the code was likely set up this way specifically so that spaces didn't have to be replaced. As you pointed out in #1995, MediaWiki sometimes reports that the "base" URL is not the "Main Page", which actually makes sense logically.

Basically, I think you just need to do space replacement on your current PR and we should be good. Thanks again!

@TripleCamera
Copy link
Contributor Author

TripleCamera commented Jul 7, 2024

  1. Why is spaceDelimiter a variable? In MediaWiki, page names in URLs always have spaces replaced by underscores. I have read the issue description but still couldn't understand.

Not sure what you mean here. I don't see any references to spaceDelimiter at HEAD.

I was talking about #1180. I couldn't understand kelson's description.

  1. In Removing concept of spaceDelimiter from MWOffliner #1197, why is the step of "replacing spaces with underscores" removed?

What do you mean by "step"? I believe it was successfully removed.

I inspected #1197 and couldn't understand why the "replace" step was removed, perhaps because the author thought they were unnecessary? Here is an example:

@@ -96,11 +94,11 @@ class MediaWiki {
       const entries = json.query[type];
       Object.keys(entries).forEach((key) => {
         const entry = entries[key];
-        const name = entry['*'].replace(/ /g, self.spaceDelimiter);
+        const name = entry['*'];
         const num = entry.id;
         const allowedSubpages = ('subpages' in entry);
         const isContent = !!(entry.content !== undefined || util.contains(addNamespaces, num));
-        const canonical = entry.canonical ? entry.canonical.replace(/ /g, self.spaceDelimiter) : '';
+        const canonical = entry.canonical ? entry.canonical : '';
         const details = { num, allowedSubpages, isContent };
         /* Namespaces in local language */
         self.namespaces[util.lcFirst(name)] = details;

In my opinion, the author of the pull request simply needs to replace self.spaceDelimiter with '_'. However, they removed .replace(/ /g, self.spaceDelimiter), which changed the behavior of the program.


Basically, I think you just need to do space replacement on your current PR and we should be good. Thanks again!

Done. Now I can pass the tests on my computer. Please run Actions again.

@audiodude
Copy link
Member

I inspected #1197 and couldn't understand why the "replace" step was removed, perhaps because the author thought they were unnecessary? Here is an example:

@@ -96,11 +94,11 @@ class MediaWiki {
       const entries = json.query[type];
       Object.keys(entries).forEach((key) => {
         const entry = entries[key];
-        const name = entry['*'].replace(/ /g, self.spaceDelimiter);
+        const name = entry['*'];
         const num = entry.id;
         const allowedSubpages = ('subpages' in entry);
         const isContent = !!(entry.content !== undefined || util.contains(addNamespaces, num));
-        const canonical = entry.canonical ? entry.canonical.replace(/ /g, self.spaceDelimiter) : '';
+        const canonical = entry.canonical ? entry.canonical : '';
         const details = { num, allowedSubpages, isContent };
         /* Namespaces in local language */
         self.namespaces[util.lcFirst(name)] = details;

In my opinion, the author of the pull request simply needs to replace self.spaceDelimiter with '_'. However, they removed .replace(/ /g, self.spaceDelimiter), which changed the behavior of the program.

I think you're right, this is probably a bug, but I assume it doesn't affect the output ZIM. Re-running CI now.

Copy link
Member

@audiodude audiodude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kelson42
Copy link
Collaborator

AFAIK we can not assume that articleIds replaces spaces by undescores, this depends how Mediwiki is configured.

@kelson42
Copy link
Collaborator

Actually, I can not find the reference anymore...

This is a very small change that has been tested on my computer, so I'll just use the web editor.
Copy link

codecov bot commented Jul 13, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.93%. Comparing base (7f8ebd2) to head (42d6569).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2025      +/-   ##
==========================================
- Coverage   74.76%   71.93%   -2.83%     
==========================================
  Files          41       41              
  Lines        3146     3146              
  Branches      689      689              
==========================================
- Hits         2352     2263      -89     
- Misses        675      748      +73     
- Partials      119      135      +16     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@audiodude audiodude merged commit c09bc92 into openzim:main Jul 14, 2024
5 of 6 checks passed
@kelson42
Copy link
Collaborator

@audiodude This operation of replacing space by underscore is very standard. Very surprised there is no primitive already there for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Troubleshooting minecraftwiki_zh_all recipe
4 participants