Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some taken some given #202

Open
wants to merge 110 commits into
base: master
Choose a base branch
from
Open

Some taken some given #202

wants to merge 110 commits into from

Conversation

HjalmarrSv
Copy link

Some ideas and fixes that may not suit everyone.

    # Whether to add an empty line after header
    noLineAfterHeader = False,
   ##
    # Whether to preserve lists
    keepLists = False,
    
    ##
    # Whether to add an empty line after header
    noLineAfterHeader = False,
    
    ##
    # Whether to add aHeader and Footer
    headersfooters = False,
    
    ##
    # Whether to add aHeader and Footer
    spacefree = False,
    
    ##
    # Whether to add aHeader and Footer
    titlefree = False,
-- spacefree now also removes line = " " in addition to line = "". Both empty lines and one space lines are frequent in the output otherwise. Clean output afterwards with sed or similar tool of choice, but carefully - it will quickly do _exactly_ as told.
updated to reflect change to --squeeze-blank
both should work the same
as suggested by operator-name as: wants to merge 1 commit into attardi:master from operator-name:master
removed doublette of         '__TOC__',
added: 
        '__NOEDITSECTION__', #added 191229
        '__EXPECTUNUSEDCATEGORY__', #added 191229 
as found in private $mDoubleUnderscoreIDs here: https://doc.wikimedia.org/mediawiki-core/master/php/MagicWordFactory_8php_source.html
from
total of page: %d, total of articl page: %d; total of used articl page:
to
total of pages: %d, total of article pages: %d; total of used article pages:
New option. One sentence per line. Nothing but article text. One empty line between articles. Some cleaning.
Adapted from work by josecannete on wikiextractorforBERT here on github

Also some small edits.
Use  --min_text_length 100 for removing very short articles.
cat --squeeze-blank wiki/*/* > wiki/wiki.txt

Notice the use of a wildcard on two levels, i.e. folders and files. Not possible usually.
Fix as proposed by chaojiang06:
"Hi, I try to fix the template expansion function based on the current latest 2.75 version."
"Basically, two lines got changed, #1269 and #1868." [not the same lines in this fork]
overcomplicated maybe
could be just: if choice then exchange all . for ,
now, as in better late than never
with content within tags
Leftovers from templates, when guessing.
at row 3528, if you want to rid output of '( )', whether empty or full of text.
as seen in json-output
if they are not already cleaned
https://stackoverflow.com/questions/12871066/what-exactly-is-a-raw-string-regex-and-how-can-you-use-it
Using raw does nothing for re., but stops python from interpreting anything in the string before passing to re., as stated above.
may contain errors, testing needed
looks like there is a nested bug that needs to be hunted down also
if line 53 is changed to False. Note that if you want something else than ./ab/abcd/abcd as directory structure you need to change in the code. I have commented where (lines 117-119). Please, also look at line 247 for file name variations.
Either change from "/" to e.g. "\", or call the python os.path.join function once per directory level created. I have not tested either.
Nicer library structure.
Slight improvement. Note sentences ends with . ! ?. Other punctuation marks have to be added, if they exist.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants