Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I want parallel processing results by file #93

Open
Ribo-Py opened this issue Oct 11, 2021 · 0 comments
Open

I want parallel processing results by file #93

Ribo-Py opened this issue Oct 11, 2021 · 0 comments

Comments

@Ribo-Py
Copy link

Ribo-Py commented Oct 11, 2021

I have below code to parallel process multiple files. I want the results ordered by file (first loop), however, it is now ordered by clause (third loop) as shown in image. How may I fix it?

`
def extractFile(file):
pdf = pdfplumber.open(os.path.join(myPath, file))
n_clause = 0

for i in range(len(pdf.pages)):
    page = pdf.pages[i]
    text = page.extract_text()
    # tables = tabula.read_pdf(os.path.join(myPath, file), pages=i+1, multiple_tables=True)

    if re.search(keywords.casefold(), text.casefold()):
        highlights = text.split('.')
        for sentence in highlights:
            if re.search(keywords.casefold(), sentence.casefold()):
                n_clause += 1
                if n_clause <= clause_cap:
                    print(f'[Contract Name: {file}] \n Page {i+1} - Clause {n_clause}: {attention(sentence, keywords)}')
                else:
                    break

files = [x for x in os.listdir(myPath) if x.endswith(".pdf")]

with mp.Pool(6) as pool:
print(pool.map(extractFile, files))
`

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant