Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX #2617 Cherio Web Crawler doesn't work with large sites #2678

Merged
merged 2 commits into from
Jul 5, 2024

Conversation

ahmosman
Copy link
Contributor

Hi,

There is a fix for issue #2617. There were two problems.

Firstly, sometimes cheerioLoader fails to download contnent of page and it returns undefined, so I added assignment of empty arrays if there is undefined.

Secondly, CheerioWebBaseLoader doesn’t support loading PDF files. It takes a lot to load PDF file and the content is encoded so I believe it shouldn’t be downloaded as Document. So I added condition to avoid downloading PDF files.

I’ve tried it on the same Chatflow as in the issue.

@HenryHengZJ
Copy link
Contributor

do you have an example site that we can test before and after this PR?

@ahmosman
Copy link
Contributor Author

ahmosman commented Jun 19, 2024

Sure, I tested on this site: https://www.cupraofficial.pl. It takes about 10 minutes to scrap everything and upload to Postgres VectorDB.

@HenryHengZJ HenryHengZJ merged commit 90558ca into FlowiseAI:main Jul 5, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants