FIX #2617 Cherio Web Crawler doesn't work with large sites #2678

ahmosman · 2024-06-19T16:51:14Z

Hi,

There is a fix for issue #2617. There were two problems.

Firstly, sometimes cheerioLoader fails to download contnent of page and it returns undefined, so I added assignment of empty arrays if there is undefined.

Secondly, CheerioWebBaseLoader doesn’t support loading PDF files. It takes a lot to load PDF file and the content is encoded so I believe it shouldn’t be downloaded as Document. So I added condition to avoid downloading PDF files.

I’ve tried it on the same Chatflow as in the issue.

HenryHengZJ · 2024-06-19T17:37:14Z

do you have an example site that we can test before and after this PR?

ahmosman · 2024-06-19T21:33:06Z

Sure, I tested on this site: https://www.cupraofficial.pl. It takes about 10 minutes to scrap everything and upload to Postgres VectorDB.

packages/components/nodes/documentloaders/Cheerio/Cheerio.ts

FIX FlowiseAI#2617 Big sites scan error

dca904c

HenryHengZJ reviewed Jun 21, 2024

View reviewed changes

packages/components/nodes/documentloaders/Cheerio/Cheerio.ts Outdated Show resolved Hide resolved

FIX FlowiseAI#2617 Big sites scan error - review fix

387147f

HenryHengZJ approved these changes Jul 5, 2024

View reviewed changes

HenryHengZJ merged commit 90558ca into FlowiseAI:main Jul 5, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX #2617 Cherio Web Crawler doesn't work with large sites #2678

FIX #2617 Cherio Web Crawler doesn't work with large sites #2678

ahmosman commented Jun 19, 2024

HenryHengZJ commented Jun 19, 2024

ahmosman commented Jun 19, 2024 •

edited

Loading

FIX #2617 Cherio Web Crawler doesn't work with large sites #2678

FIX #2617 Cherio Web Crawler doesn't work with large sites #2678

Conversation

ahmosman commented Jun 19, 2024

HenryHengZJ commented Jun 19, 2024

ahmosman commented Jun 19, 2024 • edited Loading

ahmosman commented Jun 19, 2024 •

edited

Loading