Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confluence loader "keep_newlines" not always passed to "process_pages" #20086

Closed
5 tasks done
KevinHubert-Dev opened this issue Apr 5, 2024 · 1 comment
Closed
5 tasks done
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature 🔌: chroma Primarily related to ChromaDB integrations Ɑ: doc loader Related to document loader module (not documentation)

Comments

@KevinHubert-Dev
Copy link
Contributor

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

libs/community/langchain_community/document_loaders/confluence.py
@@ -359,6 +359,7 @@ def _lazy_load(self, **kwargs: Any) -> Iterator[Document]:
content_format,
ocr_languages,
keep_markdown_format,
keep_newlines=keep_newlines
)

Error Message and Stack Trace (if applicable)

No response

Description

I use the confluence loader of langchain to download the pages content of a specific page of my confluence instance. While textspllitting/chunking the pages I've noticed that in none-markdown format the newlines were missing. During the debugging I saw that that the required forward-pass of the keep_newlines parameter was not passed down to all call of the process_pages function inside of
libs/community/langchain_community/document_loaders/confluence.py

System Info

langchain=0.1.14
windows 11
python 3.10

@KevinHubert-Dev
Copy link
Contributor Author

I've forked the repo and will open a pull request in a few minutes.

@dosubot dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🔌: chroma Primarily related to ChromaDB integrations 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Apr 5, 2024
@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jul 5, 2024
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 12, 2024
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jul 12, 2024
ccurme added a commit that referenced this issue Aug 23, 2024
…to 'process_pages' function in confluence loader (#20086) (#20087)

- **Description:** Fixed missing `keep_newlines` parameter forward-pass
in confluence-loader
- **Issue:** #20086 
- **Dependencies:** None

---------

Co-authored-by: ccurme <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature 🔌: chroma Primarily related to ChromaDB integrations Ɑ: doc loader Related to document loader module (not documentation)
Projects
None yet
Development

No branches or pull requests

1 participant