Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing pipeline creates unexpected fields in opensearch #7966

Open
1 task done
ArzelaAscoIi opened this issue Jul 2, 2024 · 0 comments
Open
1 task done

Indexing pipeline creates unexpected fields in opensearch #7966

ArzelaAscoIi opened this issue Jul 2, 2024 · 0 comments
Labels
P1 High priority, add to the next sprint type:bug Something isn't working

Comments

@ArzelaAscoIi
Copy link
Member

ArzelaAscoIi commented Jul 2, 2024

Describe the bug
When creating an indexing pipeline (see below) and index a file we get unexpected keys in the resulting document in opensearch. You can see the opensearch response below too.

components:
  file_classifier:
    type: haystack.components.routers.file_type_router.FileTypeRouter
    init_parameters:
      mime_types:
        - text/plain
        - application/pdf
        - text/markdown
        - text/html

  text_converter:
    type: haystack.components.converters.txt.TextFileToDocument
    init_parameters:
      encoding: utf-8

  pdf_converter:
    type: haystack.components.converters.pypdf.PyPDFToDocument
    init_parameters:
      converter_name: default

  markdown_converter:
    type: haystack.components.converters.markdown.MarkdownToDocument
    init_parameters:
      table_to_single_line: false

  html_converter:
    type: haystack.components.converters.html.HTMLToDocument
    init_parameters:
      extraction_kwargs:
        output_format: txt
        target_language: null
        include_tables: true
        include_links: false

  joiner:
    type: haystack.components.joiners.document_joiner.DocumentJoiner
    init_parameters:
      join_mode: concatenate
      sort_by_score: false

  writer:
    type: haystack.components.writers.document_writer.DocumentWriter
    init_parameters:
      document_store:
        type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
        init_parameters:
          similarity: cosine
      policy: OVERWRITE

connections: # Defines how the components are connected
  - sender: file_classifier.text/plain
    receiver: text_converter.sources
  - sender: file_classifier.application/pdf
    receiver: pdf_converter.sources
  - sender: file_classifier.text/markdown
    receiver: markdown_converter.sources
  - sender: file_classifier.text/html
    receiver: html_converter.sources
  - sender: text_converter.documents
    receiver: joiner.documents
  - sender: pdf_converter.documents
    receiver: joiner.documents:
  - sender: markdown_converter.documents
    receiver: joiner.documents
  - sender: html_converter.documents
    receiver: joiner.documents
  - sender: joiner.documents
    receiver: writer.documents
max_loops_allowed: 100

Opensearch response:

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "0a4680f5-f96c-4315-96a0-1c61373077ba-2e2643ba-73ed-42e8-bd81-d54a49a36beb-2e2643ba-73ed-42e8-bd81-d54a49a36beb",
        "_id": "4f2b3c1af05e21324d4bdb92153c16cf",
        "_score": 1,
        "_source": {
          "id": "4f2b3c1af05e21324d4bdb92153c16cf",
          "content": """...some content""",
          "dataframe": null,
          "blob": null,
          "score": null,
          "embedding": null,
          "sparse_embedding": null,
          "file_id": "d2e595c8-d8d4-4efd-8d00-6d8c495b685a",
          "file_name": "7_Post Checkout repository.txt",
          "_file_size": 1451,
          "_file_created_at": "2024-07-02T15:13:02.207122+00:00",
          "content_type": "text/plain"
        }
      }
    ]
  }
}

Error message

  • no error

Expected behavior

  • the key score should not be part of the document

Additional context
I guess that at some point we just sent the whole document containing the "score" to opensearch. This key however is misleading since the score should be assigned on retrieval.

To Reproduce
Run the pipeline above

FAQ Check

System:

  • OS:
  • GPU/CPU:
  • Haystack version (commit or version number):
  • DocumentStore:
  • Reader:
  • Retriever:
@ArzelaAscoIi ArzelaAscoIi added the type:bug Something isn't working label Jul 2, 2024
@mrm1001 mrm1001 added the P1 High priority, add to the next sprint label Jul 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 High priority, add to the next sprint type:bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants