Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Audio Metadata #7490

Merged
merged 10 commits into from
Nov 6, 2023
Merged

Audio Metadata #7490

merged 10 commits into from
Nov 6, 2023

Conversation

dschmidt
Copy link
Member

@dschmidt dschmidt commented Oct 15, 2023

CURRENT STATE

We now extract audio metadata, write it into the index and into the arbitrary metadata storage.
There are a bunch of tests for testing the marshalling and unmarshalling between the different data structures in play.
Audio Metadata is added to Graph Api listings as Audio Facet of DriveItems.

The search service responds with audio metadata, but neither Graph Api (as it has no search yet) nor WebDAV Api currently make it part of their search responses. I would like to tackle that in a followup.

INITIAL PR

Description

This PR is a draft for adding audio metadata to the search index and the metadata storage. It uses that metadata to implement the audio facet (added in owncloud/libre-graph-api#120).

Moreover the metadata is added to search results and WebDAV propfinds as custom <oc:audio.*> props.

I had to partially update libegraph-api-go to make the Audio model available and had to update reva to add the props to propfinds, I've done that in the vendor/ folder just to get started and will obviously upstream those changes once we agree we want to go down this road.

On the one hand I've been surprised how easy it was to implement this in the end (kudos for having everything in place for me!), on the other hand I've found it quite cumbersome to do lots and lots of error-prone (de)serializing manually:

In search service:

  • tika result -> libregraph.Audio struct (okay, this makes total sense, this does some actual semantic mapping)
  • libregraph.Audio struct -> metadata (map[string]string) for storage
  • blevee entity (which was auto-serialized from libregraph.Audio) -> protobuf searchMessage.Audio

Then on the consuming end:

  • metadata to libregraph.Audio
  • libregraph.Audio to WebDAV props
  • searchMessage.Audio -> WebDAV props
  • possibly later: searchMessage.Audio to libregraph.Audio (then we could reuse that when serializing to webdav)

All of this is almost the same but not quite - any ideas how to handle this more automatically or at least have all logic in a more central place? Right now one has to touch a lot of places to add a new facet to items.
This could probably be handled almost automatically using reflection, I'm just not sure of the performance penalty this might bring.

Used prefixes etc are totally up for debate, this is basically a PoC - not knowing the oCIS codebase too well I'm happy about any constructive feedback.

Related Issue

Motivation and Context

This is a personal pet peeve of mine, there's been no company request to implement this and as such there's no priority to deal with this PR - I would just personally appreciate it ;-)

I want to query my personal music collection by metadata.

Screenshots (if appropriate):

Regular PROPFIND:

http --verify no --auth admin:admin -v PROPFIND https://localhost:9200/remote.php/dav/spaces/c2f4c71c-0e6d-4c58-b82f-8337ef789579%24a6f58a38-de7d-4c1a-849e-5072faa4295b
PROPFIND /remote.php/dav/spaces/c2f4c71c-0e6d-4c58-b82f-8337ef789579%24a6f58a38-de7d-4c1a-849e-5072faa4295b 
HTTP/1.1 207 Multi-Status
...

<d:multistatus xmlns:s="http://sabredav.org/ns" xmlns:d="DAV:" xmlns:oc="http://owncloud.org/ns">
...
  <d:response>
    <d:href>/remote.php/dav/spaces/c2f4c71c-0e6d-4c58-b82f-8337ef789579$a6f58a38-de7d-4c1a-849e-5072faa4295b/01%20-%20Sucker.mp3</d:href>
    <d:propstat>
      <d:prop>
        <oc:id>c2f4c71c-0e6d-4c58-b82f-8337ef789579$a6f58a38-de7d-4c1a-849e-5072faa4295b!20667376-d173-47b8-98af-c8efdfa842ad</oc:id>
        <oc:fileid>c2f4c71c-0e6d-4c58-b82f-8337ef789579$a6f58a38-de7d-4c1a-849e-5072faa4295b!20667376-d173-47b8-98af-c8efdfa842ad</oc:fileid>
        <oc:spaceid>a6f58a38-de7d-4c1a-849e-5072faa4295b</oc:spaceid>
        <oc:file-parent>c2f4c71c-0e6d-4c58-b82f-8337ef789579$a6f58a38-de7d-4c1a-849e-5072faa4295b!a6f58a38-de7d-4c1a-849e-5072faa4295b</oc:file-parent>
        <oc:name>01 - Sucker.mp3</oc:name>
        <d:getetag>&quot;1f6aa11d5792d9fb69aeedeaa47a838f&quot;</d:getetag>
        <oc:permissions>RDNVWZP</oc:permissions>
        <d:resourcetype/>
        <d:getcontentlength>5848543</d:getcontentlength>
        <d:getcontenttype>audio/mpeg</d:getcontenttype>
        <d:getlastmodified>Fri, 13 Oct 2023 16:00:21 GMT</d:getlastmodified>
        <oc:checksums>
          <oc:checksum>SHA1:28b306e048d26df4c53e66263959a5c0dc267204 MD5:4aaad82c633c17db5f94fd38c1c89082 ADLER32:072717cd</oc:checksum>
        </oc:checksums>
        <oc:tags/>
        <oc:favorite>0</oc:favorite>
        <oc:audio.album>Kiss Of Death</oc:audio.album>
        <oc:audio.artist>Motörhead</oc:audio.artist>
        <oc:audio.title>Sucker</oc:audio.title>
      </d:prop>
      <d:status>HTTP/1.1 200 OK</d:status>
    </d:propstat>
  </d:response>

WebDAV search:

curl 'https://localhost:9200/remote.php/dav/spaces/' -X 'REPORT'  --data-raw $'<?xml version="1.0"?>\n<oc:search-files  xmlns:d="DAV:" xmlns:oc="http://owncloud.org/ns">\n  <d:prop /><oc:search><oc:pattern>mimetype:audio/*</oc:pattern></oc:search>\n</oc:search-files>' --insecure --user admin:admin 


<d:multistatus xmlns:s="http://sabredav.org/ns" xmlns:d="DAV:" xmlns:oc="http://owncloud.org/ns">
    <d:response>
        <d:href>
            /remote.php/dav/spaces/c2f4c71c-0e6d-4c58-b82f-8337ef789579$a6f58a38-de7d-4c1a-849e-5072faa4295b/01%20-%20Sucker.mp3</d:href>
        <d:propstat>
            <d:prop>
                <oc:fileid>
                    c2f4c71c-0e6d-4c58-b82f-8337ef789579$a6f58a38-de7d-4c1a-849e-5072faa4295b!20667376-d173-47b8-98af-c8efdfa842ad</oc:fileid>
                <oc:file-parent>
                    c2f4c71c-0e6d-4c58-b82f-8337ef789579$a6f58a38-de7d-4c1a-849e-5072faa4295b!a6f58a38-de7d-4c1a-849e-5072faa4295b</oc:file-parent>
                <oc:name>01 - Sucker.mp3</oc:name>
                <d:getlastmodified>2023-10-13T16:00:21Z</d:getlastmodified>
                <d:getcontenttype>audio/mpeg</d:getcontenttype>
                <oc:permissions>RDNVW</oc:permissions>
                <oc:highlights></oc:highlights>
                <oc:tags></oc:tags>
                <d:getetag></d:getetag>
                <d:resourcetype></d:resourcetype>
                <d:getcontentlength>5848543</d:getcontentlength>
                <oc:audio.album>Kiss Of Death</oc:audio.album>
                <oc:audio.artist>Motörhead</oc:audio.artist>
                <oc:audio.title>Sucker</oc:audio.title>
                <oc:score>3.1045825481414795</oc:score>
            </d:prop>
            <d:status>HTTP/1.1 200 OK</d:status>
        </d:propstat>
    </d:response>
</d:multistatus>

Graph API Listing:

//http --verify no --auth admin:admin GET  'https://localhost:9200/graph/v1.0/drives/c2f4c71c-0e6d-4c58-b82f-8337ef789579$a6f58a38-de7d-4c1a-849e-5072faa4295b/items/c2f4c71c-0e6d-4c58-b82f-8337ef789579$a6f58a38-de7d-4c1a-849e-5072faa4295b/children'
{
    "value": [
        {
            "audio": {
                "album": "Kiss Of Death",
                "artist": "Motörhead",
                "title": "Sucker"
            },
            "eTag": "\"1f6aa11d5792d9fb69aeedeaa47a838f\"",
            "file": {
                "mimeType": "audio/mpeg"
            },
            "id": "c2f4c71c-0e6d-4c58-b82f-8337ef789579$a6f58a38-de7d-4c1a-849e-5072faa4295b!20667376-d173-47b8-98af-c8efdfa842ad",
            "lastModifiedDateTime": "2023-10-13T18:00:21.000000225+02:00",
            "name": "01 - Sucker.mp3",
            "size": 5848543
        }
    ]
}

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Technical debt
  • Tests only (no source changes)

Checklist:

  • Code changes
  • Unit tests added
  • Acceptance tests added
  • Documentation ticket raised:

@update-docs
Copy link

update-docs bot commented Oct 15, 2023

Thanks for opening this pull request! The maintainers of this repository would appreciate it if you would create a changelog item based on your changes.

"album": "Audio.album",
"albumArtist": "Audio.albumArtist",
"artist": "Audio.artist",
"title": "Audio.title",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

. seems to have a specific meaning in kql, so e.g. audio.artist cannot be used?

Copy link
Member

@butonic butonic Oct 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what exactly are you trying to search for?

GET /drives/{drive-id}/root/search(q='Motörhead') or GET /me/drive/search(q='Motörhead') to include all drives the user has access to is a full text search that includes content.

I think GET /drives/{drive-id}/items?$filter=audio/artist eq Motörhead better captures what you are0 trying to do ...

More info on $filter https://learn.microsoft.com/en-us/graph/filter-query-parameter?tabs=http

now therte also is a search endpoint in ms graph thet can be used to query ... everything:

POST https://graph.microsoft.com/beta/search/query
Content-Type: application/json

{
  "requests": [
    {
      "entityTypes": [
        "driveItem"
      ],
      "query": {
        "queryString": "Motörhead"
      },
      "from": 0,
      "size": 25
    }
  ]
}

The queryString is a KQL query, but I cant test it on ms graph because AFAICT I don't have enough permissions to call that endpoint. Nevertheless, xamples for queries can be found in https://learn.microsoft.com/en-us/graph/search-concept-files#example-5-use-filters-in-search-queries.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I want to search for "artist:Motörhead" and get all songs that have Motörhead as artist and none that only have "(Motörhead Cover)" in the title

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So yes, that filter seems reasonable... is that already implemented in oCIS? Should that just work automatically or do I need to implement it somehow?

Copy link
Contributor

@dragotin dragotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it ❤️ How could we have lived without it?

Awesome proof that it can be done for specific meta data, although this patch is big.

@dschmidt
Copy link
Member Author

dschmidt commented Oct 15, 2023

Awesome proof that it can be done for specific meta data, although this patch is big.

There are multiple aspects why this looks big:

  1. I updated libregraph-api-go, this could/should happen in a standalone pr
  2. I updated the protobuf spec for search messages, that generated a lot of code
  3. the manual (de)serialization/translation of datatypes for a rather big struct

This PR could be split up, I just wanted to send it in total first, so we can discuss the big picture and don't add small incremental changes that lead nowhere.
If we agree on something, I'm happy to do the split

edit:
Anyhow I would argue it's quite a bit of code, yes, but the complexity is extremely low, the only really interesting bit is storing the metadata in storage. Everything else is boilerplate

@dschmidt
Copy link
Member Author

I've pushed new commits with a more dynamic approach using reflection, which is probably slightly slower but more maintainable. It also makes it much easier to add more data types and is less error prone.

A third approach could be to autogenerate ToStringMap(prefix string) and FromStringMap(prefix string) methods in libregraph-client-go...

I'm not insisting on any approach, do you have any thoughts?

@dschmidt
Copy link
Member Author

dschmidt commented Oct 20, 2023

Can probably be simplified with https://pkg.go.dev/github.com/mitchellh/mapstructure

which is used by reva anyway

@dschmidt dschmidt force-pushed the audio-metadata branch 2 times, most recently from b955f48 to 801566d Compare October 24, 2023 19:54
@dschmidt
Copy link
Member Author

The state right now extracts audio metadata, writes them into the index and into the arbitrary metadata storage.
There are a bunch of tests for testing the marshalling and unmarshalling between the different data structures in play.

What I've removed for now although it was part of the initial PR:
mapping fields in KQL (i.e. make it possible to search for artist:"head" and other fields) and WebDAV props in propfinds and search, as I would like to tackle those in a follow up.

This is ready for another round of review from my point of view.

@dschmidt dschmidt changed the title WIP: Audio Metadata Audio Metadata Oct 24, 2023
@dschmidt dschmidt marked this pull request as ready for review October 24, 2023 19:57
@dschmidt
Copy link
Member Author

Can probably be simplified with https://pkg.go.dev/github.com/mitchellh/mapstructure

which is used by reva anyway

I don't think it's worth porting as that would likely mean more loop iterations because it doesn't exactly do what we need and our methods using reflection are reasonable enough...

@dschmidt
Copy link
Member Author

Changelog added

services/search/pkg/content/content.go Show resolved Hide resolved
services/search/pkg/engine/bleve.go Show resolved Hide resolved
services/search/pkg/search/service.go Show resolved Hide resolved
services/search/pkg/search/service.go Outdated Show resolved Hide resolved
services/graph/pkg/service/v0/driveitems.go Show resolved Hide resolved
@dschmidt
Copy link
Member Author

Updated, I think I addressed everything from @kobergj 's review

@sonarcloud
Copy link

sonarcloud bot commented Oct 31, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 9 Code Smells

71.2% 71.2% Coverage
0.0% 0.0% Duplication

@micbar micbar requested a review from kobergj November 3, 2023 10:35
Copy link
Contributor

@micbar micbar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine for me.

@dschmidt dschmidt enabled auto-merge (squash) November 3, 2023 10:39
@dschmidt dschmidt merged commit db32fb4 into owncloud:master Nov 6, 2023
2 checks passed
ownclouders pushed a commit that referenced this pull request Nov 6, 2023
* Add audio facet to search protobuf message

* Add audio metadata to search index

* Return audio facet from search if available

* Store audio metadata in arbitrary metadata

* Add audio facet to driveItems listings

* Make tests coding style more consistent

* Fix tests

* Add changelog

* Make valueToString code more defensive

* Log status code as well
@dschmidt
Copy link
Member Author

dschmidt commented Nov 6, 2023

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants