Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search #26

Open
gerbrent opened this issue Jul 5, 2022 · 23 comments · May be fixed by #576
Open

Search #26

gerbrent opened this issue Jul 5, 2022 · 23 comments · May be fixed by #576
Labels
enhancement New feature, enhancement, or request JB - need decision decision/consult needed from JB Team question Further information is requested
Milestone

Comments

@gerbrent
Copy link
Collaborator

gerbrent commented Jul 5, 2022

figure out a good way to integrate search. Clientside like Lunr.js will probably not perform due to index size.

@theZMC recommends:

Since golang seems to be a theme here (which it should be), maybe zinc would fit the bill?

@gerbrent gerbrent added the enhancement New feature, enhancement, or request label Jul 5, 2022
@gerbrent gerbrent added this to the Hugo Website milestone Jul 5, 2022
@elreydetoda
Copy link
Collaborator

If zinc is used (from my understanding) it looks like it'll need to be implemented on the server side (at least for the server which holds all the references to the objects/records).

Also, of note it appears that zinc is still in beta (src):

Project Status:
ZincSearch is in Pre GA (General Availability) and will be marked as production ready at v1.0.0 .

So, it's possible they'll have major breaking changes before the 1.0 release, which means we'll need to make sure we pin the version and read the changelog before upgrading to see if anything's going to break.

While it's not a self-hosted option (and will probably cost money because of how many episodes JB has), as a temporary solution, we could use algolia. I've used them before, and overall it was pretty easy. I've actually got a GH action for doing CI with hugo content as well: https://github.com/Climate-Refugee-Stories/crs-website/blob/c82f394a620b4631bb43de5ca4433d33a51bb292/.github/workflows/cd.yml#L91-L126
(figure we probably won't go with this, but just figured I'd mention it).

@elreydetoda
Copy link
Collaborator

Meta: BTW, @gerbrent you might want to add a "JB - action needed" tag to this issue since it's discussing cost of running an extra service on a server specifically for search, and if that's something they want to contemplate (because that'll be another service to maintain).

@gerbrent gerbrent added JB - need decision decision/consult needed from JB Team question Further information is requested labels Jul 22, 2022
@reesericci
Copy link
Collaborator

Typesense might be a good option

@gerbrent gerbrent modified the milestones: JB.com 1.0, JB.com 2.0 Aug 6, 2022
@ironicbadger
Copy link
Collaborator

@RealOrangeOne usually has some pretty strong opinions on search.

@gerbrent
Copy link
Collaborator Author

I have opinions aswell, from a functionality and end-user perspective.

The search results at notes.jupiterbroadcasting.com is not at all to my liking pretty much every single time I try to use it, which is generally answering a question like "I remember we mentioned that in the last few months, lets see which episode that was from" - This generally gives results sorted by "relevance" which never gets me what I want (and yet I keep trying.....)

I would far prefer chronologically sorted search results.

Also on a slow connection, the UX of that current search - the present-results-as-pop-up-in-search-bar behaviour isn't obvious for quite some time till the results load. An annoying UX experience, and slow enough to make me wonder more than once "is this working?"

@theZMC
Copy link

theZMC commented Aug 15, 2022

I'm willing to start putting some serious development work into this. Some clarifying questions:

  • Do we need full text search?
  • Any plans for some sort of transcription process (automatic or manual) and if so is there a timeline for when that would be in place?

@ironicbadger
Copy link
Collaborator

See the search at notes.jupiterbroadcasting.com - that's the type of thing I think is needed. @ChrisLAS @noblepayne or @gerbrent feel free to jump in here.

@gerbrent
Copy link
Collaborator Author

see here why search via recency is not supported nor desired by mkdocs:

selfhostedshow/show-notes#16 (comment)

@RealOrangeOne
Copy link

I agree lunr probably isn't ideal, as the index will be huge. I've written client-side search with Hugo, and it's very simple, but the index may be large given the show history

mkdocs's search is lunr-based. The issue with mkdocs is that the pages have no sense of date, as opposed to being a technological issue. Hugo however does have dates as a concept, so could be done.

For search, I suspect we'd want want something server-side to do it. For ease (of local dev and hosting), scraping the content into sqlite and using its fulltext search would probably be very simple, very powerful and scalable.

Elasticsearch etc are definitely options, but they're very heavy for what we need. As are hosted tools like Algolia, but given the name of one of our shows, that's a less desirable option.

@ironicbadger
Copy link
Collaborator

Could we run some mock ups with elastic and get a sense of just how heavy? We have the infra to do it I'd wager.

@RealOrangeOne
Copy link

It's not just heavy in terms of resource. It also makes local development much more of a pain, not to mention is more complex to setup and work with anyway. The container alone is ~550mb compressed.

@theZMC
Copy link

theZMC commented Aug 17, 2022

It's not just heavy in terms of resource. It also makes local development much more of a pain, not to mention is more complex to setup and work with anyway. The container alone is ~550mb compressed.

Unfortunately when it comes to search, I think it's the classic pick two between fast, good, and inexpensive. Though I do agree that full fat elastic is a bit too heavy-handed for our needs.

@CGBassPlayer
Copy link
Collaborator

CGBassPlayer commented Sep 8, 2022

So I just found a tool that might be worth using if search is still something we are after. Its called Pagefind and it is a single binary that indexes the site after it is built. There is a video on the home page of their site showing how it works and a basic example.

Its also written in 🎉 Rust 🎉

@elreydetoda
Copy link
Collaborator

That looks pretty awesome! It's nice that we could just bundle that in an artifact as well. It just comes with the new site build! 🥳

@gerbrent
Copy link
Collaborator Author

gerbrent commented Sep 8, 2022

this DOES look fascinating!

The demo at the top of the page at https://pagefind.app/ is fast - much faster than our current notes.jupiterbroadcasting.com for me on a low end internet connection and low-end hardware.

Pagefind can run a full-text search on a 10,000 page site with a total network payload under 300KB, including the Pagefind library itself. For most sites, this will be closer to 100KB.

that sounds like us ; )

🎯 Another lovely demo: https://xkcd.pagefind.app/

@gerbrent
Copy link
Collaborator Author

gerbrent commented Sep 8, 2022

My big question - can results be sorted by date/recency? I see Pagefind has the concept of "date"

image

@CGBassPlayer
Copy link
Collaborator

CGBassPlayer commented Sep 8, 2022

I don't see why not since it is content on the page. I wonder if we will need a piece of metadata for the date.

But I found this tool about 15 minutes before I commented (Just long enough to watch the video)

@gerbrent
Copy link
Collaborator Author

gerbrent commented Sep 8, 2022

That can be very handy for the JB Archive (a distinct hugo instance):

Pagefind can be configured to search across multiple sites, merging results and filters into a single response. Multisite search configuration happens entirely in the browser, by pointing one Pagefind instance at multiple search bundles.

The following examples reflect Pagefind running on a website at blog.example.com that wants to include pages from docs.example.com in the search results.

https://pagefind.app/docs/multisite/

Changing the weighting of individual indexes

When searching across multiple sites you may want to rank each index higher or lower than the others. This can be achieved by passing an indexWeight option for each index:

https://pagefind.app/docs/multisite/#changing-the-weighting-of-individual-indexes

@FlakM
Copy link

FlakM commented Oct 30, 2022

Hello all! Have you considered https://www.meilisearch.com/ ? It's also an open source project with a valid source of income (they have recently received 15M round o founding). It is very easy to deploy. I'd be more than happy to write backend RSS watcher and some mockups for front end.
As for costs times are crazy but I think I can commit to covering a year of runtime and on call support as value for value 🥰

@gerbrent
Copy link
Collaborator Author

gerbrent commented Nov 1, 2022

oh wow, very generous @FlakM !!!

I'll be curious to hear what others think of MeiliSearch - def worth considering!

@FlakM
Copy link

FlakM commented Nov 2, 2022

I've been recently reviewing alternatives for more traditional ELK stack and hosted options for my employer. Meilisearch has come up on this week in rust so I have also looked into it. Here are some reasons why I think it would be a good fit here:

  • It uses a very solid backing technologies - ie LMDB which has been designed as a embedded database for openldap by very smart people. If you prefer podcasts here is a great episode about it.
  • It has a rest API so it could be used without any other backend services apart from the component that will keep data in sync (and maybe some nginx to add some rate limiting/tls etc)
  • It is dead simple to deploy and maintain - just a single container
  • It has a front-end code already written so including it is also very simple
  • It has all of those nice features like typo safety, synonyms etc
  • It is blazingly fast 🚀 🦀

For your convenience, I've deployed a sample service and loaded the index with contents of all feeds RSS its available here (BTW it is a proper use of Linode credits) secret key is MASTER_KEY. Keep in mind that it is a result of a fast and dirty effort. For a full-blown index, I think it would be useful to also add transcription (I've experience with deep speech so probably not a big problem) and more complete show notes. The current showcase version of code loading data from the RSS feed is available here

@FlakM FlakM mentioned this issue Nov 3, 2022
@gerbrent
Copy link
Collaborator Author

gerbrent commented Nov 3, 2022

amazing again @FlakM !! Will look at this further in a few days.. thank you!!

@elreydetoda
Copy link
Collaborator

@kylepotts suggests start of convo & end of convo:

What things have we tried for searching transcriptions? I wonder if taking the output of the transcription and putting it inside something like ElasticSearch/Opensearch and exposing it via an API is overkill? Or if a product like that already exists. Definitely will require a unique way to have a "dynamic" results page in Hugo from where you search.

@CGBassPlayer CGBassPlayer linked a pull request Jan 17, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature, enhancement, or request JB - need decision decision/consult needed from JB Team question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants