lexical search vs semantic search

share

Summary of results

GPT-4o
Warning: quote links might not work while streaming
1.

I think there are a couple things worth noting here:

Semantic search performs well at capturing documents keyword search misses. As noted in the article, when searching for exact keywords, keyword will outperform semantic search. It is when users do not know the exact phrase they are looking for that keyword search shines.

Semantic search should only be a part of your search system, not your entire search system. We find that combining keyword search + semantic search and then using a reranker gives the best results. It is best if the reranker is fine tuned on your search history, but general crossencoders perform surprisingly well.

2.

Isn't the big problem that BM25 (and friends) will help you find (and rank) exact search terms (or stemmed varieties of that search term), whereas semantic search can typically find items out-of-dictionary but "close" semantically? SPLADE, on my reading of it, seems to do a "pre-materialization" of the out-of-dictionary part.

3.

Traditional search such as ES and Lucene rely primarily on bag-of-words retrieval and keyword matching e.g. BM25, TF-IDF, etc. Vector databases such as Milvus allow for _semantic_ search.

Here's a highly simplified example: if my query was "fields related to computer science", a semantic engine could return "statistics" and "electrical engineering" while avoiding things such as "social science" and "political science".

4.

you can do first a keyword based search (inverse index, etc.) and then do semantic search with the results of the first step

5.

Keyword or text matching based search is likely still more popular due to how long its been around, its simplicity, and the tooling built around it. Most companies who have internal search are most likely not using vector / semantic search, but are doing something more basic.

6.

The point of vector search is to support semantic search. It makes sense that grep will outperform if you're just looking for verbatim occurrences of a string.

7.

Is there something (besides the convenience of being already in the proper vector formats) unique about the combination of semantic search and LLM's that makes it a better fit?

I guess I'm surprised that the approach is to use semantic search here, where in regular search, completely different algorithms (Lucene?) have "won".

8.

This sounds a lot like semantic web search? Can anyone share how it differs?

9.

In my experience semantic search is great for finding implicit relationships (bad guy => villain) but sometimes fails in unpredictable ways for more elementary matches (friends => friend). That's why it can be good to combine semantic search with something like BM25, which is what I use in my blog search [1]. N-gram text frequency algorithms like TF-IDF and BM25 are also lightning fast compared to semantic search.

[1] https://lukesalamone.github.io/posts/rolling-my-own-blog-sea...

10.

We’ve had good results with semantic search. We use it because keyword search doesn’t handle minor changes in words gracefully and semantic search does.

11.

I remember when "semantic search" was the Next Big Thing (back when all we had were simple keyword searches).

I don't know enough about the internals of Google's search engine to know if it could be called a "semantic search engine", but not, it gets close enough to fool me.

But I feel like I'm still stuck on keyword searches for a lot of other things, like email (outlook and mutt), grepping IRC logs, searching for products in small online stores, and sometimes even things like searching for text in a long webpage.

I'm sure people have thought about these things: what technical challenges exist in improving search in these areas? is it just a matter of integrating engines like the one that's linked here? Or maybe keyword searches are often Good Enough, so no one is really clamoring for something better

12.

Keyword search won for commodity bc semantic search was weaker and more complicated. More sophisticated orgs like Google switched to semantic/hybrid for years as both problems got addressed.

Now that opensearch/elasticsearch have vector indexes, and high quality text embedding is a lot easier...

13.

Vector search is incredibly powerful on matching on context or similarity. For example, automobile and car are semantically similar and, and one will rank well for the other in a search.

Vector search, though, isn't as good on handling typos and not good at all when it comes to as you type searching. Vehic won't match on auto, for example.

We believe that there is use for each of these approaches and a use in a single search, rather than choosing ahead of time or through heuristics after the fact which to choose.

(I'm a Principal PM for Semantic Search and Search Ranking at Algolia.)

14.

In a strict "one question / one response" search, raw semantic search results are a great solution. And consumes far fewer tokens.

In conversational AI, providing search results appended to a long-memory context produces "human-like" results.

15.

You are right, most LLM queries won't be exact match, so you will need semantic search for hit those similar but not same questions

16.

They've been doing this for years. This can be done using semantic search which has been around since the dawn of AI. Using LLM's for this is just overkill and way too expensive.

17.

The problem of semantic search, particularly in the context of question answering over documents, is examined in this paper. Semantic search currently assumes that the answer to a question is semantically similar to the query, a supposition that can result in lower accuracy and reliability for the search engine. A novel solution, “Semantic Enrichment,” is proposed as an improvement over the conventional method, with an in-depth comparison between the two strategies provided.

18.

It's not obvious to me that vector size isn't relevant in semantic search. What is it about the training process for semantic search that makes that the case?

19.

Thanks! I missed that part.

The semantic search approach seems to focus the answers better than fine-tuning; at the cost of preloading the prompt with a lot of tokens, but with the benefit of a more constrained response.

20.

The part at the beginning where employees manually loaded possible search terms sounds like the semantic web to me. This project uses deep learning to automatically direct a whole space of possible search terms to an item, instead of just missing all the terms that were not manually added.

22.

Semantic search. The current search feels like it is barely doing something smarter than substring matching / basic regex. The search should be more like Google and less like matching substrings

23.

As the op, you can do both semantic search (embedding) and keyword search. Some RAG techniques call out using both for better results. Nice product by the way!

24.

Yes, I don't think the author fully thought through what they wrote. In essence they are saying they just want semantic search.

25.

I’ve worked in scaled enterprise search, both with lexical (lucene based, eg elastic search) & semantic search engines (vector retrieval).

Vector retrieval that isn’t contextualized in the domain is usually bad (RAG solutions call this “naive rag” … and make up for it with funky chunking and retrieval ensembles). Training custom retrievers and reranker is often key but quite an effort and still hard to generalize in a domain with broad knowledge.

Lexical based searching provides nice guarantees and deterministic control in results (depending on how you index). Certainly useful here is advanced querying capability. Constructing/enriching queries with transformers is cool.

Reranking is often nice ensemble additions, albeit can be done with smaller models.

26.

vector DBs have much higher 'semantic' recall than classical search engines if you want to ask questions about your documents or previous discussions.

27.

Elastic/OpenSearch have classically used BM25, but the latter recently added semantic search capabilities using neural embeddings in 2.4. Not sure about the former.

[1] https://opensearch.org/blog/opensearch-2-4-is-available-toda...

28.

I wonder how do legacy search players like elastic / solr compete against the new age startups combining semantic and regular search ?

29.

Most 'traditional' search engines are likely to already be 'AI' search engines anyway - there is a good chance you are transforming your query into a vector in a latent space, and searching for documents that are themselves annotated with a vector approximately close to your search vector (so they are already semantic search).

The difference is more just what output format they give you the results in - bespoke text, or a set of links most relevant to the query. The extra layer of text2text processing probably doesn't add much if the top result already answers your query.

What people really want to do though, is to set custom criteria that synthesise data across multiple different sets to draw novel conclusions that aren't in any webpage. That is probably very expensive computationally though, hence there still being a need for bespoke sites that semantically index data for certain types of queries.

30.

Well semantic web search is less about traditional web search and more about semantic relationships between terms using ontologies.

If you search for "pipe A500", the search engine would deconstruct that. It would see pipes have steel, and a coating for steel, and a grade of steel. It would see A500 is a grade of steel. Pipes don't have a grade A500, but tubes do, and tubes have the other classifications that pipes do. It may then conclude that while 'A500' and 'pipe' are not linked directly, a different term may be very similar and a more direct match ('tube'), and thus return 'A500 tube' results.

It seemed like the machine learning model was building relations between these different concepts and using them to improve the search, but without the intentional taxonomic mapping that semantic web uses. Semantic web is essentially more of a curated database of relationships, whereas the machine learning appears to be using another method to establish the relationships. I wonder if they're not doing basically the same thing.

31.

You might want to try semantic search instead of fiddling with keywords. Disclaimer: I'm building a plug-and-play semantic search API at https://kailualabs.com

32.

By definition, semantic search works best by similarity. Thus, the interface you are looking for, is one that facilitates selecting one or multiple objects (i.e. video games).

33.

Typical retrieval methods break up documents into chunks and perform semantic search on relevant chunks to answer the question.

34.

You can use semantic search, then feed that into the LLM.

There are many solutions already, look into Haystack by deepset, or if you are up for a challenge, you could make something in Langchain.

35.

A case study on how to simply create a search system with txtai, Qdrant and pretrained language models. The cool thing about the semantic search is that none of the words used in a query has to be used in any document in our dataset, as the model is already capable of capturing synonyms. This is a huge advantage over conventional search algorithms like BM25.

36.

Part practical demonstration of differences between lexical and semantic search in Elasticsearch, with a Sentence Transformer model and the Quora dataset, part text embeddings explanation using Saussure's theory of language.

37.

The app performs full text search as well as semantic search. The full text search results are presented first, so if you scroll down to the bottom of the page you should see the semantically related results under the header "Semantic Search Results"

38.

There is a place for both:

If you are searching for a specific ID/ISBN some random token, keyword search will be always useful and easy to implement.

If the goal of the search is more semantically ambiguous and can not be expressed by a unique phrase, then neural search will be the way to go.

Most of the interesting applications of search will be semantically driven and therefore neural search has a big role to play.

39.

In particular because they are more or less using a term index I believe. This new world relies on vector indexes for semantic search.

40.

:D :D

Sorry my question was on the basis of the quality of the results, simply put .. how does players who have good semantic search turn out against "legacy" players who had good text search

41.

> In a sense traditional search is just a really, really dumb LLM. They both injest a bunch of text and build an "index" based on the text.

Not really .. the significant difference is that a search engine is backed by a web crawler that is running continuously discovering new content, vs the LLM which has a fixed training set that will only be updated infrequently (very slow and expensive to retrain). Also, once a new web page, or new version of a web page, has been indexed by the crawler then you should be able to find it if your search terms match, while with an LLM all bets are off as to whether you can coax it to generate something based on something in the training set.

42.

Here's an example of semantic search:

Let's say your dataset has the words "Oceans are blue" in it.

With keyword search, if someone searches for "Ocean", they'll see that record, since it's a close match. But if they search for "sea" then that record won't be returned.

This is where semantic search comes in. It can automatically deduce semantic / conceptual relationships between words and return a record with "Ocean" even if the search term is "sea", because the two words are conceptually related.

The way semantic search works under the hood is using these things called embeddings, which are just a big array of floating point numbers for each record. It's an alternate way to represent words, in an N-dimensional space created by a machine learning model. Here's more information about embeddings: https://typesense.org/docs/0.25.0/api/vector-search.html#wha...

With the latest release, you essentially don't have to worry about embeddings (except may be picking one of the model names to use and experiment) and Typesense will do the semantic search for you by generating embeddings automatically.

43.

Isn't this as straightforward as semantic search over an embedded corpus ? Unless i'm missing something, i don't think the backend engineering would take much

44.

I see it as translation, not search. Search is already done fantastically well at small scale with simple indexing, and (before adversarial reactions from SEO) at internet scale with Page rank.

Asking if LLMs are really reasoning or not feels like an argument about terminology, like asking if A* really is route planning.

45.

A lot of the semantic web has evolved, spurred on by SEO and the need to accurately scrape data from web pages. The old semantic web seemed to be more of a solution in search of every problem. I'm not surprised that searches for "semantic web" are down - as most interest now is focused on structured data via microformats, LD-JSON and standards published at schema.org.

46.

Are there any semantic search implementations focused on.. small, local deploys?

Eg i'm interested in local serverless setups (on desktop, mobile, etc) that yield quality search results in the ~instant~ time frame, but that are also complete and accurate in results. Ie i threw out investigating ANN because i wanted complete results due to smaller datasets.

47.

They don't mention BM25, which still outperforms much of semantic search. A fun exercise is to watch the benchmarks of the latest semantic embeddings models and see that they still struggle to match good 'ol BM25.

BM25 uses the relative statistical frequency of words to identify relevant material, along with some adjustments. It doesn't use ML at all, but it works very well, especially for technical content.

SPLADE is capable for some areas but is slow, and often times it doesn't present much of a benefit (or is worse) versus BM25 for technical searches, where specific technical words don't have many synonyms that it would be able to pull.

The best search systems today use a mix of semantic search and BM25 or SPLADE, depending on the type of material and the speed required.


Terms & Privacy Policy | This site is not affiliated with or sponsored by Hacker News or Y Combinator
Built by @jnnnthnn