“lexical search vs semantic search”
Summary of results
Semantic search is a powerful tool that can greatly improve the relevance and quality of search results by understanding the intent and contextual meaning of search terms. However, it’s not without its limitations. One of the key challenges with semantic search is the assumption that the answer to a query is semantically similar to the query itself. This is not always the case, and it can lead to less than optimal results in certain situations.
https://fsndzomga.medium.com/the-problem-with-semantic-searc...
The point of vector search is to support semantic search. It makes sense that grep will outperform if you're just looking for verbatim occurrences of a string.
Isn't the big problem that BM25 (and friends) will help you find (and rank) exact search terms (or stemmed varieties of that search term), whereas semantic search can typically find items out-of-dictionary but "close" semantically? SPLADE, on my reading of it, seems to do a "pre-materialization" of the out-of-dictionary part.
Traditional search such as ES and Lucene rely primarily on bag-of-words retrieval and keyword matching e.g. BM25, TF-IDF, etc. Vector databases such as Milvus allow for _semantic_ search.
Here's a highly simplified example: if my query was "fields related to computer science", a semantic engine could return "statistics" and "electrical engineering" while avoiding things such as "social science" and "political science".
Is there something (besides the convenience of being already in the proper vector formats) unique about the combination of semantic search and LLM's that makes it a better fit?
I guess I'm surprised that the approach is to use semantic search here, where in regular search, completely different algorithms (Lucene?) have "won".
I think there are a couple things worth noting here:
Semantic search performs well at capturing documents keyword search misses. As noted in the article, when searching for exact keywords, keyword will outperform semantic search. It is when users do not know the exact phrase they are looking for that keyword search shines.
Semantic search should only be a part of your search system, not your entire search system. We find that combining keyword search + semantic search and then using a reranker gives the best results. It is best if the reranker is fine tuned on your search history, but general crossencoders perform surprisingly well.
you can do first a keyword based search (inverse index, etc.) and then do semantic search with the results of the first step
Keyword or text matching based search is likely still more popular due to how long its been around, its simplicity, and the tooling built around it. Most companies who have internal search are most likely not using vector / semantic search, but are doing something more basic.
This sounds a lot like semantic web search? Can anyone share how it differs?
In my experience semantic search is great for finding implicit relationships (bad guy => villain) but sometimes fails in unpredictable ways for more elementary matches (friends => friend). That's why it can be good to combine semantic search with something like BM25, which is what I use in my blog search [1]. N-gram text frequency algorithms like TF-IDF and BM25 are also lightning fast compared to semantic search.
[1] https://lukesalamone.github.io/posts/rolling-my-own-blog-sea...
We’ve had good results with semantic search. We use it because keyword search doesn’t handle minor changes in words gracefully and semantic search does.
The problem I find with semantic search is first I have to read and understand somebody elses definitions before I can search within the confines of the ontology.
The problem I have with ML guided search is the ML takes web average view of what I mean, which sometimes I need to understand and then try and work around if that's wrong. It can become impossible to find stuff off the beaten track.
The nice thing about keyword and exact text searching with fast iteration is it's my mental model that is driving the results. However if it's an area I don't know much about there is a chicken and egg problem of knowing which words to use.
I remember when "semantic search" was the Next Big Thing (back when all we had were simple keyword searches).
I don't know enough about the internals of Google's search engine to know if it could be called a "semantic search engine", but not, it gets close enough to fool me.
But I feel like I'm still stuck on keyword searches for a lot of other things, like email (outlook and mutt), grepping IRC logs, searching for products in small online stores, and sometimes even things like searching for text in a long webpage.
I'm sure people have thought about these things: what technical challenges exist in improving search in these areas? is it just a matter of integrating engines like the one that's linked here? Or maybe keyword searches are often Good Enough, so no one is really clamoring for something better
Keyword search won for commodity bc semantic search was weaker and more complicated. More sophisticated orgs like Google switched to semantic/hybrid for years as both problems got addressed.
Now that opensearch/elasticsearch have vector indexes, and high quality text embedding is a lot easier...
Vector search is incredibly powerful on matching on context or similarity. For example, automobile and car are semantically similar and, and one will rank well for the other in a search.
Vector search, though, isn't as good on handling typos and not good at all when it comes to as you type searching. Vehic won't match on auto, for example.
We believe that there is use for each of these approaches and a use in a single search, rather than choosing ahead of time or through heuristics after the fact which to choose.
(I'm a Principal PM for Semantic Search and Search Ranking at Algolia.)
Given the recent advances in semantic search, what's the SOTA search stack that people are using for combined keyword + semantic search these days?
In a strict "one question / one response" search, raw semantic search results are a great solution. And consumes far fewer tokens.
In conversational AI, providing search results appended to a long-memory context produces "human-like" results.
You are right, most LLM queries won't be exact match, so you will need semantic search for hit those similar but not same questions
They've been doing this for years. This can be done using semantic search which has been around since the dawn of AI. Using LLM's for this is just overkill and way too expensive.
The problem of semantic search, particularly in the context of question answering over documents, is examined in this paper. Semantic search currently assumes that the answer to a question is semantically similar to the query, a supposition that can result in lower accuracy and reliability for the search engine. A novel solution, “Semantic Enrichment,” is proposed as an improvement over the conventional method, with an in-depth comparison between the two strategies provided.
It's not obvious to me that vector size isn't relevant in semantic search. What is it about the training process for semantic search that makes that the case?
Thanks! I missed that part.
The semantic search approach seems to focus the answers better than fine-tuning; at the cost of preloading the prompt with a lot of tokens, but with the benefit of a more constrained response.
The part at the beginning where employees manually loaded possible search terms sounds like the semantic web to me. This project uses deep learning to automatically direct a whole space of possible search terms to an item, instead of just missing all the terms that were not manually added.
The use case is a specific type of search:
Semantic search. The current search feels like it is barely doing something smarter than substring matching / basic regex. The search should be more like Google and less like matching substrings
As the op, you can do both semantic search (embedding) and keyword search. Some RAG techniques call out using both for better results. Nice product by the way!
I’ve worked in scaled enterprise search, both with lexical (lucene based, eg elastic search) & semantic search engines (vector retrieval).
Vector retrieval that isn’t contextualized in the domain is usually bad (RAG solutions call this “naive rag” … and make up for it with funky chunking and retrieval ensembles). Training custom retrievers and reranker is often key but quite an effort and still hard to generalize in a domain with broad knowledge.
Lexical based searching provides nice guarantees and deterministic control in results (depending on how you index). Certainly useful here is advanced querying capability. Constructing/enriching queries with transformers is cool.
Reranking is often nice ensemble additions, albeit can be done with smaller models.
vector DBs have much higher 'semantic' recall than classical search engines if you want to ask questions about your documents or previous discussions.
Elastic/OpenSearch have classically used BM25, but the latter recently added semantic search capabilities using neural embeddings in 2.4. Not sure about the former.
[1] https://opensearch.org/blog/opensearch-2-4-is-available-toda...
I wonder how do legacy search players like elastic / solr compete against the new age startups combining semantic and regular search ?
Most 'traditional' search engines are likely to already be 'AI' search engines anyway - there is a good chance you are transforming your query into a vector in a latent space, and searching for documents that are themselves annotated with a vector approximately close to your search vector (so they are already semantic search).
The difference is more just what output format they give you the results in - bespoke text, or a set of links most relevant to the query. The extra layer of text2text processing probably doesn't add much if the top result already answers your query.
What people really want to do though, is to set custom criteria that synthesise data across multiple different sets to draw novel conclusions that aren't in any webpage. That is probably very expensive computationally though, hence there still being a need for bespoke sites that semantically index data for certain types of queries.
I tested 2 semantic search services — Elastic and Table-Search. Elastic was much harder to set up, had more limitations, and required more manual input, but showed better results. Table-Search was free and super easy to use but showed somewhat worse results.
Well semantic web search is less about traditional web search and more about semantic relationships between terms using ontologies.
If you search for "pipe A500", the search engine would deconstruct that. It would see pipes have steel, and a coating for steel, and a grade of steel. It would see A500 is a grade of steel. Pipes don't have a grade A500, but tubes do, and tubes have the other classifications that pipes do. It may then conclude that while 'A500' and 'pipe' are not linked directly, a different term may be very similar and a more direct match ('tube'), and thus return 'A500 tube' results.
It seemed like the machine learning model was building relations between these different concepts and using them to improve the search, but without the intentional taxonomic mapping that semantic web uses. Semantic web is essentially more of a curated database of relationships, whereas the machine learning appears to be using another method to establish the relationships. I wonder if they're not doing basically the same thing.
You might want to try semantic search instead of fiddling with keywords. Disclaimer: I'm building a plug-and-play semantic search API at https://kailualabs.com
As others mentioned you want to do combine both semantic and td-idf like searches. The thing is that searches that carry no semantic weight (e.g. a part number) or that have special meaning compared to the corpus used to train your embedding model (e.g. average Joe thinks about the building when seeing the word "bank", but if you work in a city planning firm you might only consider the bank of a river) fail spectacularly when using only semantic search.
Alternatively you can finetune your embedding model so that it was exposed to these words/meanings. However, the best (from personal experience) is doing both and use some sort of query rewriting on the full-text search to keep only the "keywords".
By definition, semantic search works best by similarity. Thus, the interface you are looking for, is one that facilitates selecting one or multiple objects (i.e. video games).
Typical retrieval methods break up documents into chunks and perform semantic search on relevant chunks to answer the question.
You can use semantic search, then feed that into the LLM.
There are many solutions already, look into Haystack by deepset, or if you are up for a challenge, you could make something in Langchain.
Can you talk about how you implemented search-as-you-type? Doing so with semantic search seems tricked given the roundtrips needed to compute embeddings on the fly (assuming the use of OpenAI embeddings)
Part practical demonstration of differences between lexical and semantic search in Elasticsearch, with a Sentence Transformer model and the Quora dataset, part text embeddings explanation using Saussure's theory of language.
The app performs full text search as well as semantic search. The full text search results are presented first, so if you scroll down to the bottom of the page you should see the semantically related results under the header "Semantic Search Results"
There is a place for both:
If you are searching for a specific ID/ISBN some random token, keyword search will be always useful and easy to implement.
If the goal of the search is more semantically ambiguous and can not be expressed by a unique phrase, then neural search will be the way to go.
Most of the interesting applications of search will be semantically driven and therefore neural search has a big role to play.
In particular because they are more or less using a term index I believe. This new world relies on vector indexes for semantic search.
Traditional search can become "spamming text" nowadays because search engines like Google are quite broken and are trying to do too many things at once. I like to think that LLM-based search may be better for direct questions but traditional search is better for search queries, akin to a version of grep for the web. If that is what you need, then traditional search is better. But these are different use cases, in my view, and it is easy to confuse the two when the only interface is a single search box that accepts both kinds of queries.
One issue is that Google and other search engines do not really have much of a query language anymore and they have largely moved away from the idea that you are searching for strings in a page (like the mental model of using grep). I kinda wish that modern search wasn't so overloaded and just stuck to a clearer approach akin to grep. Other specialty search engines have much more concrete query languages and it is much clearer what you are doing when you search a query. Consider JSTOR [1] or ProQuest [2], for example. Both have proximity operators, which are extremely useful when searching large numbers of documents for narrow concepts. I wish Google or other search engines like Kagi would have proximity operators or just more operators in general. That makes it much clearer what you are in fact doing when you submit a search query.
[1] https://support.jstor.org/hc/en-us/articles/115012261448-Sea...
:D :D
Sorry my question was on the basis of the quality of the results, simply put .. how does players who have good semantic search turn out against "legacy" players who had good text search
> In a sense traditional search is just a really, really dumb LLM. They both injest a bunch of text and build an "index" based on the text.
Not really .. the significant difference is that a search engine is backed by a web crawler that is running continuously discovering new content, vs the LLM which has a fixed training set that will only be updated infrequently (very slow and expensive to retrain). Also, once a new web page, or new version of a web page, has been indexed by the crawler then you should be able to find it if your search terms match, while with an LLM all bets are off as to whether you can coax it to generate something based on something in the training set.
Here's an example of semantic search:
Let's say your dataset has the words "Oceans are blue" in it.
With keyword search, if someone searches for "Ocean", they'll see that record, since it's a close match. But if they search for "sea" then that record won't be returned.
This is where semantic search comes in. It can automatically deduce semantic / conceptual relationships between words and return a record with "Ocean" even if the search term is "sea", because the two words are conceptually related.
The way semantic search works under the hood is using these things called embeddings, which are just a big array of floating point numbers for each record. It's an alternate way to represent words, in an N-dimensional space created by a machine learning model. Here's more information about embeddings: https://typesense.org/docs/0.25.0/api/vector-search.html#wha...
With the latest release, you essentially don't have to worry about embeddings (except may be picking one of the model names to use and experiment) and Typesense will do the semantic search for you by generating embeddings automatically.
Isn't this as straightforward as semantic search over an embedded corpus ? Unless i'm missing something, i don't think the backend engineering would take much
I see it as translation, not search. Search is already done fantastically well at small scale with simple indexing, and (before adversarial reactions from SEO) at internet scale with Page rank.
Asking if LLMs are really reasoning or not feels like an argument about terminology, like asking if A* really is route planning.
A lot of the semantic web has evolved, spurred on by SEO and the need to accurately scrape data from web pages. The old semantic web seemed to be more of a solution in search of every problem. I'm not surprised that searches for "semantic web" are down - as most interest now is focused on structured data via microformats, LD-JSON and standards published at schema.org.
Are there any semantic search implementations focused on.. small, local deploys?
Eg i'm interested in local serverless setups (on desktop, mobile, etc) that yield quality search results in the ~instant~ time frame, but that are also complete and accurate in results. Ie i threw out investigating ANN because i wanted complete results due to smaller datasets.