Advanced keyword search is built upon natural language processing (NLP)

A search engine needs to “process” the language in a search bar before it can execute a query. The process could be as simple as comparing…

· January 10th, 2023 – 9 minute read

A search engine needs to “process” the language in a search bar before it can execute a query. The process could be as simple as comparing the query exactly as written to the content in the index. But classic keyword search is more advanced than that, because it involves tokenizing and normalizing the query into smaller pieces – i.e., words and keywords. This process can be easy (where the words are separated by spaces) or more complex (like Asian languages, which do not use spaces, so the machine needs to recognize the words).

Once the query is broken down into smaller pieces, the search engine can correct misspellings and typos, apply synonyms, reduce the words further into roots, manage multiple languages, and more – all of which enable the user to type a more “natural” query.

Natural language keywords

On average, people type single-word or short-phrase queries to describe the items they are searching for. That is, they use keywords, not whole sentences or questions. (Though that’s changing, due to voice technology and the success of Google’s question and answer results.)

This kind of keyword search, both the simple and more advanced versions of it, has been around since the beginning of search. The more natural it is, the more advanced the techniques become. Search engines need to structure incoming queries before they can look up results in the search index. This pre-processing technology falls into what we call Natural Language Processing, or NLP, which is an umbrella term for any technology that enables computers to understand human language, whether written or spoken.

Natural language processing (“NLP”) takes text and transforms it into pieces that are easier for computers to use. Some common NLP tasks are removing stop words, segmenting words, or splitting compound words. NLP can also identify parts of speech, or important entities within text.

— Dustin Coates, Product and GTM Manager at Algolia

We’ve written quite a lot about natural language processing (NLP) here at Algolia. We’ve defined NLP, compared NLP vs NLU, and described some popular NLP/NLU applications. Additionally, our engineers have explained how our engine processes language and handles multilingual search. In this article, we’ll look at how NLP drives keyword search, which is an essential piece of our hybrid search solution that also includes AI/ML-based vector embeddings and hashing.

To understand the nexus between keywords and NLP, it’s important to start off by diving deep into keyword search.

What is keyword search?

At its most basic, a keyword search engine compares the text of a query to the text of each record in a search index. Every record that matches (whether exact or similar) is returned by the search engine. Matching, as suggested, can be simple or advanced.

We use keywords to describe clothing, movies, toys, cars, and other objects. Most keyword search engines rely on structured data, where the objects in the index are clearly described with single words or simple phrases.

For example, a flower can be structured using tags, or “keys”, to form key-value pairs. The values (a large, red, summer, flower, with four petals) can be paired with their keys (size, color, season, type of object, and number of petals). The flower can also sell at a “price” of “4.99”.

We can represent this structure of keys and values as follows:

{ "name": "Meadow Beauty", "size": "large", "color": "red", "season": “summer”, "type of object": "flower", "number of petals": "4", "price": "4.99", "description": "Coming from the Rhexia family, the Meadow Beauty is a wildflower.” } All these key-value pairs make up a record that can be stored in a search index, so that a query like “red flower” will return all flower records where “type = flower” and “color = red”.

Additionally, a partial query like “red” can find flowers with “color = reddish-green”, because “red” is in “reddish”.

Many keyword search engines use manually-defined synonyms. Thus, a “blue” query can return “azure” flowers, if you explicitly tell the engine that “blue” and “azure” are synonyms.

Other techniques correct misspellings and typos. The query“4 pedels” contains a typo; a typo-tolerant engine will return correctly spelled flowers (“petals”). The engine can also treat “4” as a synonym for “four”. And It can also match the plural “petals” to the singular “petal”, based on them both having the same root “petal”.

Partial searches, like “4 pe”, can match “four petals” because most keyword search engines allow for prefix searching, which enables the important as you type feature, where a user can see search results or see query suggestions as they type.

One more example (among many others): transliteration. Transliteration maps the letters or sounds of one language to the letters or sounds of another language. For example, transliteration enables a user to type in Latin letters (e.g. a, b, c etc.) to search for Russian Cyrillic characters, or type in Japanese Hiragana to search for Katakana.

Keyword search’s relevance & ranking algorithm

A keyword search engines uses these language-processing techniques to create great relevance and ranking – the twin goals of a great search solution.

Relevance Relevance makes sure that all records that match a query are found:

Relevance relies on an intelligent matching that takes into account typo tolerance, partial word matching, spatial distance between matching words, the number of attributes that match, synonyms and Rules, natural language characteristics like stop words and plurals, geolocation, and many other intuitive aspects of what people would expect from search, especially in the Google era.

Ranking Ranking puts the records in order:

Ranking is how you order the records that a search returns so that the most accurate results appear earliest (on the first couple pages), while less accurate results appear later.

The keyword search algorithm To accomplish the best relevance and ranking, engineers need to design the best algorithm and data structure that will enable the best textual comparisons.

For relevance, there are many data structures and search algorithms available to choose from. However, keyword search is best served by using an inverted index and applying character-based comparisons. This means the search engine pre-processes the data so that each character (letters, numbers, and most characters on your keyboard) refers to one or more records that contain those characters. For example, when an engine compares “aardvark” to a search index, it will execute an upside-down lookup, attempting to find all records that contain the letters a-aa-aar-aard-aardv-aardva-aardvar-aardvark:

inverted search index

This inverted index can be adapted to allow for typos and other keyword search techniques.

Once the records are found, the final task is for the engine to rank the results, ensuring that the best matches show up at the top of the list. Again, there are different techniques, for example, statistical ranking based on the frequency of the words matched. The one we chose relies on a tie-breaking algorithm, which ranks records by applying a top-down tie-breaking, or testing, strategy similar to an elimination game.

A good example is to look at the first and second tests: typos and geolocation. A query for “theatre” in the US will return records with either “theater” or “theatre”, where the UK spelling is treated as a typo. Since the tie-breaking algorithm privileges the best matches, the records that match exactly (no typos = “theater”) will be selected as the best matches, and the UK spelling will be sent to the bottom of the list.

Next, the best records (with “theater”) will be subjected to the second test, geolocation: theaters located near the user (similar coordinates) will be selected as the best. The tie-breaking algorithm then continues to apply the next six tests (exact matches go before partial matches, distance between words in a two-word query, etc.) until all records are ranked.

Natural language processing A keyword-based relevance and ranking algorithm is built on natural language processing (NLP). NLP manages the complexity of language. For example, singular vs plural terms, verb inflections (present vs past tense, present participle, etc.), agglutinative or compound languages, and so forth. We’ll look at some NLP techniques below, but first we’ll define two NLP basics: tokenization and normalization.

What is tokenization? Tokenization breaks a larger text into smaller pieces. It may break a document into paragraphs, paragraphs into sentences, and sentences into “tokens.” Tokenization can be very difficult. For example, even something as simple as identifying a sentence in a paragraph is tricky. Is “Michael J. Fox” two separate sentences because it contains a period followed by a word with a capital letter?

What is normalization? A well known low-level step while doing NLP is normalization. The goal of this step is to standardize every query, to depend more on the letters than on the way it was typed. So instead of treating uppercase “Michael” different from lowercase “michael”, we normalize both to “michael”. We do similar normalizations with accents or special characters.

We find the canonical form of a letter: À → A and ß → ss. We find the lowercase form of the canonical form: À → A → a. Once parsed (tokenized and normalized), what can NLP do? Here’s a short list of some standard NLP techniques.

Transliteration: As mentioned above, transliteration allows you to search for characters of one language using the alphabet of another language. This is useful for languages like Russian (Latin to Cyrillic) and Japanese (Hiragana to Katakana), and Chinese (traditional to simplified). Stemming: Stemming is the process of converting words into their base forms by removing prefixes and suffixes. Stemming allows “run” and “running” to match, or “change” and “changing”, by stripping away the endings (“run”, “chang”). Keyword search is mostly concerned with noun stemming, such as the stem “pupp” for “puppy” and “puppies”. Lemmatization: Similar to stemming, lemmatization breaks words down into their base (or root) form, but does so by considering the context and morphological basis of each word. For example, lemmatization can convert irregular plurals, like “feet” to “foot”, or the French “œil” to “yeux”. (Note that the engine will first normalize “œ” to “oe”). Word segmentation: In English and many Latin-based languages, the space is a good approximation of a word divider (or word delimiter), although this concept has limits because of the variability in how each language combines and separates word parts. For example, many English compound nouns are variably written (ice box = ice-box = icebox). However, the space is not found in all written scripts, and without it, word segmentation becomes a difficult problem. Languages which do not have a trivial word segmentation process include Chinese and Japanese, where sentences but not words are delimited; Thai and Lao, where phrases and sentences but not words are delimited; and Vietnamese, where syllables but not words are delimited. Agglutinating: In some languages (German, Dutch, Finnish, etc.) several nouns can be concatenated without space to form new words. For example, Baumhaus is a German word composed of Baum (tree), and Haus (house) which designates a “tree house”. A person searching for “Baumhaus” (tree house) may very well be interested in results containing something like “Haus in einem Baum” (house in a tree), not only “Baumhaus”. A more startling challenge is the Icelandic word “Vaðlaheiðarvegavinnuverkfærageymsluskúraútidyralyklakippuhringur”, which is a combined collection of smaller words with the meaning: “key ring of the key chain of the outer door to the storage tool shed of the road workers on the Vaðlaheiði plateau”. Very precise, and probably useful for road workers who need to open a tool shed in the Vaðlaheiði plateau. Parts of speech tagging (POS-tagging): Also called grammatical tagging, POS-tagging is the process of determining the part of speech of a particular word or piece of text based on its use and context. It identifies “make” as a verb in “I can make a paper plane” and as a noun in “What make of car do you own?”

Going further with NLP – Semantics

Keyword search technology, laced with a more AI-driven technology, including NLU (natural language understanding) and vector-based semantic search, can take search to a new level.

Here are only a few examples:

Summarizing news articles and blog posts Detecting the language of a webpage to offer a translation Identifying key topics in the transcript of a sales call Categorizing the emotions expressed in a tweet A bot to serve customer service requests Serving up the right products for a search request Smart voice assistants Here’s a sample list of semantic-based NLP techniques:

Entity extraction: Entity extraction has become particularly important for voice search. As the name might suggest, entity extraction is a way to identify different elements of a query — people, places, dates, frequencies, quantities, etc. — to help a machine “understand” the information it contains. Entity extraction is a very good solution for overcoming simple keyword search limitations. Word sense: Finding the multiple meanings of words, for example, the verb ‘make’ in ‘make the grade’ (achieve) is different from ‘make a bet’ (place). Co-reference resolution: Treating two words as the same entity (e.g., ‘she’ = ‘Mary’), or identifying a metaphor or an idiom in the text (e.g., ‘bear’ isn’t an animal but a large hairy person). Sentiment analysis: Attempting to extract subjective qualities — attitudes, emotions, sarcasm, confusion, suspicion — from text.

Conclusion – NLP reduces the ambiguity of keywords

Human language is filled with ambiguities that make it difficult to write software that accurately determines the intended meaning of text or voice data. Homonyms, homophones, sarcasm, idioms, metaphors, grammar and usage exceptions, variations in sentence structure — these are just a few of the irregularities of human language that took humans years to learn, but that programmers must teach natural language-driven applications to recognize and understand accurately from the start, if those applications are going to be useful.

To address the most complex aspects of language, NLP has changed with the times. Central to this change is artificial intelligence, in particular machine learning models like vectors and large language models (LLMs). In the area of translation and natural language understanding (NLU), machine learning has vastly simplified and improved the search process. Vector spaces have removed the need to manually create synonyms. In this article, we focused on the purposes and how-to of keyword search, and on certain essential NLP techniques. NLP continues to evolve, to empower the query-level functionality of keyword search – which will remain as the go-to method to handle the simple queries that we perform on a daily basis.

nlp