How NLP & NLU Work For Semantic Search

Natural language processing (NLP) and natural language understanding (NLU) are two often-confused technologies that make search more intelligent and ensure people can search and find what they want.

This intelligence is a core component of semantic search.

NLP and NLU are why you can type “dresses” and find that long-sought-after “NYE Party Dress” and why you can type “Matthew McConnahey” and get Mr. McConnaughey back.

With these two technologies, searchers can find what they want without having to type their query exactly as it’s found on a page or in a product.

NLP is one of those things that has built up such a large meaning that it’s easy to look past the fact that it tells you exactly what it is: NLP processes natural language, specifically into a format that computers can understand.

These kinds of processing can include tasks like normalization, spelling correction, or stemming, each of which we’ll look at in more detail.

NLU, on the other hand, aims to “understand” what a block of natural language is communicating.

It performs tasks that can, for example, identify verbs and nouns in sentences or important items within a text. People or programs can then use this information to complete other tasks.

Computers seem advanced because they can do a lot of actions in a short period of time. However, in a lot of ways, computers are quite daft.

They need the information to be structured in specific ways to build upon it. For natural language data, that’s where NLP comes in.

It takes messy data (and natural language can be very messy) and processes it into something that computers can work with.

Text Normalization

When searchers type text into a search bar, they are trying to find a good match, not play “guess the format.”

For example, to require a user to type a query in exactly the same format as the matching words in a record is unfair and unproductive.

We use text normalization to do away with this requirement so that the text will be in a standard format no matter where it’s coming from.

As we go through different normalization steps, we’ll see that there is no approach that everyone follows. Each normalization step generally increases recall and decreases precision.

A quick aside: “recall” means a search engine finds results that are known to be good.

Precision means a search engine finds only good results.

Search results could have 100% recall by returning every document in an index, but precision would be poor.

Conversely, a search engine could have 100% recall by only returning documents that it knows to be a perfect fit, but sit will likely miss some good results.

Again, normalization generally increases recall and decreases precision.

Whether that movement toward one end of the recall-precision spectrum is valuable depends on the use case and the search technology. It isn’t a question of applying all normalization techniques but deciding which ones provide the best balance of precision and recall.

Letter Normalization

The simplest normalization you could imagine would be the handling of letter case.

In English, at least, words are generally capitalized at the beginning of sentences, occasionally in titles, and when they are proper nouns. (There are other rules, too, depending on whom you ask.)

But in German, all nouns are capitalized. Other languages have their own rules.

These rules are useful. Otherwise, we wouldn’t follow them.

For example, capitalizing the first words of sentences helps us quickly see where sentences begin.

That usefulness, however, is diminished in an information retrieval context.

The meanings of words don’t change simply because they are in a title and have their first letter capitalized.

Even trickier is that there are rules, and then there is how people actually write.

If I text my wife, “SOMEONE HIT OUR CAR!” we all know that I’m talking about a car and not something different because the word is capitalized.

We can see this clearly by reflecting on how many people don’t use capitalization when communicating informally – which is, incidentally, how most case-normalization works.

Of course, we know that sometimes capitalization does change the meaning of a word or phrase. We can see that “cats” are animals, and “Cats” is a musical.

In most cases, though, the increased precision that comes with not normalizing on case, is offset by decreasing recall by far too much.

The difference between the two is easy to tell via context, too, which we’ll be able to leverage through natural language understanding.

While less common in English, handling diacritics is also a form of letter normalization.

Diacritics are the marks, or “glyphs,” attached to letters, as in á, ë, or ç.

Words can otherwise be spelled the same, but added diacritics can change the meaning. In French, “élève” means “student,” while “élevé” means “elevated.”

Nonetheless, many people will not include the diacritics when searching, and so another form of normalization is to strip all diacritics, leaving behind the simple (and now ambiguous) “eleve.”

Tokenization

The next normalization challenge is breaking down the text the searcher has typed in the search bar and the text in the document.

This step is necessary because word order does not need to be exactly the same between the query and the document text, except when a searcher wraps the query in quotes.

Breaking queries, phrases, and sentences into words may seem like a simple task: Just break up the text at each space.

Problems show up quickly with this approach. Again, let’s start with English.

Separating on spaces alone means that the phrase “Let’s break up this phrase!” yields us let’s, break, up, this, and phrase! as words.

For search, we almost surely don’t want the exclamation point at the end of the word “phrase.”

Whether we want to keep the contracted word “let’s” together is not as clear.

Some software will break the word down even further (“let” and “‘s”) and some won’t.

Some will not break down “let’s” while breaking down “don’t” into two pieces.

This process is called “tokenization.”

We call it tokenization for reasons that should now be clear: What we end up with are not words but discrete groups of characters. This is even more true for languages other than English.

German speakers, for example, can merge words (more accurately “morphemes,” but close enough) together to form a larger word. The German word for “dog house” is “Hundehütte,” which contains the words for both “dog” (“Hund”) and “house” (“Hütte”).

Nearly all search engines tokenize text, but there are further steps an engine can take to normalize the tokens. Two related approaches are stemming and lemmatization.

Stemming And Lemmatization

Stemming and lemmatization take different forms of tokens and break them down for comparison.

For example, take the words “calculator” and “calculation,” or “slowing” and “slowly.”

We can see there are some clear similarities.

Stemming breaks a word down to its “stem,” or other variants of the word it is based on. Stemming is fairly straightforward; you could do it on your own.

What’s the stem of “stemming?”

You can probably guess that it’s “stem.” Often stemming means removing prefixes or suffixes, as in this case.

There are multiple stemming algorithms, and the most popular is the Porter Stemming Algorithm, which has been around since the 1980s. It is a series of steps applied to a token to get to the stem.

Stemming can sometimes lead to results that you wouldn’t foresee.

Looking at the words “carry” and “carries,” you might expect that the stem of each of these is “carry.”

The actual stem, at least according to the Porter Stemming Algorithm, is “carri.”

This is because stemming attempts to compare related words and break down words into their smallest possible parts, even if that part is not a word itself.

On the other hand, if you want an output that will always be a recognizable word, you want lemmatization. Again, there are different lemmatizers, such as NLTK using Wordnet.

Lemmatization breaks a token down to its “lemma,” or the word which is considered the base for its derivations. The lemma from Wordnet for “carry” and “carries,” then, is what we expected before: “carry.”

Lemmatization will generally not break down words as much as stemming, nor will as many different word forms be considered the same after the operation.

The stems for “say,” “says,” and “saying” are all “say,” while the lemmas from Wordnet are “say,” “say,” and “saying.” To get these lemma, lemmatizers are generally corpus-based.

If you want the broadest recall possible, you’ll want to use stemming. If you want the best possible precision, use neither stemming nor lemmatization.

Which you go with ultimately depends on your goals, but most searches can generally perform very well with neither stemming nor lemmatization, retrieving the right results, and not introducing noise.

Plurals

If you decide not to include lemmatization or stemming in your search engine, there is still one normalization technique that you should consider.

That is the normalization of plurals to their singular form.

Generally, ignoring plurals is done through the use of dictionaries.

Even if “de-pluralization” seems as simple as chopping off an “-s,” that’s not always the case. The first problem is with irregular plurals, such as “deer,” “oxen,” and “mice.”

A second problem is pluralization with an “-es” suffix, such as “potato.” Finally, there are simply the words that end in an “s” but aren’t plural, like “always.”

A dictionary-based approach will ensure that you introduce recall, but not incorrectly.

Just as with lemmatization and stemming, whether you normalize plurals is dependent on your goals.

Cast a wider net by normalizing plurals, a more precise one by avoiding normalization.

Usually, normalizing plurals is the right choice, and you can remove normalization pairs from your dictionary when you find them causing problems.

One area, however, where you will almost always want to introduce increased recall is when handling typos.

Typo Tolerance And Spell Check

We have all encountered typo tolerance and spell check within search, but it’s useful to think about why it’s present.

Sometimes, there are typos because fingers slip and hit the wrong key.

Other times, the searcher thinks a word is spelled differently than it is.

Increasingly, “typos” can also result from poor speech-to-text understanding.

Finally, words can seem like they have typos but really don’t, such as in comparing “scream” and “cream.”

The simplest way to handle these typos, misspellings, and variations, is to avoid trying to correct them at all. Some algorithms can compare different tokens.

One of these is the Damerau-Levenshtein Distance algorithm.

This measure looks at how many edits are needed to go from one token to another.

You can then filter out all tokens with a distance that is too high.

(Two is generally a good threshold, but you will probably want to adjust this based on the length of the token.)

After filtering, you can use the distance for sorting results or feeding into a ranking algorithm.

Many times, context can matter when determining if a word is misspelled or not. The word “scream” is probably correct after “I,” but not after “ice.”

Machine learning can be a solution for this by bringing context to this NLP task.

This spell check software can use the context around a word to identify whether it is likely to be misspelled and its most likely correction.

Typos In Documents

One thing that we skipped over before is that words may not only have typos when a user types it into a search bar.

Words may also have typos inside a document.

This is especially true when the documents are made of user-generated content.

This detail is relevant because if a search engine is only looking at the query for typos, it is missing half of the information.

The best typo tolerance should work across both query and document, which is why edit distance generally works best for retrieving and ranking results.

Spell check can be used to craft a better query or provide feedback to the searcher, but it is often unnecessary and should never stand alone.

Natural Language Understanding

While NLP is all about processing text and natural language, NLU is about understanding that text.

Named Entity Recognition

A task that can aid in search is that of named entity recognition, or NER. NER identifies key items, or “entities,” inside of text.

While some people will call NER natural language processing and others will call it natural language understanding, what’s clear is that it can find what’s important within a text.

For the query “NYE party dress” you would perhaps get back an entity of “dress” that is mapped to a type of “category.”

NER will always map an entity to a type, from as generic as “place” or “person,” to as specific as your own facets.

NER can also use context to identify entities.

A query of “white house” may refer to a place, while “white house paint” might refer to a color of “white” and a product category of “paint.”

Query Categorization

Named entity recognition is valuable in search because it can be used in conjunction with facet values to provide better search results.

Recalling the “white house paint” example, you can use the “white” color and the “paint” product category to filter down your results to only show those that match those two values.

This would give you high precision.

If you don’t want to go that far, you can simply boost all products that match one of the two values.

Query categorization can also help with recall.

For searches with few results, you can use the entities to include related products.

Imagine that there are no products that match the keywords “white house paint.”

In this case, leveraging the product category of “paint” can return other paints that might be a decent alternative, such as that nice eggshell color.

Document Tagging

Another way that named entity recognition can help with search quality is by moving the task from query time to ingestion time (when the document is added to the search index).

When ingesting documents, NER can use the text to tag those documents automatically.

These documents will then be easier to find for the searchers.

Either the searchers use explicit filtering, or the search engine applies automatic query-categorization filtering, to enable searchers to go directly to the right products using facet values.

Intent Detection

Related to entity recognition is intent detection, or determining the action a user wants to take.

Intent detection is not the same as what we talk about when we say “identifying searcher intent.”

Identifying searcher intent is getting people to the right content at the right time.

Intent detection maps a request to a specific, pre-defined intent.

It then takes action based on that intent. A user searching for “how to make returns” might trigger the “help” intent, while “red shoes” might trigger the “product” intent.

In the first case, you could route the search to your help desk search.

In the second one, you could route it to the product search. This isn’t so different from what you see when you search for the weather on Google.

Look, and notice that you get a weather box at the very top of the page. (Newly launched web search engine Andi takes this concept to the extreme, bundling search in a chatbot.)

For most search engines, intent detection, as outlined here, isn’t necessary.

Most search engines only have a single content type on which to search at a time.

When there are multiple content types, federated search can perform admirably by showing multiple search results in a single UI at the same time.

Other NLP And NLU tasks

There are plenty of other NLP and NLU tasks, but these are usually less relevant to search.

Tasks like sentiment analysis can be useful in some contexts, but search isn’t one of them.

You could imagine using translation to search multi-language corpuses, but it rarely happens in practice, and is just as rarely needed.

Question answering is an NLU task that is increasingly implemented into search, especially search engines that expect natural language searches.

Once again, you can see this on major web search engines.

Google, Bing, and Kagi will all immediately answer the question “how old is the Queen of England?” without needing to click through to any results.

Some search engine technologies have explored implementing question answering for more limited search indices, but outside of help desks or long, action-oriented content, the usage is limited.

Few searchers are going to an online clothing store and asking questions to a search bar.

Summarization is an NLU task that is more useful for search.

Much like with the use of NER for document tagging, automatic summarization can enrich documents. Summaries can be used to match documents to queries, or to provide a better display of the search results.

This better display can help searchers be confident that they have gotten good results and get them to the right answers more quickly.

Even including newer search technologies using images and audio, the vast, vast majority of searches happen with text. To get the right results, it’s important to make sure the search is processing and understanding both the query and the documents.

Semantic search brings intelligence to search engines, and natural language processing and understanding are important components.

NLP and NLU tasks like tokenization, normalization, tagging, typo tolerance, and others can help make sure that searchers don’t need to be search experts.

Instead, they can go from need to solution “naturally” and quickly.

More resources: 

Featured Image: ryzhi/Shutterstock