Retrieve keywords fuzzily from Japanese user search queries with Smith-Waterman algorithm

IKKO Ohta
13 min readDec 26, 2022

TL;DR

For our search engine of EC service in Japan, I implemented an approximate matching based on Smith-Waterman algorithm. It is robust against lengthy and fuzzy input enough but still dynamic programming, not machine learning, so it is more manageable.

Original: https://inside.luchegroup.com/entry/2022/12/15/135713
self-translation by @samayota with the permission of Giftmall.inc.

Introduction

Hi, I, @samayotta, have been working for Giftmall. Inc in various software engineering fields, including search engine development.

We Giftmall. Inc provides an EC service in Japan. We have many contents like products or articles tagged with specific keywords. We want to show the related ones to the customer if the user query includes the tagged keyword. However, the query may contain the keyword in a way we do not expect. In this case, there is no clear-cut solution. Still, we want to find a better strategy than a naive solution like using str.Contains.

So, in this article, I will introduce approximate string match algorithms. Firstly, We mention a classical problem Longest Common Subsequence. Secondly, we will use the more sophisticated version of it, the Smith-Waterman algorithm.

  • This article is intended for readers who have heard about LCS. And it will address issues specific to the Japanese language, which is not tokenized by space between words like in English.

Requirement

We want to implement a function such that:

Given inputs

user query: string

keywords: string[]

Expectation

return type: string[]

  • Retrieve keywords from the user query and return the array with the maximum total string length of the keywords.
  • The return value is a subset of the input keywords.
  • In our service, we will show content related to these keywords.

Limitation

  • The user query is mainly Japanese, sometimes not tokenized, and without word breaks.
  • The user may be an incomplete sentence.
  • The user query may contain spelling discrepancies.

Example

ex0.

user query = "Paul Smith wallet farthers' day"
keywords = ["Paul Smith", "wallet", "farthers' day", "father"]
# => ["Paul Smith", "wallet", "farthers' day"]

Take lengthier farthers' day instead of farther .

ex1.

user query = "ポール・スミス 財布 父の日"
keywords = ["ポール・スミス", "財布", "父の日", "父"]
# => ["ポール・スミス", "財布", "父の日"]

The user input is mainly in Japanese (this is a direct translation from ex0).

ex2.

user query = "ポールスミス 財布 父の日"
keywords = ["ポール・スミス", "財布", "父の日", "父"]
# => ["ポール・スミス", "財布", "父の日"]

“ポールスミス” should be fixed to “ポール・スミス” due to spelling discrepancy. Japanese tend to use “・” ambiguously.

ex3.

user query = "父の日のポールスミスの財布のプレゼントを教えて下さい。"
keywords = ["ポール・スミス", "財布", "父の日", "父"]
# => ["父の日", "ポール・スミス", "財布"]

The sentence written in Japanese does not have any word dividers. It means, “please show me a good father’s day present like wallets by Paul Smith.” Still, ポールスミス should be fixed to ポール・スミス . We cannot split the sentence directly with the character , one of the Japanese case maker particles, nearly equal to "of" in English, because of 父の日, to maximize the sum of the lengths of return elements.

Considering available solutions

At first, as an obvious baseline, we can use exact matching like ‘str.Contains’. Adding synonyms to given keywords by hand against spelling discrepancies improves our result. However, it does not scale due to limited human resources.

Next, we think of using modern natural language processing technology with deep learning. However, the expensive maintenance cost of a machine-learning solution can be an obstacle for a small team. Simple and pre-trained tokenizers, like MeCab, can also be helpful, but our product business logic should be independent of the unexpected behaviors of the pre-trained model.

Third, typically, we can solve spelling discrepancies by ‘edit distance’ (Levenshtein distance). Edit distance defines how close two texts are with the number of edits. For example, the edit distance between color and colour is 1 because we need to add u. We can set a small threshold to consider these words the same. However, this solution only works when we compare terms of similar length. In our case, we may receive a much longer user input than our keywords. For instance, the distance between 父の日のポールスミスの財布のプレゼントを教えて下さい。 and ポール・スミス is not desirable.

So, we must develop a similar but robust strategy against lengthy input.

Longest Common Subsequence

The longest Common Subsequence is the most extended shared sequence allowing the gap. When I was a university student, I didn’t say I liked LCS in my class because of the difficulty compared to other introduction algorithms. I cannot complain about teaching it in the introductory course because this method is quite helpful in solving string match problems.

We define the function ‘LCS’ as follows:

func LCS(s, t string) (lcsLength int, lcsString string)
// ex. LCS("abcdeeeef", "abcdefg") => 6, "abcdef"
// - LCS allows missing 'g' and redundant 'e.'
// - return the length of LCS and LCS itself

With LCS, we can clear the lengthy user input problem because it is more robust towards the long input than the edit distance. The ratio of ‘the subsequence length / the full keyword length’ can be a good model to judge whether the keyword is valid under the given threshold. You can subtract the LCS substring from the user query if it is acceptable.

Let’s implement LCS by dynamic programming. Here is the LCS table example above the input.

Let’s consider we have two queues, ["φ", "a", "b", "c", "d", "e", "e", "e", "e", "f"] and ["φ", "a", "b", "c", "d", "e", "f", "g"] . We can get 1 point if we pop the same character simultaneously from both queues. We can also pop only one queue without a point. In this table, the diagonal movement means popping both and gaining 1 point. The left movement and the right movement suggest popping the queue. Including φ in the queues is a famous trick. It means there is no queue operation. All we need to do is calculate the result from our available and limited operations. Filling it by hand on the real spreadsheet is intuitive if you feel it is difficult.

I cannot explain this extensively due to the limited space, but there are a lot of excellent materials to understand this topic. For example, a famous YouTube channel CSDojo has videos.

By the way, to gain the real LCS from the strings, we need to trace back the DP table. Multiple LCS may be correct, but they have the same length, so we do not care about it.

Here is Golang implementation example.

// ex. ("abcdeeeef", "abcdefg") => 6, "abcdef"
func LongestCommonSubsequence(s, t string) (int, string) {
sAsRunes := []rune(s)
tAsRunes := []rune(t)
sLen := len(sAsRunes)
tLen := len(tAsRunes)

dp := make([][]int, sLen+1)
for i := 0; i < sLen+1; i++ {
dp[i] = make([]int, tLen+1)
}

// fill the DP table
for i := 1; i <= sLen; i++ {
for j := 1; j <= tLen; j++ {
if sAsRunes[i-1] == tAsRunes[j-1] {
dp[i][j] = dp[i-1][j-1] + 1
} else {
dp[i][j] = max(dp[i-1][j], dp[i][j-1])
}
}
}

// construct lcs by backtracking
lcsLen := dp[sLen][tLen]
lcs := make([]rune, lcsLen)
backtrack(sAsRunes, tAsRunes, &dp, sLen, tLen, &lcs, lcsLen)
return lcsLen, string(lcs)
}

func backtrack(s, t []rune, dp *[][]int, i, j int, lcs *[]rune, lcsIndex int) {
// basecase
if i == 0 || j == 0 {
return
}

if s[i-1] == t[j-1] {
backtrack(s, t, dp, i-1, j-1, lcs, lcsIndex-1)
(*lcs)[lcsIndex-1] = s[i-1]
} else {
if (*dp)[i-1][j] >= (*dp)[i][j-1] {
backtrack(s, t, dp, i-1, j, lcs, lcsIndex)
} else {
backtrack(s, t, dp, i, j-1, lcs, lcsIndex)
}
}
}

func max(a, b int) int {
if a < b {
return b
}
return a
}

Smith-Waterman algorithm

However, there are still corner cases in LCS. Long user inputs create an accidental coincidence between strings like the following snippet.

LCS("Time flies like an arrow", "mellow") =>  "mellow" // tiME fLies Like an arrOW
LCS("エルメスのバッグ", "エコバッグ") // => "エバッグ"

Under threshold 0.8, エコバッグ wrongly matches エルメスのバッグ because it contains enough matching characters. Our function's desirable behavior is:

("エルメスのバッグ", "エコバッグ") => "バッグ"

The smith-Waterman algorithm improves our result by adding a penalty for too long a gap or unmatch. We call its result string the ‘local alignment’ because it gathers in a limited part of the origin strings. The Smith-Waterman algorithm and its local alignment are robust in solving various real-world problems. For instance, bioinformatics scientists use it to examine the local similarity of two DNA codes. Moreover, fzf, the famous fuzzy finder for CLI, is based on it.

We define the following function based on it:

// ex. ("エルメスのバッグ", "エコバッグ", (penaltyConfig)) => "バッグ"
func GetLocalAlignment(s, t string, config *LocalAlignmentConfig)

Here is the smith-waterman DP table example.

MatchScore=3, UnmatchPenalty=3, GapPenalty=2

As the table shows, this is LCS with weight.

type LocalAlignmentConfig struct {
MatchScore int
UnmatchPenalty int
GapPenalty int
}

We define three hyperparameters for a match score (if popping both), a unmatch penalty(if popping both but not the same character), and a gap cost (if popping one queue). Instead of the simple LCS score rule, we add those parameters and find the longest thread in the DP table.

This snippet is the pseudo-code of the DP[i][j] score.

var diagonalScore int
if sAsRunes[i-1] == tAsRunes[j-1] {
diagonalScore = dp[i-1][j-1] + config.MatchScore // diagonal on match
} else {
godiagonalScore = dp[i-1][j-1] - config.UnmatchPenalty // diagonal on unmatch
}
dp[i][j] = maxOfFour(
0,
diagonalScore, // diagonal
dp[i-1][j]- config.GapPenalty, // down
dp[i][j-1]- config.GapPenalty, // right
)

The local alignment is the longest thread of not 0 cells in the table. Note that our starting cell in the backtracking part is the maximum score point cell.

By the way, we need to find the appropriate weighting set with experiments.

config := util.NewLocalAlignmentConfig(3, 3, 2)
GetLocalAlignment("エルメスのバッグ", "エコバッグ", config) # => "バッグ"

In the above example, (3, 3, 2) works well. However, after examining input and output, we will find we should set a penalty more than the match score. So, we will use (3, 10, 10) in the following part.

Customizing for our project

Let’s try to solve another corner case.

config := util.NewLocalAlignmentConfig(3, 10, 10)
GetLocalAlignment("イヴサンローラン", "イヴ・サンローラン", config) // => "サンローラン"
interruption by ‘・’

Now, because I set the score (3, 10, 10), the first part of イヴサンローラン (Yves Saint Laurent) is cut off, even though our goal is to get almost the whole of the string. Unfortunately, this results in only 6 hits out of 9 characters, and with a threshold of 0.8, it is not acceptable. The table illustrates blocking our thread. Can we manage such allowable unmatches?

For each state transition, we should give different scoring depending on each character. For example, we can set a ‘forbidden’ matching or more minor mistakes than others towards the specific characters by allocating an exceptional score. In our task, unmatching by or space is quite common, and we set a penalty-free for them. Next, we may want to forbid unmatching by の with a special heavy penalty because of the grammatical problem. Still, an exact match by の is allowable thanks to the rule, so we can use 父の日 to match it efficiently.

Here is our improvement against the problem.

type LocalAlignmentConfig struct {
defaultMatchScore int
defaultUnmatchPenalty int
defaultGapPenalty int
penaltyMap map[rune]int
}
// ex.
// config := util.NewLocalAlignmentConfig(3, 10, 10, map[rune]int{
// 'の': 100, // heavy penalty
// ' ':0, // penalty-free
// '・':0,
// })
// GetLocalAlignment("イヴサンローラン", "イヴ・サンローラン", config) => "イヴサンローラン"
func GetLocalAlignment(s, t string, config *LocalAlignmentConfig) string

After getting the local alignment, the new ratio, `the local alignment length / the full keyword length,` will be a better metric than simple LCS.

Lastly, we need to describe the disadvantages of the solution.

  • Time Complexity. str.Contains (KMP) is O(n) in the worst case. However, LCS and the improved version is O(mn).
  • Hyperparameter tuning is required. We need to find a good score set by testing it. And the constant score values slightly reduce the code readability and maintainability.

Here is our final version:

type LocalAlignmentConfig struct {
defaultMatchScore int
defaultUnmatchPenalty int
defaultGapPenalty int
penaltyMap map[rune]int
}

// ex.
// ```
// util.NewLocalAlignmentConfig(3, 3, 2, map[rune]int{
// 'の': 100,
// })
func NewLocalAlignmentConfig(
defaultMatchScore, defaultUnmatchPenalty, defaultGapPenalty int, penaltyMap map[rune]int,
) *LocalAlignmentConfig {
return &LocalAlignmentConfig{
defaultMatchScore: defaultMatchScore,
defaultUnmatchPenalty: defaultUnmatchPenalty,
defaultGapPenalty: defaultGapPenalty,
penaltyMap: penaltyMap,
}
}

// ex.
// config := util.NewLocalAlignmentConfig(3, 10, 10, map[rune]int{
// 'の': 100, // heavy penalty
// ' ':0, // penalty-free
// '・':0,
// })
// GetLocalAlignment("イヴサンローラン", "イヴ・サンローラン", config) => "イヴサンローラン"
func GetLocalAlignment(s, t string, config *LocalAlignmentConfig) string {
// setting
matchScore := 1
if config != nil {
matchScore = config.defaultMatchScore
}
unmatchPenalty := 0
if config != nil {
unmatchPenalty = config.defaultUnmatchPenalty
}
getPenalty := func(r rune) int {
if config == nil {
return 0
}
penaltyValue, ok := config.penaltyMap[r]
if ok {
return penaltyValue
}
return config.defaultGapPenalty
}

sAsRunes := []rune(s)
tAsRunes := []rune(t)
sLen := len(sAsRunes)
tLen := len(tAsRunes)

dp := make([][]int, sLen+1)
for i := 0; i < sLen+1; i++ {
dp[i] = make([]int, tLen+1)
}

// fill the DP table
highestScore := 0
highestI, highestJ := 0, 0
for i := 1; i <= sLen; i++ {
for j := 1; j <= tLen; j++ {
var diagonalScore int
if sAsRunes[i-1] == tAsRunes[j-1] {
diagonalScore = dp[i-1][j-1] + matchScore
} else {
diagonalScore = dp[i-1][j-1] - unmatchPenalty
}

dp[i][j] = maxOfFour(
0,
diagonalScore,
dp[i-1][j]-getPenalty(sAsRunes[i-1]),
dp[i][j-1]-getPenalty(tAsRunes[j-1]),
)
// find the highest cell
if highestScore < dp[i][j] {
highestScore = dp[i][j]
highestI, highestJ = i, j
}
}
}
// construct localAlignment by backtracking
localAlignment := []rune{}
backtrack(sAsRunes, tAsRunes, &dp, highestI, highestJ, &localAlignment)
return string(localAlignment)
}

func backtrack(s, t []rune, dp *[][]int, i, j int, lcs *[]rune) {
// basecase
if i == 0 || j == 0 {
return
}
if (*dp)[i][j] == 0 {
return
}

if s[i-1] == t[j-1] {
backtrack(s, t, dp, i-1, j-1, lcs)
(*lcs) = append((*lcs), s[i-1])
} else {
if (*dp)[i-1][j] >= (*dp)[i][j-1] {
backtrack(s, t, dp, i-1, j, lcs)
} else {
backtrack(s, t, dp, i, j-1, lcs)
}
}
}

func max(a, b int) int {
if a < b {
return b
}
return a
}

func maxOfFour(a, b, c, d int) int {
return max(max(max(a, b), c), d)
}

Comparison

We summarize the pros and cons of the mentioned algorithms.

In order to consider complexity, we define l, m, n as follows:

  • m is the length of user input
  • n is the length of the keyword length
  • l is the length of the keywords list

VS machine learning

Machine learning systems have heavy maintenance costs and are unsuitable for small teams. Using the pre-trained model to reduce developing costs can give unexpected behaviors to our business logic. On the other hand, the Smith-Waterman method is just dynamic programming, so maintenance cost is low. Furthermore, the behavior is more testable because the system has no stochastic component.

  • Let’s briefly consider an implementation using a machine learning model. A sentence will be morphologically analyzed, vectorized, and then classified for each keyword. If we embed words into semantic space as Word2Vec does, it may solve other corner cases of our problem, which is impossible with the methods we have seen so far, such as matching Xmas with Christmas.
  • In terms of time complexity, we compute the matrix product after the morphological analysis of the user query requiring O(m²) once. Whether this is faster than the Smith-Waterman implementation depends on the matrix design and computational efficiency. We need to check it in an experiment.
  • Downsides. Developing such a stochastic model requires order-made annotations to match our keyword list. Additionally, we need to focus on the validation process, including the freshness of the model.

The comparison of string match algorithm

  • The exact string matching is assumed to be implemented by the KMP method.
  • Regarding time complexity, all methods above will eventually need to apply for each keyword. As a result, it takes O(l *(m+n)) for exact matching and O(lmn) for the Smith-Waterman method.
  • In terms of space complexity, methods other than exact string matching use more memory to create their DP tables, although much smaller than the memory available on the server.
  • LCS may work well for long string input, but the matching range is not controllable. It leads to unexpected matches by chance. On the other hand, the Smith-Waterman method uses hyperparameters to control the matching range.
  • By the way, we can implement the edit distance function by dynamic programming similar to LCS, and the complexities are above.

Application in service

In implementing it into our service, we created a list of top search terms for Giftmall and conducted a quality check to ensure whether it returned the expected results or not.

Against founded cases that did not give the expected results, we analyzed why they misbehaved line by line. In the iteration, we also set up some fallbacks or expanded the keyword list itself. The idea of using the Smith-Waterman algorithm came from this trial and error process. Finally, our team decided that we had reached a release-ready quality, and we released it.

The search engine with the approximate keyword extraction has worked for reasonable computational time so far. We have combined minor optimizations, such as pruning ahead of time for obviously unmatched keywords.

My personal impression is that it is helpful in business because of the flexibility of the detailed settings. In particular, when operating an actual service, we often receive tuning requests from within the company for the specific case. All I need to do in this case is change the config parameters.

Lastly, we mention scalability. If the number of keywords increases tenfold from the current level, the time required will also increase linearly. If it happens, we will need to clean the keywords or find another matching algorithm.

Conclusion

Classical string match algorithms are helpful, even if they look complicated. In this article, I introduced approximate keyword extraction based on the Smith-Waterman algorithm. Firstly, it is robust towards both spelling discrepancies and lengthy user input. Secondly, it is flexible in setting special rules for our situation. This algorithm can be worth examining in the real business scene.

--

--