大约一年前,我遇到了这个问题,当时我要在一个杂项信息数据库中查找用户输入的关于石油钻井平台的信息。目标是执行某种模糊字符串搜索,以识别具有最常见元素的数据库条目。
研究的一部分涉及实现Levenshtein距离算法,该算法确定必须对一个字符串或短语进行多少更改才能将其变成另一个字符串或短语。
我提出的实现相对简单,涉及两个短语的长度、每个短语之间的变化数量以及是否可以在目标条目中找到每个单词的加权比较。
这篇文章在一个私人网站上,所以我会尽我所能在这里附上相关内容:
模糊字符串匹配是对两个单词或短语的相似度进行类似人类估计的过程。在许多情况下,它包括识别彼此最相似的单词或短语。本文描述了模糊字符串匹配问题的内部解决方案,以及它在解决各种问题方面的有用性,这些问题可以使我们将以前需要冗长的用户参与的任务自动化。
简介
The need to do fuzzy string matching originally came about while developing the Gulf of Mexico Validator tool. What existed was a database of known gulf of Mexico oil rigs and platforms, and people buying insurance would give us some badly typed out information about their assets and we had to match it to the database of known platforms. When there was very little information given, the best we could do is rely on an underwriter to "recognize" the one they were referring to and call up the proper information. This is where this automated solution comes in handy.
我花了一天的时间研究模糊字符串匹配的方法,最终在维基百科上偶然发现了非常有用的Levenshtein距离算法。
实现
在阅读了它背后的理论之后,我实现了它,并找到了优化它的方法。这是我的代码在VBA中的样子:
'Calculate the Levenshtein Distance between two strings (the number of insertions,
'deletions, and substitutions needed to transform the first string into the second)
Public Function LevenshteinDistance(ByRef S1 As String, ByVal S2 As String) As Long
Dim L1 As Long, L2 As Long, D() As Long 'Length of input strings and distance matrix
Dim i As Long, j As Long, cost As Long 'loop counters and cost of substitution for current letter
Dim cI As Long, cD As Long, cS As Long 'cost of next Insertion, Deletion and Substitution
L1 = Len(S1): L2 = Len(S2)
ReDim D(0 To L1, 0 To L2)
For i = 0 To L1: D(i, 0) = i: Next i
For j = 0 To L2: D(0, j) = j: Next j
For j = 1 To L2
For i = 1 To L1
cost = Abs(StrComp(Mid$(S1, i, 1), Mid$(S2, j, 1), vbTextCompare))
cI = D(i - 1, j) + 1
cD = D(i, j - 1) + 1
cS = D(i - 1, j - 1) + cost
If cI <= cD Then 'Insertion or Substitution
If cI <= cS Then D(i, j) = cI Else D(i, j) = cS
Else 'Deletion or Substitution
If cD <= cS Then D(i, j) = cD Else D(i, j) = cS
End If
Next i
Next j
LevenshteinDistance = D(L1, L2)
End Function
Simple, speedy, and a very useful metric. Using this, I created two separate metrics for evaluating the similarity of two strings. One I call "valuePhrase" and one I call "valueWords". valuePhrase is just the Levenshtein distance between the two phrases, and valueWords splits the string into individual words, based on delimiters such as spaces, dashes, and anything else you'd like, and compares each word to each other word, summing up the shortest Levenshtein distance connecting any two words. Essentially, it measures whether the information in one 'phrase' is really contained in another, just as a word-wise permutation. I spent a few days as a side project coming up with the most efficient way possible of splitting a string based on delimiters.
valueWords, valuePhrase和Split函数:
Public Function valuePhrase#(ByRef S1$, ByRef S2$)
valuePhrase = LevenshteinDistance(S1, S2)
End Function
Public Function valueWords#(ByRef S1$, ByRef S2$)
Dim wordsS1$(), wordsS2$()
wordsS1 = SplitMultiDelims(S1, " _-")
wordsS2 = SplitMultiDelims(S2, " _-")
Dim word1%, word2%, thisD#, wordbest#
Dim wordsTotal#
For word1 = LBound(wordsS1) To UBound(wordsS1)
wordbest = Len(S2)
For word2 = LBound(wordsS2) To UBound(wordsS2)
thisD = LevenshteinDistance(wordsS1(word1), wordsS2(word2))
If thisD < wordbest Then wordbest = thisD
If thisD = 0 Then GoTo foundbest
Next word2
foundbest:
wordsTotal = wordsTotal + wordbest
Next word1
valueWords = wordsTotal
End Function
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
' SplitMultiDelims
' This function splits Text into an array of substrings, each substring
' delimited by any character in DelimChars. Only a single character
' may be a delimiter between two substrings, but DelimChars may
' contain any number of delimiter characters. It returns a single element
' array containing all of text if DelimChars is empty, or a 1 or greater
' element array if the Text is successfully split into substrings.
' If IgnoreConsecutiveDelimiters is true, empty array elements will not occur.
' If Limit greater than 0, the function will only split Text into 'Limit'
' array elements or less. The last element will contain the rest of Text.
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
Function SplitMultiDelims(ByRef Text As String, ByRef DelimChars As String, _
Optional ByVal IgnoreConsecutiveDelimiters As Boolean = False, _
Optional ByVal Limit As Long = -1) As String()
Dim ElemStart As Long, N As Long, M As Long, Elements As Long
Dim lDelims As Long, lText As Long
Dim Arr() As String
lText = Len(Text)
lDelims = Len(DelimChars)
If lDelims = 0 Or lText = 0 Or Limit = 1 Then
ReDim Arr(0 To 0)
Arr(0) = Text
SplitMultiDelims = Arr
Exit Function
End If
ReDim Arr(0 To IIf(Limit = -1, lText - 1, Limit))
Elements = 0: ElemStart = 1
For N = 1 To lText
If InStr(DelimChars, Mid(Text, N, 1)) Then
Arr(Elements) = Mid(Text, ElemStart, N - ElemStart)
If IgnoreConsecutiveDelimiters Then
If Len(Arr(Elements)) > 0 Then Elements = Elements + 1
Else
Elements = Elements + 1
End If
ElemStart = N + 1
If Elements + 1 = Limit Then Exit For
End If
Next N
'Get the last token terminated by the end of the string into the array
If ElemStart <= lText Then Arr(Elements) = Mid(Text, ElemStart)
'Since the end of string counts as the terminating delimiter, if the last character
'was also a delimiter, we treat the two as consecutive, and so ignore the last elemnent
If IgnoreConsecutiveDelimiters Then If Len(Arr(Elements)) = 0 Then Elements = Elements - 1
ReDim Preserve Arr(0 To Elements) 'Chop off unused array elements
SplitMultiDelims = Arr
End Function
相似度度量
Using these two metrics, and a third which simply computes the distance between two strings, I have a series of variables which I can run an optimization algorithm to achieve the greatest number of matches. Fuzzy string matching is, itself, a fuzzy science, and so by creating linearly independent metrics for measuring string similarity, and having a known set of strings we wish to match to each other, we can find the parameters that, for our specific styles of strings, give the best fuzzy match results.
最初,度量的目标是为精确匹配提供一个低搜索值,并为不断排列的度量增加搜索值。在不实际的情况下,使用一组定义良好的排列来定义这是相当容易的,并设计最终公式,使它们具有所需的增加搜索值的结果。
In the above screenshot, I tweaked my heuristic to come up with something that I felt scaled nicely to my perceived difference between the search term and result. The heuristic I used for Value Phrase in the above spreadsheet was =valuePhrase(A2,B2)-0.8*ABS(LEN(B2)-LEN(A2)). I was effectively reducing the penalty of the Levenstein distance by 80% of the difference in the length of the two "phrases". This way, "phrases" that have the same length suffer the full penalty, but "phrases" which contain 'additional information' (longer) but aside from that still mostly share the same characters suffer a reduced penalty. I used the Value Words function as is, and then my final SearchVal heuristic was defined as =MIN(D2,E2)*0.8+MAX(D2,E2)*0.2 - a weighted average. Whichever of the two scores was lower got weighted 80%, and 20% of the higher score. This was just a heuristic that suited my use case to get a good match rate. These weights are something that one could then tweak to get the best match rate with their test data.
正如您所看到的,最后两个指标是模糊字符串匹配指标,已经有一种自然的倾向,即对打算匹配的字符串给予低分(对角线向下)。这很好。
应用程序
为了优化模糊匹配,我对每个指标进行了加权。因此,模糊字符串匹配的每种应用都可以对不同的参数进行不同的加权。定义最终分数的公式是指标及其权重的简单组合:
value = Min(phraseWeight*phraseValue, wordsWeight*wordsValue)*minWeight
+ Max(phraseWeight*phraseValue, wordsWeight*wordsValue)*maxWeight
+ lengthWeight*lengthValue
Using an optimization algorithm (neural network is best here because it is a discrete, multi-dimentional problem), the goal is now to maximize the number of matches. I created a function that detects the number of correct matches of each set to each other, as can be seen in this final screenshot. A column or row gets a point if the lowest score is assigned the the string that was meant to be matched, and partial points are given if there is a tie for the lowest score, and the correct match is among the tied matched strings. I then optimized it. You can see that a green cell is the column that best matches the current row, and a blue square around the cell is the row that best matches the current column. The score in the bottom corner is roughly the number of successful matches and this is what we tell our optimization problem to maximize.
这个算法非常成功,解的参数说明了很多关于这类问题的信息。您将注意到优化的分数是44,而最好的分数是48。最后的5列是假的,与行值没有任何匹配。诱饵越多,自然就越难找到最佳匹配。
在这种特殊的匹配情况下,字符串的长度是不相关的,因为我们期望表示较长的单词的缩写,因此长度的最佳权重是-0.3,这意味着我们不会惩罚长度变化的字符串。我们降低了对这些缩写的预期得分,为部分单词匹配提供了更多的空间,以取代非单词匹配,因为字符串更短,只需要更少的替换。
单词的权重是1.0,而短语的权重只有0.5,这意味着我们会惩罚一个字符串中缺失的整个单词,而更重视整个短语的完整。这很有用,因为很多字符串都有一个共同的单词(危险),真正重要的是是否保持组合(区域和危险)。
最后,最小权重优化为10,最大权重优化为1。这意味着,如果两个分数中最好的分数(价值短语和价值词)不是很好,比赛将受到很大的惩罚,但我们不会对两个分数中最差的分数进行很大的惩罚。从本质上讲,这强调要求valueWord或valuePhrase有一个好分数,而不是两者都有。一种“能得到什么就拿什么”的心态。
这5个权重的优化值说明了发生的模糊字符串匹配,这真的很有趣。对于完全不同的模糊字符串匹配的实际情况,这些参数是非常不同的。到目前为止,我已经在3个不同的应用程序中使用了它。
虽然在最终优化中未使用,但建立了一个基准表,它将列与自己相匹配,以获得对角线上的所有完美结果,并让用户改变参数来控制得分从0偏离的比率,并注意搜索短语之间的固有相似性(理论上可以用于抵消结果中的假阳性)
进一步的应用
这种解决方案有潜力用于任何用户希望计算机系统识别一组字符串中没有完美匹配的字符串的地方。(类似于字符串的近似匹配vlookup)。
因此,您应该从中得到的是,您可能想要使用高级启发式的组合(从一个短语中找到另一个短语中的单词,两个短语的长度,等等)以及Levenshtein距离算法的实现。因为决定哪个是“最佳”匹配是一种启发式(模糊)决定——你必须为你想到的任何指标提出一组权重来确定相似性。
有了合适的启发式和权重集,比较程序就能快速做出你本应做出的决定。