我需要一种方法来比较多个字符串到一个测试字符串,并返回与它非常相似的字符串:
TEST STRING: THE BROWN FOX JUMPED OVER THE RED COW
CHOICE A : THE RED COW JUMPED OVER THE GREEN CHICKEN
CHOICE B : THE RED COW JUMPED OVER THE RED COW
CHOICE C : THE RED FOX JUMPED OVER THE BROWN COW
(如果我做对了)最接近“TEST string”的字符串应该是“CHOICE C”。最简单的方法是什么?
我计划将其实现为多种语言,包括VB.net、Lua和JavaScript。此时,伪代码是可以接受的。如果你能提供一个特定语言的例子,这也是非常感谢的!
下面是一个不依赖于任何库的快速解决方案,并且可以很好地处理自动完成表单之类的事情:
function compare_strings(str1, str2) {
arr1 = str1.split("");
arr2 = str2.split("");
res = arr1.reduce((a, c) => a + arr2.includes(c), 0);
return(res)
}
可以像这样在自动完成输入中使用:
HTML:
<div id="wrapper">
<input id="tag_input" placeholder="add tags..."></input>
<div id="hold_tags"></div>
</div>
CSS:
body {
background: #2c2c54;
display: flex;
justify-content: center;
align-items: center;
}
input {
height: 40px;
width: 400px;
border-radius: 4px;
outline: 0;
border: none;
padding-left: 5px;
font-size: 18px;
}
#wrapper {
height: auto;
background: #40407a;
}
.tag {
background: #ffda79;
margin: 4px;
padding: 5px;
border-radius: 4px;
box-shadow: 2px 2px 2px black;
font-size: 18px;
font-family: arial;
cursor: pointer;
}
JS:
const input = document.getElementById("tag_input");
const wrapper = document.getElementById("wrapper");
const hold_tags = document.getElementById("hold_tags");
const words = [
"machine",
"data",
"platform",
"garbage",
"twitter",
"knowledge"
];
input.addEventListener("input", function (e) {
const value = document.getElementById(e.target.id).value;
hold_tags.replaceChildren();
if (value !== "") {
words.forEach(function (word) {
if (compare_strings(word, value) > value.length - 1) {
const tag = document.createElement("div");
tag.className = "tag";
tag.innerText = word;
hold_tags.append(tag);
}
});
}
});
function compare_strings(str1, str2) {
arr1 = str1.split("");
arr2 = str2.split("");
res = arr1.reduce((a, c) => a + arr2.includes(c), 0);
return res;
}
结果:
还有一个相似度测量,我曾经在我们的系统中实施,并给出了令人满意的结果:-
用例
有一个用户查询需要与一组文档进行匹配。
算法
从用户查询中提取关键字(相关POS TAGS -名词,专有名词)。
现在根据下面的公式计算分数,用于测量用户查询和给定文档之间的相似性。
对于从用户查询中提取的每个关键字:-
开始在文档中搜索给定的单词,并在文档中每出现一次该单词就减少奖励点数。
从本质上讲,如果第一个关键字在文档中出现了4次,则得分将计算为:-
第一次出现将获取'1'点。
第二次出现将在计算分数上加1/2
第三次会增加总数的1/3
第四次得到1/4
总相似度= 1 + 1/2 + 1/3 + 1/4 = 2.083
类似地,我们为用户查询中的其他关键字计算它。
最后,总分将表示用户查询与给定文档之间的相似程度。
Lua实现,为子孙后代:
function levenshtein_distance(str1, str2)
local len1, len2 = #str1, #str2
local char1, char2, distance = {}, {}, {}
str1:gsub('.', function (c) table.insert(char1, c) end)
str2:gsub('.', function (c) table.insert(char2, c) end)
for i = 0, len1 do distance[i] = {} end
for i = 0, len1 do distance[i][0] = i end
for i = 0, len2 do distance[0][i] = i end
for i = 1, len1 do
for j = 1, len2 do
distance[i][j] = math.min(
distance[i-1][j ] + 1,
distance[i ][j-1] + 1,
distance[i-1][j-1] + (char1[i] == char2[j] and 0 or 1)
)
end
end
return distance[len1][len2]
end
这里有一个使用c#的示例。
public static void Main()
{
Console.WriteLine("Hello World " + LevenshteinDistance("Hello","World"));
Console.WriteLine("Choice A " + LevenshteinDistance("THE BROWN FOX JUMPED OVER THE RED COW","THE RED COW JUMPED OVER THE GREEN CHICKEN"));
Console.WriteLine("Choice B " + LevenshteinDistance("THE BROWN FOX JUMPED OVER THE RED COW","THE RED COW JUMPED OVER THE RED COW"));
Console.WriteLine("Choice C " + LevenshteinDistance("THE BROWN FOX JUMPED OVER THE RED COW","THE RED FOX JUMPED OVER THE BROWN COW"));
}
public static float LevenshteinDistance(string a, string b)
{
var rowLen = a.Length;
var colLen = b.Length;
var maxLen = Math.Max(rowLen, colLen);
// Step 1
if (rowLen == 0 || colLen == 0)
{
return maxLen;
}
/// Create the two vectors
var v0 = new int[rowLen + 1];
var v1 = new int[rowLen + 1];
/// Step 2
/// Initialize the first vector
for (var i = 1; i <= rowLen; i++)
{
v0[i] = i;
}
// Step 3
/// For each column
for (var j = 1; j <= colLen; j++)
{
/// Set the 0'th element to the column number
v1[0] = j;
// Step 4
/// For each row
for (var i = 1; i <= rowLen; i++)
{
// Step 5
var cost = (a[i - 1] == b[j - 1]) ? 0 : 1;
// Step 6
/// Find minimum
v1[i] = Math.Min(v0[i] + 1, Math.Min(v1[i - 1] + 1, v0[i - 1] + cost));
}
/// Swap the vectors
var vTmp = v0;
v0 = v1;
v1 = vTmp;
}
// Step 7
/// The vectors were swapped one last time at the end of the last loop,
/// that is why the result is now in v0 rather than in v1
return v0[rowLen];
}
输出结果为:
Hello World 4
Choice A 15
Choice B 6
Choice C 8