我需要一种方法来比较多个字符串到一个测试字符串,并返回与它非常相似的字符串:
TEST STRING: THE BROWN FOX JUMPED OVER THE RED COW
CHOICE A : THE RED COW JUMPED OVER THE GREEN CHICKEN
CHOICE B : THE RED COW JUMPED OVER THE RED COW
CHOICE C : THE RED FOX JUMPED OVER THE BROWN COW
(如果我做对了)最接近“TEST string”的字符串应该是“CHOICE C”。最简单的方法是什么?
我计划将其实现为多种语言,包括VB.net、Lua和JavaScript。此时,伪代码是可以接受的。如果你能提供一个特定语言的例子,这也是非常感谢的!
如果输入数据太大(比如数百万个字符串),这个问题就很难实现。我用弹性搜索来解决这个问题。
快速入门:https://www.elastic.co/guide/en/elasticsearch/client/net-api/6.x/elasticsearch-net.html
只需将所有输入数据插入到DB中,您就可以根据任何编辑距离快速搜索任何字符串。下面是一个c#代码片段,它会给你一个按编辑距离排序的结果列表(从小到大)
var res = client.Search<ClassName>(s => s
.Query(q => q
.Match(m => m
.Field(f => f.VariableName)
.Query("SAMPLE QUERY")
.Fuzziness(Fuzziness.EditDistance(5))
)
));
下面是一个不依赖于任何库的快速解决方案,并且可以很好地处理自动完成表单之类的事情:
function compare_strings(str1, str2) {
arr1 = str1.split("");
arr2 = str2.split("");
res = arr1.reduce((a, c) => a + arr2.includes(c), 0);
return(res)
}
可以像这样在自动完成输入中使用:
HTML:
<div id="wrapper">
<input id="tag_input" placeholder="add tags..."></input>
<div id="hold_tags"></div>
</div>
CSS:
body {
background: #2c2c54;
display: flex;
justify-content: center;
align-items: center;
}
input {
height: 40px;
width: 400px;
border-radius: 4px;
outline: 0;
border: none;
padding-left: 5px;
font-size: 18px;
}
#wrapper {
height: auto;
background: #40407a;
}
.tag {
background: #ffda79;
margin: 4px;
padding: 5px;
border-radius: 4px;
box-shadow: 2px 2px 2px black;
font-size: 18px;
font-family: arial;
cursor: pointer;
}
JS:
const input = document.getElementById("tag_input");
const wrapper = document.getElementById("wrapper");
const hold_tags = document.getElementById("hold_tags");
const words = [
"machine",
"data",
"platform",
"garbage",
"twitter",
"knowledge"
];
input.addEventListener("input", function (e) {
const value = document.getElementById(e.target.id).value;
hold_tags.replaceChildren();
if (value !== "") {
words.forEach(function (word) {
if (compare_strings(word, value) > value.length - 1) {
const tag = document.createElement("div");
tag.className = "tag";
tag.innerText = word;
hold_tags.append(tag);
}
});
}
});
function compare_strings(str1, str2) {
arr1 = str1.split("");
arr2 = str2.split("");
res = arr1.reduce((a, c) => a + arr2.includes(c), 0);
return res;
}
结果:
如果输入数据太大(比如数百万个字符串),这个问题就很难实现。我用弹性搜索来解决这个问题。
快速入门:https://www.elastic.co/guide/en/elasticsearch/client/net-api/6.x/elasticsearch-net.html
只需将所有输入数据插入到DB中,您就可以根据任何编辑距离快速搜索任何字符串。下面是一个c#代码片段,它会给你一个按编辑距离排序的结果列表(从小到大)
var res = client.Search<ClassName>(s => s
.Query(q => q
.Match(m => m
.Field(f => f.VariableName)
.Query("SAMPLE QUERY")
.Fuzziness(Fuzziness.EditDistance(5))
)
));
这里有一个使用c#的示例。
public static void Main()
{
Console.WriteLine("Hello World " + LevenshteinDistance("Hello","World"));
Console.WriteLine("Choice A " + LevenshteinDistance("THE BROWN FOX JUMPED OVER THE RED COW","THE RED COW JUMPED OVER THE GREEN CHICKEN"));
Console.WriteLine("Choice B " + LevenshteinDistance("THE BROWN FOX JUMPED OVER THE RED COW","THE RED COW JUMPED OVER THE RED COW"));
Console.WriteLine("Choice C " + LevenshteinDistance("THE BROWN FOX JUMPED OVER THE RED COW","THE RED FOX JUMPED OVER THE BROWN COW"));
}
public static float LevenshteinDistance(string a, string b)
{
var rowLen = a.Length;
var colLen = b.Length;
var maxLen = Math.Max(rowLen, colLen);
// Step 1
if (rowLen == 0 || colLen == 0)
{
return maxLen;
}
/// Create the two vectors
var v0 = new int[rowLen + 1];
var v1 = new int[rowLen + 1];
/// Step 2
/// Initialize the first vector
for (var i = 1; i <= rowLen; i++)
{
v0[i] = i;
}
// Step 3
/// For each column
for (var j = 1; j <= colLen; j++)
{
/// Set the 0'th element to the column number
v1[0] = j;
// Step 4
/// For each row
for (var i = 1; i <= rowLen; i++)
{
// Step 5
var cost = (a[i - 1] == b[j - 1]) ? 0 : 1;
// Step 6
/// Find minimum
v1[i] = Math.Min(v0[i] + 1, Math.Min(v1[i - 1] + 1, v0[i - 1] + cost));
}
/// Swap the vectors
var vTmp = v0;
v0 = v1;
v1 = vTmp;
}
// Step 7
/// The vectors were swapped one last time at the end of the last loop,
/// that is why the result is now in v0 rather than in v1
return v0[rowLen];
}
输出结果为:
Hello World 4
Choice A 15
Choice B 6
Choice C 8