"你是什么意思?"算法的工作吗?

我一直在为一个投资组合管理工具开发一个内部网站。有很多文本数据，公司名称等。我对一些搜索引擎的能力印象深刻，它们可以非常快速地回答“你的意思是:xxxx”。

我需要能够智能地接受用户的查询，并不仅响应原始搜索结果，而且还响应“您的意思是?”当有一个极有可能的替代答案等

我正在开发ASP。NET (VB -别跟我过不去!)］

更新: 好吧，在没有数百万“付费用户”的情况下，我该如何模仿这种模式?

为每个“已知”或“正确”的术语生成拼写错误并执行查找? 还有其他更优雅的方法吗?

当前回答

这是一个老问题，我很惊讶没有人建议OP使用Apache Solr。

Apache Solr是一个全文搜索引擎，除了许多其他功能，还提供拼写检查或查询建议。从文档中可以看到:

默认情况下，Lucene拼写检查器首先根据分由弦距计算和秒由频 (如有)在索引内的建议。

2012-03-06 20:29:54

其他回答

有一种特定的数据结构——三元搜索树——自然地支持部分匹配和近邻匹配。

2009-09-07 11:24:45

最简单的方法是动态规划。

这是一种从信息检索中借来的算法，在现代生物信息学中大量使用，以查看两个基因序列有多相似。

最优解采用动态规划和递归。

这是一个已经解决的问题，有很多解决方案。在你找到一些开源代码之前，一直在你的周围打转。

2008-11-21 01:05:37

这是一个老问题，我很惊讶没有人建议OP使用Apache Solr。

Apache Solr是一个全文搜索引擎，除了许多其他功能，还提供拼写检查或查询建议。从文档中可以看到:

默认情况下，Lucene拼写检查器首先根据分由弦距计算和秒由频 (如有)在索引内的建议。

2012-03-06 20:29:54

前段时间我发现了一篇文章:《如何编写拼写更正》，作者是Peter Norvig(谷歌公司的研究总监)。

这是一本关于“拼写纠正”主题的有趣读物。例子是用Python写的，但是很清楚，很容易理解，而且我认为算法可以很容易翻译成其他语言。

下面是该算法的简短描述。该算法包括两个步骤，准备和单词检查。

步骤1:准备-设置word数据库

最好是你能使用实际的搜索词和它们的出现。如果你没有，你可以用大量的文本来代替。计算每个单词的出现次数(流行度)。

步骤2。单词检查-找到与被检查的单词相似的单词

相似意味着编辑距离很低(通常是0-1或0-2)。编辑距离是将一个单词转换为另一个单词所需的插入/删除/更改/交换的最小数量。

从上一步中选择一个最流行的词，并建议它作为更正(如果不是这个词本身的话)。

2008-11-20 23:41:37

通常，产品拼写纠正器会使用几种方法来提供拼写建议。一些人:

Decide on a way to determine whether spelling correction is required. These may include insufficient results, results which are not specific or accurate enough (according to some measure), etc. Then: Use a large body of text or a dictionary, where all, or most are known to be correctly spelled. These are easily found online, in places such as LingPipe. Then to determine the best suggestion you look for a word which is the closest match based on several measures. The most intuitive one is similar characters. What has been shown through research and experimentation is that two or three character sequence matches work better. (bigrams and trigrams). To further improve results, weigh a higher score upon a match at the beginning, or end of the word. For performance reasons, index all these words as trigrams or bigrams, so that when you are performing a lookup, you convert to n-gram, and lookup via hashtable or trie. Use heuristics related to potential keyboard mistakes based on character location. So that "hwllo" should be "hello" because 'w' is close to 'e'. Use a phonetic key (Soundex, Metaphone) to index the words and lookup possible corrections. In practice this normally returns worse results than using n-gram indexing, as described above. In each case you must select the best correction from a list. This may be a distance metric such as levenshtein, the keyboard metric, etc. For a multi-word phrase, only one word may be misspelled, in which case you can use the remaining words as context in determining a best match.

2009-04-16 18:07:37

"你是什么意思?"算法的工作吗?

推荐文章

最新文章

标签