我一直在为一个投资组合管理工具开发一个内部网站。有很多文本数据,公司名称等。我对一些搜索引擎的能力印象深刻,它们可以非常快速地回答“你的意思是:xxxx”。
我需要能够智能地接受用户的查询,并不仅响应原始搜索结果,而且还响应“您的意思是?”当有一个极有可能的替代答案等
我正在开发ASP。NET (VB -别跟我过不去!)]
更新: 好吧,在没有数百万“付费用户”的情况下,我该如何模仿这种模式?
为每个“已知”或“正确”的术语生成拼写错误并执行查找? 还有其他更优雅的方法吗?
我一直在为一个投资组合管理工具开发一个内部网站。有很多文本数据,公司名称等。我对一些搜索引擎的能力印象深刻,它们可以非常快速地回答“你的意思是:xxxx”。
我需要能够智能地接受用户的查询,并不仅响应原始搜索结果,而且还响应“您的意思是?”当有一个极有可能的替代答案等
我正在开发ASP。NET (VB -别跟我过不去!)]
更新: 好吧,在没有数百万“付费用户”的情况下,我该如何模仿这种模式?
为每个“已知”或“正确”的术语生成拼写错误并执行查找? 还有其他更优雅的方法吗?
当前回答
通常,产品拼写纠正器会使用几种方法来提供拼写建议。一些人:
Decide on a way to determine whether spelling correction is required. These may include insufficient results, results which are not specific or accurate enough (according to some measure), etc. Then: Use a large body of text or a dictionary, where all, or most are known to be correctly spelled. These are easily found online, in places such as LingPipe. Then to determine the best suggestion you look for a word which is the closest match based on several measures. The most intuitive one is similar characters. What has been shown through research and experimentation is that two or three character sequence matches work better. (bigrams and trigrams). To further improve results, weigh a higher score upon a match at the beginning, or end of the word. For performance reasons, index all these words as trigrams or bigrams, so that when you are performing a lookup, you convert to n-gram, and lookup via hashtable or trie. Use heuristics related to potential keyboard mistakes based on character location. So that "hwllo" should be "hello" because 'w' is close to 'e'. Use a phonetic key (Soundex, Metaphone) to index the words and lookup possible corrections. In practice this normally returns worse results than using n-gram indexing, as described above. In each case you must select the best correction from a list. This may be a distance metric such as levenshtein, the keyboard metric, etc. For a multi-word phrase, only one word may be misspelled, in which case you can use the remaining words as context in determining a best match.
其他回答
通常,产品拼写纠正器会使用几种方法来提供拼写建议。一些人:
Decide on a way to determine whether spelling correction is required. These may include insufficient results, results which are not specific or accurate enough (according to some measure), etc. Then: Use a large body of text or a dictionary, where all, or most are known to be correctly spelled. These are easily found online, in places such as LingPipe. Then to determine the best suggestion you look for a word which is the closest match based on several measures. The most intuitive one is similar characters. What has been shown through research and experimentation is that two or three character sequence matches work better. (bigrams and trigrams). To further improve results, weigh a higher score upon a match at the beginning, or end of the word. For performance reasons, index all these words as trigrams or bigrams, so that when you are performing a lookup, you convert to n-gram, and lookup via hashtable or trie. Use heuristics related to potential keyboard mistakes based on character location. So that "hwllo" should be "hello" because 'w' is close to 'e'. Use a phonetic key (Soundex, Metaphone) to index the words and lookup possible corrections. In practice this normally returns worse results than using n-gram indexing, as described above. In each case you must select the best correction from a list. This may be a distance metric such as levenshtein, the keyboard metric, etc. For a multi-word phrase, only one word may be misspelled, in which case you can use the remaining words as context in determining a best match.
你是说拼写检查器?如果它是一个拼写检查器而不是一个完整的短语,那么我有一个关于拼写检查的链接,其中算法是用python开发的。检查这个链接
同时,我也在从事一个项目,包括使用文本搜索数据库。我想这能解决你的问题
这是我找到的最好的答案,由谷歌的研究总监Peter Norvig实施和描述的拼写纠正器。
如果你想了解更多这背后的理论,你可以阅读他书中的章节。
该算法的思想基于统计机器学习。
关于“did you mean”算法的理论可以参考《信息检索导论》第3章。它可以在网上免费下载。第3.3节(第52页)准确地回答了你的问题。为了明确回答你的更新,你只需要一个单词字典,不需要其他任何东西(包括数百万用户)。
关于你的问题,如何在没有大量数据的情况下模仿行为——为什么不使用谷歌收集的大量数据呢?下载谷歌的sarch结果拼写错误的单词,并搜索“Did you mean:”在HTML中。
我猜现在这叫做混搭:-)