

决策树 支持向量机 贝叶斯 神经网络 再邻居 q学习的 遗传算法 马尔可夫决策过程 卷积神经网络 线性回归或逻辑回归 提升,装袋,取样 随机爬坡或模拟退火 ...






其他指导方针是什么?甚至像“如果你必须向一些高层管理人员解释你的模型,那么也许你应该使用决策树,因为决策规则是相当透明的”这样的回答也是很好的。不过,我不太关心实现/库问题。 另外,对于一个有点独立的问题,除了标准的贝叶斯分类器,是否有“标准的最先进的”方法来检测评论垃圾邮件(而不是电子邮件垃圾邮件)?


Sam Roweis曾经说过,你应该先尝试朴素贝叶斯,逻辑回归,k近邻和Fisher线性判别。








Do you need to train incrementally (as opposed to batched)? If you need to update your classifier with new data frequently (or you have tons of data), you'll probably want to use Bayesian. Neural nets and SVM need to work on the training data in one go. Is your data composed of categorical only, or numeric only, or both? I think Bayesian works best with categorical/binomial data. Decision trees can't predict numerical values. Does you or your audience need to understand how the classifier works? Use Bayesian or decision trees, since these can be easily explained to most people. Neural networks and SVM are "black boxes" in the sense that you can't really see how they are classifying data. How much classification speed do you need? SVM's are fast when it comes to classifying since they only need to determine which side of the "line" your data is on. Decision trees can be slow especially when they're complex (e.g. lots of branches). Complexity. Neural nets and SVMs can handle complex non-linear classification.



增强——当有大量可用的训练数据时通常是有效的。 随机树-通常非常有效,也可以执行回归。 k近邻-你能做的最简单的事情,通常有效但缓慢,需要大量内存。 神经网络——训练速度慢,但运行速度很快,仍然是字母识别的最佳表现。 SVM -在数据有限的情况下是最好的,但只有在大数据集可用时才会输给增强或随机树。

正如Andrew Ng教授经常说的那样:总是从实现一个粗糙的、肮脏的算法开始,然后迭代地完善它。

For classification, Naive Bayes is a good starter, as it has good performances, is highly scalable and can adapt to almost any kind of classification task. Also 1NN (K-Nearest Neighbours with only 1 neighbour) is a no-hassle best fit algorithm (because the data will be the model, and thus you don't have to care about the dimensionality fit of your decision boundary), the only issue is the computation cost (quadratic because you need to compute the distance matrix, so it may not be a good fit for high dimensional data).

Another good starter algorithm is the Random Forests (composed of decision trees), this is highly scalable to any number of dimensions and has generally quite acceptable performances. Then finally, there are genetic algorithms, which scale admirably well to any dimension and any data with minimal knowledge of the data itself, with the most minimal and simplest implementation being the microbial genetic algorithm (only one line of C code! by Inman Harvey in 1996), and one of the most complex being CMA-ES and MOGA/e-MOEA.

