有人能解释一下数据挖掘中分类和聚类的区别吗?
如果可以,请给出两者的例子以理解主旨。
有人能解释一下数据挖掘中分类和聚类的区别吗?
如果可以,请给出两者的例子以理解主旨。
当前回答
I am sure a number of you have heard about machine learning. A dozen of you might even know what it is. And a couple of you might have worked with machine learning algorithms too. You see where this is going? Not a lot of people are familiar with the technology that will be absolutely essential 5 years from now. Siri is machine learning. Amazon’s Alexa is machine learning. Ad and shopping item recommender systems are machine learning. Let’s try to understand machine learning with a simple analogy of a 2 year old boy. Just for fun, let’s call him Kylo Ren
让我们假设凯洛·伦看到了一头大象。他的大脑会告诉他什么?(记住,即使他是维德的继任者,他也只有最低限度的思考能力)。他的大脑会告诉他,他看到了一个巨大的移动生物,颜色是灰色的。接着他看到一只猫,他的大脑告诉他那是一只会动的金色小动物。最后,他看到了一把光剑,他的大脑告诉他,这是一个无生命的物体,他可以玩!
此时他的大脑知道,军刀不同于大象和猫,因为军刀是用来玩的,不会自己移动。即使凯洛不知道移动是什么意思,他的大脑也能想出这么多。这个简单的现象叫做聚类。
机器学习只不过是这个过程的数学版本。 很多研究统计学的人意识到,他们可以用大脑工作的方式来计算一些方程。 大脑可以聚类相似的物体,大脑可以从错误中学习,大脑可以学习识别事物。
所有这些都可以用统计数据来表示,基于计算机模拟的这一过程被称为机器学习。为什么我们需要基于计算机的模拟?因为计算机比人脑更快地完成繁重的数学运算。 我很想进入机器学习的数学/统计部分,但在没有明确一些概念之前,你不会想直接进入。
Let’s get back to Kylo Ren. Let’s say Kylo picks up the saber and starts playing with it. He accidentally hits a stormtrooper and the stormtrooper gets injured. He doesn’t understand what’s going on and continues playing. Next he hits a cat and the cat gets injured. This time Kylo is sure he has done something bad, and tries to be somewhat careful. But given his bad saber skills, he hits the elephant and is absolutely sure that he is in trouble. He becomes extremely careful thereafter, and only hits his dad on purpose as we saw in Force Awakens!!
从错误中学习的整个过程可以用方程式来模拟,在方程式中,做错事的感觉用错误或代价来表示。这种识别不该用军刀做什么的过程叫做分类。 聚类和分类是机器学习的绝对基础。让我们看看它们之间的区别。
Kylo differentiated between animals and light saber because his brain decided that light sabers cant move by themselves and are therefore, different. The decision was based solely upon the objects present (data) and no external help or advice was provided. In contrast to this, Kylo differentiated the importance of being careful with light saber by first observing what hitting an object can do. The decision wasn’t completely based on the saber, but on what it could do to different objects . In short, there was some help here.
Because of this difference in learning, Clustering is called an unsupervised learning method and Classification is called a supervised learning method. They are very different in the machine learning world, and are often dictated by the kind of data present. Obtaining labelled data (or things that help us learn , like stormtrooper,elephant and cat in Kylo’s case) is often not easy and becomes very complicated when the data to be differentiated is large. On the other hand, learning without labels can have it’s own disadvantages , like not knowing what are the label titles. If Kylo was to learn being careful with the saber without any examples or help, he wouldn’t know what it would do. He would just know that it is not suppose to be done. It’s kind of a lame analogy but you get the point!
We are just getting started with Machine Learning. Classification itself can be classification of continuous numbers or classification of labels. For instance, if Kylo had to classify what each stormtrooper’s height is, there would be a lot of answers because the heights can be 5.0, 5.01, 5.011, etc. But a simple classification like types of light sabers (red,blue.green) would have very limited answers. Infact they can be represented with simple numbers. Red can be 0 , Blue can be 1 and Green can be 2.
如果你懂基础数学,你就知道0、1、2和5.1、5.01、5.011是不同的,分别被称为离散数和连续数。离散数的分类称为逻辑回归,连续数的分类称为回归。 逻辑回归也被称为分类分类,所以当你在其他地方读到这个术语时不要感到困惑
这是关于机器学习的一个非常基础的介绍。我将在下一篇文章中详细讨论统计方面的问题。如果我需要更正,请告诉我:)
第二部分张贴在这里。
其他回答
聚类是一种对对象进行分组的方法,通过这种方式,具有相似特征的对象聚集在一起,而具有不同特征的对象分开。它是机器学习和数据挖掘中常用的统计数据分析技术。
分类是在训练数据集的基础上识别、区分和理解对象的分类过程。分类是一种有监督的学习技术,其中训练集和正确定义的观察是可用的。
通常,在分类中,您有一组预定义的类,并希望知道新对象属于哪个类。
聚类尝试将一组对象分组,并发现对象之间是否存在某种关系。
在机器学习的背景下,分类是监督学习,聚类是无监督学习。
也可以看看维基百科上的分类和聚类。
我是一个数据挖掘的新手,但正如我的课本所说,分类应该是监督学习,而聚类应该是非监督学习。监督学习和无监督学习之间的区别可以在这里找到。
There are two definitions in data mining "Supervised" and "Unsupervised". When someone tells the computer, algorithm, code, ... that this thing is like an apple and that thing is like an orange, this is supervised learning and using supervised learning (like tags for each sample in a data set) for classifying the data, you'll get classification. But on the other hand if you let the computer find out what is what and differentiate between features of the given data set, in fact learning unsupervised, for classifying the data set this would be called clustering. In this case data that are fed to the algorithm don't have tags and the algorithm should find out different classes.
分类
是根据从例子中学习,将预定义的类分配给新的观察结果。
这是机器学习的关键任务之一。
聚类(或聚类分析)
尽管被普遍认为是“无监督分类”,但它完全不同。
与许多机器学习者教你的不同,它不是将“类”分配给对象,而是没有预先定义它们。这是做了太多分类的人的有限观点;一个典型的例子,如果你有一个锤子(分类器),所有的东西对你来说都像钉子(分类问题)。但这也是为什么从事分类的人没有掌握聚类的诀窍。
相反,可以将其视为结构发现。聚类的任务是在你的数据中找到你以前不知道的结构(例如组)。如果您学习了一些新的东西,那么群集是成功的。如果你只知道你已经知道的结构,它就失败了。
聚类分析是数据挖掘的关键任务(也是机器学习中的丑小鸭,所以不要相信机器学习者对聚类的否定)。
“无监督学习”有点矛盾
这在文献中反复出现,但无监督学习是该死的。它并不存在,但它就像“军事情报”一样自相矛盾。
算法要么从例子中学习(那么它就是“监督学习”),要么不学习。如果所有的聚类方法都是“学习”,那么计算一个数据集的最小值、最大值和平均值也是“无监督学习”。然后任何计算“学习”它的输出。因此,术语“无监督学习”是完全没有意义的,它意味着一切和什么都不是。
Some "unsupervised learning" algorithms do, however, fall into the optimization category. For example k-means is a least-squares optimization. Such methods are all over statistics, so I don't think we need to label them "unsupervised learning", but instead should continue to call them "optimization problems". It's more precise, and more meaningful. There are plenty of clustering algorithms who do not involve optimization, and who do not fit into machine-learning paradigms well. So stop squeezing them in there under the umbrella "unsupervised learning".
有一些与集群相关的“学习”,但学习的不是程序。用户应该学习关于他的数据集的新东西。