是否可以使用scikit-learn K-Means聚类来指定自己的距离函数?
当前回答
只要在可以这样做的地方使用nltk即可,例如:
from nltk.cluster.kmeans import KMeansClusterer
NUM_CLUSTERS = <choose a value>
data = <sparse matrix that you would normally give to scikit>.toarray()
kclusterer = KMeansClusterer(NUM_CLUSTERS, distance=nltk.cluster.util.cosine_distance, repeats=25)
assigned_clusters = kclusterer.cluster(data, assign_clusters=True)
其他回答
Sklearn Kmeans使用欧几里得距离。它没有度量参数。也就是说,如果你在聚类时间序列,你可以使用tslearn python包,当你可以指定一个度量(dtw, softdtw,欧几里得)。
只要在可以这样做的地方使用nltk即可,例如:
from nltk.cluster.kmeans import KMeansClusterer
NUM_CLUSTERS = <choose a value>
data = <sparse matrix that you would normally give to scikit>.toarray()
kclusterer = KMeansClusterer(NUM_CLUSTERS, distance=nltk.cluster.util.cosine_distance, repeats=25)
assigned_clusters = kclusterer.cluster(data, assign_clusters=True)
python/ c++中有pyclustering(所以它很快!),可以让你指定一个自定义度量函数
from pyclustering.cluster.kmeans import kmeans
from pyclustering.utils.metric import type_metric, distance_metric
user_function = lambda point1, point2: point1[0] + point2[0] + 2
metric = distance_metric(type_metric.USER_DEFINED, func=user_function)
# create K-Means algorithm with specific distance metric
start_centers = [[4.7, 5.9], [5.7, 6.5]];
kmeans_instance = kmeans(sample, start_centers, metric=metric)
# run cluster analysis and obtain results
kmeans_instance.process()
clusters = kmeans_instance.get_clusters()
实际上,我还没有测试这段代码,但它拼凑在一起从一个票和示例代码。
The Affinity propagation algorithm from the sklearn library allows you to pass the similarity matrix instead of the samples. So, you can use your metric to compute the similarity matrix (not the dissimilarity matrix) and pass it to the function by setting the "affinity" term to "precomputed".https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AffinityPropagation.html#sklearn.cluster.AffinityPropagation.fit In terms of the K-Mean, I think it is also possible but I have not tried it. However, as the other answers stated, finding the mean using a different metric will be the issue. Instead, you can use PAM (K-Medoids) algorthim as it calculates the change in Total Deviation (TD), thus it does not rely on the distance metric. https://python-kmedoids.readthedocs.io/en/latest/#fasterpam
def distance_metrics(dist_metrics):
kmeans_instance = kmeans(trs_data, initial_centers, metric=dist_metrics)
label = np.zeros(210, dtype=int)
for i in range(0, len(clusters)):
for index, j in enumerate(clusters[i]):
label[j] = i
推荐文章
- DeprecationWarning:无效转义序列-用什么代替\d?
- 如何改变日期时间格式在熊猫
- 使用Conda进行批量包更新
- Ipython笔记本清除单元格输出代码
- ImportError: numpy.core.multiarray导入失败
- 有办法在Python中使用PhantomJS吗?
- 如何在Python中将if/else压缩成一行?
- 如何在Python 3中使用pip。Python 2.x
- 如何让IntelliJ识别常见的Python模块?
- Django:“projects”vs“apps”
- 如何列出导入的模块?
- 转换Python程序到C/ c++代码?
- 如何从gmtime()的时间+日期输出中获得自epoch以来的秒数?
- 在python模块文档字符串中放入什么?
- 我如何在Django中过滤一个DateTimeField的日期?