是否可以使用scikit-learn K-Means聚类来指定自己的距离函数?
当前回答
python/ c++中有pyclustering(所以它很快!),可以让你指定一个自定义度量函数
from pyclustering.cluster.kmeans import kmeans
from pyclustering.utils.metric import type_metric, distance_metric
user_function = lambda point1, point2: point1[0] + point2[0] + 2
metric = distance_metric(type_metric.USER_DEFINED, func=user_function)
# create K-Means algorithm with specific distance metric
start_centers = [[4.7, 5.9], [5.7, 6.5]];
kmeans_instance = kmeans(sample, start_centers, metric=metric)
# run cluster analysis and obtain results
kmeans_instance.process()
clusters = kmeans_instance.get_clusters()
实际上,我还没有测试这段代码,但它拼凑在一起从一个票和示例代码。
其他回答
python/ c++中有pyclustering(所以它很快!),可以让你指定一个自定义度量函数
from pyclustering.cluster.kmeans import kmeans
from pyclustering.utils.metric import type_metric, distance_metric
user_function = lambda point1, point2: point1[0] + point2[0] + 2
metric = distance_metric(type_metric.USER_DEFINED, func=user_function)
# create K-Means algorithm with specific distance metric
start_centers = [[4.7, 5.9], [5.7, 6.5]];
kmeans_instance = kmeans(sample, start_centers, metric=metric)
# run cluster analysis and obtain results
kmeans_instance.process()
clusters = kmeans_instance.get_clusters()
实际上,我还没有测试这段代码,但它拼凑在一起从一个票和示例代码。
是的,在当前稳定版本的sklearn (scikit-learn 1.1.3)中,您可以轻松地使用自己的距离度量。你所要做的就是创建一个继承自sklearn.cluster.KMeans的类,并覆盖它的_transform方法。
下面的例子是IOU与Yolov2论文的距离。
import sklearn.cluster
import numpy as np
def anchor_iou(box_dims, centroid_box_dims):
box_w, box_h = box_dims[..., 0], box_dims[..., 1]
centroid_w, centroid_h = centroid_box_dims[..., 0], centroid_box_dims[..., 1]
inter_w = np.minimum(box_w[..., np.newaxis], centroid_w[np.newaxis, ...])
inter_h = np.minimum(box_h[..., np.newaxis], centroid_h[np.newaxis, ...])
inter_area = inter_w * inter_h
centroid_area = centroid_w * centroid_h
box_area = box_w * box_h
return inter_area / (
centroid_area[np.newaxis, ...] + box_area[..., np.newaxis] - inter_area
)
class IOUKMeans(sklearn.cluster.KMeans):
def __init__(
self,
n_clusters=8,
*,
init="k-means++",
n_init=10,
max_iter=300,
tol=1e-4,
verbose=0,
random_state=None,
copy_x=True,
algorithm="lloyd",
):
super().__init__(
n_clusters=n_clusters,
init=init,
n_init=n_init,
max_iter=max_iter,
tol=tol,
verbose=verbose,
random_state=random_state,
copy_x=copy_x,
algorithm=algorithm
)
def _transform(self, X):
return anchor_iou(X, self.cluster_centers_)
rng = np.random.default_rng(12345)
num_boxes = 10
bboxes = rng.integers(low=0, high=100, size=(num_boxes, 2))
kmeans = IOUKMeans(num_clusters).fit(bboxes)
不幸的是没有:scikit-learn目前实现的k-means只使用欧几里得距离。
将k-means扩展到其他距离并不是一件简单的事情,denis上面的回答并不是对其他度量实现k-means的正确方法。
The Affinity propagation algorithm from the sklearn library allows you to pass the similarity matrix instead of the samples. So, you can use your metric to compute the similarity matrix (not the dissimilarity matrix) and pass it to the function by setting the "affinity" term to "precomputed".https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AffinityPropagation.html#sklearn.cluster.AffinityPropagation.fit In terms of the K-Mean, I think it is also possible but I have not tried it. However, as the other answers stated, finding the mean using a different metric will be the issue. Instead, you can use PAM (K-Medoids) algorthim as it calculates the change in Total Deviation (TD), thus it does not rely on the distance metric. https://python-kmedoids.readthedocs.io/en/latest/#fasterpam
Sklearn Kmeans使用欧几里得距离。它没有度量参数。也就是说,如果你在聚类时间序列,你可以使用tslearn python包,当你可以指定一个度量(dtw, softdtw,欧几里得)。
推荐文章
- 证书验证失败:无法获得本地颁发者证书
- 当使用pip3安装包时,“Python中的ssl模块不可用”
- 无法切换Python与pyenv
- Python if not == vs if !=
- 如何从scikit-learn决策树中提取决策规则?
- 为什么在Mac OS X v10.9 (Mavericks)的终端中apt-get功能不起作用?
- 将旋转的xtick标签与各自的xtick对齐
- 为什么元组可以包含可变项?
- 如何合并字典的字典?
- 如何创建类属性?
- 数据挖掘中分类和聚类的区别?
- 不区分大小写的“in”
- 在Python中获取迭代器中的元素个数
- 解析日期字符串并更改格式
- 使用try和。Python中的if