是否可以使用scikit-learn K-Means聚类来指定自己的距离函数?
当前回答
是的,你可以使用差分度量函数;然而,根据定义,k-means聚类算法依赖于每个聚类均值的欧几里得距离。
你可以使用不同的度量,所以即使你仍然在计算平均值你也可以使用像mahalnobis距离这样的东西。
其他回答
python/ c++中有pyclustering(所以它很快!),可以让你指定一个自定义度量函数
from pyclustering.cluster.kmeans import kmeans
from pyclustering.utils.metric import type_metric, distance_metric
user_function = lambda point1, point2: point1[0] + point2[0] + 2
metric = distance_metric(type_metric.USER_DEFINED, func=user_function)
# create K-Means algorithm with specific distance metric
start_centers = [[4.7, 5.9], [5.7, 6.5]];
kmeans_instance = kmeans(sample, start_centers, metric=metric)
# run cluster analysis and obtain results
kmeans_instance.process()
clusters = kmeans_instance.get_clusters()
实际上,我还没有测试这段代码,但它拼凑在一起从一个票和示例代码。
The Affinity propagation algorithm from the sklearn library allows you to pass the similarity matrix instead of the samples. So, you can use your metric to compute the similarity matrix (not the dissimilarity matrix) and pass it to the function by setting the "affinity" term to "precomputed".https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AffinityPropagation.html#sklearn.cluster.AffinityPropagation.fit In terms of the K-Mean, I think it is also possible but I have not tried it. However, as the other answers stated, finding the mean using a different metric will be the issue. Instead, you can use PAM (K-Medoids) algorthim as it calculates the change in Total Deviation (TD), thus it does not rely on the distance metric. https://python-kmedoids.readthedocs.io/en/latest/#fasterpam
是的,在当前稳定版本的sklearn (scikit-learn 1.1.3)中,您可以轻松地使用自己的距离度量。你所要做的就是创建一个继承自sklearn.cluster.KMeans的类,并覆盖它的_transform方法。
下面的例子是IOU与Yolov2论文的距离。
import sklearn.cluster
import numpy as np
def anchor_iou(box_dims, centroid_box_dims):
box_w, box_h = box_dims[..., 0], box_dims[..., 1]
centroid_w, centroid_h = centroid_box_dims[..., 0], centroid_box_dims[..., 1]
inter_w = np.minimum(box_w[..., np.newaxis], centroid_w[np.newaxis, ...])
inter_h = np.minimum(box_h[..., np.newaxis], centroid_h[np.newaxis, ...])
inter_area = inter_w * inter_h
centroid_area = centroid_w * centroid_h
box_area = box_w * box_h
return inter_area / (
centroid_area[np.newaxis, ...] + box_area[..., np.newaxis] - inter_area
)
class IOUKMeans(sklearn.cluster.KMeans):
def __init__(
self,
n_clusters=8,
*,
init="k-means++",
n_init=10,
max_iter=300,
tol=1e-4,
verbose=0,
random_state=None,
copy_x=True,
algorithm="lloyd",
):
super().__init__(
n_clusters=n_clusters,
init=init,
n_init=n_init,
max_iter=max_iter,
tol=tol,
verbose=verbose,
random_state=random_state,
copy_x=copy_x,
algorithm=algorithm
)
def _transform(self, X):
return anchor_iou(X, self.cluster_centers_)
rng = np.random.default_rng(12345)
num_boxes = 10
bboxes = rng.integers(low=0, high=100, size=(num_boxes, 2))
kmeans = IOUKMeans(num_clusters).fit(bboxes)
Spectral Python的k-means允许使用L1 (Manhattan)距离。
只要在可以这样做的地方使用nltk即可,例如:
from nltk.cluster.kmeans import KMeansClusterer
NUM_CLUSTERS = <choose a value>
data = <sparse matrix that you would normally give to scikit>.toarray()
kclusterer = KMeansClusterer(NUM_CLUSTERS, distance=nltk.cluster.util.cosine_distance, repeats=25)
assigned_clusters = kclusterer.cluster(data, assign_clusters=True)
推荐文章
- 如何从Python函数中返回两个值?
- 前一个月的Python日期
- Python中方括号括起来的列表和圆括号括起来的列表有什么区别?
- Python日志记录不输出任何东西
- 每n秒运行特定代码
- SQLAlchemy是否有与Django的get_or_create等价的函数?
- 如何将python datetime转换为字符串,具有可读格式的日期?
- 美丽的汤和提取div及其内容的ID
- 在Python中重置生成器对象
- 用Python构建最小的插件架构
- model.eval()在pytorch中做什么?
- Tensorflow 2.0:模块“Tensorflow”没有属性“Session”
- 从环境文件中读入环境变量
- 在OSX 10.11中安装Scrapy时,“OSError: [Errno 1]操作不允许”(El Capitan)(系统完整性保护)
- 如何删除熊猫数据帧的最后一行数据