我需要在numpy.array中找到唯一的行。
例如:
>>> a # I have
array([[1, 1, 1, 0, 0, 0],
[0, 1, 1, 1, 0, 0],
[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]])
>>> new_a # I want to get to
array([[1, 1, 1, 0, 0, 0],
[0, 1, 1, 1, 0, 0],
[1, 1, 1, 1, 1, 0]])
我知道我可以创建一个集和循环数组,但我正在寻找一个有效的纯numpy解决方案。我相信有一种方法可以将数据类型设置为void,然后我可以使用numpy。唯一的,但我不知道如何让它工作。
我们实际上可以将m x n数值numpy数组转换为m x 1 numpy字符串数组,请尝试使用以下函数,它提供了count, inverse_idx等,就像numpy.unique一样:
import numpy as np
def uniqueRow(a):
#This function turn m x n numpy array into m x 1 numpy array storing
#string, and so the np.unique can be used
#Input: an m x n numpy array (a)
#Output unique m' x n numpy array (unique), inverse_indx, and counts
s = np.chararray((a.shape[0],1))
s[:] = '-'
b = (a).astype(np.str)
s2 = np.expand_dims(b[:,0],axis=1) + s + np.expand_dims(b[:,1],axis=1)
n = a.shape[1] - 2
for i in range(0,n):
s2 = s2 + s + np.expand_dims(b[:,i+2],axis=1)
s3, idx, inv_, c = np.unique(s2,return_index = True, return_inverse = True, return_counts = True)
return a[idx], inv_, c
例子:
A = np.array([[ 3.17 9.502 3.291],
[ 9.984 2.773 6.852],
[ 1.172 8.885 4.258],
[ 9.73 7.518 3.227],
[ 8.113 9.563 9.117],
[ 9.984 2.773 6.852],
[ 9.73 7.518 3.227]])
B, inv_, c = uniqueRow(A)
Results:
B:
[[ 1.172 8.885 4.258]
[ 3.17 9.502 3.291]
[ 8.113 9.563 9.117]
[ 9.73 7.518 3.227]
[ 9.984 2.773 6.852]]
inv_:
[3 4 1 0 2 4 0]
c:
[2 1 1 1 2]
我比较了速度的建议替代方案,惊奇地发现,void视图唯一解决方案甚至比numpy的带有axis参数的本机唯一解决方案还要快一点。如果你想要速度,你会想要
numpy.unique(
a.view(numpy.dtype((numpy.void, a.dtype.itemsize*a.shape[1])))
).view(a.dtype).reshape(-1, a.shape[1])
我已经在npx.unique_rows中实现了最快的变体。
在GitHub上也有一个bug报告。
代码重现情节:
import numpy
import perfplot
def unique_void_view(a):
return (
numpy.unique(a.view(numpy.dtype((numpy.void, a.dtype.itemsize * a.shape[1]))))
.view(a.dtype)
.reshape(-1, a.shape[1])
)
def lexsort(a):
ind = numpy.lexsort(a.T)
return a[
ind[numpy.concatenate(([True], numpy.any(a[ind[1:]] != a[ind[:-1]], axis=1)))]
]
def vstack(a):
return numpy.vstack([tuple(row) for row in a])
def unique_axis(a):
return numpy.unique(a, axis=0)
perfplot.show(
setup=lambda n: numpy.random.randint(2, size=(n, 20)),
kernels=[unique_void_view, lexsort, vstack, unique_axis],
n_range=[2 ** k for k in range(15)],
xlabel="len(a)",
equality_check=None,
)
我们实际上可以将m x n数值numpy数组转换为m x 1 numpy字符串数组,请尝试使用以下函数,它提供了count, inverse_idx等,就像numpy.unique一样:
import numpy as np
def uniqueRow(a):
#This function turn m x n numpy array into m x 1 numpy array storing
#string, and so the np.unique can be used
#Input: an m x n numpy array (a)
#Output unique m' x n numpy array (unique), inverse_indx, and counts
s = np.chararray((a.shape[0],1))
s[:] = '-'
b = (a).astype(np.str)
s2 = np.expand_dims(b[:,0],axis=1) + s + np.expand_dims(b[:,1],axis=1)
n = a.shape[1] - 2
for i in range(0,n):
s2 = s2 + s + np.expand_dims(b[:,i+2],axis=1)
s3, idx, inv_, c = np.unique(s2,return_index = True, return_inverse = True, return_counts = True)
return a[idx], inv_, c
例子:
A = np.array([[ 3.17 9.502 3.291],
[ 9.984 2.773 6.852],
[ 1.172 8.885 4.258],
[ 9.73 7.518 3.227],
[ 8.113 9.563 9.117],
[ 9.984 2.773 6.852],
[ 9.73 7.518 3.227]])
B, inv_, c = uniqueRow(A)
Results:
B:
[[ 1.172 8.885 4.258]
[ 3.17 9.502 3.291]
[ 8.113 9.563 9.117]
[ 9.73 7.518 3.227]
[ 9.984 2.773 6.852]]
inv_:
[3 4 1 0 2 4 0]
c:
[2 1 1 1 2]