这个问题已经得到了彻底的回答,所以我认为对所提议的方法进行运行时分析会很有兴趣(至少对我来说是这样)。我还将查看噪声数据集中心和边缘的方法的行为。
博士TL;
| runtime in s | runtime in s
method | python list | numpy array
--------------------|--------------|------------
kernel regression | 23.93405 | 22.75967
lowess | 0.61351 | 0.61524
naive average | 0.02485 | 0.02326
others* | 0.00150 | 0.00150
fft | 0.00021 | 0.00021
numpy convolve | 0.00017 | 0.00015
*savgol with different fit functions and some numpy methods
Kernel回归的伸缩性很差,Lowess稍微快一点,但两者都能产生平滑的曲线。Savgol在速度上是一个中间地带,可以产生跳跃和平滑的输出,这取决于多项式的等级。FFT非常快,但只适用于周期性数据。
使用numpy的移动平均方法更快,但显然会生成一个包含步骤的图。
设置
我以正弦曲线的形状生成了1000个数据点:
size = 1000
x = np.linspace(0, 4 * np.pi, size)
y = np.sin(x) + np.random.random(size) * 0.2
data = {"x": x, "y": y}
我把它们传递给一个函数来测量运行时并绘制结果拟合:
def test_func(f, label): # f: function handle to one of the smoothing methods
start = time()
for i in range(5):
arr = f(data["y"], 20)
print(f"{label:26s} - time: {time() - start:8.5f} ")
plt.plot(data["x"], arr, "-", label=label)
我测试了许多不同的平滑功能。Arr是要平滑的y个值的数组,并张成平滑参数。越低,拟合越接近原始数据,越高,得到的曲线越平滑。
def smooth_data_convolve_my_average(arr, span):
re = np.convolve(arr, np.ones(span * 2 + 1) / (span * 2 + 1), mode="same")
# The "my_average" part: shrinks the averaging window on the side that
# reaches beyond the data, keeps the other side the same size as given
# by "span"
re[0] = np.average(arr[:span])
for i in range(1, span + 1):
re[i] = np.average(arr[:i + span])
re[-i] = np.average(arr[-i - span:])
return re
def smooth_data_np_average(arr, span): # my original, naive approach
return [np.average(arr[val - span:val + span + 1]) for val in range(len(arr))]
def smooth_data_np_convolve(arr, span):
return np.convolve(arr, np.ones(span * 2 + 1) / (span * 2 + 1), mode="same")
def smooth_data_np_cumsum_my_average(arr, span):
cumsum_vec = np.cumsum(arr)
moving_average = (cumsum_vec[2 * span:] - cumsum_vec[:-2 * span]) / (2 * span)
# The "my_average" part again. Slightly different to before, because the
# moving average from cumsum is shorter than the input and needs to be padded
front, back = [np.average(arr[:span])], []
for i in range(1, span):
front.append(np.average(arr[:i + span]))
back.insert(0, np.average(arr[-i - span:]))
back.insert(0, np.average(arr[-2 * span:]))
return np.concatenate((front, moving_average, back))
def smooth_data_lowess(arr, span):
x = np.linspace(0, 1, len(arr))
return sm.nonparametric.lowess(arr, x, frac=(5*span / len(arr)), return_sorted=False)
def smooth_data_kernel_regression(arr, span):
# "span" smoothing parameter is ignored. If you know how to
# incorporate that with kernel regression, please comment below.
kr = KernelReg(arr, np.linspace(0, 1, len(arr)), 'c')
return kr.fit()[0]
def smooth_data_savgol_0(arr, span):
return savgol_filter(arr, span * 2 + 1, 0)
def smooth_data_savgol_1(arr, span):
return savgol_filter(arr, span * 2 + 1, 1)
def smooth_data_savgol_2(arr, span):
return savgol_filter(arr, span * 2 + 1, 2)
def smooth_data_fft(arr, span): # the scaling of "span" is open to suggestions
w = fftpack.rfft(arr)
spectrum = w ** 2
cutoff_idx = spectrum < (spectrum.max() * (1 - np.exp(-span / 2000)))
w[cutoff_idx] = 0
return fftpack.irfft(w)
结果
速度
运行时超过1000个元素,在python列表和numpy数组上进行测试,以保存值。
method | python list | numpy array
--------------------|-------------|------------
kernel regression | 23.93405 s | 22.75967 s
lowess | 0.61351 s | 0.61524 s
numpy average | 0.02485 s | 0.02326 s
savgol 2 | 0.00186 s | 0.00196 s
savgol 1 | 0.00157 s | 0.00161 s
savgol 0 | 0.00155 s | 0.00151 s
numpy convolve + me | 0.00121 s | 0.00115 s
numpy cumsum + me | 0.00114 s | 0.00105 s
fft | 0.00021 s | 0.00021 s
numpy convolve | 0.00017 s | 0.00015 s
特别是核回归在计算超过1k个元素时非常缓慢,当数据集变得更大时,lowess也会失败。Numpy卷积和FFT特别快。我没有研究随着样本量增加或减少的运行时行为(O(n))。
边缘的行为
我将把这部分分成两部分,以保持图像的可理解性。
Numpy方法+ savgol 0:
这些方法计算的是数据的平均值,图形不是平滑的。当用于计算平均值的窗口没有触及数据的边缘时,它们(numpy.cumsum除外)都会产生相同的图形。numpy的差异。Cumsum很可能是由于窗口大小的“关闭一个”错误。
当方法必须处理更少的数据时,有不同的边缘行为:
savgol 0: continues with a constant to the edge of the data (savgol 1 and savgol 2 end with a line and parabola respectively)
numpy average: stops when the window reaches the left side of the data and fills those places in the array with Nan, same behaviour as my_average method on the right side
numpy convolve: follows the data pretty accurately. I suspect the window size is reduced symmetrically when one side of the window reaches the edge of the data
my_average/me: my own method that I implemented, because I was not satisfied with the other ones. Simply shrinks the part of the window that is reaching beyond the data to the edge of the data, but keeps the window to the other side the original size given with span
复杂的方法:
这些方法都以非常适合数据的方式结束。savgo1以一条直线结束,savgo2以一条抛物线结束。
曲线的行为
在数据中间展示不同方法的行为。
不同的savgol和平均滤波器产生一个粗糙的线,低,fft和核回归产生一个平滑的拟合。当数据发生变化时,Lowess似乎在偷工减料。
动机
I have a Raspberry Pi logging data for fun and the visualization proved to be a small challenge. All data points, except RAM usage and ethernet traffic are only recorded in discrete steps and/or inherently noisy. For example the temperature sensor only outputs whole degrees, but differs by up to two degrees between consecutive measurements. No useful information can be gained from such a scatter plot. To visualize the data I therefore needed some method that is not too computationally expensive and produced a moving average. I also wanted nice behavior at the edges of the data, as this especially impacts the latest info when looking at live data. I settled on the numpy convolve method with my_average to improve the edge behavior.