更新:到目前为止表现最好的算法是这个。
这个问题探讨了在实时时间序列数据中检测突然峰值的稳健算法。
考虑以下示例数据:
这个数据的例子是Matlab格式的(但这个问题不是关于语言,而是关于算法):
p = [1 1 1.1 1 0.9 1 1 1.1 1 0.9 1 1.1 1 1 0.9 1 1 1.1 1 1 1 1 1.1 0.9 1 1.1 1 1 0.9, ...
1 1.1 1 1 1.1 1 0.8 0.9 1 1.2 0.9 1 1 1.1 1.2 1 1.5 1 3 2 5 3 2 1 1 1 0.9 1 1, ...
3 2.6 4 3 3.2 2 1 1 0.8 4 4 2 2.5 1 1 1];
你可以清楚地看到有三个大峰和一些小峰。这个数据集是问题所涉及的时间序列数据集类的一个特定示例。这类数据集有两个一般特征:
有一种具有一般平均值的基本噪声
有很大的“峰值”或“更高的数据点”明显偏离噪声。
让我们假设以下情况:
峰的宽度不能事先确定
峰的高度明显偏离其他值
算法实时更新(因此每个新数据点都会更新)
对于这种情况,需要构造一个触发信号的边值。但是,边界值不能是静态的,必须通过算法实时确定。
我的问题是:什么是实时计算这些阈值的好算法?有没有针对这种情况的特定算法?最著名的算法是什么?
健壮的算法或有用的见解都受到高度赞赏。(可以用任何语言回答:这是关于算法的)
用现代c++实现的面向对象版z-score算法
template<typename T>
class FindPeaks{
private:
std::vector<T> m_input_signal; // stores input vector
std::vector<T> m_array_peak_positive;
std::vector<T> m_array_peak_negative;
public:
FindPeaks(const std::vector<T>& t_input_signal): m_input_signal{t_input_signal}{ }
void estimate(){
int lag{5};
T threshold{ 5 }; // set a threshold
T influence{ 0.5 }; // value between 0 to 1, 1 is normal influence and 0.5 is half the influence
std::vector<T> filtered_signal(m_input_signal.size(), 0.0); // placeholdered for smooth signal, initialie with all zeros
std::vector<int> signal(m_input_signal.size(), 0); // vector that stores where the negative and positive located
std::vector<T> avg_filtered(m_input_signal.size(), 0.0); // moving averages
std::vector<T> std_filtered(m_input_signal.size(), 0.0); // moving standard deviation
avg_filtered[lag] = findMean(m_input_signal.begin(), m_input_signal.begin() + lag); // pass the iteartor to vector
std_filtered[lag] = findStandardDeviation(m_input_signal.begin(), m_input_signal.begin() + lag);
for (size_t iLag = lag + 1; iLag < m_input_signal.size(); ++iLag) { // start index frm
if (std::abs(m_input_signal[iLag] - avg_filtered[iLag - 1]) > threshold * std_filtered[iLag - 1]) { // check if value is above threhold
if ((m_input_signal[iLag]) > avg_filtered[iLag - 1]) {
signal[iLag] = 1; // assign positive signal
}
else {
signal[iLag] = -1; // assign negative signal
}
filtered_signal[iLag] = influence * m_input_signal[iLag] + (1 - influence) * filtered_signal[iLag - 1]; // exponential smoothing
}
else {
signal[iLag] = 0; // no signal
filtered_signal[iLag] = m_input_signal[iLag];
}
avg_filtered[iLag] = findMean(filtered_signal.begin() + (iLag - lag), filtered_signal.begin() + iLag);
std_filtered[iLag] = findStandardDeviation(filtered_signal.begin() + (iLag - lag), filtered_signal.begin() + iLag);
}
for (size_t iSignal = 0; iSignal < m_input_signal.size(); ++iSignal) {
if (signal[iSignal] == 1) {
m_array_peak_positive.emplace_back(m_input_signal[iSignal]); // store the positive peaks
}
else if (signal[iSignal] == -1) {
m_array_peak_negative.emplace_back(m_input_signal[iSignal]); // store the negative peaks
}
}
printVoltagePeaks(signal, m_input_signal);
}
std::pair< std::vector<T>, std::vector<T> > get_peaks()
{
return std::make_pair(m_array_peak_negative, m_array_peak_negative);
}
};
template<typename T1, typename T2 >
void printVoltagePeaks(std::vector<T1>& m_signal, std::vector<T2>& m_input_signal) {
std::ofstream output_file("./voltage_peak.csv");
std::ostream_iterator<T2> output_iterator_voltage(output_file, ",");
std::ostream_iterator<T1> output_iterator_signal(output_file, ",");
std::copy(m_input_signal.begin(), m_input_signal.end(), output_iterator_voltage);
output_file << "\n";
std::copy(m_signal.begin(), m_signal.end(), output_iterator_signal);
}
template<typename iterator_type>
typename std::iterator_traits<iterator_type>::value_type findMean(iterator_type it, iterator_type end)
{
/* function that receives iterator to*/
typename std::iterator_traits<iterator_type>::value_type sum{ 0.0 };
int counter = 0;
while (it != end) {
sum += *(it++);
counter++;
}
return sum / counter;
}
template<typename iterator_type>
typename std::iterator_traits<iterator_type>::value_type findStandardDeviation(iterator_type it, iterator_type end)
{
auto mean = findMean(it, end);
typename std::iterator_traits<iterator_type>::value_type sum_squared_error{ 0.0 };
int counter{ 0 };
while (it != end) {
sum_squared_error += std::pow((*(it++) - mean), 2);
counter++;
}
auto standard_deviation = std::sqrt(sum_squared_error / (counter - 1));
return standard_deviation;
}
我允许自己创建一个javascript版本。也许会有帮助。javascript应该是上面给出的伪代码的直接转录。可用的npm包和github repo:
https://github.com/crux/smoothed-z-score
@joe_six / smoothed-z-score-peak-signal-detection
Javascript的翻译:
// javascript port of: https://stackoverflow.com/questions/22583391/peak-signal-detection-in-realtime-timeseries-data/48895639#48895639
function sum(a) {
return a.reduce((acc, val) => acc + val)
}
function mean(a) {
return sum(a) / a.length
}
function stddev(arr) {
const arr_mean = mean(arr)
const r = function(acc, val) {
return acc + ((val - arr_mean) * (val - arr_mean))
}
return Math.sqrt(arr.reduce(r, 0.0) / arr.length)
}
function smoothed_z_score(y, params) {
var p = params || {}
// init cooefficients
const lag = p.lag || 5
const threshold = p.threshold || 3.5
const influence = p.influece || 0.5
if (y === undefined || y.length < lag + 2) {
throw ` ## y data array to short(${y.length}) for given lag of ${lag}`
}
//console.log(`lag, threshold, influence: ${lag}, ${threshold}, ${influence}`)
// init variables
var signals = Array(y.length).fill(0)
var filteredY = y.slice(0)
const lead_in = y.slice(0, lag)
//console.log("1: " + lead_in.toString())
var avgFilter = []
avgFilter[lag - 1] = mean(lead_in)
var stdFilter = []
stdFilter[lag - 1] = stddev(lead_in)
//console.log("2: " + stdFilter.toString())
for (var i = lag; i < y.length; i++) {
//console.log(`${y[i]}, ${avgFilter[i-1]}, ${threshold}, ${stdFilter[i-1]}`)
if (Math.abs(y[i] - avgFilter[i - 1]) > (threshold * stdFilter[i - 1])) {
if (y[i] > avgFilter[i - 1]) {
signals[i] = +1 // positive signal
} else {
signals[i] = -1 // negative signal
}
// make influence lower
filteredY[i] = influence * y[i] + (1 - influence) * filteredY[i - 1]
} else {
signals[i] = 0 // no signal
filteredY[i] = y[i]
}
// adjust the filters
const y_lag = filteredY.slice(i - lag, i)
avgFilter[i] = mean(y_lag)
stdFilter[i] = stddev(y_lag)
}
return signals
}
module.exports = smoothed_z_score
我在我的机器人项目中需要这样的东西。我想我可以归还Kotlin实现。
/**
* Smoothed zero-score alogrithm shamelessly copied from https://stackoverflow.com/a/22640362/6029703
* Uses a rolling mean and a rolling deviation (separate) to identify peaks in a vector
*
* @param y - The input vector to analyze
* @param lag - The lag of the moving window (i.e. how big the window is)
* @param threshold - The z-score at which the algorithm signals (i.e. how many standard deviations away from the moving mean a peak (or signal) is)
* @param influence - The influence (between 0 and 1) of new signals on the mean and standard deviation (how much a peak (or signal) should affect other values near it)
* @return - The calculated averages (avgFilter) and deviations (stdFilter), and the signals (signals)
*/
fun smoothedZScore(y: List<Double>, lag: Int, threshold: Double, influence: Double): Triple<List<Int>, List<Double>, List<Double>> {
val stats = SummaryStatistics()
// the results (peaks, 1 or -1) of our algorithm
val signals = MutableList<Int>(y.size, { 0 })
// filter out the signals (peaks) from our original list (using influence arg)
val filteredY = ArrayList<Double>(y)
// the current average of the rolling window
val avgFilter = MutableList<Double>(y.size, { 0.0 })
// the current standard deviation of the rolling window
val stdFilter = MutableList<Double>(y.size, { 0.0 })
// init avgFilter and stdFilter
y.take(lag).forEach { s -> stats.addValue(s) }
avgFilter[lag - 1] = stats.mean
stdFilter[lag - 1] = Math.sqrt(stats.populationVariance) // getStandardDeviation() uses sample variance (not what we want)
stats.clear()
//loop input starting at end of rolling window
(lag..y.size - 1).forEach { i ->
//if the distance between the current value and average is enough standard deviations (threshold) away
if (Math.abs(y[i] - avgFilter[i - 1]) > threshold * stdFilter[i - 1]) {
//this is a signal (i.e. peak), determine if it is a positive or negative signal
signals[i] = if (y[i] > avgFilter[i - 1]) 1 else -1
//filter this signal out using influence
filteredY[i] = (influence * y[i]) + ((1 - influence) * filteredY[i - 1])
} else {
//ensure this signal remains a zero
signals[i] = 0
//ensure this value is not filtered
filteredY[i] = y[i]
}
//update rolling average and deviation
(i - lag..i - 1).forEach { stats.addValue(filteredY[it]) }
avgFilter[i] = stats.getMean()
stdFilter[i] = Math.sqrt(stats.getPopulationVariance()) //getStandardDeviation() uses sample variance (not what we want)
stats.clear()
}
return Triple(signals, avgFilter, stdFilter)
}
带有验证图的示例项目可以在github上找到。
c++ (Qt)演示端口,交互式参数
我已经将这个算法的演示应用程序移植到c++ (Qt)上。
代码可以在GitHub上找到这里。带有安装程序的Windows(64位)构建在发布页面上。最后,我将添加一些文档和其他发布版本。
您不能绘制点,但可以从文本文件中导入它们(用空格分隔点——换行也算作空格)。您还可以调整算法参数,实时查看效果。这对于针对特定数据集调整算法以及探索参数如何影响结果非常有用。
上面的截图有些过时;从那以后,我添加了两个原始算法中没有的实验性选项:
反向处理数据集的选项(似乎至少改善了功率谱的结果)。
选项,为峰值设置硬性最小阈值。
我还在窗口中间添加了一个笨拙的缩放/平移条,只需用鼠标拖动它来缩放和平移。
模糊的构建指令:
在发布页面上有一个Windows安装程序(64位),但如果你想从源代码构建它,要点是:
安装Qt的构建工具,然后将qmake && make放在与.pro文件相同的目录下,或者
安装Qt Creator,打开.pro文件,选择任何默认的构建配置,然后按下构建和/或运行按钮(Creator的左下角)。
我只测试过Qt5。我有91%的信心,如果你手动配置组件,Qt Creator安装程序会让你安装Qt5(如果你手动配置组件,你还需要确认是否安装了Qt Charts)。Qt6可能是一个流畅的构建,也可能不是。有一天,我将测试Qt4和Qt6,使这些文档更好。也许吧。
在Palshikar(2009)中发现了另一个算法:
Palshikar, G.(2009)。时间序列中峰值检测的简单算法。在Proc. 1st Int。高级数据分析,商业分析和智能(卷122)。
论文可以从这里下载。
算法是这样的:
algorithm peak1 // one peak detection algorithms that uses peak function S1
input T = x1, x2, …, xN, N // input time-series of N points
input k // window size around the peak
input h // typically 1 <= h <= 3
output O // set of peaks detected in T
begin
O = empty set // initially empty
for (i = 1; i < n; i++) do
// compute peak function value for each of the N points in T
a[i] = S1(k,i,xi,T);
end for
Compute the mean m' and standard deviation s' of all positive values in array a;
for (i = 1; i < n; i++) do // remove local peaks which are “small” in global context
if (a[i] > 0 && (a[i] – m') >( h * s')) then O = O + {xi};
end if
end for
Order peaks in O in terms of increasing index in T
// retain only one peak out of any set of peaks within distance k of each other
for every adjacent pair of peaks xi and xj in O do
if |j – i| <= k then remove the smaller value of {xi, xj} from O
end if
end for
end
优势
本文提出了5种不同的峰值检测算法
算法在原始时间序列数据上工作(不需要平滑)
缺点
很难事先确定k和h
峰不能是平的(就像我测试数据中的第三个峰)
例子: