A Very Useful Metric for Change Point Detection Models

Abdur Rehman Khan
4 min readNov 4, 2023

--

There are numerous metrics used to evaluate the performance of classification models, of which the F family of measures has been widely used and is also used widely in evaluating ML models. In practice another benefit of single evaluation metrics is that it helps us to quickly develop our models, as we have a unified comparison metric across different iterations of model development. For example the F1 score below:

However when working with change point detection we have a window of points on a continuous time line that can be considered as equally correct. Taking this into consideration and also the view that change point detection is a classification problem between the “change point” and “non-change point” classes as mentioned in [1] and [2] - a new metric is given in [3] that can be used, in which the calculation for precision and recall is altered. Following from the intuition above one can see that this is important because:

  • This allows us to create a window in which a change point is valid, rather than a single point as we are working with continuous time intervals
  • Also many models can detect multiple change points in one window — so we need to be able to void this double counting

To get an intuitive understanding for the metric to be introduced later consider the graph below - the green lines are annotations , with an interval of black solid lines, where as the red dotted lines represent change points predicted by a detection model. The change points predicted inside the interval of black solid lines around the first change point annotation on the left hand side should be considered as one correct ‘prediction’ to avoid double counting. Also the change detected around the second change point annotation on the right hand side should not be considered as it does not fall inside the error boundary (interval M).

Following on from the graph above, from [3] we have that the metric is defined as follows: “let X denote the set of change point locations provided by a detection algorithm (predictions), let T *= U Tk be the combined set of all human annotations . For a set of ground truth annotations T we define the set of true positives TP(T , X ) from X to be those τ ∈ T for which ∃x ∈ X such that |τ − x| ≤ M, while ensuring to avoid the double counting mentioned previously, and M ≥ 0 is the allowed margin of error”. Here Tk stands for the set of annotations done by one annotator, where one annotation can be thought of as the green line in the diagram above and M in the diagram above would be the distance between the green dashed line and the solid black line. The precision (P) and recall (R) are then defined as follows from [3]:

Here |X| in ‘P’ can be considered a proxy for ‘true positives + false positives’ . Further the Tk in ‘R’ can be considered as a proxy for ‘true positives + false negatives’. Also from [3] we have: “With this definition of precision, we consider as false positives only those detections that correspond to no human annotation. Further, by defining the recall as the average of the recall for each individual annotator we encourage algorithms to explain all human annotations without favoring any annotator in particular. Note that precision and recall are undefined if the annotators or the algorithm declare that a series has no change points. To avoid this, we include t = 1 as a trivial change point in Tk and X . Since change points are interpreted as the start of a new segment this choice does not affect the meaning of the detection results.”

[1] Killick, R., Fearnhead, P., and Eckley, I. A. Optimal detection of changepoints with a linear computational cost. Journal of the American Statistical Association, 107(500):1590–1598, 2012.

[2] Aminikhanghahi, S. and Cook, D. J. A survey of methods for time series change point detection. Knowledge and Information Systems, 51(2):339–367, 2017.

[3] van den Burg, G.J.J. and Williams, C.K.I., 2022. An Evaluation of Change Point Detection Algorithms. _arXiv preprint arXiv:2003.06222_.

--

--

No responses yet