MASS: Mueen's Algorithm for Similarity Search

The Fastest Similarity Search Algorithm for Time Series Subsequences under Euclidean Distance and Correlation Coefficient

Similarity search for time series subsequences is THE most important subroutine for time series pattern mining. Subsequence similarity search has been scaled to trillions obsetvations under both DTW (Dynamic Time Warping) and Euclidean distances [a]. The algorithms are ultra fast and efficient. The key technique that makes the algorithms useful is the Early Abandoning technique [b,e] known since 1994. However, the algorithms lack few properties that are useful for many time series data mining algorithms.


1. Early abandoning depends on the dataset. The worst case complexity is still O(nm) where n is the length of the larger time series and m is the length of the short query.

2. The algorithm can produce the most similar subsequence to the query and cannot produce the Distance Profile to all the subssequences given the query.


MASS is an algorithm to create Distance Profile of a query to a long time series. In this page we share a code for The Fastest Similarity Search Algorithm for Time Series Subsequences under Euclidean Distance. Early abandoning can occasionally beat this algorithm on some datasets for some queries. This algorithm is independent of data and query. The underlying concept of the algorithm is known for a long time to the signal processing community. We have used it for the first time on time series subsequence search under z-normalization. The algorithm was used as a subroutine in our papers [c,d] and the code are given below. 


1. The algorithm has an overall time complexity of O(n log n) which does not depend on datasets and is the lower bound of similarity search over time series subsequences.

2. The algorithm produces all of the distances from the query to the subsequences of a long time series. In our recent paper, we generalize the usage of the distance profiles calculated using MASS in finding motifs, shapelets and discords. Check out the paper here. 


The MASS paper is available here..



To cite  this code:


Abdullah Mueen, Sheng Zhong, Yan Zhu, Michael Yeh, Kaveh Kamgar, Krishnamurthy Viswanathan, Chetan Kumar Gupta and Eamonn Keogh (2022), The Fastest Similarity Search Algorithm for Time Series Subsequences under Euclidean Distance, URL: http://www.cs.unm.edu/~mueen/FastestSimilaritySearch.html


or use the following bib snippet.


@misc{FastestSimilaritySearch,

title={The Fastest Similarity Search Algorithm for Time Series Subsequences under Euclidean Distance},

author={ Mueen, Abdullah and Zhong, Sheng and Zhu, Yan and Yeh, Michael and Kamgar, Kaveh and Viswanathan, Krishnamurthy and Gupta, Chetan and Keogh, Eamonn},

year={2022},

month={August},

note = {\url{http://www.cs.unm.edu/~mueen/FastestSimilaritySearch.html}}

}




Code:


The first version of MASS was developed in 2011 and published in this page in 2015. It was named findNN initially and later renamed as MASS.


The second version of MASS_V2 was developed in 2017 which is more than 2X faster than the first version. Several people contributed to this code including Yan Zhu, Michael Yeh, Kaveh Kamgar and Eamonn Keogh from UCR.


The third version of MASS_V3 is a piecewise version of MASS that performs better when the size of the pieces are well aligned with the hardware.


The fourth version of MASS_V4 is free of numerical errors due to aliasing in frequency domain. Sheng Zhong has exploited Discrete Cosine Transform (DCT) that does not use any imaginary number while maintaining O(n log n) complexity in the worst case. The implementation is a bit slower than MASS_V3, however, there is no numerical error due to leakage.



MASS_weighted is an extension to create weighted distance profiles.


MASS_absolute is a simplification of the original MASS algoeithm to search for absolute matches without normalizing the query and matches.


A python version written by Adnan Ibn Khair is available here. Another version written by Tyler Marrs is available here. 


A C++ version is available here.


If you have MATLAB Parallel Computing Toolbox and a GPU installed, MASS can be much faster for you. Consider the following few lines. 


>> tic, MASS_V3(gpuArray(randn(100000000,1)),gpuArray(rand(400,1)),2^23); toc;

Elapsed time is 1.600559 seconds.

>> tic, MASS_V3(randn(100000000,1),rand(400,1),2^23); toc;

Elapsed time is 8.881304 seconds.


If you find this code useful for your research, please drop me a line, I would love to know the impact of this code. 


Sample Case Studies using MASS:


These slides are created by Prof. Eamonn Keogh (many thanks). 


Simple Case Studies 


Datasets for the case studies. 


Penguin


Robot


Carpet




Tutorial:


MASS has been featured in several tutorials presented in top data mining venues. A self-contained version is available here.



[a] Thanawin Rakthanmanon, Bilson J. L. Campana, Abdullah Mueen, Gustavo E. A. P. A. Batista, M. Brandon Westover, Qiang Zhu, Jesin Zakaria, Eamonn J. Keogh: Searching and mining trillions of time series subsequences under dynamic time warping. KDD 2012: 262-270

[b] Christos Faloutsos, M. Ranganathan, Yannis Manolopoulos: Fast Subsequence Matching in Time-Series Databases. SIGMOD Conference 1994: 419-429

[c] Abdullah Mueen, Hossein Hamooni, Trilce Estrada: Time Series Join on Subsequence Correlation. ICDM 2014: 450-459

[d] Jesin Zakaria, Abdullah Mueen, Eamonn J. Keogh: Clustering Time Series Using Unsupervised-Shapelets. ICDM 2012: 785-794

[e] Eamonn J. Keogh, Shruti Kasetty: On the need for time series data mining benchmarks: a survey and empirical demonstration. KDD 2002: 102-111



What people say about MASS: A pdf of quotes from published papers.


... we show that MASS is 20 times faster than brute force search on the UMAP embedding space. ---

G. Franch, G. Jurman, L. Coviello, M. Pendesini, and C. Furlanello, “MASS-UMAP: Fast and Accurate Analog Ensemble Search in Weather Radar Archives,” Remote Sensing, vol. 11, no. 24, 2019.


[MASS] is known to greatly speed up the calculation of distances between sequences. ---

P. Avogadro. and M. Dominoni., “Topological Approach for Finding Nearest Neighbor Sequence in Time Series,” in Proceedings of the 11th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - KDIR,. SciTePress, 2019, pp. 233–244.


For computing the similarity, we use Mueen’s ultra-fast Algorithm for Similarity Search (MASS), because MASS is algorithmically efficient and free of hyperparameters. ---

S. Wilhelm and J. Kasbauer, “Exploiting Smart Meter Power Consumption Measurements for Human Activity Recognition (HAR) with a Motif-Detection-Based Non-Intrusive Load Monitoring (NILM) Approach,” Sensors, vol. 21, no. 23, 2021.


The strong point of MASS is the much faster indexing, ... ---

D. Piatov, S. Helmer, A. Dign ̈os, and J. Gamper, “Interactive and space-efficient multi-dimensional time series subsequence matching,” Information Systems, vol. 82, pp. 121–135, 2019.


The adoption of MASS essentially accelerates the extraction process compared to using pruning techniques based on brute force approach. ---

J. Zuo, K. Zeitouni, and Y. Taher, “Incremental and Adaptive Feature Exploration over Time Series Stream,” in 2019 IEEE International Conference on Big Data (Big Data), 2019, pp. 593–602.


This approach has the advantages of being easy to implement with the ready-to-use implementations of MASS. ---

D. Janka, F. Lenders, S. Wang, A. Cohen, and N. Li, “Detecting and locating patterns in time series using machine learning,” Control Engineering Practice, vol. 93, p. 104-169, 2019.


... the exact algorithm such as MASS by Mueen et al. can be used to efficiently retrieve the exact correlations. ---

H. Qiu, H. T. Lam, F. Fusco, and M. Sinn, “Learning Correlation Space for Time Series,” CoRR, vol. abs/1802.03628, 2018.


The distance profile for a single subsequence can be calculated very fast using the MASS algorithm which only needs O(n\log(n)) time, ... ---

Leonard Clauß, Humboldt University of Berlin.


In this paper we utilized the MASS algorithm to speed up the original SetFinder algorithm. ---

G. Yoshimura, A. Kanemura, and H. Asoh, “Enumerating hub motifs in time series based on the matrix profile.” 5th Workshop on Mining and Learning from Time Series (MiLeTS’19), 2019


The Mueen's ultra-fast Algorithm for Similarity Search (MASS) strategy is adopted to speed up the ED calculation to ensure the efficiency of the proposed scheme. ---

J. Hu and N. Chen, “Enhanced Feature Summarizing for Effective Cover Song Identification,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 12, pp. 2113–2126, 2019.


... we calculate the distances between this pattern and all subsequences in the series using MASS. ---

D. De Paepe and S. Van Hoecke, “Mining Recurring Patterns in Real-Valued Time Series using the Radius Profile,” in 2020 IEEE International Conference on Data Mining (ICDM), 2020, pp. 984–989.


After constructing a motif dictionary, we further apply the MASS algorithm to each sensory time series. ---

S. Lin, X. Wu, and N. V. Chawla, “motif2vec: Semantic-aware Representation Learning for Wearables’ Time Series Data,” in 2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA), 2021, pp. 1–10.


... conventional Mass-ts and Euclidean distance calculation methods showed similar high accuracy; ---

H.-J. Lee, Y. Kwon, and S.-Y. Ihm, “Dual-ISM: Duality-Based Image Sequence Matching for Similar Image Search,” Applied Sciences, vol. 12, no. 3, 2022.


The standard deviation for MASS and UDTW are zero, since they are deterministic retrieval algorithms. ---

V. Gupta, S. Bedathur, and A. De, “Learning Temporal Point Processes for Efficient Retrieval of Continuous Time Event Sequences,” 2022.


Like its matrix profile counterparts for pair-wise segment similarity computation, MASS has fast time performance. ---

J. Liu, J. Li, and L. Liu, “FastOPM—A practical method for partial match of time series,” Pattern Recognition, vol. 130, p. 108808, 2022.


... the number of neighbors closer to the candidate subsequences than the defined maximum distance is calculated with the MASS algorithm ... ---

L. Gr ̈unerbel, F. Heinrich, D. Diebolder, and M. Richter, “Wearable Decubitus Prophylaxis Tool Based on Machine Learning Methods,” in 2022 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops), 2022, pp. 730–734


... MASS method is utilized to construct the full distance matrix of the subsequence with length ... ---

X. Zheng, X. Zou, X. Ren, C. Ji, and M. Zhang, “ECG Prediction Based on Bidirectional Time Series Chain Discovery Algorithm,” in 5th International Conference on Crowd Science and Engineering, ser. ICCSE 2021.


We also use MASS directly for subsequence-similarity measurement, in addition to a similar feature-based method that we develop. ---

N. Singh, “Sifting sound - interactive extraction, exploration, and expressive recombination of large and heterogeneous audio collections,” Master’s thesis, Massachusetts Institute of Technology, Boston, MA, 2020.




Research and code development funded by NSF IIS - 1161997, NSF SHF 1527127 and an internship at HP Labs.