University of Nevada, Reno, United States of America
With the rapid growth in scale and complexity, today's enterprise storage systems need to deal with significant amounts of errors. Existing proactive methods mainly focus on machine learning techniques trained on SMART measurements. Such methods, however, are usually expensive to use in practice and can only be applied to limited types of errors with a limited scale. We collected more than 23 million storage events from 87 deployed NetApp-ONTAP systems managing 14,371 disks for two years, and propose a lightweight training-free storage error forecasting method; SEFEE. SEFEE employs tensor decomposition to directly analyze storage error-event logs and perform online error prediction for all error types in all storage nodes. SEFEE explores hidden spatiotemporal information that is deeply embedded in the global scale of storage systems to achieve record breaking error forecasting accuracy with minimal prediction overhead.