Timeseries Anomaly Detection¶

A significant problem that arises in computer networking is the detection of timeseries anomalies. Specifically, data is captured over time, and network operators may wish to detect when metrics from the data deviate from "normal" behavior.

This activity involves performing timeseries anomaly detection with network performance data captured from a project at University of Chicago, to continually measure network performance metrics, including throughput, latency, DNS lookup times, and so forth.

In this hands-on, you will combine various anomaly detection approaches with time-series models including:

  • Moving Average with Seasonal Decomposition

You might also try some of the following methods:

  • ARIMA
  • Prophet
  • changing point detection

and apply them to detect anomalous performance in network measurements of latency and bandwidth data from 85 Chicago households. The data for this hands-on activity is available here.

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from datetime import datetime

Load and Explore Data¶

This pickle file has measurement data of various types, from different IDs, to different destinations. The data is also aggregated in various ways, as indicated in the method column.

In [2]:
df = pd.read_pickle('data/5m_data.pkl')
df.sort_values(by='Time_clean', inplace=True)
df.head(5)
Out[2]:
Time RTT Measurement ID Destination Method Time_clean
2532502 2021-09-30 19:00 8.775 ping_latency nm-mngd-20210623-1b49a7fa google Avg 2021-09-30 19:00:00
2432051 2021-09-30 19:00 10.977 ping_latency nm-mngd-20210511-9d9824d4 google Avg 2021-09-30 19:00:00
2504516 2021-09-30 19:00 9.244 ping_latency nm-mngd-20210519-84594993 google Avg 2021-09-30 19:00:00
2451551 2021-09-30 19:00 9.510 ping_latency nm-mngd-20210518-075ab2f0 google Avg 2021-09-30 19:00:00
2536760 2021-09-30 19:00 9.634 ping_latency nm-mngd-20210623-d9aab537 google Avg 2021-09-30 19:00:00

Compute an Hour-by-Hour Average RTT¶

Compute an hour-by-hour average RTT for each destination.

In [48]:
def split(df):
    gb = df.groupby(['Destination'])
    return [gb.get_group(x) for x in gb.groups]

df_split = split(df)
df_split_clean = []
for i in range(len(df_split)):
    df_split_clean.append(df_split[i].set_index('Time_clean').resample('H').mean(numeric_only=True).interpolate(limit = 300, direction = "forward", method = "linear").interpolate(method='linear', axis=0).ffill().bfill())
    
df_split_clean
Out[48]:
[                           RTT
 Time_clean                    
 2021-09-30 19:00:00  34.275405
 2021-09-30 20:00:00  34.631601
 2021-09-30 21:00:00  34.432748
 2021-09-30 22:00:00  35.768035
 2021-09-30 23:00:00  36.075542
 ...                        ...
 2022-02-28 13:00:00  38.242081
 2022-02-28 14:00:00  39.605813
 2022-02-28 15:00:00  39.191060
 2022-02-28 16:00:00  43.936399
 2022-02-28 17:00:00  41.158659
 
 [3623 rows x 1 columns],
                            RTT
 Time_clean                    
 2021-09-30 19:00:00  34.802956
 2021-09-30 20:00:00  35.417021
 2021-09-30 21:00:00  35.777643
 2021-09-30 22:00:00  36.562641
 2021-09-30 23:00:00  37.100275
 ...                        ...
 2022-02-28 13:00:00  38.889235
 2022-02-28 14:00:00  39.203723
 2022-02-28 15:00:00  39.830655
 2022-02-28 16:00:00  45.174415
 2022-02-28 17:00:00  41.694703
 
 [3623 rows x 1 columns],
                            RTT
 Time_clean                    
 2021-09-30 19:00:00  26.132487
 2021-09-30 20:00:00  27.109839
 2021-09-30 21:00:00  26.890154
 2021-09-30 22:00:00  28.306056
 2021-09-30 23:00:00  28.859042
 ...                        ...
 2022-02-28 13:00:00  30.168679
 2022-02-28 14:00:00  31.661348
 2022-02-28 15:00:00  31.664851
 2022-02-28 16:00:00  39.208561
 2022-02-28 17:00:00  33.262585
 
 [3623 rows x 1 columns],
                            RTT
 Time_clean                    
 2021-09-30 19:00:00  10.662082
 2021-09-30 20:00:00  11.861280
 2021-09-30 21:00:00  12.066776
 2021-09-30 22:00:00  13.008655
 2021-09-30 23:00:00  12.227444
 ...                        ...
 2022-02-28 13:00:00  12.590375
 2022-02-28 14:00:00  13.207759
 2022-02-28 15:00:00  14.037368
 2022-02-28 16:00:00  20.439953
 2022-02-28 17:00:00  15.960016
 
 [3623 rows x 1 columns],
                            RTT
 Time_clean                    
 2021-09-30 19:00:00   9.071586
 2021-09-30 20:00:00  10.307427
 2021-09-30 21:00:00   7.937359
 2021-09-30 22:00:00   9.504431
 2021-09-30 23:00:00  11.833100
 ...                        ...
 2022-02-28 13:00:00  12.578887
 2022-02-28 14:00:00  14.300915
 2022-02-28 15:00:00  13.528169
 2022-02-28 16:00:00  20.140555
 2022-02-28 17:00:00  16.066531
 
 [3623 rows x 1 columns],
                            RTT
 Time_clean                    
 2021-09-30 19:00:00   8.486978
 2021-09-30 20:00:00   8.487558
 2021-09-30 21:00:00   8.707344
 2021-09-30 22:00:00  11.370707
 2021-09-30 23:00:00   9.904454
 ...                        ...
 2022-02-28 13:00:00  12.727393
 2022-02-28 14:00:00  13.166132
 2022-02-28 15:00:00  14.130768
 2022-02-28 16:00:00  23.193074
 2022-02-28 17:00:00  16.203313
 
 [3623 rows x 1 columns],
                            RTT
 Time_clean                    
 2021-09-30 19:00:00   8.867304
 2021-09-30 20:00:00   8.332049
 2021-09-30 21:00:00   9.048469
 2021-09-30 22:00:00  10.902218
 2021-09-30 23:00:00  11.669923
 ...                        ...
 2022-02-28 13:00:00  12.163240
 2022-02-28 14:00:00  13.047743
 2022-02-28 15:00:00  12.813666
 2022-02-28 16:00:00  16.323061
 2022-02-28 17:00:00  15.214958
 
 [3623 rows x 1 columns],
                            RTT
 Time_clean                    
 2021-09-30 19:00:00  11.843677
 2021-09-30 20:00:00  12.655091
 2021-09-30 21:00:00  13.644007
 2021-09-30 22:00:00  12.787838
 2021-09-30 23:00:00  12.416958
 ...                        ...
 2022-02-28 13:00:00  12.564133
 2022-02-28 14:00:00  14.010658
 2022-02-28 15:00:00  13.559028
 2022-02-28 16:00:00  17.598709
 2022-02-28 17:00:00  16.040162
 
 [3623 rows x 1 columns],
                            RTT
 Time_clean                    
 2021-09-30 19:00:00  18.355639
 2021-09-30 20:00:00  28.862259
 2021-09-30 21:00:00  25.952895
 2021-09-30 22:00:00  20.599838
 2021-09-30 23:00:00  14.535092
 ...                        ...
 2022-02-28 13:00:00  13.455604
 2022-02-28 14:00:00  14.706076
 2022-02-28 15:00:00  15.228845
 2022-02-28 16:00:00  21.131463
 2022-02-28 17:00:00  16.582371
 
 [3623 rows x 1 columns],
                            RTT
 Time_clean                    
 2021-09-30 19:00:00  28.298308
 2021-09-30 20:00:00  29.271809
 2021-09-30 21:00:00  29.655916
 2021-09-30 22:00:00  30.600200
 2021-09-30 23:00:00  30.250808
 ...                        ...
 2022-02-28 13:00:00  33.059379
 2022-02-28 14:00:00  33.396985
 2022-02-28 15:00:00  34.415056
 2022-02-28 16:00:00  41.848794
 2022-02-28 17:00:00  36.197371
 
 [3623 rows x 1 columns],
                            RTT
 Time_clean                    
 2021-09-30 19:00:00   8.692370
 2021-09-30 20:00:00   8.741290
 2021-09-30 21:00:00   8.531924
 2021-09-30 22:00:00  11.173869
 2021-09-30 23:00:00   9.863508
 ...                        ...
 2022-02-28 13:00:00  12.747330
 2022-02-28 14:00:00  13.564087
 2022-02-28 15:00:00  13.848568
 2022-02-28 16:00:00  22.951587
 2022-02-28 17:00:00  15.585128
 
 [3623 rows x 1 columns]]

Decompose Data and Return the Residuals¶

The function below performs seasonal decomposition using moving averages. Also, calculate the Moving Average, One and Two Standard Deviations.

Anomaly Detection¶

  1. Perform decomposition
  2. Compute moving averages
  3. Declare anomalies where the residual is more than 2 standard deviations away from the moving average

Count the Number of Anomalies per Date¶

Count and plot the number of anomalies to date.