A significant problem that arises in computer networking is the detection of timeseries anomalies. Specifically, data is captured over time, and network operators may wish to detect when metrics from the data deviate from "normal" behavior.
This activity involves performing timeseries anomaly detection with network performance data captured from a project at University of Chicago, to continually measure network performance metrics, including throughput, latency, DNS lookup times, and so forth.
In this hands-on, you will combine various anomaly detection approaches with time-series models including:
You might also try some of the following methods:
and apply them to detect anomalous performance in network measurements of latency and bandwidth data from 85 Chicago households. The data for this hands-on activity is available here.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from datetime import datetime
This pickle file has measurement data of various types, from different IDs, to different destinations. The data is also aggregated in various ways, as indicated in the method
column.
df = pd.read_pickle('data/5m_data.pkl')
df.sort_values(by='Time_clean', inplace=True)
df.head(5)
Time | RTT | Measurement | ID | Destination | Method | Time_clean | |
---|---|---|---|---|---|---|---|
2532502 | 2021-09-30 19:00 | 8.775 | ping_latency | nm-mngd-20210623-1b49a7fa | Avg | 2021-09-30 19:00:00 | |
2432051 | 2021-09-30 19:00 | 10.977 | ping_latency | nm-mngd-20210511-9d9824d4 | Avg | 2021-09-30 19:00:00 | |
2504516 | 2021-09-30 19:00 | 9.244 | ping_latency | nm-mngd-20210519-84594993 | Avg | 2021-09-30 19:00:00 | |
2451551 | 2021-09-30 19:00 | 9.510 | ping_latency | nm-mngd-20210518-075ab2f0 | Avg | 2021-09-30 19:00:00 | |
2536760 | 2021-09-30 19:00 | 9.634 | ping_latency | nm-mngd-20210623-d9aab537 | Avg | 2021-09-30 19:00:00 |
Compute an hour-by-hour average RTT for each destination.
def split(df):
gb = df.groupby(['Destination'])
return [gb.get_group(x) for x in gb.groups]
df_split = split(df)
df_split_clean = []
for i in range(len(df_split)):
df_split_clean.append(df_split[i].set_index('Time_clean').resample('H').mean(numeric_only=True).interpolate(limit = 300, direction = "forward", method = "linear").interpolate(method='linear', axis=0).ffill().bfill())
df_split_clean
[ RTT Time_clean 2021-09-30 19:00:00 34.275405 2021-09-30 20:00:00 34.631601 2021-09-30 21:00:00 34.432748 2021-09-30 22:00:00 35.768035 2021-09-30 23:00:00 36.075542 ... ... 2022-02-28 13:00:00 38.242081 2022-02-28 14:00:00 39.605813 2022-02-28 15:00:00 39.191060 2022-02-28 16:00:00 43.936399 2022-02-28 17:00:00 41.158659 [3623 rows x 1 columns], RTT Time_clean 2021-09-30 19:00:00 34.802956 2021-09-30 20:00:00 35.417021 2021-09-30 21:00:00 35.777643 2021-09-30 22:00:00 36.562641 2021-09-30 23:00:00 37.100275 ... ... 2022-02-28 13:00:00 38.889235 2022-02-28 14:00:00 39.203723 2022-02-28 15:00:00 39.830655 2022-02-28 16:00:00 45.174415 2022-02-28 17:00:00 41.694703 [3623 rows x 1 columns], RTT Time_clean 2021-09-30 19:00:00 26.132487 2021-09-30 20:00:00 27.109839 2021-09-30 21:00:00 26.890154 2021-09-30 22:00:00 28.306056 2021-09-30 23:00:00 28.859042 ... ... 2022-02-28 13:00:00 30.168679 2022-02-28 14:00:00 31.661348 2022-02-28 15:00:00 31.664851 2022-02-28 16:00:00 39.208561 2022-02-28 17:00:00 33.262585 [3623 rows x 1 columns], RTT Time_clean 2021-09-30 19:00:00 10.662082 2021-09-30 20:00:00 11.861280 2021-09-30 21:00:00 12.066776 2021-09-30 22:00:00 13.008655 2021-09-30 23:00:00 12.227444 ... ... 2022-02-28 13:00:00 12.590375 2022-02-28 14:00:00 13.207759 2022-02-28 15:00:00 14.037368 2022-02-28 16:00:00 20.439953 2022-02-28 17:00:00 15.960016 [3623 rows x 1 columns], RTT Time_clean 2021-09-30 19:00:00 9.071586 2021-09-30 20:00:00 10.307427 2021-09-30 21:00:00 7.937359 2021-09-30 22:00:00 9.504431 2021-09-30 23:00:00 11.833100 ... ... 2022-02-28 13:00:00 12.578887 2022-02-28 14:00:00 14.300915 2022-02-28 15:00:00 13.528169 2022-02-28 16:00:00 20.140555 2022-02-28 17:00:00 16.066531 [3623 rows x 1 columns], RTT Time_clean 2021-09-30 19:00:00 8.486978 2021-09-30 20:00:00 8.487558 2021-09-30 21:00:00 8.707344 2021-09-30 22:00:00 11.370707 2021-09-30 23:00:00 9.904454 ... ... 2022-02-28 13:00:00 12.727393 2022-02-28 14:00:00 13.166132 2022-02-28 15:00:00 14.130768 2022-02-28 16:00:00 23.193074 2022-02-28 17:00:00 16.203313 [3623 rows x 1 columns], RTT Time_clean 2021-09-30 19:00:00 8.867304 2021-09-30 20:00:00 8.332049 2021-09-30 21:00:00 9.048469 2021-09-30 22:00:00 10.902218 2021-09-30 23:00:00 11.669923 ... ... 2022-02-28 13:00:00 12.163240 2022-02-28 14:00:00 13.047743 2022-02-28 15:00:00 12.813666 2022-02-28 16:00:00 16.323061 2022-02-28 17:00:00 15.214958 [3623 rows x 1 columns], RTT Time_clean 2021-09-30 19:00:00 11.843677 2021-09-30 20:00:00 12.655091 2021-09-30 21:00:00 13.644007 2021-09-30 22:00:00 12.787838 2021-09-30 23:00:00 12.416958 ... ... 2022-02-28 13:00:00 12.564133 2022-02-28 14:00:00 14.010658 2022-02-28 15:00:00 13.559028 2022-02-28 16:00:00 17.598709 2022-02-28 17:00:00 16.040162 [3623 rows x 1 columns], RTT Time_clean 2021-09-30 19:00:00 18.355639 2021-09-30 20:00:00 28.862259 2021-09-30 21:00:00 25.952895 2021-09-30 22:00:00 20.599838 2021-09-30 23:00:00 14.535092 ... ... 2022-02-28 13:00:00 13.455604 2022-02-28 14:00:00 14.706076 2022-02-28 15:00:00 15.228845 2022-02-28 16:00:00 21.131463 2022-02-28 17:00:00 16.582371 [3623 rows x 1 columns], RTT Time_clean 2021-09-30 19:00:00 28.298308 2021-09-30 20:00:00 29.271809 2021-09-30 21:00:00 29.655916 2021-09-30 22:00:00 30.600200 2021-09-30 23:00:00 30.250808 ... ... 2022-02-28 13:00:00 33.059379 2022-02-28 14:00:00 33.396985 2022-02-28 15:00:00 34.415056 2022-02-28 16:00:00 41.848794 2022-02-28 17:00:00 36.197371 [3623 rows x 1 columns], RTT Time_clean 2021-09-30 19:00:00 8.692370 2021-09-30 20:00:00 8.741290 2021-09-30 21:00:00 8.531924 2021-09-30 22:00:00 11.173869 2021-09-30 23:00:00 9.863508 ... ... 2022-02-28 13:00:00 12.747330 2022-02-28 14:00:00 13.564087 2022-02-28 15:00:00 13.848568 2022-02-28 16:00:00 22.951587 2022-02-28 17:00:00 15.585128 [3623 rows x 1 columns]]
The function below performs seasonal decomposition using moving averages. Also, calculate the Moving Average, One and Two Standard Deviations.
Count and plot the number of anomalies to date.