Logistic Regression is very similar to linear regression, except all of the points can only have $y$-values of $1$ or $0$. This is useful if we want to predict whether something is or isn't part of a particular class. Instead of fitting a line (as in linear regression), logistic regression involves fitting a probability curve.
For example, using our device traffic, let's see whether we can predict a DNS packet is request or response from its length.
First, let's import the data, extract only the DNS packets, and view the first few packets.
# Pandas, Numpy
import numpy as np
import pandas as pd
import logging
logging.getLogger("scapy.runtime").setLevel(logging.ERROR)
# Machine Learning
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
# Plotting
import matplotlib.pyplot as plt
%matplotlib inline
import sys
#sys.path.insert(1,"/Users/feamster/research/netml/src/")
import netml
from netml.pparser.parser import PCAP
hpcap = PCAP('data/http.pcap', flow_ptks_thres=2, verbose=10)
hpcap.pcap2pandas()
pcap = hpcap.df
'_pcap2pandas()' starts at 2022-11-02 17:17:32 '_pcap2pandas()' ends at 2022-11-02 17:17:39 and takes 0.1196 mins.
Each row in the printed data is a packet and each column is a feature of the packet.
Next let's divide the DNS packets into requests and repsonses, and convert them into points where the $x$-value is the length of the packet and $y$-value is $0$ for requests and $1$ for responses. This will allow us to fit the data to a logistic regression curve.
Let's see how many data points we have.
Next we will convert the DNS response column into a 0/1 value so that it is amenable to logstic regression.