Training Autoencoders to Identify Normal and Abnormal Traffic Behavior in Netflow Data


Goal

Train an autoencoder neural network to develop lower-dimensional representations of netflow data in an attempt to identify categories of behavior present within the data and effectively differentiate between abnormal and normal behavior.

Now when I say lower-dimensional representation (aka an embedding), what I mean by this is basically a fingerprint. So what I'm searching for is a way to take all the features of raw netflow data and really distill it down into key components that identify and classify the data.

Intro to Unsupervised Learning and Autoencoders

An autoencoder is a type of unsupervised learning algorithm.

What do I mean by an unsupervised learning algorithm? An unsupervised learning algorithm takes in unlabeled or unclassified data, and attempts to develop its own labels/classifications. An easy way to think of this is showing an algorithm a bunch of images of cartoon characters but not telling the alogorithm which character is which. While training the algorithm, it attempts to place the characters into it's own groups based on the similarities inherent in each image. Below our algorithm has clearly discovered two beloved types of cartoon charaters: Duck and not-Duck!

Specifically, an autoencoder is a type of neural network algorithm that attempts to reproduce the input that is passed into its input layer. alt text Initially this may sound unimpressive and indeed, if we allow our encoding to be the same size as the input, our algorithm could easily learn to simply memorize the input values!

alt text However, by imposing a bottleneck in the network we can force a compressed knowledge representation of the original input (i.e. we can boil the input data down to its essential pieces or its fingerprint). If the input features were each independent of one another, this compression and subsequent reconstruction would be a very difficult task. However, if some sort of structure exists in the data (ie. correlations between input features), this structure can be learned and consequently leveraged when forcing the input through the network's bottleneck. alt text

Download the data

The dataset I'll be using for this project is the LANL 2017 netflow dataset and focusing my initial analysis on day-03. There are about 6.1 million entries or 'events' that occur on day-03, each with 11 fields of data:

import os
import pandas as pd

DIR = "../input"

colnames = ['Time', 'Duration', 'SrcDevice', 'DstDevice', 'Protocol', 'SrcPort', 'DstPort', 'SrcPackets', 
       'DstPackets', 'SrcBytes', 'DstBytes']
cat_vars = ['SrcDevice', 'DstDevice','Protocol','SrcPort', 'DstPort']
cont_vars = ['Time', 'Duration', 'SrcPackets', 'DstPackets', 'SrcBytes', 'DstBytes']
train_df = pd.read_csv(os.path.join(DIR, 'netflow_day-03.csv'),names = colnames)
train_df.head()
Time Duration SrcDevice DstDevice Protocol SrcPort DstPort SrcPackets DstPackets SrcBytes DstBytes
0 172800 0 Comp348305 Comp370444 6 Port02726 80 0 5 0 784
1 172800 0 Comp817584 Comp275646 17 Port97545 53 1 0 77 0
2 172800 0 Comp654013 Comp685925 6 Port26890 Port94857 6 5 1379 1770
3 172800 0 Comp500631 Comp275646 17 Port62938 53 1 0 64 0
4 172800 0 Comp500631 Comp275646 17 Port52912 53 1 0 64 0
%wget -c https://csr.lanl.gov/data/unified-host-network-dataset-2017/netflow/netflow_day-03.bz2
Field Name Description
Time The start time of the event in epoch time format.
Duration The duration of the event in seconds.
SrcDevice The device that likely initiated the event.
DstDevice The receiving device.
Protocol The protocol number.
SrcPort The port used by the SrcDevice.
DstPort The port used by the DstDevice.
SrcPackets The number of packets the SrcDevice sent during the event.
DstPackets The number of packets the DstDevice sent during the event.
SrcBytes The number of bytes the SrcDevice sent during the event.
DstBytes The number of bytes the DstDevice sent during the event.

Load the data

Each day of the LANL netflow data ends in an incomplete line. Hence attempting a simple bunzip2 will result in a failure. To work around this issue we can use bzip2recover on the file, use bzip2 to unzip the individual blocks and stich them together into a single csv file and finally use a quick bash command to remove the corrupted final line.

%bzip2recover netflow_day-03.bz2
%bzip2 -dc rec00* > netflow_day-03.csv
%head -n -1 netflow_day-03.csv > temp.txt ; mv temp.txt netflow_day-03.csv
%mkdir ../input
%mv netflow_day-03.csv ../input/

Data Exploration

Initial look

%matplotlib inline
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import math
Field Name Description
Time The start time of the event in epoch time format.
Duration The duration of the event in seconds.
SrcDevice The device that likely initiated the event.
DstDevice The receiving device.
Protocol The protocol number.
SrcPort The port used by the SrcDevice.
DstPort The port used by the DstDevice.
SrcPackets The number of packets the SrcDevice sent during the event.
DstPackets The number of packets the DstDevice sent during the event.
SrcBytes The number of bytes the SrcDevice sent during the event.
DstBytes The number of bytes the DstDevice sent during the event.
DIR = "../input"

colnames = ['Time', 'Duration', 'SrcDevice', 'DstDevice', 'Protocol', 'SrcPort', 'DstPort', 'SrcPackets', 
       'DstPackets', 'SrcBytes', 'DstBytes']
cat_vars = ['SrcDevice', 'DstDevice','Protocol','SrcPort', 'DstPort']
cont_vars = ['Time', 'Duration', 'SrcPackets', 'DstPackets', 'SrcBytes', 'DstBytes']
train_df = pd.read_csv(os.path.join(DIR, 'netflow_day-03.csv'),names = colnames)

We've got a little over 6.1 million events occuring during day-03 in our dataset.

train_df.shape
(6102087, 11)
train_df.head()
Time Duration SrcDevice DstDevice Protocol SrcPort DstPort SrcPackets DstPackets SrcBytes DstBytes
0 172800 0 Comp348305 Comp370444 6 Port02726 80 0 5 0 784
1 172800 0 Comp817584 Comp275646 17 Port97545 53 1 0 77 0
2 172800 0 Comp654013 Comp685925 6 Port26890 Port94857 6 5 1379 1770
3 172800 0 Comp500631 Comp275646 17 Port62938 53 1 0 64 0
4 172800 0 Comp500631 Comp275646 17 Port52912 53 1 0 64 0

Let's get an idea of the number of unique values for a few of the features. SrcDevice and DstDevice each have roughly 19K unique values. SrcPort has roughly 65K unique values while DstPort has roughly 40K unique values.

(Technical Aside: This high cardinality means its difficult to take advantage of a one-hot-encoding scheme for these 4 variables although we will be able to break apart SrcDevice and DstDevice for some more information as I will show below. Question: Are Comp/IP/Port numbers near one another are related in their uses? For now I'll treat them as categorical. Protocol only has 3 different values, so we can easily one-hot-encode that variable (1=ICMP Internet Control Message, 6=TCP Transmission Control, 17=UDP User Datagram).)

SrcDevice and DstDevice

There are 6 general categories of Devices: Comp, IP, ActiveDirectory, EnterpriseAppServer, VPN and VScanner.

train_df['SrcDevice'].astype('category').cat.categories
Index(['ActiveDirectory', 'Comp000116', 'Comp000156', 'Comp000219',
       'Comp000566', 'Comp000573', 'Comp000577', 'Comp000688', 'Comp000696',
       'Comp000775',
       ...
       'IP998111', 'IP998491', 'IP998759', 'IP998823', 'IP999095', 'IP999633',
       'IP999839', 'IP999955', 'VPN', 'VScanner'],
      dtype='object', length=18950)
train_df['DstDevice'].astype('category').cat.categories
Index(['ActiveDirectory', 'Comp000190', 'Comp000244', 'Comp000332',
       'Comp000346', 'Comp000386', 'Comp000575', 'Comp000595', 'Comp000688',
       'Comp000944',
       ...
       'IP999711', 'IP999776', 'IP999780', 'IP999852', 'IP999875', 'IP999896',
       'IP999935', 'IP999949', 'VPN', 'VScanner'],
      dtype='object', length=18921)

There are a few other Devices beyond Comps and IPs.

noComp_orIP = train_df['SrcDevice'].loc[(train_df['SrcDevice'].str.contains('Comp')==False) &
                          (train_df['SrcDevice'].str.contains('IP')==False)].astype('category').cat.categories
print(noComp_orIP)
Index(['ActiveDirectory', 'EnterpriseAppServer', 'VPN', 'VScanner'], dtype='object')

The majority (98%) of events appear to have a SrcDevice that is a Comp. The next closest Device is the EnterpriseAppServer at 1.37% of events. It's also interesting to note that 8 Comps are the SrcDevice of nearly half the data.

print('Events\' SrcDevice Breakdown')
for i in ['Comp','IP','ActiveDirectory', 'EnterpriseAppServer', 'VPN', 'VScanner']:
    tmp = train_df['SrcDevice'].loc[(train_df['SrcDevice'].str.contains(f'{i}')==True)].count()/train_df['Time'].count()
    print(i,tmp)
SrcDevice Breakdown
Comp 0.9766106251844656
IP 0.0033899877205946097
ActiveDirectory 0.0014785105489318654
EnterpriseAppServer 0.013733498063859135
VPN 0.0011150283501366008
VScanner 0.0036723501320122115
counted = train_df1.groupby("SrcDevice_Comp").count().rename(columns={"Time":"SrcDevice_count"})
counted.sort_values(by=['SrcDevice_count'],ascending=False).head(10)
SrcDevice_count Duration SrcPort DstPort SrcPackets DstPackets SrcBytes DstBytes DstDevice_Comp SrcDevice_IP ... DstDevice_ActiveDirectory SrcDevice_EnterpriseAppServer DstDevice_EnterpriseAppServer SrcDevice_VPN DstDevice_VPN SrcDevice_VScanner DstDevice_VScanner Protocol_1 Protocol_6 Protocol_17
SrcDevice_Comp
44772 412110 412110 412110 412110 412110 412110 412110 412110 412110 412110 ... 412110 412110 412110 412110 412110 412110 412110 412110 412110 412110
296454 396393 396393 396393 396393 396393 396393 396393 396393 396393 396393 ... 396393 396393 396393 396393 396393 396393 396393 396393 396393 396393
623258 376489 376489 376489 376489 376489 376489 376489 376489 376489 376489 ... 376489 376489 376489 376489 376489 376489 376489 376489 376489 376489
73202 355912 355912 355912 355912 355912 355912 355912 355912 355912 355912 ... 355912 355912 355912 355912 355912 355912 355912 355912 355912 355912
30334 343163 343163 343163 343163 343163 343163 343163 343163 343163 343163 ... 343163 343163 343163 343163 343163 343163 343163 343163 343163 343163
866402 342651 342651 342651 342651 342651 342651 342651 342651 342651 342651 ... 342651 342651 342651 342651 342651 342651 342651 342651 342651 342651
257274 342549 342549 342549 342549 342549 342549 342549 342549 342549 342549 ... 342549 342549 342549 342549 342549 342549 342549 342549 342549 342549
965575 337894 337894 337894 337894 337894 337894 337894 337894 337894 337894 ... 337894 337894 337894 337894 337894 337894 337894 337894 337894 337894
0 142724 142724 142724 142724 142724 142724 142724 142724 142724 142724 ... 142724 142724 142724 142724 142724 142724 142724 142724 142724 142724
107130 97940 97940 97940 97940 97940 97940 97940 97940 97940 97940 ... 97940 97940 97940 97940 97940 97940 97940 97940 97940 97940

10 rows × 22 columns

While the majority of events have a Comp for their DstDevice (86.61%), ActiveDirectory and EnterpriseAppServer have a larger share than for the SrcDevice (7.14% and 5.85% respectively).

print('Events\' DstDevice Breakdown')
for i in ['Comp','IP','ActiveDirectory', 'EnterpriseAppServer', 'VPN', 'VScanner']:
    tmp = train_df['DstDevice'].loc[(train_df['DstDevice'].str.contains(f'{i}')==True)].count()/train_df['Time'].count()
    print(i,tmp)
DstDevice Breakdown
Comp 0.8661133477775719
IP 0.0035854946020926937
ActiveDirectory 0.07141753305057762
EnterpriseAppServer 0.05854111880082995
VPN 0.00013454413219608308
VScanner 0.00020796163673182633
counted = train_df1.groupby("DstDevice_Comp").count().rename(columns={"Time":"DstDevice_count"})
counted.sort_values(by=['DstDevice_count'],ascending=False).head(10)
DstDevice_count Duration SrcPort DstPort SrcPackets DstPackets SrcBytes DstBytes SrcDevice_Comp SrcDevice_IP ... DstDevice_ActiveDirectory SrcDevice_EnterpriseAppServer DstDevice_EnterpriseAppServer SrcDevice_VPN DstDevice_VPN SrcDevice_VScanner DstDevice_VScanner Protocol_1 Protocol_6 Protocol_17
DstDevice_Comp
0 816988 816988 816988 816988 816988 816988 816988 816988 816988 816988 ... 816988 816988 816988 816988 816988 816988 816988 816988 816988 816988
275646 624576 624576 624576 624576 624576 624576 624576 624576 624576 624576 ... 624576 624576 624576 624576 624576 624576 624576 624576 624576 624576
576031 218410 218410 218410 218410 218410 218410 218410 218410 218410 218410 ... 218410 218410 218410 218410 218410 218410 218410 218410 218410 218410
576843 193316 193316 193316 193316 193316 193316 193316 193316 193316 193316 ... 193316 193316 193316 193316 193316 193316 193316 193316 193316 193316
186884 145802 145802 145802 145802 145802 145802 145802 145802 145802 145802 ... 145802 145802 145802 145802 145802 145802 145802 145802 145802 145802
916004 66911 66911 66911 66911 66911 66911 66911 66911 66911 66911 ... 66911 66911 66911 66911 66911 66911 66911 66911 66911 66911
63824 63126 63126 63126 63126 63126 63126 63126 63126 63126 63126 ... 63126 63126 63126 63126 63126 63126 63126 63126 63126 63126
393033 44319 44319 44319 44319 44319 44319 44319 44319 44319 44319 ... 44319 44319 44319 44319 44319 44319 44319 44319 44319 44319
501516 43614 43614 43614 43614 43614 43614 43614 43614 43614 43614 ... 43614 43614 43614 43614 43614 43614 43614 43614 43614 43614
995183 37139 37139 37139 37139 37139 37139 37139 37139 37139 37139 ... 37139 37139 37139 37139 37139 37139 37139 37139 37139 37139

10 rows × 22 columns

There doesn't seem to be anything particularly interesting in the distribution of Comp numbers for SrcDevice or DstDevice. Again we see that a select few Comps have a majoirty of the traffic.

num_bins = 100
plt.hist(train_df1['SrcDevice_Comp'], num_bins, normed=1, facecolor='blue', alpha=0.5)
plt.show()
/home/paperspace/anaconda3/envs/fastaiv1/lib/python3.7/site-packages/matplotlib/axes/_axes.py:6510: MatplotlibDeprecationWarning: 
The 'normed' kwarg was deprecated in Matplotlib 2.1 and will be removed in 3.1. Use 'density' instead.
  alternative="'density'", removal="3.1")

png

num_bins = 100
plt.hist(train_df1['DstDevice_Comp'], num_bins, normed=1, facecolor='blue', alpha=0.5)
plt.show()
/home/paperspace/anaconda3/envs/fastaiv1/lib/python3.7/site-packages/matplotlib/axes/_axes.py:6510: MatplotlibDeprecationWarning: 
The 'normed' kwarg was deprecated in Matplotlib 2.1 and will be removed in 3.1. Use 'density' instead.
  alternative="'density'", removal="3.1")

png

for i in np.logspace(0,4,5):
    if i != 1e4:
        counted.loc[(counted["SrcDevice_count"] > i) & (counted["SrcDevice_count"] < i*10),'SrcDevice_count'] = i
    else:
        counted.loc[counted["SrcDevice_count"] > i,'SrcDevice_count'] = i
plt.figure()
sns.countplot(data=counted, x="SrcDevice_count")
plt.show()

png

Looking at the distribution of SrcDevice, it appears that the majority (98.401%) of SrcDevices are used anywhere from 1 to 1000 times but only accoutn for a quarter the data! Another quarter of the data account for 1.557% SrcDevices used between 1000 and 1e5 times and nearly half of the data originates from 0.042% (8!!) of SrcDevices.

counted = train_df.groupby("DstDevice").count().rename(columns={"Time":"DstDevice_count"})
for i in np.logspace(0,4,5):
    if i != 1e4:
        counted.loc[(counted["DstDevice_count"] > i) & (counted["DstDevice_count"] < i*10),'DstDevice_count'] = i
    else:
        counted.loc[counted["DstDevice_count"] > i,'DstDevice_count'] = i
plt.figure()
sns.countplot(data=counted, x="DstDevice_count")
plt.show()

png

for i in np.logspace(1,3,3):
    print(i, len(counted['Duration'].loc[counted['Duration']>i].values))
10.0 10997
100.0 6131
1000.0 303
train_df.head()
Time Duration SrcDevice DstDevice Protocol SrcPort DstPort SrcPackets DstPackets SrcBytes DstBytes
0 172800 0 4420 2456 1 1926 224 0 5 0 784
1 172800 0 10428 1828 2 63323 155 1 0 77 0
2 172800 0 8268 4593 1 17434 37537 6 5 1379 1770
3 172800 0 6348 1828 2 40895 155 1 0 64 0
4 172800 0 6348 1828 2 34386 155 1 0 64 0

Attempt at showing clusters of devices through a flow graph (Work in progress!!)

import igraph as ig
train_df0 = pd.read_csv(os.path.join(DIR, 'netflow_day-03.csv'),names = colnames)
flow = train_df0.groupby(['SrcDevice','DstDevice']).size().reset_index().rename(columns={0:'count'})
computer_edgelist = dict(zip(map(tuple,flow[['SrcDevice','DstDevice']].values.tolist()),flow['count'].values.tolist()))
src_set = set([x[0][0] for x in computer_edgelist.items()])
dst_set = set([x[0][1] for x in computer_edgelist.items()])

print("Sources:     ", len(src_set))
print("Destinations: ", len(dst_set))
print("Union:       ", len(src_set.union(dst_set)))
print("Intersection: ", len(src_set.intersection(dst_set)))
Sources:      18950
Destinations:  18921
Union:        34019
Intersection:  3852
nv = len(src_set.union(dst_set))
vlookup = dict(zip(range(nv), list(src_set.union(dst_set))))
ilookup = dict(zip(list(src_set.union(dst_set)), range(nv)))
elist = [[ilookup[k[0]], ilookup[k[1]]] for k in computer_edgelist.keys()]

flow_graph = ig.Graph(nv)
flow_graph.add_edges(elist)

flow_graph.is_connected()
False

alt text

Protocol

Protocol 6 (TCP) accounts for over 77.69% of events, Protocol 17 (UDP) accounts for 22.28% while Protocol 1 (ICMP) only accounts for .03%.

train_df['Protocol'].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x7f6a2f7b7c50>

png

train_df['Protocol'].loc[train_df['Protocol']==17].count()/train_df['Time'].count()
0.22279705287715498

SrcPort and DstPort

Distribution across SrcPort seems nearly uniform.

num_bins = 1000
plt.hist(train_df1['SrcPort'], num_bins, density=True, facecolor='blue', alpha=0.5)
plt.show()

png

DstPort is another story. The top 10 DstPorts account for over half of the events

num_bins = 1000
plt.hist(train_df1['DstPort'], num_bins, density=True, facecolor='blue', alpha=0.5)
plt.show()

png

counted = train_df1.groupby("DstPort").count().rename(columns={"Time":"DstPort_count"})
counted.sort_values(by=['DstPort_count'],ascending=False).head(10)
DstPort_count Duration SrcPort SrcPackets DstPackets SrcBytes DstBytes SrcDevice_Comp DstDevice_Comp SrcDevice_IP ... DstDevice_ActiveDirectory SrcDevice_EnterpriseAppServer DstDevice_EnterpriseAppServer SrcDevice_VPN DstDevice_VPN SrcDevice_VScanner DstDevice_VScanner Protocol_1 Protocol_6 Protocol_17
DstPort
53 750295 750295 750295 750295 750295 750295 750295 750295 750295 750295 ... 750295 750295 750295 750295 750295 750295 750295 750295 750295 750295
443 700175 700175 700175 700175 700175 700175 700175 700175 700175 700175 ... 700175 700175 700175 700175 700175 700175 700175 700175 700175 700175
80 607732 607732 607732 607732 607732 607732 607732 607732 607732 607732 ... 607732 607732 607732 607732 607732 607732 607732 607732 607732 607732
514 264663 264663 264663 264663 264663 264663 264663 264663 264663 264663 ... 264663 264663 264663 264663 264663 264663 264663 264663 264663 264663
389 213198 213198 213198 213198 213198 213198 213198 213198 213198 213198 ... 213198 213198 213198 213198 213198 213198 213198 213198 213198 213198
88 162582 162582 162582 162582 162582 162582 162582 162582 162582 162582 ... 162582 162582 162582 162582 162582 162582 162582 162582 162582 162582
23118 137832 137832 137832 137832 137832 137832 137832 137832 137832 137832 ... 137832 137832 137832 137832 137832 137832 137832 137832 137832 137832
427 113420 113420 113420 113420 113420 113420 113420 113420 113420 113420 ... 113420 113420 113420 113420 113420 113420 113420 113420 113420 113420
95765 99594 99594 99594 99594 99594 99594 99594 99594 99594 99594 ... 99594 99594 99594 99594 99594 99594 99594 99594 99594 99594
445 99020 99020 99020 99020 99020 99020 99020 99020 99020 99020 ... 99020 99020 99020 99020 99020 99020 99020 99020 99020 99020

10 rows × 22 columns

sum(counted.sort_values(by=['DstPort_count'],ascending=False)['DstPort_count'].values[:10])
3148511

SrcPackets and DstPackets

91.47% of events send between 0-10 packets from the SrcDevice. The highest amount of packets sent is 87701075!

counted = train_df1.groupby("SrcPackets").count().rename(columns={"Time":"SrcPackets_count"})
counted.sort_values(by=['SrcPackets_count'],ascending=False).head(10)
SrcPackets_count Duration SrcPort DstPort DstPackets SrcBytes DstBytes SrcDevice_Comp DstDevice_Comp SrcDevice_IP ... DstDevice_ActiveDirectory SrcDevice_EnterpriseAppServer DstDevice_EnterpriseAppServer SrcDevice_VPN DstDevice_VPN SrcDevice_VScanner DstDevice_VScanner Protocol_1 Protocol_6 Protocol_17
SrcPackets
0 1951332 1951332 1951332 1951332 1951332 1951332 1951332 1951332 1951332 1951332 ... 1951332 1951332 1951332 1951332 1951332 1951332 1951332 1951332 1951332 1951332
1 1181225 1181225 1181225 1181225 1181225 1181225 1181225 1181225 1181225 1181225 ... 1181225 1181225 1181225 1181225 1181225 1181225 1181225 1181225 1181225 1181225
5 600726 600726 600726 600726 600726 600726 600726 600726 600726 600726 ... 600726 600726 600726 600726 600726 600726 600726 600726 600726 600726
4 537638 537638 537638 537638 537638 537638 537638 537638 537638 537638 ... 537638 537638 537638 537638 537638 537638 537638 537638 537638 537638
3 245588 245588 245588 245588 245588 245588 245588 245588 245588 245588 ... 245588 245588 245588 245588 245588 245588 245588 245588 245588 245588
2 233467 233467 233467 233467 233467 233467 233467 233467 233467 233467 ... 233467 233467 233467 233467 233467 233467 233467 233467 233467 233467
6 184014 184014 184014 184014 184014 184014 184014 184014 184014 184014 ... 184014 184014 184014 184014 184014 184014 184014 184014 184014 184014
10 183998 183998 183998 183998 183998 183998 183998 183998 183998 183998 ... 183998 183998 183998 183998 183998 183998 183998 183998 183998 183998
11 171049 171049 171049 171049 171049 171049 171049 171049 171049 171049 ... 171049 171049 171049 171049 171049 171049 171049 171049 171049 171049
7 104048 104048 104048 104048 104048 104048 104048 104048 104048 104048 ... 104048 104048 104048 104048 104048 104048 104048 104048 104048 104048

10 rows × 22 columns

sum(counted.sort_values(by=['SrcPackets_count'],ascending=False)['SrcPackets_count'].values[:12])/sum(counted.sort_values(by=['SrcPackets_count'],ascending=False)['SrcPackets_count'].values)
0.9146632947055655
np.sort(train_df1['SrcPackets'].values)[-10:]
array([72759509, 74938675, 77148736, 79109660, 81076770, 82796324,
       84609078, 84760288, 86720597, 87701075])

93.40% of events send between 0-10 packets from the DstDevice. The highest amount of packets sent is 77760718.

counted = train_df1.groupby("DstPackets").count().rename(columns={"Time":"DstPackets_count"})
counted.sort_values(by=['DstPackets_count'],ascending=False).head(10)
DstPackets_count Duration SrcPort DstPort SrcPackets SrcBytes DstBytes SrcDevice_Comp DstDevice_Comp SrcDevice_IP ... DstDevice_ActiveDirectory SrcDevice_EnterpriseAppServer DstDevice_EnterpriseAppServer SrcDevice_VPN DstDevice_VPN SrcDevice_VScanner DstDevice_VScanner Protocol_1 Protocol_6 Protocol_17
DstPackets
0 2098503 2098503 2098503 2098503 2098503 2098503 2098503 2098503 2098503 2098503 ... 2098503 2098503 2098503 2098503 2098503 2098503 2098503 2098503 2098503 2098503
1 1746836 1746836 1746836 1746836 1746836 1746836 1746836 1746836 1746836 1746836 ... 1746836 1746836 1746836 1746836 1746836 1746836 1746836 1746836 1746836 1746836
3 596994 596994 596994 596994 596994 596994 596994 596994 596994 596994 ... 596994 596994 596994 596994 596994 596994 596994 596994 596994 596994
5 283068 283068 283068 283068 283068 283068 283068 283068 283068 283068 ... 283068 283068 283068 283068 283068 283068 283068 283068 283068 283068
2 218062 218062 218062 218062 218062 218062 218062 218062 218062 218062 ... 218062 218062 218062 218062 218062 218062 218062 218062 218062 218062
4 196227 196227 196227 196227 196227 196227 196227 196227 196227 196227 ... 196227 196227 196227 196227 196227 196227 196227 196227 196227 196227
7 189765 189765 189765 189765 189765 189765 189765 189765 189765 189765 ... 189765 189765 189765 189765 189765 189765 189765 189765 189765 189765
6 120250 120250 120250 120250 120250 120250 120250 120250 120250 120250 ... 120250 120250 120250 120250 120250 120250 120250 120250 120250 120250
9 68150 68150 68150 68150 68150 68150 68150 68150 68150 68150 ... 68150 68150 68150 68150 68150 68150 68150 68150 68150 68150
11 63692 63692 63692 63692 63692 63692 63692 63692 63692 63692 ... 63692 63692 63692 63692 63692 63692 63692 63692 63692 63692

10 rows × 22 columns

sum(counted.sort_values(by=['DstPackets_count'],ascending=False)['DstPackets_count'].values[:12])/sum(counted.sort_values(by=['DstPackets_count'],ascending=False)['DstPackets_count'].values)
0.9340995957612535
np.sort(train_df1['DstPackets'].values)[-10:]
array([64687095, 66650395, 68478386, 68610780, 70356457, 71882842,
       73626605, 75365457, 77107098, 77760718])

SrcBytes and DstBytes

Initial Look

counted = train_df1.groupby("SrcBytes").count().rename(columns={"Time":"SrcBytes_count"})
counted.sort_values(by=['SrcBytes_count'],ascending=False).head(10)
SrcBytes_count Duration SrcPort DstPort SrcPackets DstPackets DstBytes SrcDevice_Comp DstDevice_Comp SrcDevice_IP ... DstDevice_ActiveDirectory SrcDevice_EnterpriseAppServer DstDevice_EnterpriseAppServer SrcDevice_VPN DstDevice_VPN SrcDevice_VScanner DstDevice_VScanner Protocol_1 Protocol_6 Protocol_17
SrcBytes
0 1951332 1951332 1951332 1951332 1951332 1951332 1951332 1951332 1951332 1951332 ... 1951332 1951332 1951332 1951332 1951332 1951332 1951332 1951332 1951332 1951332
277 268365 268365 268365 268365 268365 268365 268365 268365 268365 268365 ... 268365 268365 268365 268365 268365 268365 268365 268365 268365 268365
216 266771 266771 266771 266771 266771 266771 266771 266771 266771 266771 ... 266771 266771 266771 266771 266771 266771 266771 266771 266771 266771
60 247876 247876 247876 247876 247876 247876 247876 247876 247876 247876 ... 247876 247876 247876 247876 247876 247876 247876 247876 247876 247876
225 112601 112601 112601 112601 112601 112601 112601 112601 112601 112601 ... 112601 112601 112601 112601 112601 112601 112601 112601 112601 112601
164 99908 99908 99908 99908 99908 99908 99908 99908 99908 99908 ... 99908 99908 99908 99908 99908 99908 99908 99908 99908 99908
399 77662 77662 77662 77662 77662 77662 77662 77662 77662 77662 ... 77662 77662 77662 77662 77662 77662 77662 77662 77662 77662
152 74380 74380 74380 74380 74380 74380 74380 74380 74380 74380 ... 74380 74380 74380 74380 74380 74380 74380 74380 74380 74380
46 71063 71063 71063 71063 71063 71063 71063 71063 71063 71063 ... 71063 71063 71063 71063 71063 71063 71063 71063 71063 71063
64 62962 62962 62962 62962 62962 62962 62962 62962 62962 62962 ... 62962 62962 62962 62962 62962 62962 62962 62962 62962 62962

10 rows × 22 columns

np.sort(train_df['SrcBytes'].values)[-10:]
array([12122909246, 12459931268, 12832178150, 13209994822, 13545567320,
       13882204590, 14176496098, 14513284176, 14848224144, 15015712600])
counted = train_df1.groupby("DstBytes").count().rename(columns={"Time":"DstBytes_count"})
counted.sort_values(by=['DstBytes_count'],ascending=False).head(10)
DstBytes_count Duration SrcPort DstPort SrcPackets DstPackets SrcBytes SrcDevice_Comp DstDevice_Comp SrcDevice_IP ... DstDevice_ActiveDirectory SrcDevice_EnterpriseAppServer DstDevice_EnterpriseAppServer SrcDevice_VPN DstDevice_VPN SrcDevice_VScanner DstDevice_VScanner Protocol_1 Protocol_6 Protocol_17
DstBytes
0 2098503 2098503 2098503 2098503 2098503 2098503 2098503 2098503 2098503 2098503 ... 2098503 2098503 2098503 2098503 2098503 2098503 2098503 2098503 2098503 2098503
46 1500887 1500887 1500887 1500887 1500887 1500887 1500887 1500887 1500887 1500887 ... 1500887 1500887 1500887 1500887 1500887 1500887 1500887 1500887 1500887 1500887
164 224058 224058 224058 224058 224058 224058 224058 224058 224058 224058 ... 224058 224058 224058 224058 224058 224058 224058 224058 224058 224058
3911 71773 71773 71773 71773 71773 71773 71773 71773 71773 71773 ... 71773 71773 71773 71773 71773 71773 71773 71773 71773 71773
479 57954 57954 57954 57954 57954 57954 57954 57954 57954 57954 ... 57954 57954 57954 57954 57954 57954 57954 57954 57954 57954
112 55094 55094 55094 55094 55094 55094 55094 55094 55094 55094 ... 55094 55094 55094 55094 55094 55094 55094 55094 55094 55094
106 53733 53733 53733 53733 53733 53733 53733 53733 53733 53733 ... 53733 53733 53733 53733 53733 53733 53733 53733 53733 53733
177 39591 39591 39591 39591 39591 39591 39591 39591 39591 39591 ... 39591 39591 39591 39591 39591 39591 39591 39591 39591 39591
1503 38017 38017 38017 38017 38017 38017 38017 38017 38017 38017 ... 38017 38017 38017 38017 38017 38017 38017 38017 38017 38017
853 32385 32385 32385 32385 32385 32385 32385 32385 32385 32385 ... 32385 32385 32385 32385 32385 32385 32385 32385 32385 32385

10 rows × 22 columns

np.sort(train_df['DstBytes'].values)[-10:]
array([11146902232, 11457059168, 11805619364, 12153021664, 12461839686,
       12732086028, 13040875936, 13348384456, 13657486438, 13773586368])

Question: As someone new to the field, I'm uncertain why the number of bytes for events jumps from 0 to 42.

unique_df = np.unique(train_df['DstBytes'].values)
np.sort(unique_df)[:10]
array([ 0, 42, 44, 46, 48, 49, 52, 55, 56, 58])
unique_df = np.unique(train_df['SrcBytes'].values)
np.sort(unique_df)[:10]
array([ 0, 42, 44, 46, 47, 48, 49, 50, 51, 52])

Distributions of Packets and Bytes seem to follow a power law which appears to be normal.

num_bins = 20
plt.figure(figsize=(10,4))
plt.xscale('log')
plt.yscale('log')
plt.hist(np.log(train_df['SrcPackets'].values+0.5), num_bins, facecolor='blue', alpha=0.5,log=True, label = 'SrcPackets')
plt.hist(np.log(train_df['DstPackets'].values+0.5), num_bins, facecolor='red', alpha=0.5,log=True, label = 'DstPackets')
plt.hist(np.log(train_df['SrcBytes'].values+0.5), num_bins, facecolor='green', alpha=0.5,log=True, label = 'SrcBytes', zorder=-1)
plt.hist(np.log(train_df['DstBytes'].values+0.5), num_bins, facecolor='black', alpha=0.5,log=True, label = 'SrcBytes', zorder=-1)
plt.xlabel("Size") 
plt.ylabel("Count")
plt.legend( loc='upper right')
plt.xlim(1e0,1e2)
plt.show()

png

Looking at a 2d heatmap of Packets vs Bytes shows a clean linear correlation, with all flows exhibiting an average packet size (i.e. bytes/packets) between 32 and 1024 bytes.

tdf = train_df.copy()
tdf['SrcPackets'] = np.log(tdf['SrcPackets'].values+0.5)/np.log(2)
tdf['SrcBytes'] = np.log(tdf['SrcBytes'].values+0.5)/np.log(2)
sns.jointplot(x="SrcPackets", y="SrcBytes", data=tdf, kind="hex",Norm=LogNorm())
<seaborn.axisgrid.JointGrid at 0x7f4f5d15d278>

png

tdf['DstPackets'] = np.log(tdf['DstPackets'].values+0.5)/np.log(2)
tdf['DstBytes'] = np.log(tdf['DstBytes'].values+0.5)/np.log(2)
sns.jointplot(x="DstPackets", y="DstBytes", data=tdf, kind="hex",Norm = LogNorm())
<seaborn.axisgrid.JointGrid at 0x7f4f5f4c51d0>

png

Time and Duration

80% of the events last less than 1 second. The longest event took 978035 seconds or roughly 271 hours.

counted = train_df.groupby("Duration").count().rename(columns={"Time":"Duration_count"})
counted.sort_values(by=['Duration_count'],ascending=False).head(10)
Duration_count SrcDevice DstDevice Protocol SrcPort DstPort SrcPackets DstPackets SrcBytes DstBytes
Duration
0 2754172 2754172 2754172 2754172 2754172 2754172 2754172 2754172 2754172 2754172
1 2133800 2133800 2133800 2133800 2133800 2133800 2133800 2133800 2133800 2133800
2 80962 80962 80962 80962 80962 80962 80962 80962 80962 80962
9 68782 68782 68782 68782 68782 68782 68782 68782 68782 68782
5 59152 59152 59152 59152 59152 59152 59152 59152 59152 59152
3 36107 36107 36107 36107 36107 36107 36107 36107 36107 36107
12 34511 34511 34511 34511 34511 34511 34511 34511 34511 34511
6 28342 28342 28342 28342 28342 28342 28342 28342 28342 28342
4 28297 28297 28297 28297 28297 28297 28297 28297 28297 28297
15 25519 25519 25519 25519 25519 25519 25519 25519 25519 25519
sum(counted.sort_values(by=['Duration_count'],ascending=False)['Duration_count'].values[:2])/sum(counted.sort_values(by=['Duration_count'],ascending=False)['Duration_count'].values)
0.8010328269655939
np.sort(train_df1['Duration'].values)[-1:]/60/60
array([271.67638889])
num_bins = 1000
plt.hist(train_df['Duration'], num_bins, facecolor='blue', alpha=0.5,log=True)
plt.show()

png

num_bins = 1000
plt.hist(train_df1['Time'], num_bins, density=True, facecolor='blue', alpha=0.5)
plt.show()

png

train_df.keys()
Index(['Time', 'Duration', 'SrcDevice', 'DstDevice', 'Protocol', 'SrcPort',
       'DstPort', 'SrcPackets', 'DstPackets', 'SrcBytes', 'DstBytes'],
      dtype='object')
plt.figure(figsize=(10,4))
plt.yscale('log')
plt.stem(duration_hist.keys(), duration_hist.values(), markerfmt='b.')

Normalize data

In order for our autoencoder to train better, it will be useful to normalize the data. I initially skipped this step and was getting astronomically high losses!

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
colnames_tsa = ['Time', 'Duration', 'SrcDevice', 'DstDevice', 'Protocol', 'SrcPort', 'DstPort', 'SrcPackets', 
        'DstPackets', 'SrcBytes', 'DstBytes', 'SrcPackets_mean1s','SrcPackets_std1s','SrcPackets_mean10s',
        'SrcPackets_std10s','SrcPackets_mean60s','SrcPackets_std60s', 'DstPackets_mean1s','DstPackets_std1s','DstPackets_mean10s',
        'DstPackets_std10s','DstPackets_mean60s','DstPackets_std60s', 'SrcBytes_mean1s','SrcBytes_std1s','SrcBytes_mean10s',
        'SrcBytes_std10s','SrcBytes_mean60s','SrcBytes_std60s', 'DstBytes_mean1s','DstBytes_std1s','DstBytes_mean10s',
        'DstBytes_std10s','DstBytes_mean60s','DstBytes_std60s',]

scaler = StandardScaler()
scaled_df = scaler.fit_transform(train_df)
scaled_df = pd.DataFrame(scaled_df, columns=colnames)
scaler1 = MinMaxScaler()
scaled_df1 = scaler1.fit_transform(train_df2)
scaled_df1 = pd.DataFrame(scaled_df1, columns=colnames_tsa)
scaler2 = RobustScaler()
scaled_df2 = scaler2.fit_transform(train_df)
scaled_df2 = pd.DataFrame(scaled_df2, columns=colnames)

fig, (ax1, ax2, ax3, ax4) = plt.subplots(ncols=4, figsize=(16, 5))

ax1.set_title('Before Scaling')
sns.kdeplot(train_df['SrcDevice'], ax=ax1)
sns.kdeplot(train_df['DstDevice'], ax=ax1)
sns.kdeplot(train_df['SrcPort'], ax=ax1)
sns.kdeplot(train_df['DstPort'], ax=ax1)
ax2.set_title('After Standard Scaler')
sns.kdeplot(scaled_df['SrcDevice'], ax=ax2)
sns.kdeplot(scaled_df['DstDevice'], ax=ax2)
sns.kdeplot(scaled_df['SrcPort'], ax=ax2)
sns.kdeplot(scaled_df['DstPort'], ax=ax2)
ax3.set_title('After Min-Max Scaler')
sns.kdeplot(scaled_df1['SrcDevice'], ax=ax3)
sns.kdeplot(scaled_df1['DstDevice'], ax=ax3)
sns.kdeplot(scaled_df1['SrcPort'], ax=ax3)
sns.kdeplot(scaled_df1['DstPort'], ax=ax3)
ax4.set_title('After Robust Scaler')
sns.kdeplot(scaled_df2['SrcDevice'], ax=ax4)
sns.kdeplot(scaled_df2['DstDevice'], ax=ax4)
sns.kdeplot(scaled_df2['SrcPort'], ax=ax4)
sns.kdeplot(scaled_df2['DstPort'], ax=ax4)
plt.show()
/home/paperspace/anaconda3/envs/fastaiv1/lib/python3.7/site-packages/sklearn/preprocessing/data.py:625: DataConversionWarning: Data with input dtype int8, int16, int32, int64 were all converted to float64 by StandardScaler.
  return self.partial_fit(X, y)
/home/paperspace/anaconda3/envs/fastaiv1/lib/python3.7/site-packages/sklearn/base.py:462: DataConversionWarning: Data with input dtype int8, int16, int32, int64 were all converted to float64 by StandardScaler.
  return self.fit(X, **fit_params).transform(X)
/home/paperspace/anaconda3/envs/fastaiv1/lib/python3.7/site-packages/sklearn/preprocessing/data.py:323: DataConversionWarning: Data with input dtype int8, int16, int32, int64, float64 were all converted to float64 by MinMaxScaler.
  return self.partial_fit(X, y)
/home/paperspace/anaconda3/envs/fastaiv1/lib/python3.7/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

png

Making the data richer with feature extraction

One-hot-encoding

Returning to the SrcDevice and DstDevice features, the 4 non-Comp/IP devices with no numbers associated with them are perfect for one-hot-encoding. The Protocol, with only 3 different values, is also a great candidate for one-hot-encoding.

train_df1 = train_df.copy()
cols = np.append(['Comp','IP'],noComp_orIP)
for i in cols:
    train_df1[f'SrcDevice_{i}'] = 0
    train_df1[f'DstDevice_{i}'] = 0
    ind_src = (train_df['SrcDevice'].str.contains(f'{i}')==True)
    ind_dst = (train_df['DstDevice'].str.contains(f'{i}')==True)
    train_df1[f'SrcDevice_{i}'].loc[ind_src] = 1
    train_df1[f'DstDevice_{i}'].loc[ind_dst] = 1
    if i == 'Comp' or i == 'IP': #Replace 1 with actual Comp or IP value
        train_df1[f'SrcDevice_{i}'].loc[ind_src] = train_df1['SrcDevice'].loc[ind_src].map(lambda x: int(x.lstrip(f'{i}')))
        train_df1[f'DstDevice_{i}'].loc[ind_dst] = train_df1['DstDevice'].loc[ind_dst].map(lambda x: int(x.lstrip(f'{i}')))
for i in np.unique(train_df1['Protocol'].values):
    train_df1[f'Protocol_{i}'] = 0
    train_df1[f'Protocol_{i}'].loc[(train_df['Protocol'] == i)] = 1
train_df1.drop(columns = ['SrcDevice','DstDevice','Protocol'],inplace = True)
train_df1['SrcPort'] = train_df1['SrcPort'].map(lambda x: int(x.lstrip(f'Port')))
train_df1['DstPort'] = train_df1['DstPort'].map(lambda x: int(x.lstrip(f'Port')))

Time-series Analysis (TSA)

So far we have not really encoded any idea of 'normal' or 'abnormal' into our model.
From Kitsune: An Ensemble of Autoencoders for Online Network Intrusion Detection (Mirsky 2018)

Feature extraction is the process of obtaining or engineering a vector of values which describe a real world observation. In network anomaly detection, it is important to extract features which capture the context and purpose of each packet traversing the network. For example, consider a single TCP SYN packet. The packet may be a benign attempt to establish a connection with a server, or it may be one of millions of similar packets sent in an attempt to cause a denial of service attack (DoS).

This is just one example of attacks where temporal-statistical features could help detect anomalies. The challenge with extracting these kinds of features from network traffic is that (1) packets from different channels (conversations) are interleaved, (2) there can be many channels at any given moment, (3) the packet arrival rate can be very high.

We can begin to do this by adding rolling statistics of events and apply them to each feature. This way we will be able to check how each event and its features compare to a typical event in a window of x seconds.

Let's first look at the mean and standard deviation of the Packets and Bytes features for time windows of 1, 10 and 60 seconds into the past. This will add 24 features to our dataset.

def FeatureExtractor(df,features,time_windows):
    for i in features:
        for j in time_windows:
            print(i,j)
            dff = df.copy()
            dff['datetime'] = pd.to_datetime(dff['Time'],unit = 's')
            dff.index = dff['datetime']
            tmp_mean = dff[f'{i}'].rolling(f'{j}s', min_periods=1).mean()
            tmp_std =  dff[f'{i}'].rolling(f'{j}s', min_periods=1).std().fillna(0)
            tmp_mean.index = df.index
            tmp_std.index = df.index
            df[f'{i}_mean{j}s'] = tmp_mean
            df[f'{i}_std{j}s'] = tmp_std

Future Work:One way to improve the Feature Extractor is to intialize the statistics with a typical stream of data instead of just zero.

train_df2 = train_df.copy()
features = ['SrcPackets','DstPackets','SrcBytes','DstBytes']
time_windows = [1,10,60]
FeatureExtractor(train_df2,features,time_windows)
SrcPackets 1
SrcPackets 10
SrcPackets 60
DstPackets 1
DstPackets 10
DstPackets 60
SrcBytes 1
SrcBytes 10
SrcBytes 60
DstBytes 1
DstBytes 10
DstBytes 60

Entity Embedding for categorical variables.

Future work Since our categorical variables have such high cardinality, it might be useful to create an embedding vector for each of the ID numbers that is of a significantly lower dimension than the original cardinality. By combining all the vectors together we will have an embedding matrix.

Feature Mapper

Future work Let's use agglomerative hierarchical clustering to see if any of our features are redundant.

Testing various Autoencoder Architectures

Imports and Utility Functions

import torch
from torch import nn
from torch import optim
import torch.nn.functional as F
from torch.autograd import Variable
from torch.utils.data import DataLoader
from torch.utils.data import TensorDataset
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
import os
import seaborn as sns

#if not os.path.exists('./mlp_img'):
#    os.mkdir('./mlp_img')

epochs = 30
bs = 128
lr = 1e-3
loss_func = F.binary_cross_entropy
dev = torch.device(
    "cuda") if torch.cuda.is_available() else torch.device("cpu")

def get_data(train_ds, valid_ds, bs):
    return (
        DataLoader(train_ds, batch_size=bs, shuffle=True),
        DataLoader(valid_ds, batch_size=bs * 2),
    )

def loss_batch(model, loss_func, xb, yb, opt=None):
    loss = loss_func(model(xb), yb)

    if opt is not None:
        loss.backward()
        opt.step()
        opt.zero_grad()

    return loss.item(), len(xb)

def fit(epochs, model, loss_func, opt, train_dl, valid_dl):
    val_loss_hist = np.zeros(epochs)
    for epoch in range(epochs):
        model.train()
        for xb, yb in train_dl:
            loss_batch(model, loss_func, xb, yb, opt)

        model.eval()
        with torch.no_grad():
            losses, nums = zip(
                *[loss_batch(model, loss_func, xb, yb) for xb, yb in valid_dl]
            )
        val_loss = np.sum(np.multiply(losses, nums)) / np.sum(nums)
        val_loss_hist[epoch] = val_loss
        print(epoch, val_loss)
    return val_loss_hist

class Lambda(nn.Module):
    def __init__(self, func):
        super().__init__()
        self.func = func

    def forward(self, x):
        return self.func(x)

def preprocess(x, y):
    return x.view(-1,11).to(dev), y.to(dev)

class WrappedDataLoader:
    def __init__(self, dl, func):
        self.dl = dl
        self.func = func

    def __len__(self):
        return len(self.dl)

    def __iter__(self):
        batches = iter(self.dl)
        for b in batches:
            yield (self.func(*b))
def check_cats(data, cat_vars):
    for i in cat_vars:
        data[f'{i}'] = data[f'{i}'].astype('category').cat.codes
    return data
DIR = "../input"

colnames = ['Time', 'Duration', 'SrcDevice', 'DstDevice', 'Protocol', 'SrcPort', 'DstPort', 'SrcPackets', 
       'DstPackets', 'SrcBytes', 'DstBytes']
cat_vars = ['SrcDevice', 'DstDevice','Protocol','SrcPort', 'DstPort']
cont_vars = ['Time', 'Duration', 'SrcPackets', 'DstPackets', 'SrcBytes', 'DstBytes']
train_df = pd.read_csv(os.path.join(DIR, 'netflow_day-03.csv'),names = colnames)
train_df = check_cats(train_df,cat_vars)

Just to get a feel for our data and model performance, I'll only be using ~2% of the data to allow for quick iterations. Our goal is to see what works initally and then when training a final model I'll use all of the data.

x_train, x_valid, y_train, y_valid = train_test_split(scaled_df1.iloc[:100000], scaled_df1.iloc[:100000], test_size=0.33, random_state=42)
x_train, y_train, x_valid, y_valid = map(torch.tensor, (x_train.values, y_train.values, x_valid.values, y_valid.values))
x_train, y_train, x_valid, y_valid = x_train.float(), y_train.float(), x_valid.float(), y_valid.float()
train_ds = TensorDataset(x_train, y_train)
valid_ds = TensorDataset(x_valid, y_valid)
train_dl, valid_dl = get_data(train_ds, valid_ds, bs)
train_dl = WrappedDataLoader(train_dl, preprocess)
valid_dl = WrappedDataLoader(valid_dl, preprocess)

Basic AutoEncoder (AE)

class autoencoder(nn.Module):
    def __init__(self):
        super(autoencoder, self).__init__()
        self.encoder = nn.Sequential(
            nn.Linear(11, 3),
            nn.ReLU())
        self.decoder = nn.Sequential(
            nn.Linear(3, 11),
            nn.Sigmoid()) 

    def forward(self, x):
        x = self.encoder(x)
        x = self.decoder(x)
        return x

For our initial run, let's try training the AE using the standard Stochastic Gradient Descent (SGD) optimization algorithm.

model = autoencoder()
model.to(dev)
opt = optim.SGD(model.parameters(), lr=lr, momentum=0.9)
simpleAE = fit(epochs, model, loss_func, opt, train_dl, valid_dl)
0 0.6153640248992226
1 0.5632328183867714
2 0.521731213497393
3 0.48840221228744046
4 0.4613625860214233
5 0.43919258066379663
6 0.42081355727080144
7 0.40540852949836037
8 0.392343591075955
9 0.3811262700918949
10 0.371357077107285
11 0.36269724823489335
12 0.3549058718753583
13 0.3477984815799829
14 0.34122927554448446
15 0.335087068752809
16 0.32927751234083463
17 0.32372928273316587
18 0.31840084727605183
19 0.3132779013893821
20 0.30835604749303874
21 0.30364818929180953
22 0.29918292765183885
23 0.29499714277729844
24 0.29111959272442445
25 0.28757020660602683
26 0.2843516824967933
27 0.28145706274292687
28 0.27886660063627994
29 0.2765542323661573

When utilizing the Adam optimization algorithm instead, we see that our loss starts off much lower and ultimately achieves a better result.

model = autoencoder()
model.to(dev)
opt1 = optim.Adam(model.parameters(), lr=lr, weight_decay=1e-5)
simpleAE1 = fit(epochs, model, loss_func, opt1, train_dl, valid_dl)
0 0.3132772430289875
1 0.27298690352295385
2 0.26548837280273435
3 0.2561405701709516
4 0.24500035896807007
5 0.23917841724554698
6 0.23640931301044696
7 0.22925565374981274
8 0.2199682547901616
9 0.21693617815321142
10 0.2149911564985911
11 0.21362458565740874
12 0.21265988235762626
13 0.21197906290039872
14 0.2114728673732642
15 0.21104865649974708
16 0.2106399353056243
17 0.21023554500305291
18 0.2098258723598538
19 0.2094674902973753
20 0.20916657164241328
21 0.20891935854247123
22 0.20872451052882454
23 0.20859452519633553
24 0.2084934107245821
25 0.20842122670014698
26 0.20837569244702658
27 0.20834395785042734
28 0.20831816785263294
29 0.20830102947986487

By changing the size of our bottleneck, we are expanding/limiting the space our network will use to find latent representations of our features. For now, increasing the size seems to only help marginally. However, we will want to return to this hyperparameter when we engineer more features in the future.

class autoencoder1(nn.Module):
    def __init__(self):
        super(autoencoder1, self).__init__()
        self.encoder = nn.Sequential(
            nn.Linear(11, 6),
            nn.ReLU())
        self.decoder = nn.Sequential(
            nn.Linear(6, 11),
            nn.Sigmoid()) 

    def forward(self, x):
        x = self.encoder(x)
        x = self.decoder(x)
        return x
model = autoencoder1()
model.to(dev)
opt1 = optim.Adam(model.parameters(), lr=lr, weight_decay=1e-5)
SimpleAE2 = fit(epochs, model, loss_func, opt1, train_dl, valid_dl)
0 0.27842848107309054
1 0.2424553751078519
2 0.22034330931938056
3 0.2125227058836908
4 0.2091995555559794
5 0.20629277456890452
6 0.20423057547121337
7 0.20225441743749561
8 0.2000320770451517
9 0.19847749503814813
10 0.19774435007572175
11 0.19735746824741363
12 0.1971329365788084
13 0.19700006837555856
14 0.19692712545756139
15 0.19688325585018504
16 0.19685634627847962
17 0.19684289213744077
18 0.19683339035149777
19 0.19682246895269914
20 0.19681916312376657
21 0.19681775667450646
22 0.19682014381524288
23 0.19681429224664515
24 0.1968093104543108
25 0.19681130621288762
26 0.19680515278108193
27 0.19680367324930248
28 0.19680531917557573
29 0.19681047119877554

QUESTION: What size should we select for the latent representation of our data?

Deep AutoEncoder (aka Stacked AutoEncoder, SAE)

class deep_autoencoder(nn.Module):
    def __init__(self):
        super(deep_autoencoder, self).__init__()
        self.encoder = nn.Sequential(
            nn.Linear(11, 9),
            nn.ReLU(True),
            nn.Linear(9, 6),
            nn.ReLU(True))
        self.decoder = nn.Sequential(
            nn.Linear(6, 9),
            nn.ReLU(True),
            nn.Linear(9, 11),
            nn.Sigmoid())

    def forward(self, x):
        x = self.encoder(x)
        x = self.decoder(x)
        return x

Now I'll create an AE that has multiple hidden layers, a deep AE. Let's first start with a model that has 2 hidden layers. This doesn't seem to do much better than our one-layer AE but again, might be useful if we incorporate more features in the future.

model = deep_autoencoder()
model.to(dev)
opt1 = optim.Adam(model.parameters(), lr=lr, weight_decay=1e-5)
deepAE = fit(epochs, model, loss_func, opt1, train_dl, valid_dl)
0 0.2449271744598042
1 0.22742869142330055
2 0.21516548670060706
3 0.2077671935377699
4 0.2028120079979752
5 0.20213690534866216
6 0.200944096066735
7 0.19925501989595817
8 0.19780528188835492
9 0.1971274009401148
10 0.1968309831944379
11 0.19652490559852484
12 0.1962942510987773
13 0.19614683185562942
14 0.19603187960566895
15 0.19594012701872623
16 0.19582038521766662
17 0.19572626200589266
18 0.19565712436040242
19 0.19555748596335903
20 0.19550040659759985
21 0.19545797579577476
22 0.1954013790434057
23 0.19536583811947794
24 0.19534382763414673
25 0.19530811882019042
26 0.19530282218528516
27 0.19527743303414546
28 0.19524949955579007
29 0.1952263595046419

Another option is to have 'overcomplete' layers. We can balloon the size of the first layer and shrink it back down. The simpler, 2 layer AE above does a better job so this doesn't seem like a promising avenue to continue down further.

class deep_autoencoder1(nn.Module):
    def __init__(self):
        super(deep_autoencoder, self).__init__()
        self.encoder = nn.Sequential(
            nn.Linear(11, 128),
            nn.ReLU(True),
            nn.Linear(128, 64),
            nn.ReLU(True), 
            nn.Linear(64, 12), 
            nn.ReLU(True), 
            nn.Linear(12, 6))
        self.decoder = nn.Sequential(
            nn.Linear(6, 12),
            nn.ReLU(True),
            nn.Linear(12, 64),
            nn.ReLU(True),
            nn.Linear(64, 128),
            nn.ReLU(True), 
            nn.Linear(128, 11), 
            nn.Sigmoid())

    def forward(self, x):
        x = self.encoder(x)
        x = self.decoder(x)
        return x
model = deep_autoencoder()
model.to(dev)
opt1 = optim.Adam(model.parameters(), lr=lr, weight_decay=1e-5)
deepAE1 = fit(epochs, model, loss_func, opt1, train_dl, valid_dl)
0 0.2650241525100939
1 0.23482537920186014
2 0.21984392593123697
3 0.21350352708137396
4 0.20568819431825117
5 0.20219772437847022
6 0.20110394136110943
7 0.20012424728364656
8 0.19942686419053512
9 0.19796628739978328
10 0.19645627458529039
11 0.1958876735911225
12 0.19568070779424726
13 0.19561183846358096
14 0.1956282062964006
15 0.19557593305544418
16 0.1955740126985492
17 0.19555451732332058
18 0.19555348258668726
19 0.1955339559063767
20 0.1955322019042391
21 0.19553729743668527
22 0.19552998865011967
23 0.19553163955428385
24 0.1955169496716875
25 0.19551997378016964
26 0.19551090997276885
27 0.19550259295015623
28 0.19551545627189404
29 0.19549637895280664

Convolutional AutoEncoder (CAE)

Future work! See below for a rough schematic of a planned implementation.

class conv_autoencoder(nn.Module):
    def __init__(self):
        super(conv_autoencoder, self).__init__()
        self.encoder = nn.Sequential(
            nn.Conv1d(1, 1, 3, stride=3, padding=1),  # b, 16, 10, 10
            nn.ReLU(True),
            nn.MaxPool1d(2, stride=2),  # b, 16, 5, 5
            nn.Conv1d(16, 8, 3, stride=2, padding=1),  # b, 8, 3, 3
            nn.ReLU(True),
            nn.MaxPool1d(2, stride=1)  # b, 8, 2, 2
        )
        self.decoder = nn.Sequential(
            nn.ConvTranspose1d(8, 16, 3, stride=2),  # b, 16, 5, 5
            nn.ReLU(True),
            nn.ConvTranspose1d(16, 8, 5, stride=3, padding=1),  # b, 8, 15, 15
            nn.ReLU(True),
            nn.ConvTranspose1d(8, 1, 2, stride=2, padding=1),  # b, 1, 28, 28
            nn.Sigmoid()
        )

    def forward(self, x):
        x = self.encoder(x)
        x = self.decoder(x)
        return x
train_dl_w = WrappedDataLoader(train_dl, preprocess)
valid_dl_w = WrappedDataLoader(valid_dl, preprocess)
model = conv_autoencoder()
model.to(dev)
convAE = fit(epochs, model, loss_func, opt1, train_dl_w, valid_dl_w)

Comparison of AE architectures with default dataset (no feature engineering)

fig = plt.figure()
sub = fig.add_subplot(111)
sub.plot(np.arange(epochs),simpleAE,  label = 'Simple w/ SGD')
sub.plot(np.arange(epochs),simpleAE1, label = 'Simple w/ Adam')
sub.plot(np.arange(epochs),deepAE, label = 'Deep (2-layer) w/ Adam')
sub.plot(np.arange(epochs),deepAE1, label = 'Deep (6-layer) w/ Adam')
fig.legend()
<matplotlib.legend.Legend at 0x7f6b42de8940>

png

Comparison of AE architectures with feature engineered dataset

Basic AE w/ TSA features

class autoencoder2(nn.Module):
    def __init__(self):
        super(autoencoder2, self).__init__()
        self.encoder = nn.Sequential(
            nn.Linear(35, 13),
            nn.ReLU())
        self.decoder = nn.Sequential(
            nn.Linear(13, 35),
            nn.Sigmoid()) 

    def forward(self, x):
        x = self.encoder(x)
        x = self.decoder(x)
        return x
def preprocess(x, y):
    return x.view(-1,35).to(dev), y.to(dev)
x_train, x_valid, y_train, y_valid = train_test_split(scaled_df1.iloc[:100000], scaled_df1.iloc[:100000], test_size=0.33, random_state=42)
x_train, y_train, x_valid, y_valid = map(torch.tensor, (x_train.values, y_train.values, x_valid.values, y_valid.values))
x_train, y_train, x_valid, y_valid = x_train.float(), y_train.float(), x_valid.float(), y_valid.float()
train_ds = TensorDataset(x_train, y_train)
valid_ds = TensorDataset(x_valid, y_valid)
train_dl, valid_dl = get_data(train_ds, valid_ds, bs)
train_dl = WrappedDataLoader(train_dl, preprocess)
valid_dl = WrappedDataLoader(valid_dl, preprocess)
model = autoencoder2()
model.to(dev)
opt1 = optim.Adam(model.parameters(), lr=lr, weight_decay=1e-5)
simpleAE_ed = fit(epochs, model, loss_func, opt1, train_dl, valid_dl)
0 0.15542524325486384
1 0.1330017716595621
2 0.12358071852994687
3 0.12134318643627745
4 0.12013640852949836
5 0.11922035816040906
6 0.1186038601488778
7 0.11830568287769953
8 0.11817351314515778
9 0.11809839636629278
10 0.11803493072950479
11 0.118003826110652
12 0.11796236093477769
13 0.11794244261040832
14 0.11791596555709839
15 0.11789782470103466
16 0.11788164298642766
17 0.11786410925785701
18 0.1178481928651983
19 0.11783261449770494
20 0.11781676309217107
21 0.11780566596984864
22 0.11778696250373667
23 0.11777133213570623
24 0.11775690714156989
25 0.11774744177406485
26 0.11774040185863321
27 0.11772692710883689
28 0.11772219492630526
29 0.11770839673100096
class autoencoder3(nn.Module):
    def __init__(self):
        super(autoencoder3, self).__init__()
        self.encoder = nn.Sequential(
            nn.Linear(35, 3),
            nn.ReLU())
        self.decoder = nn.Sequential(
            nn.Linear(3, 35),
            nn.Sigmoid()) 

    def forward(self, x):
        x = self.encoder(x)
        x = self.decoder(x)
        return x
def preprocess(x, y):
    return x.view(-1,35).to(dev), y.to(dev)
model = autoencoder3()
model.to(dev)
opt1 = optim.Adam(model.parameters(), lr=lr, weight_decay=1e-5)
simpleAE_ed1 = fit(epochs, model, loss_func, opt1, train_dl, valid_dl)
0 0.18049194024548387
1 0.15102098090359659
2 0.1473470046339613
3 0.143951437588894
4 0.13999235530333085
5 0.1366622529969071
6 0.13426986752134382
7 0.1325131246248881
8 0.13121682799223697
9 0.13023262498595498
10 0.12944078761158567
11 0.12880684867049708
12 0.1282730374841979
13 0.12780357990842878
14 0.12739020001165793
15 0.12700787126656735
16 0.1266632669802868
17 0.12634942066669463
18 0.12608273115663818
19 0.12587054110657084
20 0.12572116792201996
21 0.12558745316664377
22 0.1255006782141599
23 0.12543083710381478
24 0.12537497831113412
25 0.12533319084210828
26 0.12530713788307074
27 0.12527308274399152
28 0.1252579463900942
29 0.12523770618799962

Deep AutoEncoder w/ TSA features

class deep_autoencoder1(nn.Module):
    def __init__(self):
        super(deep_autoencoder1, self).__init__()
        self.encoder = nn.Sequential(
            nn.Linear(35, 26),
            nn.ReLU(True),
            nn.Linear(26, 13),
            nn.ReLU(True),
            nn.Linear(13, 5),
            nn.ReLU(True))
        self.decoder = nn.Sequential(
            nn.Linear(5, 13),
            nn.ReLU(True),
            nn.Linear(13, 26),
            nn.ReLU(True),
            nn.Linear(26, 35),
            nn.Sigmoid())

    def forward(self, x):
        x = self.encoder(x)
        x = self.decoder(x)
        return x
model = deep_autoencoder1()
model.to(dev)
opt1 = optim.Adam(model.parameters(), lr=lr, weight_decay=1e-5)
deepAE_ed = fit(epochs, model, loss_func, opt1, train_dl, valid_dl)
0 0.14267463051550316
1 0.13841716249422595
2 0.12931717374469295
3 0.12356641965923887
4 0.12296843826228922
5 0.12272117885856917
6 0.1225782561211875
7 0.12246075998472444
8 0.1223881917180437
9 0.12213705055641405
10 0.12176700847076648
11 0.1211470121939977
12 0.11976834000421292
13 0.11921178272818074
14 0.11903793139349331
15 0.11892726133447705
16 0.11888989595030293
17 0.11885944359230273
18 0.11882691781990457
19 0.11880519064267477
20 0.11877735575943282
21 0.11873244151743975
22 0.11870291563055732
23 0.11867264552911122
24 0.11859710764523708
25 0.11848349631193912
26 0.11834015336903658
27 0.11821681090918454
28 0.11815993258447358
29 0.11813407987717427

Just by including out engineered TSA features, we've lowered the loss by nearly half!

fig = plt.figure()
sub = fig.add_subplot(111)
sub.plot(np.arange(epochs),simpleAE_ed,  label = 'SimpleAE w/ eng. feat.')
sub.plot(np.arange(epochs),simpleAE_ed1, label = 'SimpleAE w/ eng. feat., diff. size')
sub.plot(np.arange(epochs),deepAE_ed, label = 'Deep (6-layer) w/ Adam')
fig.legend()
<matplotlib.legend.Legend at 0x7f539722af28>

png

Improving the learning rate

Future work

I want to implement Smith 2015's one cycle learning schedule, which incorporates a cylical learning rate (and momentum) annealing. This is already implemnted in the fastai wrapper.

Future work

  • Entity Embedding for categorical values
  • 2D Statistics (magnitude, radius, approx. covariance and correlation coefficient of 2 different streams). Question: How best to define different streams? Are these simply different time windows of the same stream?
  • Feature Mapping with agglomerative hierarchical clustering
  • Implement a Convolutional AutoEncoder
  • Apply learning rate/momentum annealing
  • Incorporate host events dataset

Resources

Good resource for autoencoders
Good resource for variational autoencoders (VAEs)
Outlook inspired by Kitsune paper (Mirsky 2018)
Insight from Cyber Reboot's work on using RNN autoencoders on network traffic data.
Good resource for embeddings of structured data.