Training Autoencoders to Identify Normal and Abnormal Traffic Behavior in Netflow Data

Goal

Train an autoencoder neural network to develop lower-dimensional representations of netflow data in an attempt to identify categories of behavior present within the data and effectively differentiate between abnormal and normal behavior.

Now when I say lower-dimensional representation (aka an embedding), what I mean by this is basically a fingerprint. So what I'm searching for is a way to take all the features of raw netflow data and really distill it down into key components that identify and classify the data.

Intro to Unsupervised Learning and Autoencoders

An autoencoder is a type of unsupervised learning algorithm.

What do I mean by an unsupervised learning algorithm? An unsupervised learning algorithm takes in unlabeled or unclassified data, and attempts to develop its own labels/classifications. An easy way to think of this is showing an algorithm a bunch of images of cartoon characters but not telling the alogorithm which character is which. While training the algorithm, it attempts to place the characters into it's own groups based on the similarities inherent in each image. Below our algorithm has clearly discovered two beloved types of cartoon charaters: Duck and not-Duck!

Specifically, an autoencoder is a type of neural network algorithm that attempts to reproduce the input that is passed into its input layer. alt text Initially this may sound unimpressive and indeed, if we allow our encoding to be the same size as the input, our algorithm could easily learn to simply memorize the input values!

alt text However, by imposing a bottleneck in the network we can force a compressed knowledge representation of the original input (i.e. we can boil the input data down to its essential pieces or its fingerprint). If the input features were each independent of one another, this compression and subsequent reconstruction would be a very difficult task. However, if some sort of structure exists in the data (ie. correlations between input features), this structure can be learned and consequently leveraged when forcing the input through the network's bottleneck. alt text

Download the data

The dataset I'll be using for this project is the LANL 2017 netflow dataset and focusing my initial analysis on day-03. There are about 6.1 million entries or 'events' that occur on day-03, each with 11 fields of data:

import os
import pandas as pd

DIR = "../input"

colnames = ['Time', 'Duration', 'SrcDevice', 'DstDevice', 'Protocol', 'SrcPort', 'DstPort', 'SrcPackets', 
       'DstPackets', 'SrcBytes', 'DstBytes']
cat_vars = ['SrcDevice', 'DstDevice','Protocol','SrcPort', 'DstPort']
cont_vars = ['Time', 'Duration', 'SrcPackets', 'DstPackets', 'SrcBytes', 'DstBytes']
train_df = pd.read_csv(os.path.join(DIR, 'netflow_day-03.csv'),names = colnames)
train_df.head()

	Time	SrcDevice	DstDevice	Protocol	SrcPort	DstPort	SrcPackets	DstPackets	SrcBytes	DstBytes
0	172800	Comp348305	Comp370444	6	Port02726	80	0	5	0	784
1	172800	Comp817584	Comp275646	17	Port97545	53	1	0	77	0
2	172800	Comp654013	Comp685925	6	Port26890	Port94857	6	5	1379	1770
3	172800	Comp500631	Comp275646	17	Port62938	53	1	0	64	0
4	172800	Comp500631	Comp275646	17	Port52912	53	1	0	64	0

%wget -c https://csr.lanl.gov/data/unified-host-network-dataset-2017/netflow/netflow_day-03.bz2

Field Name	Description
Time	The start time of the event in epoch time format.
Duration	The duration of the event in seconds.
SrcDevice	The device that likely initiated the event.
DstDevice	The receiving device.
Protocol	The protocol number.
SrcPort	The port used by the SrcDevice.
DstPort	The port used by the DstDevice.
SrcPackets	The number of packets the SrcDevice sent during the event.
DstPackets	The number of packets the DstDevice sent during the event.
SrcBytes	The number of bytes the SrcDevice sent during the event.
DstBytes	The number of bytes the DstDevice sent during the event.

Load the data

Each day of the LANL netflow data ends in an incomplete line. Hence attempting a simple bunzip2 will result in a failure. To work around this issue we can use bzip2recover on the file, use bzip2 to unzip the individual blocks and stich them together into a single csv file and finally use a quick bash command to remove the corrupted final line.

%bzip2recover netflow_day-03.bz2
%bzip2 -dc rec00* > netflow_day-03.csv
%head -n -1 netflow_day-03.csv > temp.txt ; mv temp.txt netflow_day-03.csv
%mkdir ../input
%mv netflow_day-03.csv ../input/

Data Exploration

Initial look

%matplotlib inline
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import math

Field Name	Description
Time	The start time of the event in epoch time format.
Duration	The duration of the event in seconds.
SrcDevice	The device that likely initiated the event.
DstDevice	The receiving device.
Protocol	The protocol number.
SrcPort	The port used by the SrcDevice.
DstPort	The port used by the DstDevice.
SrcPackets	The number of packets the SrcDevice sent during the event.
DstPackets	The number of packets the DstDevice sent during the event.
SrcBytes	The number of bytes the SrcDevice sent during the event.
DstBytes	The number of bytes the DstDevice sent during the event.

DIR = "../input"

colnames = ['Time', 'Duration', 'SrcDevice', 'DstDevice', 'Protocol', 'SrcPort', 'DstPort', 'SrcPackets', 
       'DstPackets', 'SrcBytes', 'DstBytes']
cat_vars = ['SrcDevice', 'DstDevice','Protocol','SrcPort', 'DstPort']
cont_vars = ['Time', 'Duration', 'SrcPackets', 'DstPackets', 'SrcBytes', 'DstBytes']
train_df = pd.read_csv(os.path.join(DIR, 'netflow_day-03.csv'),names = colnames)

We've got a little over 6.1 million events occuring during day-03 in our dataset.

train_df.shape

(6102087, 11)

train_df.head()

	Time	SrcDevice	DstDevice	Protocol	SrcPort	DstPort	SrcPackets	DstPackets	SrcBytes	DstBytes
0	172800	Comp348305	Comp370444	6	Port02726	80	0	5	0	784
1	172800	Comp817584	Comp275646	17	Port97545	53	1	0	77	0
2	172800	Comp654013	Comp685925	6	Port26890	Port94857	6	5	1379	1770
3	172800	Comp500631	Comp275646	17	Port62938	53	1	0	64	0
4	172800	Comp500631	Comp275646	17	Port52912	53	1	0	64	0

Let's get an idea of the number of unique values for a few of the features. SrcDevice and DstDevice each have roughly 19K unique values. SrcPort has roughly 65K unique values while DstPort has roughly 40K unique values.

(Technical Aside: This high cardinality means its difficult to take advantage of a one-hot-encoding scheme for these 4 variables although we will be able to break apart SrcDevice and DstDevice for some more information as I will show below. Question: Are Comp/IP/Port numbers near one another are related in their uses? For now I'll treat them as categorical. Protocol only has 3 different values, so we can easily one-hot-encode that variable (1=ICMP Internet Control Message, 6=TCP Transmission Control, 17=UDP User Datagram).)

`SrcDevice` and `DstDevice`

There are 6 general categories of Devices: `Comp`, `IP`, `ActiveDirectory`, `EnterpriseAppServer`, `VPN` and `VScanner`.

train_df['SrcDevice'].astype('category').cat.categories

Index(['ActiveDirectory', 'Comp000116', 'Comp000156', 'Comp000219',
       'Comp000566', 'Comp000573', 'Comp000577', 'Comp000688', 'Comp000696',
       'Comp000775',
       ...
       'IP998111', 'IP998491', 'IP998759', 'IP998823', 'IP999095', 'IP999633',
       'IP999839', 'IP999955', 'VPN', 'VScanner'],
      dtype='object', length=18950)

train_df['DstDevice'].astype('category').cat.categories

Index(['ActiveDirectory', 'Comp000190', 'Comp000244', 'Comp000332',
       'Comp000346', 'Comp000386', 'Comp000575', 'Comp000595', 'Comp000688',
       'Comp000944',
       ...
       'IP999711', 'IP999776', 'IP999780', 'IP999852', 'IP999875', 'IP999896',
       'IP999935', 'IP999949', 'VPN', 'VScanner'],
      dtype='object', length=18921)

There are a few other Devices beyond Comps and IPs.

noComp_orIP = train_df['SrcDevice'].loc[(train_df['SrcDevice'].str.contains('Comp')==False) &
                          (train_df['SrcDevice'].str.contains('IP')==False)].astype('category').cat.categories
print(noComp_orIP)

Index(['ActiveDirectory', 'EnterpriseAppServer', 'VPN', 'VScanner'], dtype='object')

The majority (98%) of events appear to have a `SrcDevice` that is a Comp. The next closest Device is the `EnterpriseAppServer` at 1.37% of events. It's also interesting to note that 8 Comps are the `SrcDevice` of nearly half the data.

print('Events\' SrcDevice Breakdown')
for i in ['Comp','IP','ActiveDirectory', 'EnterpriseAppServer', 'VPN', 'VScanner']:
    tmp = train_df['SrcDevice'].loc[(train_df['SrcDevice'].str.contains(f'{i}')==True)].count()/train_df['Time'].count()
    print(i,tmp)

SrcDevice Breakdown
Comp 0.9766106251844656
IP 0.0033899877205946097
ActiveDirectory 0.0014785105489318654
EnterpriseAppServer 0.013733498063859135
VPN 0.0011150283501366008
VScanner 0.0036723501320122115

counted = train_df1.groupby("SrcDevice_Comp").count().rename(columns={"Time":"SrcDevice_count"})
counted.sort_values(by=['SrcDevice_count'],ascending=False).head(10)

	SrcDevice_count	Duration	SrcPort	DstPort	SrcPackets	DstPackets	SrcBytes	DstBytes	DstDevice_Comp	SrcDevice_IP	...	DstDevice_ActiveDirectory	SrcDevice_EnterpriseAppServer	DstDevice_EnterpriseAppServer	SrcDevice_VPN	DstDevice_VPN	SrcDevice_VScanner	DstDevice_VScanner	Protocol_1	Protocol_6	Protocol_17
SrcDevice_Comp
44772	412110	412110	412110	412110	412110	412110	412110	412110	412110	412110	...	412110	412110	412110	412110	412110	412110	412110	412110	412110	412110
296454	396393	396393	396393	396393	396393	396393	396393	396393	396393	396393	...	396393	396393	396393	396393	396393	396393	396393	396393	396393	396393
623258	376489	376489	376489	376489	376489	376489	376489	376489	376489	376489	...	376489	376489	376489	376489	376489	376489	376489	376489	376489	376489
73202	355912	355912	355912	355912	355912	355912	355912	355912	355912	355912	...	355912	355912	355912	355912	355912	355912	355912	355912	355912	355912
30334	343163	343163	343163	343163	343163	343163	343163	343163	343163	343163	...	343163	343163	343163	343163	343163	343163	343163	343163	343163	343163
866402	342651	342651	342651	342651	342651	342651	342651	342651	342651	342651	...	342651	342651	342651	342651	342651	342651	342651	342651	342651	342651
257274	342549	342549	342549	342549	342549	342549	342549	342549	342549	342549	...	342549	342549	342549	342549	342549	342549	342549	342549	342549	342549
965575	337894	337894	337894	337894	337894	337894	337894	337894	337894	337894	...	337894	337894	337894	337894	337894	337894	337894	337894	337894	337894
0	142724	142724	142724	142724	142724	142724	142724	142724	142724	142724	...	142724	142724	142724	142724	142724	142724	142724	142724	142724	142724
107130	97940	97940	97940	97940	97940	97940	97940	97940	97940	97940	...	97940	97940	97940	97940	97940	97940	97940	97940	97940	97940

10 rows × 22 columns

While the majority of events have a Comp for their `DstDevice` (86.61%), `ActiveDirectory` and `EnterpriseAppServer` have a larger share than for the `SrcDevice` (7.14% and 5.85% respectively).

print('Events\' DstDevice Breakdown')
for i in ['Comp','IP','ActiveDirectory', 'EnterpriseAppServer', 'VPN', 'VScanner']:
    tmp = train_df['DstDevice'].loc[(train_df['DstDevice'].str.contains(f'{i}')==True)].count()/train_df['Time'].count()
    print(i,tmp)

DstDevice Breakdown
Comp 0.8661133477775719
IP 0.0035854946020926937
ActiveDirectory 0.07141753305057762
EnterpriseAppServer 0.05854111880082995
VPN 0.00013454413219608308
VScanner 0.00020796163673182633

counted = train_df1.groupby("DstDevice_Comp").count().rename(columns={"Time":"DstDevice_count"})
counted.sort_values(by=['DstDevice_count'],ascending=False).head(10)

	DstDevice_count	Duration	SrcPort	DstPort	SrcPackets	DstPackets	SrcBytes	DstBytes	SrcDevice_Comp	SrcDevice_IP	...	DstDevice_ActiveDirectory	SrcDevice_EnterpriseAppServer	DstDevice_EnterpriseAppServer	SrcDevice_VPN	DstDevice_VPN	SrcDevice_VScanner	DstDevice_VScanner	Protocol_1	Protocol_6	Protocol_17
DstDevice_Comp
0	816988	816988	816988	816988	816988	816988	816988	816988	816988	816988	...	816988	816988	816988	816988	816988	816988	816988	816988	816988	816988
275646	624576	624576	624576	624576	624576	624576	624576	624576	624576	624576	...	624576	624576	624576	624576	624576	624576	624576	624576	624576	624576
576031	218410	218410	218410	218410	218410	218410	218410	218410	218410	218410	...	218410	218410	218410	218410	218410	218410	218410	218410	218410	218410
576843	193316	193316	193316	193316	193316	193316	193316	193316	193316	193316	...	193316	193316	193316	193316	193316	193316	193316	193316	193316	193316
186884	145802	145802	145802	145802	145802	145802	145802	145802	145802	145802	...	145802	145802	145802	145802	145802	145802	145802	145802	145802	145802
916004	66911	66911	66911	66911	66911	66911	66911	66911	66911	66911	...	66911	66911	66911	66911	66911	66911	66911	66911	66911	66911
63824	63126	63126	63126	63126	63126	63126	63126	63126	63126	63126	...	63126	63126	63126	63126	63126	63126	63126	63126	63126	63126
393033	44319	44319	44319	44319	44319	44319	44319	44319	44319	44319	...	44319	44319	44319	44319	44319	44319	44319	44319	44319	44319
501516	43614	43614	43614	43614	43614	43614	43614	43614	43614	43614	...	43614	43614	43614	43614	43614	43614	43614	43614	43614	43614
995183	37139	37139	37139	37139	37139	37139	37139	37139	37139	37139	...	37139	37139	37139	37139	37139	37139	37139	37139	37139	37139

10 rows × 22 columns

There doesn't seem to be anything particularly interesting in the distribution of Comp numbers for `SrcDevice` or `DstDevice`. Again we see that a select few Comps have a majoirty of the traffic.

num_bins = 100
plt.hist(train_df1['SrcDevice_Comp'], num_bins, normed=1, facecolor='blue', alpha=0.5)
plt.show()

/home/paperspace/anaconda3/envs/fastaiv1/lib/python3.7/site-packages/matplotlib/axes/_axes.py:6510: MatplotlibDeprecationWarning: 
The 'normed' kwarg was deprecated in Matplotlib 2.1 and will be removed in 3.1. Use 'density' instead.
  alternative="'density'", removal="3.1")

png

num_bins = 100
plt.hist(train_df1['DstDevice_Comp'], num_bins, normed=1, facecolor='blue', alpha=0.5)
plt.show()

/home/paperspace/anaconda3/envs/fastaiv1/lib/python3.7/site-packages/matplotlib/axes/_axes.py:6510: MatplotlibDeprecationWarning: 
The 'normed' kwarg was deprecated in Matplotlib 2.1 and will be removed in 3.1. Use 'density' instead.
  alternative="'density'", removal="3.1")

png

for i in np.logspace(0,4,5):
    if i != 1e4:
        counted.loc[(counted["SrcDevice_count"] > i) & (counted["SrcDevice_count"] < i*10),'SrcDevice_count'] = i
    else:
        counted.loc[counted["SrcDevice_count"] > i,'SrcDevice_count'] = i
plt.figure()
sns.countplot(data=counted, x="SrcDevice_count")
plt.show()

png

Looking at the distribution of SrcDevice, it appears that the majority (98.401%) of SrcDevices are used anywhere from 1 to 1000 times but only accoutn for a quarter the data! Another quarter of the data account for 1.557% SrcDevices used between 1000 and 1e5 times and nearly half of the data originates from 0.042% (8!!) of SrcDevices.

counted = train_df.groupby("DstDevice").count().rename(columns={"Time":"DstDevice_count"})
for i in np.logspace(0,4,5):
    if i != 1e4:
        counted.loc[(counted["DstDevice_count"] > i) & (counted["DstDevice_count"] < i*10),'DstDevice_count'] = i
    else:
        counted.loc[counted["DstDevice_count"] > i,'DstDevice_count'] = i
plt.figure()
sns.countplot(data=counted, x="DstDevice_count")
plt.show()

png

for i in np.logspace(1,3,3):
    print(i, len(counted['Duration'].loc[counted['Duration']>i].values))

10.0 10997
100.0 6131
1000.0 303

train_df.head()

	Time	SrcDevice	DstDevice	Protocol	SrcPort	DstPort	SrcPackets	DstPackets	SrcBytes	DstBytes
0	172800	4420	2456	1	1926	224	0	5	0	784
1	172800	10428	1828	2	63323	155	1	0	77	0
2	172800	8268	4593	1	17434	37537	6	5	1379	1770
3	172800	6348	1828	2	40895	155	1	0	64	0
4	172800	6348	1828	2	34386	155	1	0	64	0

Attempt at showing clusters of devices through a flow graph (Work in progress!!)

import igraph as ig
train_df0 = pd.read_csv(os.path.join(DIR, 'netflow_day-03.csv'),names = colnames)

flow = train_df0.groupby(['SrcDevice','DstDevice']).size().reset_index().rename(columns={0:'count'})

computer_edgelist = dict(zip(map(tuple,flow[['SrcDevice','DstDevice']].values.tolist()),flow['count'].values.tolist()))

src_set = set([x[0][0] for x in computer_edgelist.items()])
dst_set = set([x[0][1] for x in computer_edgelist.items()])

print("Sources:     ", len(src_set))
print("Destinations: ", len(dst_set))
print("Union:       ", len(src_set.union(dst_set)))
print("Intersection: ", len(src_set.intersection(dst_set)))

Sources:      18950
Destinations:  18921
Union:        34019
Intersection:  3852

nv = len(src_set.union(dst_set))
vlookup = dict(zip(range(nv), list(src_set.union(dst_set))))
ilookup = dict(zip(list(src_set.union(dst_set)), range(nv)))
elist = [[ilookup[k[0]], ilookup[k[1]]] for k in computer_edgelist.keys()]

flow_graph = ig.Graph(nv)
flow_graph.add_edges(elist)

flow_graph.is_connected()

False

alt text

`Protocol`

Protocol 6 (TCP) accounts for over 77.69% of events, Protocol 17 (UDP) accounts for 22.28% while Protocol 1 (ICMP) only accounts for .03%.

train_df['Protocol'].hist()

<matplotlib.axes._subplots.AxesSubplot at 0x7f6a2f7b7c50>

png

train_df['Protocol'].loc[train_df['Protocol']==17].count()/train_df['Time'].count()

0.22279705287715498

`SrcPort` and `DstPort`

Distribution across `SrcPort` seems nearly uniform.

num_bins = 1000
plt.hist(train_df1['SrcPort'], num_bins, density=True, facecolor='blue', alpha=0.5)
plt.show()

png

`DstPort` is another story. The top 10 `DstPorts` account for over half of the events

num_bins = 1000
plt.hist(train_df1['DstPort'], num_bins, density=True, facecolor='blue', alpha=0.5)
plt.show()

png

counted = train_df1.groupby("DstPort").count().rename(columns={"Time":"DstPort_count"})
counted.sort_values(by=['DstPort_count'],ascending=False).head(10)

	DstPort_count	Duration	SrcPort	SrcPackets	DstPackets	SrcBytes	DstBytes	SrcDevice_Comp	DstDevice_Comp	SrcDevice_IP	...	DstDevice_ActiveDirectory	SrcDevice_EnterpriseAppServer	DstDevice_EnterpriseAppServer	SrcDevice_VPN	DstDevice_VPN	SrcDevice_VScanner	DstDevice_VScanner	Protocol_1	Protocol_6	Protocol_17
DstPort
53	750295	750295	750295	750295	750295	750295	750295	750295	750295	750295	...	750295	750295	750295	750295	750295	750295	750295	750295	750295	750295
443	700175	700175	700175	700175	700175	700175	700175	700175	700175	700175	...	700175	700175	700175	700175	700175	700175	700175	700175	700175	700175
80	607732	607732	607732	607732	607732	607732	607732	607732	607732	607732	...	607732	607732	607732	607732	607732	607732	607732	607732	607732	607732
514	264663	264663	264663	264663	264663	264663	264663	264663	264663	264663	...	264663	264663	264663	264663	264663	264663	264663	264663	264663	264663
389	213198	213198	213198	213198	213198	213198	213198	213198	213198	213198	...	213198	213198	213198	213198	213198	213198	213198	213198	213198	213198
88	162582	162582	162582	162582	162582	162582	162582	162582	162582	162582	...	162582	162582	162582	162582	162582	162582	162582	162582	162582	162582
23118	137832	137832	137832	137832	137832	137832	137832	137832	137832	137832	...	137832	137832	137832	137832	137832	137832	137832	137832	137832	137832
427	113420	113420	113420	113420	113420	113420	113420	113420	113420	113420	...	113420	113420	113420	113420	113420	113420	113420	113420	113420	113420
95765	99594	99594	99594	99594	99594	99594	99594	99594	99594	99594	...	99594	99594	99594	99594	99594	99594	99594	99594	99594	99594
445	99020	99020	99020	99020	99020	99020	99020	99020	99020	99020	...	99020	99020	99020	99020	99020	99020	99020	99020	99020	99020

10 rows × 22 columns

sum(counted.sort_values(by=['DstPort_count'],ascending=False)['DstPort_count'].values[:10])

`SrcPackets` and `DstPackets`

91.47% of events send between 0-10 packets from the `SrcDevice`. The highest amount of packets sent is 87701075!

counted = train_df1.groupby("SrcPackets").count().rename(columns={"Time":"SrcPackets_count"})
counted.sort_values(by=['SrcPackets_count'],ascending=False).head(10)

	SrcPackets_count	Duration	SrcPort	DstPort	DstPackets	SrcBytes	DstBytes	SrcDevice_Comp	DstDevice_Comp	SrcDevice_IP	...	DstDevice_ActiveDirectory	SrcDevice_EnterpriseAppServer	DstDevice_EnterpriseAppServer	SrcDevice_VPN	DstDevice_VPN	SrcDevice_VScanner	DstDevice_VScanner	Protocol_1	Protocol_6	Protocol_17
SrcPackets
0	1951332	1951332	1951332	1951332	1951332	1951332	1951332	1951332	1951332	1951332	...	1951332	1951332	1951332	1951332	1951332	1951332	1951332	1951332	1951332	1951332
1	1181225	1181225	1181225	1181225	1181225	1181225	1181225	1181225	1181225	1181225	...	1181225	1181225	1181225	1181225	1181225	1181225	1181225	1181225	1181225	1181225
5	600726	600726	600726	600726	600726	600726	600726	600726	600726	600726	...	600726	600726	600726	600726	600726	600726	600726	600726	600726	600726
4	537638	537638	537638	537638	537638	537638	537638	537638	537638	537638	...	537638	537638	537638	537638	537638	537638	537638	537638	537638	537638
3	245588	245588	245588	245588	245588	245588	245588	245588	245588	245588	...	245588	245588	245588	245588	245588	245588	245588	245588	245588	245588
2	233467	233467	233467	233467	233467	233467	233467	233467	233467	233467	...	233467	233467	233467	233467	233467	233467	233467	233467	233467	233467
6	184014	184014	184014	184014	184014	184014	184014	184014	184014	184014	...	184014	184014	184014	184014	184014	184014	184014	184014	184014	184014
10	183998	183998	183998	183998	183998	183998	183998	183998	183998	183998	...	183998	183998	183998	183998	183998	183998	183998	183998	183998	183998
11	171049	171049	171049	171049	171049	171049	171049	171049	171049	171049	...	171049	171049	171049	171049	171049	171049	171049	171049	171049	171049
7	104048	104048	104048	104048	104048	104048	104048	104048	104048	104048	...	104048	104048	104048	104048	104048	104048	104048	104048	104048	104048

10 rows × 22 columns

sum(counted.sort_values(by=['SrcPackets_count'],ascending=False)['SrcPackets_count'].values[:12])/sum(counted.sort_values(by=['SrcPackets_count'],ascending=False)['SrcPackets_count'].values)

0.9146632947055655

np.sort(train_df1['SrcPackets'].values)[-10:]

array([72759509, 74938675, 77148736, 79109660, 81076770, 82796324,
       84609078, 84760288, 86720597, 87701075])

93.40% of events send between 0-10 packets from the `DstDevice`. The highest amount of packets sent is 77760718.

counted = train_df1.groupby("DstPackets").count().rename(columns={"Time":"DstPackets_count"})
counted.sort_values(by=['DstPackets_count'],ascending=False).head(10)

	DstPackets_count	Duration	SrcPort	DstPort	SrcPackets	SrcBytes	DstBytes	SrcDevice_Comp	DstDevice_Comp	SrcDevice_IP	...	DstDevice_ActiveDirectory	SrcDevice_EnterpriseAppServer	DstDevice_EnterpriseAppServer	SrcDevice_VPN	DstDevice_VPN	SrcDevice_VScanner	DstDevice_VScanner	Protocol_1	Protocol_6	Protocol_17
DstPackets
0	2098503	2098503	2098503	2098503	2098503	2098503	2098503	2098503	2098503	2098503	...	2098503	2098503	2098503	2098503	2098503	2098503	2098503	2098503	2098503	2098503
1	1746836	1746836	1746836	1746836	1746836	1746836	1746836	1746836	1746836	1746836	...	1746836	1746836	1746836	1746836	1746836	1746836	1746836	1746836	1746836	1746836
3	596994	596994	596994	596994	596994	596994	596994	596994	596994	596994	...	596994	596994	596994	596994	596994	596994	596994	596994	596994	596994
5	283068	283068	283068	283068	283068	283068	283068	283068	283068	283068	...	283068	283068	283068	283068	283068	283068	283068	283068	283068	283068
2	218062	218062	218062	218062	218062	218062	218062	218062	218062	218062	...	218062	218062	218062	218062	218062	218062	218062	218062	218062	218062
4	196227	196227	196227	196227	196227	196227	196227	196227	196227	196227	...	196227	196227	196227	196227	196227	196227	196227	196227	196227	196227
7	189765	189765	189765	189765	189765	189765	189765	189765	189765	189765	...	189765	189765	189765	189765	189765	189765	189765	189765	189765	189765
6	120250	120250	120250	120250	120250	120250	120250	120250	120250	120250	...	120250	120250	120250	120250	120250	120250	120250	120250	120250	120250
9	68150	68150	68150	68150	68150	68150	68150	68150	68150	68150	...	68150	68150	68150	68150	68150	68150	68150	68150	68150	68150
11	63692	63692	63692	63692	63692	63692	63692	63692	63692	63692	...	63692	63692	63692	63692	63692	63692	63692	63692	63692	63692

10 rows × 22 columns

sum(counted.sort_values(by=['DstPackets_count'],ascending=False)['DstPackets_count'].values[:12])/sum(counted.sort_values(by=['DstPackets_count'],ascending=False)['DstPackets_count'].values)

0.9340995957612535

np.sort(train_df1['DstPackets'].values)[-10:]

array([64687095, 66650395, 68478386, 68610780, 70356457, 71882842,
       73626605, 75365457, 77107098, 77760718])

`SrcBytes` and `DstBytes`

Initial Look

counted = train_df1.groupby("SrcBytes").count().rename(columns={"Time":"SrcBytes_count"})
counted.sort_values(by=['SrcBytes_count'],ascending=False).head(10)

	SrcBytes_count	Duration	SrcPort	DstPort	SrcPackets	DstPackets	DstBytes	SrcDevice_Comp	DstDevice_Comp	SrcDevice_IP	...	DstDevice_ActiveDirectory	SrcDevice_EnterpriseAppServer	DstDevice_EnterpriseAppServer	SrcDevice_VPN	DstDevice_VPN	SrcDevice_VScanner	DstDevice_VScanner	Protocol_1	Protocol_6	Protocol_17
SrcBytes
0	1951332	1951332	1951332	1951332	1951332	1951332	1951332	1951332	1951332	1951332	...	1951332	1951332	1951332	1951332	1951332	1951332	1951332	1951332	1951332	1951332
277	268365	268365	268365	268365	268365	268365	268365	268365	268365	268365	...	268365	268365	268365	268365	268365	268365	268365	268365	268365	268365
216	266771	266771	266771	266771	266771	266771	266771	266771	266771	266771	...	266771	266771	266771	266771	266771	266771	266771	266771	266771	266771
60	247876	247876	247876	247876	247876	247876	247876	247876	247876	247876	...	247876	247876	247876	247876	247876	247876	247876	247876	247876	247876
225	112601	112601	112601	112601	112601	112601	112601	112601	112601	112601	...	112601	112601	112601	112601	112601	112601	112601	112601	112601	112601
164	99908	99908	99908	99908	99908	99908	99908	99908	99908	99908	...	99908	99908	99908	99908	99908	99908	99908	99908	99908	99908
399	77662	77662	77662	77662	77662	77662	77662	77662	77662	77662	...	77662	77662	77662	77662	77662	77662	77662	77662	77662	77662
152	74380	74380	74380	74380	74380	74380	74380	74380	74380	74380	...	74380	74380	74380	74380	74380	74380	74380	74380	74380	74380
46	71063	71063	71063	71063	71063	71063	71063	71063	71063	71063	...	71063	71063	71063	71063	71063	71063	71063	71063	71063	71063
64	62962	62962	62962	62962	62962	62962	62962	62962	62962	62962	...	62962	62962	62962	62962	62962	62962	62962	62962	62962	62962

10 rows × 22 columns

np.sort(train_df['SrcBytes'].values)[-10:]

array([12122909246, 12459931268, 12832178150, 13209994822, 13545567320,
       13882204590, 14176496098, 14513284176, 14848224144, 15015712600])

counted = train_df1.groupby("DstBytes").count().rename(columns={"Time":"DstBytes_count"})
counted.sort_values(by=['DstBytes_count'],ascending=False).head(10)

	DstBytes_count	Duration	SrcPort	DstPort	SrcPackets	DstPackets	SrcBytes	SrcDevice_Comp	DstDevice_Comp	SrcDevice_IP	...	DstDevice_ActiveDirectory	SrcDevice_EnterpriseAppServer	DstDevice_EnterpriseAppServer	SrcDevice_VPN	DstDevice_VPN	SrcDevice_VScanner	DstDevice_VScanner	Protocol_1	Protocol_6	Protocol_17
DstBytes
0	2098503	2098503	2098503	2098503	2098503	2098503	2098503	2098503	2098503	2098503	...	2098503	2098503	2098503	2098503	2098503	2098503	2098503	2098503	2098503	2098503
46	1500887	1500887	1500887	1500887	1500887	1500887	1500887	1500887	1500887	1500887	...	1500887	1500887	1500887	1500887	1500887	1500887	1500887	1500887	1500887	1500887
164	224058	224058	224058	224058	224058	224058	224058	224058	224058	224058	...	224058	224058	224058	224058	224058	224058	224058	224058	224058	224058
3911	71773	71773	71773	71773	71773	71773	71773	71773	71773	71773	...	71773	71773	71773	71773	71773	71773	71773	71773	71773	71773
479	57954	57954	57954	57954	57954	57954	57954	57954	57954	57954	...	57954	57954	57954	57954	57954	57954	57954	57954	57954	57954
112	55094	55094	55094	55094	55094	55094	55094	55094	55094	55094	...	55094	55094	55094	55094	55094	55094	55094	55094	55094	55094
106	53733	53733	53733	53733	53733	53733	53733	53733	53733	53733	...	53733	53733	53733	53733	53733	53733	53733	53733	53733	53733
177	39591	39591	39591	39591	39591	39591	39591	39591	39591	39591	...	39591	39591	39591	39591	39591	39591	39591	39591	39591	39591
1503	38017	38017	38017	38017	38017	38017	38017	38017	38017	38017	...	38017	38017	38017	38017	38017	38017	38017	38017	38017	38017
853	32385	32385	32385	32385	32385	32385	32385	32385	32385	32385	...	32385	32385	32385	32385	32385	32385	32385	32385	32385	32385

10 rows × 22 columns

np.sort(train_df['DstBytes'].values)[-10:]

array([11146902232, 11457059168, 11805619364, 12153021664, 12461839686,
       12732086028, 13040875936, 13348384456, 13657486438, 13773586368])

Question: As someone new to the field, I'm uncertain why the number of bytes for events jumps from 0 to 42.

unique_df = np.unique(train_df['DstBytes'].values)
np.sort(unique_df)[:10]

array([ 0, 42, 44, 46, 48, 49, 52, 55, 56, 58])

unique_df = np.unique(train_df['SrcBytes'].values)
np.sort(unique_df)[:10]

array([ 0, 42, 44, 46, 47, 48, 49, 50, 51, 52])

Distributions of `Packets` and `Bytes` seem to follow a power law which appears to be normal.

num_bins = 20
plt.figure(figsize=(10,4))
plt.xscale('log')
plt.yscale('log')
plt.hist(np.log(train_df['SrcPackets'].values+0.5), num_bins, facecolor='blue', alpha=0.5,log=True, label = 'SrcPackets')
plt.hist(np.log(train_df['DstPackets'].values+0.5), num_bins, facecolor='red', alpha=0.5,log=True, label = 'DstPackets')
plt.hist(np.log(train_df['SrcBytes'].values+0.5), num_bins, facecolor='green', alpha=0.5,log=True, label = 'SrcBytes', zorder=-1)
plt.hist(np.log(train_df['DstBytes'].values+0.5), num_bins, facecolor='black', alpha=0.5,log=True, label = 'SrcBytes', zorder=-1)
plt.xlabel("Size") 
plt.ylabel("Count")
plt.legend( loc='upper right')
plt.xlim(1e0,1e2)
plt.show()

png

Looking at a 2d heatmap of `Packets` vs `Bytes` shows a clean linear correlation, with all flows exhibiting an average packet size (i.e. bytes/packets) between 32 and 1024 bytes.

tdf = train_df.copy()
tdf['SrcPackets'] = np.log(tdf['SrcPackets'].values+0.5)/np.log(2)
tdf['SrcBytes'] = np.log(tdf['SrcBytes'].values+0.5)/np.log(2)
sns.jointplot(x="SrcPackets", y="SrcBytes", data=tdf, kind="hex",Norm=LogNorm())

<seaborn.axisgrid.JointGrid at 0x7f4f5d15d278>

png

tdf['DstPackets'] = np.log(tdf['DstPackets'].values+0.5)/np.log(2)
tdf['DstBytes'] = np.log(tdf['DstBytes'].values+0.5)/np.log(2)
sns.jointplot(x="DstPackets", y="DstBytes", data=tdf, kind="hex",Norm = LogNorm())

<seaborn.axisgrid.JointGrid at 0x7f4f5f4c51d0>

png

`Time` and `Duration`

80% of the events last less than 1 second. The longest event took 978035 seconds or roughly 271 hours.

counted = train_df.groupby("Duration").count().rename(columns={"Time":"Duration_count"})
counted.sort_values(by=['Duration_count'],ascending=False).head(10)

	Duration_count	SrcDevice	DstDevice	Protocol	SrcPort	DstPort	SrcPackets	DstPackets	SrcBytes	DstBytes
Duration
0	2754172	2754172	2754172	2754172	2754172	2754172	2754172	2754172	2754172	2754172
1	2133800	2133800	2133800	2133800	2133800	2133800	2133800	2133800	2133800	2133800
2	80962	80962	80962	80962	80962	80962	80962	80962	80962	80962
9	68782	68782	68782	68782	68782	68782	68782	68782	68782	68782
5	59152	59152	59152	59152	59152	59152	59152	59152	59152	59152
3	36107	36107	36107	36107	36107	36107	36107	36107	36107	36107
12	34511	34511	34511	34511	34511	34511	34511	34511	34511	34511
6	28342	28342	28342	28342	28342	28342	28342	28342	28342	28342
4	28297	28297	28297	28297	28297	28297	28297	28297	28297	28297
15	25519	25519	25519	25519	25519	25519	25519	25519	25519	25519

sum(counted.sort_values(by=['Duration_count'],ascending=False)['Duration_count'].values[:2])/sum(counted.sort_values(by=['Duration_count'],ascending=False)['Duration_count'].values)

0.8010328269655939

np.sort(train_df1['Duration'].values)[-1:]/60/60

array([271.67638889])

num_bins = 1000
plt.hist(train_df['Duration'], num_bins, facecolor='blue', alpha=0.5,log=True)
plt.show()

png

num_bins = 1000
plt.hist(train_df1['Time'], num_bins, density=True, facecolor='blue', alpha=0.5)
plt.show()

png

train_df.keys()

Index(['Time', 'Duration', 'SrcDevice', 'DstDevice', 'Protocol', 'SrcPort',
       'DstPort', 'SrcPackets', 'DstPackets', 'SrcBytes', 'DstBytes'],
      dtype='object')

plt.figure(figsize=(10,4))
plt.yscale('log')
plt.stem(duration_hist.keys(), duration_hist.values(), markerfmt='b.')

Normalize data

In order for our autoencoder to train better, it will be useful to normalize the data. I initially skipped this step and was getting astronomically high losses!

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

colnames_tsa = ['Time', 'Duration', 'SrcDevice', 'DstDevice', 'Protocol', 'SrcPort', 'DstPort', 'SrcPackets', 
        'DstPackets', 'SrcBytes', 'DstBytes', 'SrcPackets_mean1s','SrcPackets_std1s','SrcPackets_mean10s',
        'SrcPackets_std10s','SrcPackets_mean60s','SrcPackets_std60s', 'DstPackets_mean1s','DstPackets_std1s','DstPackets_mean10s',
        'DstPackets_std10s','DstPackets_mean60s','DstPackets_std60s', 'SrcBytes_mean1s','SrcBytes_std1s','SrcBytes_mean10s',
        'SrcBytes_std10s','SrcBytes_mean60s','SrcBytes_std60s', 'DstBytes_mean1s','DstBytes_std1s','DstBytes_mean10s',
        'DstBytes_std10s','DstBytes_mean60s','DstBytes_std60s',]

scaler = StandardScaler()
scaled_df = scaler.fit_transform(train_df)
scaled_df = pd.DataFrame(scaled_df, columns=colnames)
scaler1 = MinMaxScaler()
scaled_df1 = scaler1.fit_transform(train_df2)
scaled_df1 = pd.DataFrame(scaled_df1, columns=colnames_tsa)
scaler2 = RobustScaler()
scaled_df2 = scaler2.fit_transform(train_df)
scaled_df2 = pd.DataFrame(scaled_df2, columns=colnames)

fig, (ax1, ax2, ax3, ax4) = plt.subplots(ncols=4, figsize=(16, 5))

ax1.set_title('Before Scaling')
sns.kdeplot(train_df['SrcDevice'], ax=ax1)
sns.kdeplot(train_df['DstDevice'], ax=ax1)
sns.kdeplot(train_df['SrcPort'], ax=ax1)
sns.kdeplot(train_df['DstPort'], ax=ax1)
ax2.set_title('After Standard Scaler')
sns.kdeplot(scaled_df['SrcDevice'], ax=ax2)
sns.kdeplot(scaled_df['DstDevice'], ax=ax2)
sns.kdeplot(scaled_df['SrcPort'], ax=ax2)
sns.kdeplot(scaled_df['DstPort'], ax=ax2)
ax3.set_title('After Min-Max Scaler')
sns.kdeplot(scaled_df1['SrcDevice'], ax=ax3)
sns.kdeplot(scaled_df1['DstDevice'], ax=ax3)
sns.kdeplot(scaled_df1['SrcPort'], ax=ax3)
sns.kdeplot(scaled_df1['DstPort'], ax=ax3)
ax4.set_title('After Robust Scaler')
sns.kdeplot(scaled_df2['SrcDevice'], ax=ax4)
sns.kdeplot(scaled_df2['DstDevice'], ax=ax4)
sns.kdeplot(scaled_df2['SrcPort'], ax=ax4)
sns.kdeplot(scaled_df2['DstPort'], ax=ax4)
plt.show()

/home/paperspace/anaconda3/envs/fastaiv1/lib/python3.7/site-packages/sklearn/preprocessing/data.py:625: DataConversionWarning: Data with input dtype int8, int16, int32, int64 were all converted to float64 by StandardScaler.
  return self.partial_fit(X, y)
/home/paperspace/anaconda3/envs/fastaiv1/lib/python3.7/site-packages/sklearn/base.py:462: DataConversionWarning: Data with input dtype int8, int16, int32, int64 were all converted to float64 by StandardScaler.
  return self.fit(X, **fit_params).transform(X)
/home/paperspace/anaconda3/envs/fastaiv1/lib/python3.7/site-packages/sklearn/preprocessing/data.py:323: DataConversionWarning: Data with input dtype int8, int16, int32, int64, float64 were all converted to float64 by MinMaxScaler.
  return self.partial_fit(X, y)
/home/paperspace/anaconda3/envs/fastaiv1/lib/python3.7/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

png

Making the data richer with feature extraction

One-hot-encoding

Returning to the SrcDevice and DstDevice features, the 4 non-Comp/IP devices with no numbers associated with them are perfect for one-hot-encoding. The Protocol, with only 3 different values, is also a great candidate for one-hot-encoding.

train_df1 = train_df.copy()
cols = np.append(['Comp','IP'],noComp_orIP)
for i in cols:
    train_df1[f'SrcDevice_{i}'] = 0
    train_df1[f'DstDevice_{i}'] = 0
    ind_src = (train_df['SrcDevice'].str.contains(f'{i}')==True)
    ind_dst = (train_df['DstDevice'].str.contains(f'{i}')==True)
    train_df1[f'SrcDevice_{i}'].loc[ind_src] = 1
    train_df1[f'DstDevice_{i}'].loc[ind_dst] = 1
    if i == 'Comp' or i == 'IP': #Replace 1 with actual Comp or IP value
        train_df1[f'SrcDevice_{i}'].loc[ind_src] = train_df1['SrcDevice'].loc[ind_src].map(lambda x: int(x.lstrip(f'{i}')))
        train_df1[f'DstDevice_{i}'].loc[ind_dst] = train_df1['DstDevice'].loc[ind_dst].map(lambda x: int(x.lstrip(f'{i}')))
for i in np.unique(train_df1['Protocol'].values):
    train_df1[f'Protocol_{i}'] = 0
    train_df1[f'Protocol_{i}'].loc[(train_df['Protocol'] == i)] = 1
train_df1.drop(columns = ['SrcDevice','DstDevice','Protocol'],inplace = True)
train_df1['SrcPort'] = train_df1['SrcPort'].map(lambda x: int(x.lstrip(f'Port')))
train_df1['DstPort'] = train_df1['DstPort'].map(lambda x: int(x.lstrip(f'Port')))

Time-series Analysis (TSA)

So far we have not really encoded any idea of 'normal' or 'abnormal' into our model.
From Kitsune: An Ensemble of Autoencoders for Online Network Intrusion Detection (Mirsky 2018)

Feature extraction is the process of obtaining or engineering a vector of values which describe a real world observation. In network anomaly detection, it is important to extract features which capture the context and purpose of each packet traversing the network. For example, consider a single TCP SYN packet. The packet may be a benign attempt to establish a connection with a server, or it may be one of millions of similar packets sent in an attempt to cause a denial of service attack (DoS).

This is just one example of attacks where temporal-statistical features could help detect anomalies. The challenge with extracting these kinds of features from network traffic is that (1) packets from different channels (conversations) are interleaved, (2) there can be many channels at any given moment, (3) the packet arrival rate can be very high.

We can begin to do this by adding rolling statistics of events and apply them to each feature. This way we will be able to check how each event and its features compare to a typical event in a window of x seconds.

Let's first look at the mean and standard deviation of the Packets and Bytes features for time windows of 1, 10 and 60 seconds into the past. This will add 24 features to our dataset.

def FeatureExtractor(df,features,time_windows):
    for i in features:
        for j in time_windows:
            print(i,j)
            dff = df.copy()
            dff['datetime'] = pd.to_datetime(dff['Time'],unit = 's')
            dff.index = dff['datetime']
            tmp_mean = dff[f'{i}'].rolling(f'{j}s', min_periods=1).mean()
            tmp_std =  dff[f'{i}'].rolling(f'{j}s', min_periods=1).std().fillna(0)
            tmp_mean.index = df.index
            tmp_std.index = df.index
            df[f'{i}_mean{j}s'] = tmp_mean
            df[f'{i}_std{j}s'] = tmp_std

Future Work:One way to improve the Feature Extractor is to intialize the statistics with a typical stream of data instead of just zero.

train_df2 = train_df.copy()
features = ['SrcPackets','DstPackets','SrcBytes','DstBytes']
time_windows = [1,10,60]
FeatureExtractor(train_df2,features,time_windows)

SrcPackets 1
SrcPackets 10
SrcPackets 60
DstPackets 1
DstPackets 10
DstPackets 60
SrcBytes 1
SrcBytes 10
SrcBytes 60
DstBytes 1
DstBytes 10
DstBytes 60

Entity Embedding for categorical variables.

Future work Since our categorical variables have such high cardinality, it might be useful to create an embedding vector for each of the ID numbers that is of a significantly lower dimension than the original cardinality. By combining all the vectors together we will have an embedding matrix.

Feature Mapper

Future work Let's use agglomerative hierarchical clustering to see if any of our features are redundant.

Testing various Autoencoder Architectures

Imports and Utility Functions

import torch
from torch import nn
from torch import optim
import torch.nn.functional as F
from torch.autograd import Variable
from torch.utils.data import DataLoader
from torch.utils.data import TensorDataset
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
import os
import seaborn as sns

#if not os.path.exists('./mlp_img'):
#    os.mkdir('./mlp_img')

epochs = 30
bs = 128
lr = 1e-3
loss_func = F.binary_cross_entropy
dev = torch.device(
    "cuda") if torch.cuda.is_available() else torch.device("cpu")

def get_data(train_ds, valid_ds, bs):
    return (
        DataLoader(train_ds, batch_size=bs, shuffle=True),
        DataLoader(valid_ds, batch_size=bs * 2),
    )

def loss_batch(model, loss_func, xb, yb, opt=None):
    loss = loss_func(model(xb), yb)

    if opt is not None:
        loss.backward()
        opt.step()
        opt.zero_grad()

    return loss.item(), len(xb)

def fit(epochs, model, loss_func, opt, train_dl, valid_dl):
    val_loss_hist = np.zeros(epochs)
    for epoch in range(epochs):
        model.train()
        for xb, yb in train_dl:
            loss_batch(model, loss_func, xb, yb, opt)

        model.eval()
        with torch.no_grad():
            losses, nums = zip(
                *[loss_batch(model, loss_func, xb, yb) for xb, yb in valid_dl]
            )
        val_loss = np.sum(np.multiply(losses, nums)) / np.sum(nums)
        val_loss_hist[epoch] = val_loss
        print(epoch, val_loss)
    return val_loss_hist

class Lambda(nn.Module):
    def __init__(self, func):
        super().__init__()
        self.func = func

    def forward(self, x):
        return self.func(x)

def preprocess(x, y):
    return x.view(-1,11).to(dev), y.to(dev)

class WrappedDataLoader:
    def __init__(self, dl, func):
        self.dl = dl
        self.func = func

    def __len__(self):
        return len(self.dl)

    def __iter__(self):
        batches = iter(self.dl)
        for b in batches:
            yield (self.func(*b))

def check_cats(data, cat_vars):
    for i in cat_vars:
        data[f'{i}'] = data[f'{i}'].astype('category').cat.codes
    return data
DIR = "../input"

colnames = ['Time', 'Duration', 'SrcDevice', 'DstDevice', 'Protocol', 'SrcPort', 'DstPort', 'SrcPackets', 
       'DstPackets', 'SrcBytes', 'DstBytes']
cat_vars = ['SrcDevice', 'DstDevice','Protocol','SrcPort', 'DstPort']
cont_vars = ['Time', 'Duration', 'SrcPackets', 'DstPackets', 'SrcBytes', 'DstBytes']
train_df = pd.read_csv(os.path.join(DIR, 'netflow_day-03.csv'),names = colnames)
train_df = check_cats(train_df,cat_vars)

Just to get a feel for our data and model performance, I'll only be using ~2% of the data to allow for quick iterations. Our goal is to see what works initally and then when training a final model I'll use all of the data.

x_train, x_valid, y_train, y_valid = train_test_split(scaled_df1.iloc[:100000], scaled_df1.iloc[:100000], test_size=0.33, random_state=42)
x_train, y_train, x_valid, y_valid = map(torch.tensor, (x_train.values, y_train.values, x_valid.values, y_valid.values))
x_train, y_train, x_valid, y_valid = x_train.float(), y_train.float(), x_valid.float(), y_valid.float()
train_ds = TensorDataset(x_train, y_train)
valid_ds = TensorDataset(x_valid, y_valid)
train_dl, valid_dl = get_data(train_ds, valid_ds, bs)
train_dl = WrappedDataLoader(train_dl, preprocess)
valid_dl = WrappedDataLoader(valid_dl, preprocess)

Basic AutoEncoder (AE)

class autoencoder(nn.Module):
    def __init__(self):
        super(autoencoder, self).__init__()
        self.encoder = nn.Sequential(
            nn.Linear(11, 3),
            nn.ReLU())
        self.decoder = nn.Sequential(
            nn.Linear(3, 11),
            nn.Sigmoid()) 

    def forward(self, x):
        x = self.encoder(x)
        x = self.decoder(x)
        return x

For our initial run, let's try training the AE using the standard Stochastic Gradient Descent (SGD) optimization algorithm.

model = autoencoder()
model.to(dev)
opt = optim.SGD(model.parameters(), lr=lr, momentum=0.9)
simpleAE = fit(epochs, model, loss_func, opt, train_dl, valid_dl)

0 0.6153640248992226
1 0.5632328183867714
2 0.521731213497393
3 0.48840221228744046
4 0.4613625860214233
5 0.43919258066379663
6 0.42081355727080144
7 0.40540852949836037
8 0.392343591075955
9 0.3811262700918949
10 0.371357077107285
11 0.36269724823489335
12 0.3549058718753583
13 0.3477984815799829
14 0.34122927554448446
15 0.335087068752809
16 0.32927751234083463
17 0.32372928273316587
18 0.31840084727605183
19 0.3132779013893821
20 0.30835604749303874
21 0.30364818929180953
22 0.29918292765183885
23 0.29499714277729844
24 0.29111959272442445
25 0.28757020660602683
26 0.2843516824967933
27 0.28145706274292687
28 0.27886660063627994
29 0.2765542323661573

When utilizing the Adam optimization algorithm instead, we see that our loss starts off much lower and ultimately achieves a better result.

model = autoencoder()
model.to(dev)
opt1 = optim.Adam(model.parameters(), lr=lr, weight_decay=1e-5)
simpleAE1 = fit(epochs, model, loss_func, opt1, train_dl, valid_dl)

0 0.3132772430289875
1 0.27298690352295385
2 0.26548837280273435
3 0.2561405701709516
4 0.24500035896807007
5 0.23917841724554698
6 0.23640931301044696
7 0.22925565374981274
8 0.2199682547901616
9 0.21693617815321142
10 0.2149911564985911
11 0.21362458565740874
12 0.21265988235762626
13 0.21197906290039872
14 0.2114728673732642
15 0.21104865649974708
16 0.2106399353056243
17 0.21023554500305291
18 0.2098258723598538
19 0.2094674902973753
20 0.20916657164241328
21 0.20891935854247123
22 0.20872451052882454
23 0.20859452519633553
24 0.2084934107245821
25 0.20842122670014698
26 0.20837569244702658
27 0.20834395785042734
28 0.20831816785263294
29 0.20830102947986487

By changing the size of our bottleneck, we are expanding/limiting the space our network will use to find latent representations of our features. For now, increasing the size seems to only help marginally. However, we will want to return to this hyperparameter when we engineer more features in the future.

class autoencoder1(nn.Module):
    def __init__(self):
        super(autoencoder1, self).__init__()
        self.encoder = nn.Sequential(
            nn.Linear(11, 6),
            nn.ReLU())
        self.decoder = nn.Sequential(
            nn.Linear(6, 11),
            nn.Sigmoid()) 

    def forward(self, x):
        x = self.encoder(x)
        x = self.decoder(x)
        return x
model = autoencoder1()
model.to(dev)
opt1 = optim.Adam(model.parameters(), lr=lr, weight_decay=1e-5)
SimpleAE2 = fit(epochs, model, loss_func, opt1, train_dl, valid_dl)

0 0.27842848107309054
1 0.2424553751078519
2 0.22034330931938056
3 0.2125227058836908
4 0.2091995555559794
5 0.20629277456890452
6 0.20423057547121337
7 0.20225441743749561
8 0.2000320770451517
9 0.19847749503814813
10 0.19774435007572175
11 0.19735746824741363
12 0.1971329365788084
13 0.19700006837555856
14 0.19692712545756139
15 0.19688325585018504
16 0.19685634627847962
17 0.19684289213744077
18 0.19683339035149777
19 0.19682246895269914
20 0.19681916312376657
21 0.19681775667450646
22 0.19682014381524288
23 0.19681429224664515
24 0.1968093104543108
25 0.19681130621288762
26 0.19680515278108193
27 0.19680367324930248
28 0.19680531917557573
29 0.19681047119877554

QUESTION: What size should we select for the latent representation of our data?

Deep AutoEncoder (aka Stacked AutoEncoder, SAE)

class deep_autoencoder(nn.Module):
    def __init__(self):
        super(deep_autoencoder, self).__init__()
        self.encoder = nn.Sequential(
            nn.Linear(11, 9),
            nn.ReLU(True),
            nn.Linear(9, 6),
            nn.ReLU(True))
        self.decoder = nn.Sequential(
            nn.Linear(6, 9),
            nn.ReLU(True),
            nn.Linear(9, 11),
            nn.Sigmoid())

    def forward(self, x):
        x = self.encoder(x)
        x = self.decoder(x)
        return x

Now I'll create an AE that has multiple hidden layers, a deep AE. Let's first start with a model that has 2 hidden layers. This doesn't seem to do much better than our one-layer AE but again, might be useful if we incorporate more features in the future.

model = deep_autoencoder()
model.to(dev)
opt1 = optim.Adam(model.parameters(), lr=lr, weight_decay=1e-5)
deepAE = fit(epochs, model, loss_func, opt1, train_dl, valid_dl)

0 0.2449271744598042
1 0.22742869142330055
2 0.21516548670060706
3 0.2077671935377699
4 0.2028120079979752
5 0.20213690534866216
6 0.200944096066735
7 0.19925501989595817
8 0.19780528188835492
9 0.1971274009401148
10 0.1968309831944379
11 0.19652490559852484
12 0.1962942510987773
13 0.19614683185562942
14 0.19603187960566895
15 0.19594012701872623
16 0.19582038521766662
17 0.19572626200589266
18 0.19565712436040242
19 0.19555748596335903
20 0.19550040659759985
21 0.19545797579577476
22 0.1954013790434057
23 0.19536583811947794
24 0.19534382763414673
25 0.19530811882019042
26 0.19530282218528516
27 0.19527743303414546
28 0.19524949955579007
29 0.1952263595046419

Another option is to have 'overcomplete' layers. We can balloon the size of the first layer and shrink it back down. The simpler, 2 layer AE above does a better job so this doesn't seem like a promising avenue to continue down further.

class deep_autoencoder1(nn.Module):
    def __init__(self):
        super(deep_autoencoder, self).__init__()
        self.encoder = nn.Sequential(
            nn.Linear(11, 128),
            nn.ReLU(True),
            nn.Linear(128, 64),
            nn.ReLU(True), 
            nn.Linear(64, 12), 
            nn.ReLU(True), 
            nn.Linear(12, 6))
        self.decoder = nn.Sequential(
            nn.Linear(6, 12),
            nn.ReLU(True),
            nn.Linear(12, 64),
            nn.ReLU(True),
            nn.Linear(64, 128),
            nn.ReLU(True), 
            nn.Linear(128, 11), 
            nn.Sigmoid())

    def forward(self, x):
        x = self.encoder(x)
        x = self.decoder(x)
        return x

model = deep_autoencoder()
model.to(dev)
opt1 = optim.Adam(model.parameters(), lr=lr, weight_decay=1e-5)
deepAE1 = fit(epochs, model, loss_func, opt1, train_dl, valid_dl)

0 0.2650241525100939
1 0.23482537920186014
2 0.21984392593123697
3 0.21350352708137396
4 0.20568819431825117
5 0.20219772437847022
6 0.20110394136110943
7 0.20012424728364656
8 0.19942686419053512
9 0.19796628739978328
10 0.19645627458529039
11 0.1958876735911225
12 0.19568070779424726
13 0.19561183846358096
14 0.1956282062964006
15 0.19557593305544418
16 0.1955740126985492
17 0.19555451732332058
18 0.19555348258668726
19 0.1955339559063767
20 0.1955322019042391
21 0.19553729743668527
22 0.19552998865011967
23 0.19553163955428385
24 0.1955169496716875
25 0.19551997378016964
26 0.19551090997276885
27 0.19550259295015623
28 0.19551545627189404
29 0.19549637895280664

Convolutional AutoEncoder (CAE)

Future work! See below for a rough schematic of a planned implementation.

class conv_autoencoder(nn.Module):
    def __init__(self):
        super(conv_autoencoder, self).__init__()
        self.encoder = nn.Sequential(
            nn.Conv1d(1, 1, 3, stride=3, padding=1),  # b, 16, 10, 10
            nn.ReLU(True),
            nn.MaxPool1d(2, stride=2),  # b, 16, 5, 5
            nn.Conv1d(16, 8, 3, stride=2, padding=1),  # b, 8, 3, 3
            nn.ReLU(True),
            nn.MaxPool1d(2, stride=1)  # b, 8, 2, 2
        )
        self.decoder = nn.Sequential(
            nn.ConvTranspose1d(8, 16, 3, stride=2),  # b, 16, 5, 5
            nn.ReLU(True),
            nn.ConvTranspose1d(16, 8, 5, stride=3, padding=1),  # b, 8, 15, 15
            nn.ReLU(True),
            nn.ConvTranspose1d(8, 1, 2, stride=2, padding=1),  # b, 1, 28, 28
            nn.Sigmoid()
        )

    def forward(self, x):
        x = self.encoder(x)
        x = self.decoder(x)
        return x

train_dl_w = WrappedDataLoader(train_dl, preprocess)
valid_dl_w = WrappedDataLoader(valid_dl, preprocess)
model = conv_autoencoder()
model.to(dev)
convAE = fit(epochs, model, loss_func, opt1, train_dl_w, valid_dl_w)

Comparison of AE architectures with default dataset (no feature engineering)

fig = plt.figure()
sub = fig.add_subplot(111)
sub.plot(np.arange(epochs),simpleAE,  label = 'Simple w/ SGD')
sub.plot(np.arange(epochs),simpleAE1, label = 'Simple w/ Adam')
sub.plot(np.arange(epochs),deepAE, label = 'Deep (2-layer) w/ Adam')
sub.plot(np.arange(epochs),deepAE1, label = 'Deep (6-layer) w/ Adam')
fig.legend()

<matplotlib.legend.Legend at 0x7f6b42de8940>

png

Comparison of AE architectures with feature engineered dataset

Basic AE w/ TSA features

class autoencoder2(nn.Module):
    def __init__(self):
        super(autoencoder2, self).__init__()
        self.encoder = nn.Sequential(
            nn.Linear(35, 13),
            nn.ReLU())
        self.decoder = nn.Sequential(
            nn.Linear(13, 35),
            nn.Sigmoid()) 

    def forward(self, x):
        x = self.encoder(x)
        x = self.decoder(x)
        return x
def preprocess(x, y):
    return x.view(-1,35).to(dev), y.to(dev)

x_train, x_valid, y_train, y_valid = train_test_split(scaled_df1.iloc[:100000], scaled_df1.iloc[:100000], test_size=0.33, random_state=42)
x_train, y_train, x_valid, y_valid = map(torch.tensor, (x_train.values, y_train.values, x_valid.values, y_valid.values))
x_train, y_train, x_valid, y_valid = x_train.float(), y_train.float(), x_valid.float(), y_valid.float()
train_ds = TensorDataset(x_train, y_train)
valid_ds = TensorDataset(x_valid, y_valid)
train_dl, valid_dl = get_data(train_ds, valid_ds, bs)
train_dl = WrappedDataLoader(train_dl, preprocess)
valid_dl = WrappedDataLoader(valid_dl, preprocess)

model = autoencoder2()
model.to(dev)
opt1 = optim.Adam(model.parameters(), lr=lr, weight_decay=1e-5)
simpleAE_ed = fit(epochs, model, loss_func, opt1, train_dl, valid_dl)

0 0.15542524325486384
1 0.1330017716595621
2 0.12358071852994687
3 0.12134318643627745
4 0.12013640852949836
5 0.11922035816040906
6 0.1186038601488778
7 0.11830568287769953
8 0.11817351314515778
9 0.11809839636629278
10 0.11803493072950479
11 0.118003826110652
12 0.11796236093477769
13 0.11794244261040832
14 0.11791596555709839
15 0.11789782470103466
16 0.11788164298642766
17 0.11786410925785701
18 0.1178481928651983
19 0.11783261449770494
20 0.11781676309217107
21 0.11780566596984864
22 0.11778696250373667
23 0.11777133213570623
24 0.11775690714156989
25 0.11774744177406485
26 0.11774040185863321
27 0.11772692710883689
28 0.11772219492630526
29 0.11770839673100096

class autoencoder3(nn.Module):
    def __init__(self):
        super(autoencoder3, self).__init__()
        self.encoder = nn.Sequential(
            nn.Linear(35, 3),
            nn.ReLU())
        self.decoder = nn.Sequential(
            nn.Linear(3, 35),
            nn.Sigmoid()) 

    def forward(self, x):
        x = self.encoder(x)
        x = self.decoder(x)
        return x
def preprocess(x, y):
    return x.view(-1,35).to(dev), y.to(dev)
model = autoencoder3()
model.to(dev)
opt1 = optim.Adam(model.parameters(), lr=lr, weight_decay=1e-5)
simpleAE_ed1 = fit(epochs, model, loss_func, opt1, train_dl, valid_dl)

0 0.18049194024548387
1 0.15102098090359659
2 0.1473470046339613
3 0.143951437588894
4 0.13999235530333085
5 0.1366622529969071
6 0.13426986752134382
7 0.1325131246248881
8 0.13121682799223697
9 0.13023262498595498
10 0.12944078761158567
11 0.12880684867049708
12 0.1282730374841979
13 0.12780357990842878
14 0.12739020001165793
15 0.12700787126656735
16 0.1266632669802868
17 0.12634942066669463
18 0.12608273115663818
19 0.12587054110657084
20 0.12572116792201996
21 0.12558745316664377
22 0.1255006782141599
23 0.12543083710381478
24 0.12537497831113412
25 0.12533319084210828
26 0.12530713788307074
27 0.12527308274399152
28 0.1252579463900942
29 0.12523770618799962

Deep AutoEncoder w/ TSA features

class deep_autoencoder1(nn.Module):
    def __init__(self):
        super(deep_autoencoder1, self).__init__()
        self.encoder = nn.Sequential(
            nn.Linear(35, 26),
            nn.ReLU(True),
            nn.Linear(26, 13),
            nn.ReLU(True),
            nn.Linear(13, 5),
            nn.ReLU(True))
        self.decoder = nn.Sequential(
            nn.Linear(5, 13),
            nn.ReLU(True),
            nn.Linear(13, 26),
            nn.ReLU(True),
            nn.Linear(26, 35),
            nn.Sigmoid())

    def forward(self, x):
        x = self.encoder(x)
        x = self.decoder(x)
        return x
model = deep_autoencoder1()
model.to(dev)
opt1 = optim.Adam(model.parameters(), lr=lr, weight_decay=1e-5)
deepAE_ed = fit(epochs, model, loss_func, opt1, train_dl, valid_dl)

0 0.14267463051550316
1 0.13841716249422595
2 0.12931717374469295
3 0.12356641965923887
4 0.12296843826228922
5 0.12272117885856917
6 0.1225782561211875
7 0.12246075998472444
8 0.1223881917180437
9 0.12213705055641405
10 0.12176700847076648
11 0.1211470121939977
12 0.11976834000421292
13 0.11921178272818074
14 0.11903793139349331
15 0.11892726133447705
16 0.11888989595030293
17 0.11885944359230273
18 0.11882691781990457
19 0.11880519064267477
20 0.11877735575943282
21 0.11873244151743975
22 0.11870291563055732
23 0.11867264552911122
24 0.11859710764523708
25 0.11848349631193912
26 0.11834015336903658
27 0.11821681090918454
28 0.11815993258447358
29 0.11813407987717427

Just by including out engineered TSA features, we've lowered the loss by nearly half!

fig = plt.figure()
sub = fig.add_subplot(111)
sub.plot(np.arange(epochs),simpleAE_ed,  label = 'SimpleAE w/ eng. feat.')
sub.plot(np.arange(epochs),simpleAE_ed1, label = 'SimpleAE w/ eng. feat., diff. size')
sub.plot(np.arange(epochs),deepAE_ed, label = 'Deep (6-layer) w/ Adam')
fig.legend()

<matplotlib.legend.Legend at 0x7f539722af28>

png

Improving the learning rate

Future work

I want to implement Smith 2015's one cycle learning schedule, which incorporates a cylical learning rate (and momentum) annealing. This is already implemnted in the fastai wrapper.

Future work

Entity Embedding for categorical values
2D Statistics (magnitude, radius, approx. covariance and correlation coefficient of 2 different streams). Question: How best to define different streams? Are these simply different time windows of the same stream?
Feature Mapping with agglomerative hierarchical clustering
Implement a Convolutional AutoEncoder
Apply learning rate/momentum annealing
Incorporate host events dataset

Resources

Good resource for autoencoders
Good resource for variational autoencoders (VAEs)
Outlook inspired by Kitsune paper (Mirsky 2018)
Insight from Cyber Reboot's work on using RNN autoencoders on network traffic data.
Good resource for embeddings of structured data.

Training Autoencoders to Identify Normal and Abnormal Traffic Behavior in Netflow Data

January 26, 2019

Goal

Intro to Unsupervised Learning and Autoencoders

Download the data

Load the data

Data Exploration

Initial look

SrcDevice and DstDevice

There are 6 general categories of Devices: Comp, IP, ActiveDirectory, EnterpriseAppServer, VPN and VScanner.

The majority (98%) of events appear to have a SrcDevice that is a Comp. The next closest Device is the EnterpriseAppServer at 1.37% of events. It's also interesting to note that 8 Comps are the SrcDevice of nearly half the data.

While the majority of events have a Comp for their DstDevice (86.61%), ActiveDirectory and EnterpriseAppServer have a larger share than for the SrcDevice (7.14% and 5.85% respectively).

There doesn't seem to be anything particularly interesting in the distribution of Comp numbers for SrcDevice or DstDevice. Again we see that a select few Comps have a majoirty of the traffic.

Attempt at showing clusters of devices through a flow graph (Work in progress!!)

Protocol

Protocol 6 (TCP) accounts for over 77.69% of events, Protocol 17 (UDP) accounts for 22.28% while Protocol 1 (ICMP) only accounts for .03%.

SrcPort and DstPort

Distribution across SrcPort seems nearly uniform.

DstPort is another story. The top 10 DstPorts account for over half of the events

SrcPackets and DstPackets

91.47% of events send between 0-10 packets from the SrcDevice. The highest amount of packets sent is 87701075!

93.40% of events send between 0-10 packets from the DstDevice. The highest amount of packets sent is 77760718.

SrcBytes and DstBytes

Initial Look

Question: As someone new to the field, I'm uncertain why the number of bytes for events jumps from 0 to 42.

Distributions of Packets and Bytes seem to follow a power law which appears to be normal.

Looking at a 2d heatmap of Packets vs Bytes shows a clean linear correlation, with all flows exhibiting an average packet size (i.e. bytes/packets) between 32 and 1024 bytes.

Time and Duration

80% of the events last less than 1 second. The longest event took 978035 seconds or roughly 271 hours.

Normalize data

In order for our autoencoder to train better, it will be useful to normalize the data. I initially skipped this step and was getting astronomically high losses!

Making the data richer with feature extraction

One-hot-encoding

Time-series Analysis (TSA)

Entity Embedding for categorical variables.

Feature Mapper

Testing various Autoencoder Architectures

Imports and Utility Functions

Just to get a feel for our data and model performance, I'll only be using ~2% of the data to allow for quick iterations. Our goal is to see what works initally and then when training a final model I'll use all of the data.

Basic AutoEncoder (AE)

For our initial run, let's try training the AE using the standard Stochastic Gradient Descent (SGD) optimization algorithm.

When utilizing the Adam optimization algorithm instead, we see that our loss starts off much lower and ultimately achieves a better result.

QUESTION: What size should we select for the latent representation of our data?

Deep AutoEncoder (aka Stacked AutoEncoder, SAE)

Now I'll create an AE that has multiple hidden layers, a deep AE. Let's first start with a model that has 2 hidden layers. This doesn't seem to do much better than our one-layer AE but again, might be useful if we incorporate more features in the future.

Another option is to have 'overcomplete' layers. We can balloon the size of the first layer and shrink it back down. The simpler, 2 layer AE above does a better job so this doesn't seem like a promising avenue to continue down further.

Convolutional AutoEncoder (CAE)

Future work! See below for a rough schematic of a planned implementation.

Comparison of AE architectures with default dataset (no feature engineering)

Comparison of AE architectures with feature engineered dataset

Basic AE w/ TSA features

Deep AutoEncoder w/ TSA features

Just by including out engineered TSA features, we've lowered the loss by nearly half!

Improving the learning rate

Future work

Resources

`SrcDevice` and `DstDevice`

There are 6 general categories of Devices: `Comp`, `IP`, `ActiveDirectory`, `EnterpriseAppServer`, `VPN` and `VScanner`.

The majority (98%) of events appear to have a `SrcDevice` that is a Comp. The next closest Device is the `EnterpriseAppServer` at 1.37% of events. It's also interesting to note that 8 Comps are the `SrcDevice` of nearly half the data.

While the majority of events have a Comp for their `DstDevice` (86.61%), `ActiveDirectory` and `EnterpriseAppServer` have a larger share than for the `SrcDevice` (7.14% and 5.85% respectively).

There doesn't seem to be anything particularly interesting in the distribution of Comp numbers for `SrcDevice` or `DstDevice`. Again we see that a select few Comps have a majoirty of the traffic.

`Protocol`

`SrcPort` and `DstPort`

Distribution across `SrcPort` seems nearly uniform.

`DstPort` is another story. The top 10 `DstPorts` account for over half of the events

`SrcPackets` and `DstPackets`

91.47% of events send between 0-10 packets from the `SrcDevice`. The highest amount of packets sent is 87701075!

93.40% of events send between 0-10 packets from the `DstDevice`. The highest amount of packets sent is 77760718.

`SrcBytes` and `DstBytes`

Distributions of `Packets` and `Bytes` seem to follow a power law which appears to be normal.

Looking at a 2d heatmap of `Packets` vs `Bytes` shows a clean linear correlation, with all flows exhibiting an average packet size (i.e. bytes/packets) between 32 and 1024 bytes.

`Time` and `Duration`