Let's Talk: ZAT

 

What Is ZAT?

The Full Code Can Be Found Here: https://github.com/SuperCowPowers/zat/blob/master/examples/anomaly_detection.py Full credit to SuperCowPowers and the team there. They’re a very cool group that’s flying under the radar of many.

I’m gonna showcase some excerpts of their code for those who are unfamiliar and would like to know what exactly they’re getting into. I’m not going to be able to do a full code analysis in this blog post but if requested, I can make another post with a breakdown. The code is already very well commented, but I’m going to add some commentary to explain certain concepts in more detail, and to showcase some very cool elements of the toolset.

First – The Imports:

from __future__ import print_function
#So this simply brings the print function from Python 3 as opposed to using the print statement in regular Python. This allows for special printing, and changes how print is interpreted when passed parameter since it's a function.

import os
#This allows python to leverage miscellaneous operating system interfaces such as reading or writing a file.

import sys
#This allows python to use system-specific parameters and functions which can be useful in memory debugging for example, or specifying the location of an executable python can't find by itself.

import argparse
#This is Parser for command-line options, arguments and sub-commands. This is for the convenience of handling complex arguments for example command line output from Zeek logs.

import math
#imports math functions.

from collections import Counter
#This leverages the collections python module to allow for specialized container datatypes as opposed to Python’s general purpose built-in containers like a dict(dictionary), or list. It uses it to bring Counter which is a dict subclass for counting hash-able objects. Providing support for convenient and rapid tallies.

# Third Party Imports
#This is where it gets fun! time to utilize some Machine Learning platforms!

import pandas as pd
#Pandas is a software library which is used for data manipulation and analysis, especially when combined with other libraries like sklearn

from sklearn.ensemble import IsolationForest
from sklearn.cluster import KMeans
#sklearn is SciKit-Learn which is a machine learning software library. It features various classification, regression and clustering algorithms, for example IsolationForest and KMeans!

# Local imports
from zat import log_to_dataframe
from zat import dataframe_to_matrix

#These local imports are from zat itself and allows to covert the Zeek logs to Pandas Dataframes to Matrices respectively for the use of the tools like in Scikit-Learn

Inline Quickchat – Isolation Forrest:

I’m not even gonna try to act like I can simplify a machine learning algorithm but I’m going to place this link here if you’re willing to fall down a rabbit hole https://towardsdatascience.com/outlier-detection-with-isolation-forest-3d190448d45e but The Isolation Forrest Algorithm works because it isolates the values of the data it receives by randomly selecting a dimension of the data point and then randomly selecting a split value between the maximum and minimum values of the selected feature.

Inline Quickchat – KMeans:

KMeans is another machine learning algorithm that works by splitting n observations into k number of clusters in which each observation goes to its respective cluster with the nearest mean (the mean is the cluster centers), serving as a prototype of the cluster.

Possibly the simplest way to explain K-Means algorithm

Lastly, the leveraging of the machine learning imports:

# Create a Pandas dataframe from a Zeek log
try:
    log_to_df = log_to_dataframe.LogToDataFrame()
    bro_df = log_to_df.create_dataframe(args.bro_log)
    print(bro_df.head())
except IOError:
    print('Could not open or parse the specified logfile: %s' % args.bro_log)
    sys.exit(1)
print('Read in {:d} Rows...'.format(len(bro_df)))

# Using Pandas we can easily and efficiently compute additional data metrics
# Here we use the vectorized operations of Pandas/Numpy to compute query length
# We'll also compute entropy of the query
if log_type == 'dns':
    bro_df['query_length'] = bro_df['query'].str.len()
    bro_df['answer_length'] = bro_df['answers'].str.len()
    bro_df['entropy'] = bro_df['query'].map(lambda x: entropy(x))

# Use the zat DataframeToMatrix class
to_matrix = dataframe_to_matrix.DataFrameToMatrix()
bro_matrix = to_matrix.fit_transform(bro_df[features])
print(bro_matrix.shape)

# Train/fit and Predict anomalous instances using the Isolation Forest model
odd_clf = IsolationForest(contamination=0.2)  # Marking 20% as odd
odd_clf.fit(bro_matrix)

# Now we create a new dataframe using the prediction from our classifier
predictions = odd_clf.predict(bro_matrix)
odd_df = bro_df[features][predictions == -1]
display_df = bro_df[predictions == -1].copy()

# Now we're going to explore our odd observations with help from KMeans
odd_matrix = to_matrix.fit_transform(odd_df)
num_clusters = min(len(odd_df), 4)  # 4 clusters unless we have less than 4 observations
display_df['cluster'] = KMeans(n_clusters=num_clusters).fit_predict(odd_matrix)
print(odd_matrix.shape)

# Now group the dataframe by cluster
if log_type == 'dns':
    features += ['query']
else:
    features += ['host']
cluster_groups = display_df[features+['cluster']].groupby('cluster')

# Now print out the details for each cluster
print('<<< Outliers Detected! >>>')
for key, group in cluster_groups:
    print('\nCluster {:d}: {:d} observations'.format(key, len(group)))
    print(group.head())

How To Install ZAT:

james@zeek:~$: sudo pip install zat

Yes, it’s a simple but very powerful one liner! It downloads all the packages you need through the power of pip!