Unsupervised Classification of Unknown Traffic In a Campus Network

Ryan Baker, University of Utah

Classifying internet traffic as various applications is an important security measure in many network settings. This practice can aid in detecting intrusion and other anomalies, as well as identifying misuse associated with prohibited applications. This can be particularly relevant for networks utilized by universities, industrial enterprises, governments, and other organizations. Many efforts have been expended to create models for classifying internet traffic using machine learning techniques. While research so far has proven useful, most studies have focused on supervised machine learning techniques that require explicit application labels, which must be designated by hand, requiring significant effort. In addition, previous studies have focused largely on known application traffic (e.g., HTTP on port 80), and some have focused only on particular transport layer traffic (e.g., TCP traffic only). In contrast, unknown traffic is much more difficult to classify and can appear as previously unseen applications or established applications exhibiting abnormal behavior. In this work, we present methods to address these gaps in other research. Using a large amount of unfiltered, unlabeled, and realistic data gathered from a universty campus network, we utilize unsupervised machine learning techniques, such as Gaussian mixture models, for identifying and classifying unknown application traffic. Gaussian models are particularly useful in creating clusters represented by Gaussian distributions, which are then used in conjunction with Kullback-Leibler divergence methods to help compare and contrast unknown traffic flows with other application clusters. Traffic flow data is analyzed as described using information about source and destination port numbers and IP addresses, transport layer protocol type, and overall packet and byte counts. Expected results are to be able to distinguish with high-probability between never before seen applications, well-known applications with typical behavior, and known applications that have non-typical behavior.