The Retrospect challenge had the following story: “Our website got attacked but we don’t know what kind of attacks they are. There’s no room for error.”
The challenge contains 3 csv files, a training data set, a feature data file and the data set for prediction which contains our flag.
The training data set contains a total of 17253 rows and 44 columns, where the first 43 columns contains the features of a particular attack and the final column describes what kind of attack it is.
The feature data file explains the column headings of training and prediction data files. Prediction data set contains a total of 138 rows and 43 columns where the training data set contains 44 columns.
The challenge is to predict the 44th column of the prediction data set, ie, what kind of attack will happen for the particular featured datas.
In order to go about solving this challenge, we have to first remove the unnecessary columns, and in our case we need to remove the ‘id’ column. Then, separate the data frame as input and output frames. Input will be all the features and output will be the corresponding attack category.
Data contains int, float and string/nominal values. ML algorithms cannot work with categorical values directly. Therefore, nominal values must be converted into numbers. An encoding technique has to be applied to do so. One-hot encoding is one of the methods and we can use it here.
Before applying one-hot encoding the nominal value columns has to be separated from the remaining columns. Now, we have a total of 3 data frames- one is a nominal value dataframe, which contains 3 columns ‘proto’,’service’ and ‘state’.
Second data frame contains all the remaining numerical value columns. And the third dataframe contains the output column ‘attack_cat’.
In the case of the prediction data set, it does not have the last column ‘attack_cat’. So, it will only contain two df).
One hot encoding is done in two steps- first ‘label encoding’ and second ‘OneHotEncoding’. In both steps there are two processes, “fit and transform”.
Since we have two different data sets, doing one hot encoding seperate to both won’t work. It will result in a shape difference between training data and the prediction data.
To avoid this, we have to fit the training data first and then we have to transform the training and test data with the same fit. To avoid unnecessary fit we have to create a separate encoding object for all the 3 columns.
In the case of the numerical value data frame, the values are variant and it ranges from 0 to 10000, which will make the prediction difficult. So, we have to normalize the numerical values to a particular range. Here ‘MinMaxScaler’ is used to scale the data.
After scaling and encoding we concatenate the data frames together and create a single data frame. And for the ‘attack_cat’ column, we do the label encoding , Here we have a total of 8 attacks, where the label encoding assigns a value from 0-8 to each attack.
Now, select a classifier as an algorithm for training the data. Logistic Regression, Naive Bayes, SVM, Random Forest, KNN, Decision tree are the main algorithms for classification.
They have to choose a classifier which gives a f1-score of 1 with the prediction data set. Otherwise prediction will be wrong (here used - SVM). The input to the classifier must be one hot encoded and output column must be label encoded.
After fitting the data in the SVM classifier, a model is generated. This generated model is used for the data prediction. And it has to give 1 as f1 score. The predicted outcome has a total 8 categories ranging from 0-7(label encoded), where each value corresponds to each attack category. There is a total of 138 rows for prediction and the predicted output will be a column of 138 values.
The 138 values are hashed to 32 using MD5 hash and this is our flag.