Skip to content

Predicting Behaviour of Bug Finding Tools KLEE and AFL

Notifications You must be signed in to change notification settings

iamrjkunal/Summer-18-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Predicting Behaviour of Bug Finding Tools KLEE and AFL

                                                                                last updated: 23rd July, 2018

Project Members:

  • Kunal Ranjan
  • Pranjal Pratik Lal

The aim of this project was to analyse and predict the behaviour of tools like KLEE and AFL (which is used in finding bugs) using Machine Learning. This project was started in summer'18 (1st June) and is still going on under the supervision of Prof. Subhajit Roy and guidance of Awanish Pandey.

Dataset available

A set of programs (written in C Programming Language) with bugs injected by generating various versions for each code and real time run output generated by KLEE or AFL to detect the presence of bug.

Aim

  • We aimed to predict whether KLEE or AFL software detect the bug or not on new sets of programs.

  • Furthermore given that KLEE failed to detect the bug then we analyze the various parameters(through which bug is injected) and try to predict relationship between them.

Data Processing

Libraries used

  • Numpy
  • Pickle
  • Pandas
  • Scikit- Learn
  • Keras
  • XGBoost
  • MLXtend

Steps for Preprocessing

NOTE : All the preprocessing are done in python

We are given folder containing various C programs and each program has 4 versions which are actually different bugs injected in 4 ways. Each file has path for bug through line numbers and parameters for each line.

Definition of various line parameters of the programs

  • trace_length : how many for loops or if else come before the given line
  • global_nesting_depth : real depth of if-else / for / while when used inside for / if-else / while
  • call_depth : recursive depth of function at this line
  • no_of_local_variable : total numbers of variables
  • taint_count :- no of times the given condition inside if-else or any is updated if many updation is present then taint count is max of them

So we first extract the above 5 bug parameters from dataset.

NOTE : All the preprocessing are done in python

Step 1 Output generated by KLEE and AFL (See File AFLSingle.csv and KLEESingle.csv)

aflsingle = pd.read_csv('AFLSingle.csv', delimiter=' ',  header=None)
kleesingle = pd.read_csv('KLEEsingle.csv', delimiter=' ',  header=None)
y_afl = aflsingle.iloc[:, 2].values
y_afl = (y_afl >=  1)
y_klee = kleesingle.iloc[:, 2].values
y_klee = (y_klee >=  1)

Step 2

Imorting dataset(bug files) and their important parameters extraction

class tosent:
	def  __init__  (self,command,version):
		self.command = command
		self.version = version
	def  get_data(self):
		dataset = pd.read_csv('tosent/'  +  self.command +  '/'  +  self.version +  '/AvailableVariablesFiltered.txt', header=None)
		dataset= dataset.iloc[:, 3:-1]
		dataset.columns = ["trace_length", "global_nesting_depth", "call_depth", "no_of_local_variable", "taint_count"]
		return dataset
	def  get_train_data(self):
			dataset_final=  self.get_data().iloc[:, :].values
			return dataset_final
commands = ['CAT', 'CHMOD', 'CUT', 'DF', 'DU', 'ECHO', 'EXPAND', 'EXPR', 'FMT', 'HEAD', 'ID',
'KILL', 'LN', 'LS', 'NL', 'PATHCHK', 'PINKY', 'PR', 'PRINTF', 'PTX', 'SEQ', 'STAT',
'SUM', 'TAIL', 'TEST', 'TOUCH', 'UNEXPAND', 'UNIQ', 'WC', 'WHO']
data_dict= []
for name in commands:
	for i in  range(1, 5):
		scan_data= tosent(name, str(i))
		temp=scan_data.get_train_data()
		temp_mean=temp.mean(0)
		data_dict.append(temp_mean)

Note: taint_count is ignored in above code due to lack of some more information

Step3 splitting datasets into train and test

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data_dict, y_afl, test_size  =  0.2 )

Implementations

Various Machine Learning Models used :

  • Neural Networks
  • Decision Trees
  • Gradient Boosted Trees
  • Association Rules Mining algorithms

Note: Since we are given for each path (i.e. version) output generated by KLEE and AFL (1 if bug detected otherwise 0) which lead to features all having multivalue so we have to take aggregrate of them by trying different mathematical tools such mean,median,max or min to know which can define the model most appropriate.

Neural Networks

INTRO:

In neural network adjusting nodes-to-features ratios of connected layers helps to reduce the dimensionality and thus we can then know which feature is more important.

Implementation Code:

classifier = Sequential()
classifier.add(Dense(output_dim  =  1000,  init  =  'uniform',  activation  =  'relu',  input_dim  =  5))
classifier.add(Dense(output_dim  =  1000,  init  =  'uniform',  activation  =  'relu'))
classifier.add(Dense(output_dim  =  1,  init  =  'uniform',  activation  =  'sigmoid'))
classifier.compile(optimizer  =  'adam',  loss  =  'binary_crossentropy',  metrics  = ['accuracy'])
classifier.fit(X_train, y_train, batch_size  =  10,  nb_epoch  =  100)

Conclusion

We are not able to reduce feature because if we try to reduce ratio then number of features get reduced but our model effieciency on test set reduces and if we increase to as in above as 1000 output efficiency on test cases is about 75 to 85 percent which most probably the case of overfitting as we have not enough data.

Gradient Boosting Trees

Note: we consider mean values of all the features over all lines as input for this model.

Intro:

We used XGBoost library to train a Boosted Tree model and used Bayesian Optimization using the Scikit-Optimize library to optimize tree parameters to produce best results on our data.

Code:

bayes_cv_tuner = BayesSearchCV(  
estimator = xgb.XGBClassifier(  
n_jobs = 1,  
objective = 'binary:logistic',  
eval_metric = 'map',  
silent=1,  
tree_method='exact'  
),  
search_spaces = {  
'learning_rate': (0.01, 1.0, 'log-uniform'),  
'min_child_weight': (0, 10),  
'max_depth': (0, 50),  
'max_delta_step': (0, 20),  
'subsample': (0.01, 1.0, 'uniform'),  
'colsample_bytree': (0.01, 1.0, 'uniform'),  
'colsample_bylevel': (0.01, 1.0, 'uniform'),  
'reg_lambda': (1e-9, 1000, 'log-uniform'),  
'reg_alpha': (1e-9, 1.0, 'log-uniform'),  
'gamma': (1e-9, 0.5, 'log-uniform'),  
'min_child_weight': (0, 5),  
'n_estimators': (50, 100),  
'scale_pos_weight': (1e-6, 500, 'log-uniform')  
 
},  
scoring = 'accuracy',  
cv = StratifiedKFold(  
n_splits=3,  
shuffle=True,  
random_state=42  
),  
n_jobs = 1,  
n_iter = 10,  
verbose = 0,  
refit = True,  
random_state = 42  
)

Conclusion

The optimized models gave accuracy score(fraction of correctly classified samples) of 0.73 for KLEE and 0.80 for AFL upon 3-fold cross validation. Feature Importance Plots also showed Trace length to be the most important feature out of the five. AFL OUTPUT

KLEE OUTPUT Since Gradient Boosted Trees are fairly complicated, the models we made had 70 trees. We were later told that we needed to know exactly how our classifier was working i.e. the exact function on our features, which was very difficult to do with Gradient Boosted Trees. Hence, we started looking for simpler models.

Note: Before implementing Simpler Model we have tried to know feature importance on XGBoost by expanding to all parameters i.e. mean, median, maximum, minimum of KLEE .

Output importance Plot

Decision Trees

Note: we consider mean values of all the features over all lines as input for this model.

Intro:

We started working on Decision Trees because they are a lot simpler than Boosted Trees and can easily be viewed as an image using the library we were using. We used the Scikit-Learn library to implement Decision Trees and Scikit-Optimize to optimize tree parameters to give best results on our data.

Code:

bayes_cv_tuner = BayesSearchCV(  
estimator = DecisionTreeClassifier(  
criterion='gini',  
splitter='best',  
presort=True  
),  
search_spaces = {  
'max_features': (1,4),  
'max_depth':(1,15),  
'min_samples_split': (2,10),  
'min_samples_leaf': (1,10),  
'min_weight_fraction_leaf': (1e-9,0.5,'log-uniform'),  
'min_impurity_decrease' : (1e-9,0.5,'log-uniform'),  
'max_leaf_nodes':(2,50)  
},  
scoring = 'accuracy',  
cv = StratifiedKFold(  
n_splits=3,  
shuffle=True,  
random_state=42  
),  
n_jobs = 1,  
n_iter = 10,  
verbose = 0,  
refit = True,  
random_state = 42  
)

Conclusion

The trees gave an accuracy score of 0.78 for KLEE and 0.71 for AFL upon 3-fold cross validation.

AFL OUTPUT KLEE OUTPUT

We weren’t able to get any good result because the result we got from the tree for KLEE did not match the known algorithm on which KLEE works. During this time we also came to know that the Taint Count values on individual lines of the data were Taint Counts of different variables at different lines, hence taking their mean was not a good way to summarise it over the entire file. So we stopped using Taint Count in our models from here onwards. We also used maximum values of features in a file as inputs for models instead of mean and got similar accuracy scores ~0.7 for both KLEE and AFL but the trees for KLEE didn’t match with the algorithm on which KLEE works(according to our guide Awanish Pandey).

Note: We also tried using the mean, maximum and median data in the same model for KLEE but the Decision Tree became very complicated to comprehend. These tree also used the same feature multiple times as mean, median and max are correlated. Note: Around this time we also came to know that taint_count was a parameter for individual variables in the program hence taking its mean or median over the entire program didn't make sense so we didn't use taint_count in furthur models.

Output tree for KLEE

Output table for feature importance for KLEE

Parameter Weightage
max_trace_length 0.238
min_global_nesting_depth 0.147
median_no_of_local_variable 0.145
mean_no_of_local_variable 0.125
mean_trace_length 0.088
max_global_nesting_depth 0.082
median_trace_length 0.079
min_trace_length 0.061
max_no_of_local_variable 0.031

Association Rule Mining

Intro:

Association Rule Mining algorithm is used to find relation among variables in large datasets. It works for binary data i.e. data in form of 1 an 0 (present or not present), hence we needed to convert our data, which was numeric into 0s and 1s. We used thresholding to do this i.e. if the value at a point was greater than a certain value it was assigned the value 1, otherwise 0. We tried thresholding at mean and median of the features on the whole dataset.

Code:

MLXtend API(frequent_patterns) MLXtend API(association_rules)

from mlxtend.frequent_patterns import apriori
frequent_itemsets = apriori(X, min_support=0.05, use_colnames=True)
from mlxtend.frequent_patterns import association_rules
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.2)
rules = rules[(rules['consequents'] == frozenset({'AFL_data'}))] 

Conclusion

Even after trying several combination of thresholds and introducing new variables (where we assigned value 1 if value at a point was smaller than a certain value and 0 otherwise), we didn’t get any strong rule from our data. All the rules generated had a low value for either support or confidence or lift. One of the reasons might be that Association Rules Mining is used for finding patterns in large datasets and our dataset only had a 120 points.

Note: For any rule A->B

	Support = P(A)
  Confidence = P(B|A)
  Lift = P(AB)/P(A)P(B)
 

Conclusion(Overall):

  • After trying several algorithms and methods we weren’t able to devise any sort of rule or formula on which KLEE and AFL work.

  • While looking at the data files manually we found a few abnormalities in the data given to us.

  • In one of the folders FMT, three files had the exact same values of all the features but KLEE had found bug in two of them but not in the third.

  • Few files had negative values of Global Nesting Depth.

  • Another reason why our models didn’t do very well was that there are many instances where two files are very similar, hence having similar values of all the features but bug was found in one and not in other. Such conflicting data might have resulted in our models making wrong classification on such similar files.

Note: We are working further on SVM to see whether it can solve our problem or not.

Releases

No releases published

Packages

No packages published

Languages