ASTra

Overview

ASTra is a project designed to analyze C++ source files (.cpp) by reading their contents, generating Abstract Syntax Trees (ASTs), computing similarity matrices using TF-IDF vectors, and visualizing the results. The project evolved through multiple versions, each adding new features and improvements.

Version History

Combined Version

Reading .cpp Files: Reads files from a specified directory, checks for empty files, and stores full file paths.
Preprocessing: Placeholder for potential preprocessing steps, enhanced in later versions to remove comments and whitespace.
TF-IDF Computation: Computes TF-IDF vectors for the file contents and combined content (file content + AST).
AST Generation: Uses libclang to generate ASTs for each .cpp file, including detailed tokenization and filtering out comments and preprocessor directives.
Similarity Matrix: Computes a cosine similarity matrix, including AST information.
Clustering and Visualization: Adds clustering and visualization features, including PCA, t-SNE, and dendrograms for visualizing file clusters based on similarity.
AST Visualization: Converts ASTs to Graphviz format for visualization.
Output: Prints similarity matrix, pairwise similarity scores, AST information, token details, and visualizations.

Usage

Run the main() function to:

Read .cpp files from the specified directory.
Generate ASTs for each file.
Preprocess the file contents.
Compute TF-IDF vectors, including AST information.
Compute and print the similarity matrix.
Print AST information for each file.
Visualize the results using PCA.
Visualize ASTs using Graphviz.

Code Sections

Part 1: Reading and Preprocessing C++ Files, Generating ASTs, and Computing Similarity Matrix

In the first part of the code, the program reads C++ files from a specified directory, ensuring that empty files are skipped. It then tokenizes the C++ code while filtering out comments and preprocessor directives. Next, the code generates ASTs using libclang and calculates the number of nodes and the average depth of these ASTs. The tokenized code and ASTs are combined, and TF-IDF vectors are computed for this combined content. The cosine similarity matrix is then computed using these vectors, and a dataset is created that includes file names, similarity scores, TF-IDF vectors, and AST features. Finally, this dataset is saved to a JSON file.

Part 2: Loading, Preprocessing, and Training the Model

In the second part of the code, the program loads the dataset from the JSON file and preprocesses it for machine learning. The TF-IDF vectors are expanded into separate columns, and the features and labels are separated. The data is then split into training and testing sets. A Random Forest classifier is trained on the training data, and predictions are made on the test set. The model's accuracy and a classification report are printed. Additionally, the code prints the similarity scores along with predictions and actual labels for each pair of files in the test set.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.vscode		.vscode
Data		Data
previous version		previous version
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
a.out		a.out
cpp_similarity_dataset_new.json		cpp_similarity_dataset_new.json
datasetgen.py		datasetgen.py
mlModel.py		mlModel.py
requirements.txt		requirements.txt
tfidf_vectorizer.pkl		tfidf_vectorizer.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ASTra

Overview

Version History

Combined Version

Usage

Code Sections

Part 1: Reading and Preprocessing C++ Files, Generating ASTs, and Computing Similarity Matrix

Part 2: Loading, Preprocessing, and Training the Model

About

Releases

Packages

Languages

MRPERFECT0603/ASTra

Folders and files

Latest commit

History

Repository files navigation

ASTra

Overview

Version History

Combined Version

Usage

Code Sections

Part 1: Reading and Preprocessing C++ Files, Generating ASTs, and Computing Similarity Matrix

Part 2: Loading, Preprocessing, and Training the Model

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages