Skip to content

takingstock/CodeSapper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CodeSapper - pre-emptive analysis for a hassle free build !

Thanks for stopping by ! to get a very basic overview of what the site is about, kindly visit https://takingstock.github.io/codesapper.ai/ To add a few more basic details, the project was born out of all the pains i went through reviewing code, fixing build fails and ensuring test coverage. If you have gone through the architecture diagram then you probably have an idea of what the critical components in this project are going to be, but let me summarize

Steps to use the app

  • currently i have tested the app for python and javascript and there are known issues besides the fact that you will have to dump all your python code into code_db/python and js code into code_db/js ..apologies for that .. working on making it more real world :)
  • you only need to do 2 things ( though we have 4 steps :) )
    1. change the ENV variables specified in the .github/workflows/analyze_changes.yml; specifically NETWORKX_S3 where you give the name of your S3 bucket .. also please configure remote access to the S3 bucket from whichever env u use .. i dont think i can cover that part programmatically !
    2. please ensure your git repo has actions enabled
    3. for the LLM i am currently using a groq API ( please note this is not elon musks grok ) that implements LLama3-70B ..and its lightening fast ..but if you already have an API that serves LLama / claude / openai , go for it ( u will have to implement a custom method for these )
    4. once u have the above setup, simply make changes to the code base, check in and once the workflow ends, search for the keywords "BASE CHANGE IMPACT" and "DOWNSTREAM" in the logs .. sorry, i will provide a cleaner way to access this but for now, this will give u an end to end idea !

Summary of critical components

AST parsers

  • every language has its own idiosyncracies when it comes to defining methods, variables, package declarations etc. Though LLMs are kind of ok when it comes to parsing these details in a language agnostic format, given the criticality of getting this ~100% correct, we will be relying on language specific AST parsers. Currently i have already written the parsers for python and javascript and a parser for Java is in the pipeline
  • once we have parsers for python, js and java, the idea will be to test them on scale with open source projects that use these languages. Thats probably the only way we can assure our community that all the details we seek are being extracted
  • each parser sticks to a predefined format thats available in local_db/python and local_db/js ; these will be called <language_extn>_graph_entity_summary.json and stored in a data repo ( S3 , for now )

Graph operations

  • once the above json's are generated, an algorithm defined in the respective utils/ast_utils folder will find out the "usage of each of these methods / calls to these methods" in the local file
  • then the match_inter_service_calls method defined in the utils folder starts discovery of inter module / service calls
  • post updating the respective input json's we invoke the graph code defined in utils/graph_utils/networkx ( the choice of networkX was driven by 2 considerations a) min memory foot print b) 0 setup for anyone using this project ; though i feel neo4j is a much more versatile platform for graphs )

LLM ops

  • since the whole process is triggered by a code commit into git, we use the diff file, process it some and then traverse the graph to find the code snippets of the base file thats impacted and then also get all the downstream consumers of this particular method
  • extract code snippets from all respective files and conjoin them with the prompts defined in the utils/LLM_INTERFACE/llm_config.json and call the models
  • extract the outputs and display .. this can be emailed / sent on group messaging channels ( based on availability of APIs )

Short term roadmap

  • fine tuning LLaMa-3 70B model on the https://github.com/github/CodeSearchNet dataset for refined and accurate impact analysis
  • integrate Java parsers and all of its components ( for e.g. frameworks like Spring, JSF ) since a ****-ton of code in the world today is in Java
  • integrate with slack / discord or other powerful channels for dissipating notifications
  • integrating feedback module (yet to be designed; just have a basic idea) with RLHF components to ensure minimal FP's and FN's in the notifications generated by the system

Contributions

would love for people to reach out. Skill sets i am looking out for

  • dev's who have worked in complicated code bases that encompass atleast 2 or more programming languages OR same programming language but different frameworks
  • LLM fine tuning enthusiasts
  • UX designers ( to improve workflows )
  • shoot an email to [email protected] if you would like to contribute OR open issues , whatever floats your boat

Releases

No releases published

Packages

No packages published