Finished code cleanup, readme is mostly done.

This commit is contained in:
2025-04-20 20:22:57 -04:00
parent 9dcd31dd04
commit 1621023958
6 changed files with 171 additions and 25 deletions

139
README.md
View File

@@ -4,6 +4,34 @@ This repository contains a proof-of-concept of an AI coding assistant capable of
examining the repository looking for the most appropriate place to update, and then generating code appropriate to the examining the repository looking for the most appropriate place to update, and then generating code appropriate to the
prompt, and then a unit test and commit message. prompt, and then a unit test and commit message.
## Solutions Approach
The following approach is taken by this implementation:
* The repository is scanned, chunked, and embeddings are generated for all supported source code.
* The user provided prompt is also embedded and a vector search is performed to find the most appropriate chunk of code to modify.
* The identified chunk and surrounding context is provided to a code-appropriate LLM and the LLM is asked to write code appropriate to solve the prompt.
* The generated code is extracted from the LLM response and a diff patch is produced, the diff is applied to the source file.
* The context of the generated code and original prompt are provided to the code LLM and its asked to generate an appropriate unit test.
* The unit test is either appended to an existing test file or a new test file is generated.
* The changed files are staged and committed to Git after generating an appropriate Git commit message using a chat appropriate LLM.
This implementation is very simple and has numerous drawbacks (see limitations / assumptions section below) but it does
work and could be iterated on to produce better results.
## Code Structure
| Package | Description |
|-------------------------|-------------------------------------------------------------------|
| cmd | Main entrypoint |
| cmd/autopatch | Command to generate a git patch automatically from a prompt. |
| cmd/indexer | Command to (re-)index a repository and regenerate embeddings. |
| cmd/chunks | Command to search chunks based on an embeddings prompt. |
| pkg/config | Configuration file definition. |
| pkg/database | Database abstraction, used to perform bootstrapping / migrations. |
| pkg/database/migrations | Database schema. |
| pkg/indexer | Embedding generator for Git repositories. |
| pkg/llm | Abstraction layer over LLM implementations. |
| pkg/llm/prompt | Prompt templates for the LLM. |
## Requirements ## Requirements
To use this application you must provide the following: To use this application you must provide the following:
* A PostgreSQL database server with the pgvector extension installed and configured. * A PostgreSQL database server with the pgvector extension installed and configured.
@@ -25,15 +53,6 @@ To see what got generated:
git log --full-diff -p -n 1 git log --full-diff -p -n 1
``` ```
## Models Used in Testing
The following models were used while developing the application.
| Model | Purpose |
|------------------|---------------------------------------------------------------------|
| nomic-embed-text | Used for generating embeddings from source chunks. |
| gemma2-9b-it | Used for generating code. |
| llama3.2 | Used for conversational prompts and generating git commit messages. |
## Limitations / Assumptions ## Limitations / Assumptions
The following shortcuts have been taken to reduce time to implementation: The following shortcuts have been taken to reduce time to implementation:
* The application does not use an autonomous agentic approach as this would have taken implementing verification tools * The application does not use an autonomous agentic approach as this would have taken implementing verification tools
@@ -54,3 +73,105 @@ The following shortcuts have been taken to reduce time to implementation:
modify in order to produce a set of patches. modify in order to produce a set of patches.
* No attempt was made to tune the models to the task, there are almost certainly better models to use than the ones * No attempt was made to tune the models to the task, there are almost certainly better models to use than the ones
selected. This was the first set of models that produced a workable result. selected. This was the first set of models that produced a workable result.
## Models Used in Testing
The following models were used while developing the application.
| Model | Purpose |
|------------------|---------------------------------------------------------------------|
| nomic-embed-text | Used for generating embeddings from source chunks. |
| gemma2-9b-it | Used for generating code. |
| llama3.2 | Used for conversational prompts and generating git commit messages. |
## Run Example
An example of an execution:
```
git clone https://github.com/andyjessop/simple-go-server
go run ./cmd auto-patch \
--repo /path/to/simple-go-server \
--task "add a new api route to main that responds to GETs to /echo with 'hello world' with the current time appended at the end" \
--execute
```
Log:
```json
{"time":"2025-04-20T10:16:53.742200259-04:00","level":"INFO","msg":"repo already indexed, skipping"}
{"time":"2025-04-20T10:16:53.793235826-04:00","level":"INFO","msg":"found most relevant file chunk","file":"~/Projects/simple-go-server/main.go","start":2048,"end":2461,"score":0.4782574772834778,"id":1}
{"time":"2025-04-20T10:16:55.908966322-04:00","level":"INFO","msg":"applying generated patch to file","file":"~/Projects/simple-go-server/main.go"}
{"time":"2025-04-20T10:16:56.412154285-04:00","level":"INFO","msg":"applying generated unit test to file","file":"~/Projects/simple-go-server/main_test.go"}
{"time":"2025-04-20T10:16:56.758200341-04:00","level":"INFO","msg":"committed changes to git repo","repo":"~/Projects/simple-go-server"}
```
Diff Generated:
```
commit 29c5b5a2a2b71a05b4b782e94d482938cf51ba2b (HEAD -> main)
Author: Michael Powers <swedishborgie@gmail.com>
Date: Sun Apr 20 10:16:56 2025 -0400
"Added new API route to echo 'Hello World' with current time in response to GET requests to /echo"
diff --git a/main.go b/main.go
index 9acae7f..b475345 100644
--- a/main.go
+++ b/main.go
@@ -1,3 +1,4 @@
+
package main
import (
@@ -8,6 +9,7 @@ import (
"net/http"
"strconv"
"sync"
+ "time"
)
type Post struct {
@@ -24,6 +26,7 @@ var (
func main() {
http.HandleFunc("/posts", postsHandler)
http.HandleFunc("/posts/", postHandler)
+ http.HandleFunc("/echo", echoHandler)
fmt.Println("Server is running at http://localhost:8080")
log.Fatal(http.ListenAndServe(":8080", nil))
@@ -121,3 +124,7 @@ func handleDeletePost(w http.ResponseWriter, r *http.Request, id int) {
delete(posts, id)
w.WriteHeader(http.StatusOK)
}
+
+func echoHandler(w http.ResponseWriter, r *http.Request) {
+ w.Write([]byte("hello world " + time.Now().Format(time.RFC3339)))
+}
diff --git a/main_test.go b/main_test.go
index 4ca1d78..a051498 100644
--- a/main_test.go
+++ b/main_test.go
@@ -81,3 +81,25 @@ func TestHandlePostAndDeletePosts(t *testing.T) {
t.Errorf("handler returned wrong status code for delete: got %v want %v", status, http.StatusOK)
}
}
+
+
+func TestEchoHandler(t *testing.T) {
+ req, err := http.NewRequest("GET", "/echo", nil)
+ if err != nil {
+ t.Fatal(err)
+ }
+
+ rr := httptest.NewRecorder()
+ echoHandler(rr, req)
+
+ if rr.Code != http.StatusOK {
+ t.Errorf("handler returned wrong status code: got %v want %v",
+ rr.Code, http.StatusOK)
+ }
+
+ expected := "hello world " + time.Now().Format(time.RFC3339)
+ if rr.Body.String() != expected {
+ t.Errorf("handler returned unexpected body: got %v want %v",
+ rr.Body.String(), expected)
+ }
+}
```

View File

@@ -21,6 +21,7 @@ import (
"strings" "strings"
) )
// Command defines the autopatch command along with command line flags.
func Command() *cli.Command { func Command() *cli.Command {
return &cli.Command{ return &cli.Command{
Name: "auto-patch", Name: "auto-patch",
@@ -34,7 +35,7 @@ func Command() *cli.Command {
}, },
&cli.StringFlag{ &cli.StringFlag{
Name: "task", Name: "task",
Usage: "task to perform, e.g. \"add a test for a function\"", Usage: "task to perform, e.g. \"add a http route for a new health check endpoint\"",
Required: true, Required: true,
}, },
&cli.BoolFlag{ &cli.BoolFlag{
@@ -45,11 +46,13 @@ func Command() *cli.Command {
} }
} }
// autoPatch is a struct implementing the auto patcher.
type autoPatch struct { type autoPatch struct {
llm *llm.LLM llm *llm.LLM
execute bool execute bool
} }
// run gets executed when the command is run.
func (a *autoPatch) run(ctx context.Context, cmd *cli.Command) error { func (a *autoPatch) run(ctx context.Context, cmd *cli.Command) error {
llmRef := llm.FromContext(ctx) llmRef := llm.FromContext(ctx)
a.llm = llmRef a.llm = llmRef
@@ -70,6 +73,8 @@ func (a *autoPatch) run(ctx context.Context, cmd *cli.Command) error {
return nil return nil
} }
// generateGitCommit will generate the code patch, the unit test, and will stage and commit the changes with an
// appropriate commit message.
func (a *autoPatch) generateGitCommit(ctx context.Context, repoPath, prompt string) error { func (a *autoPatch) generateGitCommit(ctx context.Context, repoPath, prompt string) error {
var affectedFiles []string var affectedFiles []string
@@ -97,6 +102,7 @@ func (a *autoPatch) generateGitCommit(ctx context.Context, repoPath, prompt stri
return nil return nil
} }
// commit stages the changed files and commits them with a commit message.
func (a *autoPatch) commit(ctx context.Context, prompt, repoPath string, files ...string) error { func (a *autoPatch) commit(ctx context.Context, prompt, repoPath string, files ...string) error {
gitPath := osfs.New(filepath.Join(repoPath, ".git")) gitPath := osfs.New(filepath.Join(repoPath, ".git"))
@@ -146,6 +152,7 @@ func (a *autoPatch) commit(ctx context.Context, prompt, repoPath string, files .
return nil return nil
} }
// generateCodePatch generates an appropriate code patch given a repository and prompt.
func (a *autoPatch) generateCodePatch(ctx context.Context, repoPath, prompt string) (string, string, error) { func (a *autoPatch) generateCodePatch(ctx context.Context, repoPath, prompt string) (string, string, error) {
db := database.FromContext(ctx) db := database.FromContext(ctx)
cfg := config.FromContext(ctx) cfg := config.FromContext(ctx)
@@ -209,6 +216,8 @@ func (a *autoPatch) generateCodePatch(ctx context.Context, repoPath, prompt stri
return fileName, codeBlock, err return fileName, codeBlock, err
} }
// generateUnitTest passes the new code and a prompt to the LLM to generate an appropriate unit test. It will add the
// new test to the bottom of an existing test file, or generate a new one if no unit test file already exists.
func (a *autoPatch) generateUnitTest(ctx context.Context, prompt, fileName, newCode string) (string, error) { func (a *autoPatch) generateUnitTest(ctx context.Context, prompt, fileName, newCode string) (string, error) {
// Check to see if a test file for this already exists. // Check to see if a test file for this already exists.
testFileExists := false testFileExists := false
@@ -263,6 +272,8 @@ func (a *autoPatch) generateUnitTest(ctx context.Context, prompt, fileName, newC
return testFile, nil return testFile, nil
} }
// generateCode takes a prompt and tries to extract code from it. The prompt should try to get the LLM to structure the
// code as a single yaml chunk.
func (a *autoPatch) generateCode(ctx context.Context, prompt string) (string, error) { func (a *autoPatch) generateCode(ctx context.Context, prompt string) (string, error) {
rsp, err := a.llm.CodePrompt(ctx, prompt) rsp, err := a.llm.CodePrompt(ctx, prompt)
if err != nil { if err != nil {
@@ -284,6 +295,7 @@ func (a *autoPatch) generateCode(ctx context.Context, prompt string) (string, er
return codeBlock, nil return codeBlock, nil
} }
// patchFile attempts to write the merged diff to the original file so it can be staged and committed.
func patchFile(fileName string, diffs []diffmatchpatch.Diff) error { func patchFile(fileName string, diffs []diffmatchpatch.Diff) error {
var buff bytes.Buffer var buff bytes.Buffer
for _, diff := range diffs { for _, diff := range diffs {

View File

@@ -16,6 +16,9 @@ import (
"os" "os"
) )
// main is the entry point to the application. It's mostly responsible for bootstrapping the database, configuration,
// logging, and llms. It passes everything to subcommands by injecting state into the context to avoid coupling during
// initialization.
func main() { func main() {
app := &cli.Command{ app := &cli.Command{
Name: "ai-coding-assistant", Name: "ai-coding-assistant",
@@ -60,6 +63,7 @@ func main() {
} }
} }
// readConfig attempts to read the configuration file.
func readConfig(ctx context.Context, cmd *cli.Command) (context.Context, error) { func readConfig(ctx context.Context, cmd *cli.Command) (context.Context, error) {
cfgFile := cmd.String("config") cfgFile := cmd.String("config")
cfgHandle, err := os.Open(cfgFile) cfgHandle, err := os.Open(cfgFile)

View File

@@ -15,6 +15,7 @@ const (
contextKeyConfig contextKey = "config" contextKeyConfig contextKey = "config"
) )
// Configuration is a simple configuration that can be loaded from a YAML file.
type Configuration struct { type Configuration struct {
Database struct { Database struct {
ConnString string `yaml:"conn_string"` ConnString string `yaml:"conn_string"`

View File

@@ -18,12 +18,14 @@ const contextKeyLLM contextKey = "llm"
//go:embed prompts //go:embed prompts
var prompts embed.FS var prompts embed.FS
// LLM is responsible for abstracting the configuration and implementations of the LLMs used.
type LLM struct { type LLM struct {
code llms.Model code llms.Model
chat llms.Model chat llms.Model
embedder embeddings.Embedder embedder embeddings.Embedder
} }
// FromConfig bootstraps the LLM from a passed in configuration.
func FromConfig(cfg *config.Configuration) (*LLM, error) { func FromConfig(cfg *config.Configuration) (*LLM, error) {
embedLLM, err := cfg.Embedding.GetEmbedding() embedLLM, err := cfg.Embedding.GetEmbedding()
if err != nil { if err != nil {
@@ -52,30 +54,32 @@ func FromConfig(cfg *config.Configuration) (*LLM, error) {
}, nil }, nil
} }
// FromContext retrieves an LLM from a passed in context wrapped with WrapContext.
func FromContext(ctx context.Context) *LLM { func FromContext(ctx context.Context) *LLM {
return ctx.Value(contextKeyLLM).(*LLM) return ctx.Value(contextKeyLLM).(*LLM)
} }
// WrapContext embeds an LLM inside a context so it can be retrieved with FromContext.
func WrapContext(ctx context.Context, llmRef *LLM) context.Context { func WrapContext(ctx context.Context, llmRef *LLM) context.Context {
return context.WithValue(ctx, contextKeyLLM, llmRef) return context.WithValue(ctx, contextKeyLLM, llmRef)
} }
func (llm *LLM) GetEmbedding(ctx context.Context, texts ...string) ([][]float32, error) { // Embedder gets an embedder that can be used to store and retrieve embeddings.
return llm.embedder.EmbedDocuments(ctx, texts)
}
func (llm *LLM) Embedder() embeddings.Embedder { func (llm *LLM) Embedder() embeddings.Embedder {
return llm.embedder return llm.embedder
} }
// CodePrompt passes a prompt to the code LLM and returns the response.
func (llm *LLM) CodePrompt(ctx context.Context, prompt string) (string, error) { func (llm *LLM) CodePrompt(ctx context.Context, prompt string) (string, error) {
return llm.code.Call(ctx, prompt) return llm.code.Call(ctx, prompt)
} }
// ChatPrompt passes a prompt to the chat LLM and returns the response.
func (llm *LLM) ChatPrompt(ctx context.Context, prompt string) (string, error) { func (llm *LLM) ChatPrompt(ctx context.Context, prompt string) (string, error) {
return llm.chat.Call(ctx, prompt) return llm.chat.Call(ctx, prompt)
} }
// GetPrompt loads a LLM prompt template and injects variables into it. Uses the go template format.
func GetPrompt(name string, data any) (string, error) { func GetPrompt(name string, data any) (string, error) {
tmplText, err := prompts.ReadFile("prompts/" + name + ".tmpl") tmplText, err := prompts.ReadFile("prompts/" + name + ".tmpl")
if err != nil { if err != nil {

View File

@@ -9,6 +9,7 @@ import (
"strconv" "strconv"
) )
// RelevantDocs attempts to find the most relevant file chunks based on context from the prompt.
type RelevantDocs struct { type RelevantDocs struct {
CallbacksHandler callbacks.Handler CallbacksHandler callbacks.Handler
db *database.Database db *database.Database
@@ -17,6 +18,7 @@ type RelevantDocs struct {
size int size int
} }
// FileChunkID is a pointer to a repository file chunk that has been indexed.
type FileChunkID struct { type FileChunkID struct {
Name string Name string
ChunkID int ChunkID int
@@ -26,6 +28,7 @@ type FileChunkID struct {
Doc *schema.Document Doc *schema.Document
} }
// NewGetRelevantDocs creates a new RelevantDocs scanner.
func NewGetRelevantDocs(db *database.Database, llm *LLM, repoID string, size int) *RelevantDocs { func NewGetRelevantDocs(db *database.Database, llm *LLM, repoID string, size int) *RelevantDocs {
return &RelevantDocs{ return &RelevantDocs{
db: db, db: db,
@@ -35,6 +38,7 @@ func NewGetRelevantDocs(db *database.Database, llm *LLM, repoID string, size int
} }
} }
// GetRelevantFileChunks will scan for relevant documents based on a prompt.
func (rd *RelevantDocs) GetRelevantFileChunks(ctx context.Context, query string) ([]*FileChunkID, error) { func (rd *RelevantDocs) GetRelevantFileChunks(ctx context.Context, query string) ([]*FileChunkID, error) {
vectorStore, closeFunc, err := rd.db.GetVectorStore(ctx, rd.llm.Embedder()) vectorStore, closeFunc, err := rd.db.GetVectorStore(ctx, rd.llm.Embedder())
if err != nil { if err != nil {