Adding design document and diagrams.

2025-04-21 13:21:48 -04:00
parent 78158bd3b2
commit a46ee0d562
8 changed files with 684 additions and 0 deletions
--- a/DESIGN.md
+++ b/DESIGN.md
@@ -0,0 +1,98 @@
+# System Design
+
+This document contains a systems design for a product capable of helping software engineers by accepting prompts
+from a variety of sources (Github, Jira, IDEs, etc) and autonomously retrieving context, creating a plan to solve the
+prompt, executing the plan, verifying the results, and then pushing the results back to the most appropriate tool.
+
+At a very high level the system looks like this: \
+![arch-diagram](diagrams/arch-diagram.png)
+
+Components:
+
+* API - A REST based API capable of accepting incoming webhooks from a variety of tools. The webhooks will generate
+  tasks that will require context gathering, planning, execution, and testing to resolve. These get passed to the
+  indexer to gather context.
+* Indexer - An event based processor that accepts tasks from the API and gathers context (e.g. Git files / commits, Jira
+  ticket status, etc) that will need to be indexed and stored for efficient retrieval by later stages.
+* Agent Runner - Takes the task and associated context generated by the Indexer and generates an execution plan. Works
+  synchronously with the task executor to execute the plan. As tasks are executed the plan should be adjusted to ensure
+  the task is accomplished.
+* Task Executor - A protected process running with [gVisor](https://gvisor.dev/) (a container security orchestration
+  mechanism) that has a set of tools available to it that the agent runner can execute to perform its task. The executor
+  will have a Network Policy applied such that network access is restricted the bare minimum required to use the
+  tools. \
+  Example tools include:
+    * Code Context - Select appropriate context for code generators.
+    * Code Generators - Generate code for given tasks.
+    * Compilers - Ensure code compiles.
+    * Linters - Ensure code is well formatted and doesn't violate any defined coding standards.
+    * Run Tests - Ensure tests (unit, integration, system) continue to pass after making changes, make sure new tests
+      pass.
+* Result Processor - Responsible for receiving the result of a task from the Agent Runner and disseminating it to
+  interested parties through the API and directly to integrated services.
+
+## System Dependencies
+
+* The solution sits on top of `Amazon Web Services` as an industry standard compute provider. We intentionally will
+  not use AWS products that do not have good analogs with other compute providers (e.g. Kinesis, Nova, Bedrock, etc) to
+  avoid vendor lock in.
+* The solution is built and deployed on top of `Elastic Kubernetes Service` to provide a flexible orchestration layer
+  that will allow us to deploy, scale, monitor, and repair the application with relatively low effort. Updates with EKS
+  can be orchestrated such that they are delivered without downtime to consumers.
+* For asynchronous event flows we'll use `Apache Kafka` this will allow us to handle a very large event volume with low
+  performance overhead. Events can be processed as cluster capacity allows and events will not be dropped in the event
+  of an application availability issue.
+* For observability, we'll use `Prometheus` and `Grafana` in the application to provide metrics. For logs we'll use `Grafana
+  Loki`. This will allow us to see how the application is performing as well as identify any issues as they arise.
+* To provide large language and embedding models we can host `ollama` on top of GPU equipped GKE nodes. Models can be
+  distributed via persistent volumes and models can be pre-loaded into vRAM with an init container. Autoscalers can be
+  used to scale up and down specific model versions based on demand. This doesn't preclude using LLM-as-a-service
+  providers.
+* Persistent data storage will be done via `PostgreSQL` hosted on top of `Amazon Relational Database Service`. The
+  `pgvector`
+  extension will be used to provide efficient vector searches for searching embeddings.
+
+## Scaling Considerations
+The system can dynamically scale based on load. A horizontal pod autoscaler can be used on each component of the system
+to allow the system to scale up or down based on the current load. For "compute" instances CPU utilization can be used
+to determine when to scale. For "gpu" instances an external metric measuring GPU utilization can be used to determine
+when scaling operations are appropriate. For efficient usage of GPU vRAM and load spreading, models can be packaged
+together such that they saturate most of the available vRAM, they can be scaled independently.
+
+In order to accommodate load bursts the system will operate largely asynchronously. Boundaries between system components
+will be buffered with Kafka to allow the components to only consume what they're able to handle without data getting
+lost or the need for complex retry mechanisms. If the number of models gets large a proxy could be developed that can
+dynamically route requests to the appropriate backend with the appropriate model pre-loaded.
+
+## Testing / Model Migration Strategy
+
+An essential property of any AI based system is the ability to measure the performance of the system over time. This is
+important to ensure that models can be safely migrated as the market evolves and better models are released.
+
+A simple approach to measure performance over time is to create a representative set of example tasks that should be
+run when changes are introduced. Performance should be measured against the baseline on a number of different metrics
+such as:
+
+* Number of agentic iterations required to solve the task (less is better).
+* Amount of code / responses generated (less is better).
+* Success rate (more is better).
+
+Over time, as problematic areas are identified, new tasks should be introduced to the set to improve the training data.
+
+## Migrating / Managing Embeddings
+
+One particularly sensitive area for migrating models is around embeddings models. Better models are routinely published
+but, it is expensive to re-index data, especially if the volumes are large.
+
+The vector database should store the model that produced each embedding. When new embedding models are introduced the
+indexer should use the new embedding model to index new content, but should allow old content to be searched using old
+models. If models are permanently retired the content should be re-indexed with a supported embeddings model. The vector
+database should allow the same content to be indexed with multiple models at the same time.
+
+## Data Flow
+
+![dataflow-diagram](diagrams/dataflow-diagram.png)
+
+## Sequence Diagram
+
+![sequence-diagram](diagrams/sequence-diagram.png)