Building askgit: semantic search for GitHub repositories

Repository

Problem

Keyword search is useful when you already know what a symbol, filename, or package is called. It is much less helpful when the question is closer to intent than exact syntax.

That gap shows up quickly when working in unfamiliar repositories. You might want to ask where authentication is enforced, how a job is scheduled, or which part of a system is responsible for chunking and indexing content. Those are natural questions, but they do not always map to a single identifier or exact string.

askgit is an attempt to make that mode of repository exploration practical. The project clones a GitHub repository, chunks the codebase with strategies that depend on file type, generates embeddings, stores the results in PostgreSQL with pgvector, and exposes search through an MCP server so assistant tooling can query it directly.

What The Repository Includes

At a high level, the repository contains four important pieces:

ingestion logic for cloning and processing repositories
code and document chunking logic for different file types
embedding and storage services built around PostgreSQL and pgvector
an MCP server that exposes semantic search to assistant workflows

It also includes the surrounding operational pieces needed to run the system locally: environment configuration, Docker Compose for PostgreSQL, Alembic migrations, and a small example script for agent integration.

Architecture Overview

The architecture is intentionally straightforward.

A target GitHub repository is cloned locally.
Files are filtered and split into chunks using a strategy that fits the language or document type.
Embeddings are generated through LiteLLM.
Chunks and vectors are stored in PostgreSQL using pgvector.
Queries are embedded and matched through vector similarity search.
The MCP server exposes those search capabilities to external assistants.

The value is not in a large number of moving parts. It is in choosing a pipeline that is simple enough to operate, but still smart enough to produce useful chunks for retrieval.

Main Components

Repository ingestion

The ingestion path is fairly simple on paper, but it sets the baseline for everything that comes after it. It is responsible for cloning a repository and turning it into a stream of files that the rest of the pipeline can work with. If that step pulls in the wrong files, skips important directories, or loses too much structural context too early, the quality of the search results drops quickly.

Chunking strategy

Most of the retrieval quality is decided here.

Instead of treating every file the same way, askgit uses a few chunking strategies depending on the content:

AST-aware chunking for languages where syntax structure matters and can be parsed reliably
language-specific code splitting for languages supported well by parser tooling
semantic chunking for markdown, text, and other unsupported files

Treating every file as if it were just plain text with a different extension turns out to be too crude. Fixed-size chunks are easy to implement, but they cut straight through function boundaries and usually lose too much context. Semantic chunking helps for prose, but it is not enough for code where functions, classes, and module boundaries carry most of the meaning. AST-aware splitting costs a bit more, but it gives back chunks that line up much better with how source code is actually read.

Embedding service

Embeddings are generated through LiteLLM, which keeps the provider boundary flexible. For a project like this, that matters because it keeps the focus on indexing and retrieval quality instead of tying the whole system too tightly to a single embedding backend.

Storage and retrieval

pgvector on PostgreSQL felt like the right tradeoff for this stage of the project. It is familiar, easy to run locally, and good enough to support the search workflow without adding another datastore just for vectors. For a developer tool that starts as a local or small shared service, that simplicity matters quite a bit.

MCP server

The MCP layer is what turns askgit from a standalone indexing script into something that can participate in an assistant workflow. Once the search functionality is exposed as tools, it becomes much easier to use the repository index from the same place where the questions are already being asked.

Key Implementation Choices

Why multiple chunking strategies

If there is one design choice that carries most of the weight in askgit, it is this one.

Code is not just text, and the retrieval pipeline works better when it reflects that. Different languages and file types benefit from different splitting strategies. AST-based chunking helps preserve complete functions, methods, and classes. Language-aware splitters still keep more structure than plain fixed windows when full AST support is not available. Semantic chunking then covers markdown and plain text so the non-code parts of the repository remain useful too.

Why PostgreSQL plus pgvector

For a system like this, PostgreSQL is a very reasonable default:

it is familiar
it is easy to run locally
it keeps relational metadata and vectors together
it avoids introducing a separate vector store too early

That does not make PostgreSQL the universal answer, but for a project like this it keeps the whole system approachable and easy to run.

Why expose it through MCP

MCP is a good fit here because a retrieval engine on its own is rarely the end goal. What usually matters is repository search as one capability inside a larger assistant workflow. MCP makes that integration much more natural.

Running The Project

The happy path is intentionally small:

install dependencies with uv
copy .env.template to .env
start PostgreSQL with Docker Compose
run the Alembic migrations
ingest a repository
start the MCP server

The exact commands are already in the repository README, which is where the operational detail belongs. For the purposes of this post, the more important point is that askgit is structured like something you can actually run and inspect, not just something that sounds plausible in a diagram.

Tradeoffs And Limitations

The current version is a strong technical foundation, but it is still early in a few ways.

retrieval quality is still bounded by chunk quality and embedding quality
a local-first deployment path is practical, but not the same as a production-ready hosted service
there is room for stronger evaluation around search quality and ranking behavior
hybrid retrieval, metadata filtering, and freshness workflows would make the system more robust

None of those limitations are especially surprising for a project at this stage. The more important point is that the repository already supports a real workflow and provides a solid base for improving the retrieval side over time.

What I Would Improve Next

If I were pushing this further, the next steps would be:

add a lightweight evaluation suite for retrieval quality
support better metadata filters and repository scoping
improve re-indexing and freshness management
experiment with hybrid retrieval instead of pure vector similarity
add a clearer deployment story for a shared hosted version

Closing

askgit is the kind of project this blog is meant to focus on: a practical tool with a clear job, a public repository, and a few design decisions that are worth unpacking properly.

If you want to try it or build on it, start with the linked repository above.