RAG Agent with LangChain
Build a retrieval-augmented generation (RAG) agent that runs entirely on your laptop — Postgres with pgvector for embeddings, Redis for response caching, and LangChain orchestrating it all. No cloud vector database required.
What you'll build
A FastAPI service that:
- Ingests documents and stores embeddings in Postgres via pgvector
- Retrieves relevant context for user queries using semantic search
- Generates answers with an LLM using the retrieved context
- Caches frequent queries in Redis for sub-second responses
┌─────────────┐ ┌──────────────────┐ ┌──────────────┐
│ Browser │────▶│ FastAPI + LC │────▶│ OpenAI API │
│ :8000 │◀────│ RAG Agent │◀────│ (external) │
└─────────────┘ └───────┬──────────┘ └──────────────┘
│ │
┌────▼──┐ ┌──▼───┐
│Postgres│ │Redis │
│pgvector│ │cache │
└────────┘ └──────┘
Project structure
rag-agent/
├── Dockerfile
├── requirements.txt
├── main.py # FastAPI app + RAG chain
├── ingest.py # Document ingestion script
└── .github/
└── workflows/
└── dev-deploy.yml
requirements.txt
fastapi==0.115.0
uvicorn[standard]==0.30.0
langchain==0.3.0
langchain-openai==0.2.0
langchain-postgres==0.0.12
pgvector==0.3.5
psycopg2-binary==2.9.9
redis==5.0.0
python-multipart==0.0.9
Dockerfile
FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
main.py
import os
import hashlib
import json
import redis
from fastapi import FastAPI, UploadFile
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_postgres import PGVector
from langchain.chains import RetrievalQA
from langchain.text_splitter import RecursiveCharacterTextSplitter
app = FastAPI(title="RAG Agent")
# Connection URLs are auto-injected by kindling
DATABASE_URL = os.environ["DATABASE_URL"]
REDIS_URL = os.environ["REDIS_URL"]
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
# Initialize components
embeddings = OpenAIEmbeddings(api_key=OPENAI_API_KEY)
vectorstore = PGVector(
connection=DATABASE_URL,
embeddings=embeddings,
collection_name="documents",
)
llm = ChatOpenAI(model="gpt-4o-mini", api_key=OPENAI_API_KEY)
cache = redis.from_url(REDIS_URL)
@app.post("/ingest")
async def ingest(file: UploadFile):
"""Split a document into chunks and store embeddings."""
content = (await file.read()).decode("utf-8")
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_text(content)
vectorstore.add_texts(chunks, metadatas=[{"source": file.filename}] * len(chunks))
return {"chunks": len(chunks), "source": file.filename}
@app.get("/ask")
async def ask(q: str):
"""Answer a question using RAG with Redis caching."""
cache_key = f"rag:{hashlib.sha256(q.encode()).hexdigest()[:16]}"
cached = cache.get(cache_key)
if cached:
return json.loads(cached)
qa = RetrievalQA.from_chain_type(
llm=llm,
retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
)
result = qa.invoke(q)
response = {"question": q, "answer": result["result"]}
cache.setex(cache_key, 3600, json.dumps(response))
return response
@app.get("/health")
async def health():
return {"status": "ok"}
kindling setup
1. Store your API key
kindling secrets set OPENAI_API_KEY sk-your-key-here
2. Deploy with the workflow
The dependencies block is where kindling shines — Postgres and Redis
are auto-provisioned with zero configuration, and connection URLs are
injected into your container automatically.
# .github/workflows/dev-deploy.yml
name: dev-deploy
on:
push:
branches: [main]
workflow_dispatch:
env:
REGISTRY: registry:5000
TAG: ${{ github.actor }}-${{ github.sha }}
jobs:
deploy:
runs-on: [self-hosted, "${{ github.actor }}"]
steps:
- uses: actions/checkout@v4
- run: rm -rf /builds/*
- name: Build RAG agent image
uses: kindling-sh/kindling/.github/actions/kindling-build@main
with:
name: rag-agent
context: ${{ github.workspace }}
image: "${{ env.REGISTRY }}/rag-agent:${{ env.TAG }}"
- name: Deploy RAG agent
uses: kindling-sh/kindling/.github/actions/kindling-deploy@main
with:
name: ${{ github.actor }}-rag-agent
image: "${{ env.REGISTRY }}/rag-agent:${{ env.TAG }}"
port: "8000"
ingress-host: "${{ github.actor }}-rag.localhost"
health-check-path: "/health"
dependencies: |
- type: postgres
- type: redis
env: |
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: kindling-secret-openai-api-key
key: value
3. Try it
# Ingest a document
curl -F "file=@README.md" http://<you>-rag.localhost/ingest
# Ask a question
curl "http://<you>-rag.localhost/ask?q=what+does+this+project+do"
Iterate with sync
Edit main.py locally — change the chunk size, swap the retriever,
add a reranker — and see it live in seconds:
kindling sync -n <you>-rag-agent -d .
# Edit main.py → changes appear instantly
# Ctrl+C → deployment rolls back
Why local matters for RAG
- Embedding latency — pgvector runs on localhost, so vector search is single-digit milliseconds instead of a cloud round-trip
- Free iteration — tune chunk sizes, overlap, retrieval
kvalues, and prompt templates without burning cloud credits - Data stays local — ingest proprietary docs without sending them to a third-party vector database
- Full observability —
kindling logsshows exactly what's happening in Postgres, Redis, and your app simultaneously
Next steps
- Add Jaeger for tracing LangChain spans:
- type: jaeger - Use
kindling exposeto test with OAuth-protected endpoints - Scale up to a multi-service architecture with a separate ingestion worker