pgvector + Django: Semantic Search for ESG Document Analysis
---
pgvector + Django: Semantic Search for ESG Document Analysis
By Wingston Sharon | December 2024
At some point in building Agentosaurus, we had to make a call: add a dedicated vector database, or figure out how to do semantic search inside PostgreSQL. Pinecone, Weaviate, Qdrant โ all viable, all well-documented. But we already had PostgreSQL. Our Django ORM talks to it. Our Celery tasks write to it. Adding a second database for vectors meant a second thing to deploy, a second thing to back up, a second thing to debug at 2am.
We chose pgvector. Six months later I still think it was the right call for our scale.
The Problem We Were Solving
We crawl organizational websites and sustainability reports. For each organization, we have: a plain-text description, mission statements, extracted sustainability claims, and โ when available โ PDF reports chunked into paragraphs.
We need to answer questions like:
- "Which organizations are doing work similar to this one?" (similarity search over descriptions)
- "Which other organizations have made claims like this one?" (claim deduplication and contradiction detection)
- "Given this investor's impact thesis, which organizations align?" (semantic matching without keyword overlap)
None of these are solvable with ILIKE or full-text search alone. You need embeddings.
Installing pgvector
On the PostgreSQL server (we run PostgreSQL 15 on Ubuntu on Oracle Cloud):
sudo apt install postgresql-15-pgvector
Then in a migration or directly in psql:
CREATE EXTENSION IF NOT EXISTS vector;
If you're on a managed PostgreSQL (Supabase, Neon, etc.), pgvector is usually available as an extension you can enable from the dashboard. On RDS you need PostgreSQL 15+ and the pgvector parameter group option.
Verify it's working:
SELECT extname, extversion FROM pg_extension WHERE extname = 'vector';
-- vector | 0.7.0
The Django Model
We use pgvector-python for the Django integration:
pip install pgvector
The model for organization embeddings:
# agentosaurus/models.py
from django.db import models
from pgvector.django import VectorField
class OrganizationEmbedding(models.Model):
organization = models.OneToOneField(
'Organization',
on_delete=models.CASCADE,
related_name='embedding',
)
# nomic-embed-text outputs 768-dimensional vectors
description_vector = VectorField(dimensions=768, null=True, blank=True)
claims_vector = VectorField(dimensions=768, null=True, blank=True)
updated_at = models.DateTimeField(auto_now=True)
class Meta:
app_label = 'agentosaurus'
The migration Django generates will create a proper vector(768) column in PostgreSQL. Nothing unusual needed.
For document chunks from PDFs, we use a separate model:
class DocumentChunk(models.Model):
organization = models.ForeignKey(
'Organization',
on_delete=models.CASCADE,
related_name='chunks',
)
source_url = models.URLField()
chunk_index = models.IntegerField()
text = models.TextField()
embedding = VectorField(dimensions=768, null=True, blank=True)
created_at = models.DateTimeField(auto_now_add=True)
class Meta:
app_label = 'agentosaurus'
indexes = [
models.Index(fields=['organization', 'chunk_index']),
]
Generating Embeddings
We use nomic-embed-text via Ollama. This was our third embedding model โ I'll explain the path below. The generation function:
import httpx
from django.conf import settings
EMBED_URL = f"http://{settings.OLLAMA_HOST}:11434/api/embeddings"
def generate_embedding(text: str) -> list[float]:
"""Generate a 768-dim embedding using nomic-embed-text via Ollama."""
if not text or not text.strip():
return None
# nomic-embed-text wants a prefix for retrieval tasks
prefixed = f"search_document: {text[:4096]}"
response = httpx.post(
EMBED_URL,
json={"model": "nomic-embed-text", "prompt": prefixed},
timeout=30.0,
)
response.raise_for_status()
return response.json()["embedding"]
def embed_organization(org_id: int) -> None:
"""Embed an organization's description and store in DB."""
from agentosaurus.models import Organization, OrganizationEmbedding
org = Organization.objects.get(id=org_id)
description = f"{org.name}. {org.mission}. {org.description}"
vector = generate_embedding(description)
if vector is None:
return
OrganizationEmbedding.objects.update_or_create(
organization=org,
defaults={"description_vector": vector},
)
This runs as a Celery task after each organization crawl completes.
Similarity Search in Django ORM
With pgvector-python, you get Django ORM annotations for distance functions:
from pgvector.django import CosineDistance
from agentosaurus.models import Organization, OrganizationEmbedding
def find_similar_organizations(org_id: int, limit: int = 10):
"""Find organizations semantically similar to the given one."""
source_embedding = OrganizationEmbedding.objects.get(
organization_id=org_id
)
query_vector = source_embedding.description_vector
results = (
OrganizationEmbedding.objects
.exclude(organization_id=org_id)
.annotate(distance=CosineDistance('description_vector', query_vector))
.filter(distance__lt=0.3) # Cosine distance < 0.3 = fairly similar
.order_by('distance')
.select_related('organization')[:limit]
)
return [
{
"organization": r.organization,
"similarity_score": round(1 - r.distance, 4),
}
for r in results
]
For query-based search (e.g., investor thesis matching):
def semantic_search(query: str, limit: int = 20) -> list:
"""Search organizations by semantic similarity to a free-text query."""
# Use the retrieval query prefix for nomic-embed-text
query_vector = generate_embedding(f"search_query: {query}")
if query_vector is None:
return []
results = (
OrganizationEmbedding.objects
.annotate(distance=CosineDistance('description_vector', query_vector))
.filter(distance__lt=0.4)
.order_by('distance')
.select_related('organization')[:limit]
)
return list(results)
Indexing: IVFFlat vs HNSW
By default, pgvector does exact nearest-neighbor search โ it scans all vectors. This is fine up to ~50k vectors. Beyond that, you want an approximate nearest-neighbor index.
We started with IVFFlat and moved to HNSW. Here's why.
IVFFlat divides the vector space into clusters (lists) and searches the nearest clusters. It's faster than exact search but requires choosing a lists parameter and running ANALYZE first:
CREATE INDEX org_embedding_ivfflat_idx
ON agentosaurus_organizationembedding
USING ivfflat (description_vector vector_cosine_ops)
WITH (lists = 100);
The problem: IVFFlat needs to be built on populated data. If you build it on an empty table (common in migrations), recall is poor. You have to rebuild it once data exists.
HNSW (Hierarchical Navigable Small World) is a graph-based index with better recall and no "train on data" requirement:
CREATE INDEX org_embedding_hnsw_idx
ON agentosaurus_organizationembedding
USING hnsw (description_vector vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
HNSW builds slower and uses more memory, but query recall is significantly better, especially on our dataset size (~15k organizations). We use HNSW now.
In a Django migration:
from django.db import migrations
class Migration(migrations.Migration):
dependencies = [
('agentosaurus', '0024_organizationembedding'),
]
operations = [
migrations.RunSQL(
sql="""
CREATE INDEX IF NOT EXISTS org_embedding_hnsw_idx
ON agentosaurus_organizationembedding
USING hnsw (description_vector vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
""",
reverse_sql="DROP INDEX IF EXISTS org_embedding_hnsw_idx;",
),
]
Embedding Model Journey
We went through three models before landing on nomic-embed-text:
-
all-MiniLM-L6-v2(384 dimensions, sentence-transformers): Fast, but semantic precision on ESG/sustainability language was poor. Too many false positives on generic "we care about the planet" content. -
mxbai-embed-large(1024 dimensions): Better quality, but we were storing 1024-float vectors for every chunk and the table was ballooning. Query time was also slower. -
nomic-embed-text(768 dimensions): Better ESG domain understanding than MiniLM, manageable size, and thesearch_document:/search_query:prompt prefixes actually matter โ they shift the embedding space for retrieval vs storage tasks. This is the one we've stuck with.
Switching embedding models means re-embedding everything, which takes time. Plan your model choice early.
The ESG Use Case: Finding Contradictions
Beyond similarity search, we use embeddings to flag potential contradictions in sustainability claims. The logic: embed individual claims, then find claims that are semantically close but in opposite directions.
This is an area under active development โ we're currently testing a system that retrieves similar claims from other organizations and passes them to an LLM with the instruction "identify if these claims contradict each other." It works surprisingly well for catching greenwashing patterns like "net zero by 2030" paired with "expanding fossil fuel operations."
pgvector makes this pipeline possible without a separate infrastructure dependency, and that simplicity is worth a lot when you're a small team.
Questions about our ESG document pipeline? Reach out at hello@agentosaurus.com.
Build This Infrastructure?
We help AI teams build sovereign GPU clouds and autonomous systems. Free 30-minute consultation. Fixed-price projects from โฌ5K.
Schedule Free Consultation