ยท 9 min read ยท Wingston Sharon

pgvector + Django: Semantic Search for ESG Document Analysis

---

pgvector + Django: Semantic Search for ESG Document Analysis

By Wingston Sharon | December 2024


At some point in building Agentosaurus, we had to make a call: add a dedicated vector database, or figure out how to do semantic search inside PostgreSQL. Pinecone, Weaviate, Qdrant โ€” all viable, all well-documented. But we already had PostgreSQL. Our Django ORM talks to it. Our Celery tasks write to it. Adding a second database for vectors meant a second thing to deploy, a second thing to back up, a second thing to debug at 2am.

We chose pgvector. Six months later I still think it was the right call for our scale.

The Problem We Were Solving

We crawl organizational websites and sustainability reports. For each organization, we have: a plain-text description, mission statements, extracted sustainability claims, and โ€” when available โ€” PDF reports chunked into paragraphs.

We need to answer questions like:
- "Which organizations are doing work similar to this one?" (similarity search over descriptions)
- "Which other organizations have made claims like this one?" (claim deduplication and contradiction detection)
- "Given this investor's impact thesis, which organizations align?" (semantic matching without keyword overlap)

None of these are solvable with ILIKE or full-text search alone. You need embeddings.

Installing pgvector

On the PostgreSQL server (we run PostgreSQL 15 on Ubuntu on Oracle Cloud):

sudo apt install postgresql-15-pgvector

Then in a migration or directly in psql:

CREATE EXTENSION IF NOT EXISTS vector;

If you're on a managed PostgreSQL (Supabase, Neon, etc.), pgvector is usually available as an extension you can enable from the dashboard. On RDS you need PostgreSQL 15+ and the pgvector parameter group option.

Verify it's working:

SELECT extname, extversion FROM pg_extension WHERE extname = 'vector';
-- vector | 0.7.0

The Django Model

We use pgvector-python for the Django integration:

pip install pgvector

The model for organization embeddings:

# agentosaurus/models.py
from django.db import models
from pgvector.django import VectorField

class OrganizationEmbedding(models.Model):
    organization = models.OneToOneField(
        'Organization',
        on_delete=models.CASCADE,
        related_name='embedding',
    )
    # nomic-embed-text outputs 768-dimensional vectors
    description_vector = VectorField(dimensions=768, null=True, blank=True)
    claims_vector = VectorField(dimensions=768, null=True, blank=True)
    updated_at = models.DateTimeField(auto_now=True)

    class Meta:
        app_label = 'agentosaurus'

The migration Django generates will create a proper vector(768) column in PostgreSQL. Nothing unusual needed.

For document chunks from PDFs, we use a separate model:

class DocumentChunk(models.Model):
    organization = models.ForeignKey(
        'Organization',
        on_delete=models.CASCADE,
        related_name='chunks',
    )
    source_url = models.URLField()
    chunk_index = models.IntegerField()
    text = models.TextField()
    embedding = VectorField(dimensions=768, null=True, blank=True)
    created_at = models.DateTimeField(auto_now_add=True)

    class Meta:
        app_label = 'agentosaurus'
        indexes = [
            models.Index(fields=['organization', 'chunk_index']),
        ]

Generating Embeddings

We use nomic-embed-text via Ollama. This was our third embedding model โ€” I'll explain the path below. The generation function:

import httpx
from django.conf import settings

EMBED_URL = f"http://{settings.OLLAMA_HOST}:11434/api/embeddings"

def generate_embedding(text: str) -> list[float]:
    """Generate a 768-dim embedding using nomic-embed-text via Ollama."""
    if not text or not text.strip():
        return None

    # nomic-embed-text wants a prefix for retrieval tasks
    prefixed = f"search_document: {text[:4096]}"

    response = httpx.post(
        EMBED_URL,
        json={"model": "nomic-embed-text", "prompt": prefixed},
        timeout=30.0,
    )
    response.raise_for_status()
    return response.json()["embedding"]


def embed_organization(org_id: int) -> None:
    """Embed an organization's description and store in DB."""
    from agentosaurus.models import Organization, OrganizationEmbedding

    org = Organization.objects.get(id=org_id)
    description = f"{org.name}. {org.mission}. {org.description}"

    vector = generate_embedding(description)
    if vector is None:
        return

    OrganizationEmbedding.objects.update_or_create(
        organization=org,
        defaults={"description_vector": vector},
    )

This runs as a Celery task after each organization crawl completes.

Similarity Search in Django ORM

With pgvector-python, you get Django ORM annotations for distance functions:

from pgvector.django import CosineDistance
from agentosaurus.models import Organization, OrganizationEmbedding

def find_similar_organizations(org_id: int, limit: int = 10):
    """Find organizations semantically similar to the given one."""
    source_embedding = OrganizationEmbedding.objects.get(
        organization_id=org_id
    )
    query_vector = source_embedding.description_vector

    results = (
        OrganizationEmbedding.objects
        .exclude(organization_id=org_id)
        .annotate(distance=CosineDistance('description_vector', query_vector))
        .filter(distance__lt=0.3)  # Cosine distance < 0.3 = fairly similar
        .order_by('distance')
        .select_related('organization')[:limit]
    )

    return [
        {
            "organization": r.organization,
            "similarity_score": round(1 - r.distance, 4),
        }
        for r in results
    ]

For query-based search (e.g., investor thesis matching):

def semantic_search(query: str, limit: int = 20) -> list:
    """Search organizations by semantic similarity to a free-text query."""
    # Use the retrieval query prefix for nomic-embed-text
    query_vector = generate_embedding(f"search_query: {query}")
    if query_vector is None:
        return []

    results = (
        OrganizationEmbedding.objects
        .annotate(distance=CosineDistance('description_vector', query_vector))
        .filter(distance__lt=0.4)
        .order_by('distance')
        .select_related('organization')[:limit]
    )

    return list(results)

Indexing: IVFFlat vs HNSW

By default, pgvector does exact nearest-neighbor search โ€” it scans all vectors. This is fine up to ~50k vectors. Beyond that, you want an approximate nearest-neighbor index.

We started with IVFFlat and moved to HNSW. Here's why.

IVFFlat divides the vector space into clusters (lists) and searches the nearest clusters. It's faster than exact search but requires choosing a lists parameter and running ANALYZE first:

CREATE INDEX org_embedding_ivfflat_idx
ON agentosaurus_organizationembedding
USING ivfflat (description_vector vector_cosine_ops)
WITH (lists = 100);

The problem: IVFFlat needs to be built on populated data. If you build it on an empty table (common in migrations), recall is poor. You have to rebuild it once data exists.

HNSW (Hierarchical Navigable Small World) is a graph-based index with better recall and no "train on data" requirement:

CREATE INDEX org_embedding_hnsw_idx
ON agentosaurus_organizationembedding
USING hnsw (description_vector vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

HNSW builds slower and uses more memory, but query recall is significantly better, especially on our dataset size (~15k organizations). We use HNSW now.

In a Django migration:

from django.db import migrations

class Migration(migrations.Migration):
    dependencies = [
        ('agentosaurus', '0024_organizationembedding'),
    ]

    operations = [
        migrations.RunSQL(
            sql="""
            CREATE INDEX IF NOT EXISTS org_embedding_hnsw_idx
            ON agentosaurus_organizationembedding
            USING hnsw (description_vector vector_cosine_ops)
            WITH (m = 16, ef_construction = 64);
            """,
            reverse_sql="DROP INDEX IF EXISTS org_embedding_hnsw_idx;",
        ),
    ]

Embedding Model Journey

We went through three models before landing on nomic-embed-text:

  1. all-MiniLM-L6-v2 (384 dimensions, sentence-transformers): Fast, but semantic precision on ESG/sustainability language was poor. Too many false positives on generic "we care about the planet" content.

  2. mxbai-embed-large (1024 dimensions): Better quality, but we were storing 1024-float vectors for every chunk and the table was ballooning. Query time was also slower.

  3. nomic-embed-text (768 dimensions): Better ESG domain understanding than MiniLM, manageable size, and the search_document: / search_query: prompt prefixes actually matter โ€” they shift the embedding space for retrieval vs storage tasks. This is the one we've stuck with.

Switching embedding models means re-embedding everything, which takes time. Plan your model choice early.

The ESG Use Case: Finding Contradictions

Beyond similarity search, we use embeddings to flag potential contradictions in sustainability claims. The logic: embed individual claims, then find claims that are semantically close but in opposite directions.

This is an area under active development โ€” we're currently testing a system that retrieves similar claims from other organizations and passes them to an LLM with the instruction "identify if these claims contradict each other." It works surprisingly well for catching greenwashing patterns like "net zero by 2030" paired with "expanding fossil fuel operations."

pgvector makes this pipeline possible without a separate infrastructure dependency, and that simplicity is worth a lot when you're a small team.


Questions about our ESG document pipeline? Reach out at hello@agentosaurus.com.

Share: X (Twitter) LinkedIn

Build This Infrastructure?

We help AI teams build sovereign GPU clouds and autonomous systems. Free 30-minute consultation. Fixed-price projects from โ‚ฌ5K.

Schedule Free Consultation

Related Articles