How to Score Organizations Against UN SDGs Using RAG and LLMs

By Wingston Sharon | March 2025

When I started building the SDG scoring system for Agentosaurus, I thought the hard part would be the AI. It turned out the hard part was figuring out what we were even trying to measure.

The 17 UN Sustainable Development Goals are intentionally broad. SDG 13 is "Climate Action." SDG 8 is "Decent Work and Economic Growth." These are ambitious macro-goals, not precise evaluation criteria. An organization that makes solar panels is obviously relevant to SDG 7 (Affordable and Clean Energy). But what about a company that makes software for renewable energy operators? Or a consulting firm that advises on energy transition? The signal gets fuzzy fast.

Our current approach is RAG-based scoring: retrieve organization content relevant to each SDG, then ask an LLM to evaluate alignment based on what it can actually find in the text. This produces approximate signals, not audit-grade certifications. I'll explain both what we built and where its limits are.

The Architecture

The scoring pipeline runs after an organization has been crawled and its content chunked and embedded (covered in earlier posts). The steps are:

Retrieve: For each of the 17 SDGs, query the organization's embedded document chunks for the most relevant passages.
Score: Pass the retrieved chunks + the SDG description + targets to an LLM and ask for an alignment score (0-100) with reasoning.
Aggregate: Store per-goal scores and compute a weighted overall score.
Flag: Mark any goals where the LLM couldn't find sufficient evidence.

SDG Reference Data

We store the SDG definitions in a Django fixture:

# agentosaurus/fixtures/sdgs.json (abbreviated)
[
  {
    "model": "agentosaurus.sdg",
    "pk": 7,
    "fields": {
      "number": 7,
      "title": "Affordable and Clean Energy",
      "description": "Ensure access to affordable, reliable, sustainable and modern energy for all.",
      "targets": [
        "7.1 By 2030, ensure universal access to affordable, reliable and modern energy services",
        "7.2 By 2030, increase substantially the share of renewable energy in the global energy mix",
        "7.3 By 2030, double the global rate of improvement in energy efficiency"
      ]
    }
  }
]

The RAG Retrieval Step

For each SDG, we construct a query from the goal title and targets, embed it, and retrieve the most relevant chunks from the organization's content:

from pgvector.django import CosineDistance
from agentosaurus.models import DocumentChunk, SDG
from agentosaurus.embeddings import generate_embedding

def retrieve_relevant_chunks(org_id: int, sdg: SDG, top_k: int = 5) -> list[str]:
    """
    For a given organization and SDG, retrieve the most relevant
    document chunks using semantic similarity.
    """
    # Build a retrieval query from the SDG definition
    query_text = (
        f"search_query: {sdg.title}. {sdg.description}. "
        f"Targets: {'; '.join(sdg.targets[:3])}"
    )
    query_vector = generate_embedding(query_text)

    if query_vector is None:
        return []

    chunks = (
        DocumentChunk.objects
        .filter(organization_id=org_id)
        .annotate(distance=CosineDistance('embedding', query_vector))
        .filter(distance__lt=0.5)  # Reasonably relevant threshold
        .order_by('distance')[:top_k]
    )

    return [chunk.text for chunk in chunks]

The threshold of 0.5 cosine distance is a judgment call. Lower values mean stricter relevance matching — you get less material but what you have is more likely to be genuinely about the SDG. Higher values pull in more content but risk bringing in tangentially related text that confuses the scorer. We settled on 0.5 after manual inspection of ~200 examples.

The Scoring Prompt

This is the sanitized version of the prompt we pass to the LLM for each SDG:

SDG_SCORING_PROMPT = """You are evaluating how well an organization aligns with a specific UN Sustainable Development Goal.

## SDG Being Evaluated
Goal {sdg_number}: {sdg_title}
Description: {sdg_description}

Specific Targets:
{sdg_targets}

## Organization Content
The following passages are from this organization's website and publications:

---
{retrieved_chunks}
---

## Task
Based ONLY on the content provided above, evaluate this organization's alignment with Goal {sdg_number}.

Respond with a JSON object in exactly this format:
{{
  "score": <integer 0-100>,
  "confidence": <"high" | "medium" | "low">,
  "alignment_evidence": [<quote or paraphrase from content showing alignment, max 3>],
  "gaps": [<aspects of the goal not addressed in the content, max 3>],
  "reasoning": "<2-3 sentence summary>"
}}

Scoring guide:
- 0-20: No meaningful alignment found in provided content
- 21-40: Tangential or indirect connection only
- 41-60: Some direct alignment but limited in scope or evidence
- 61-80: Clear alignment with substantial evidence
- 81-100: Strong, direct, well-evidenced alignment with multiple targets

If the provided content is insufficient to evaluate alignment, set confidence to "low" and score to 0.
Do NOT infer or assume alignment that is not directly supported by the provided text."""

The key instruction — "based ONLY on the content provided" — is the most important constraint. Without it, the LLM will fill gaps with plausible-sounding inferences about what an organization probably does. We've seen it give a solar panel company high scores on SDG 3 (Good Health) because "clean energy indirectly improves health." Technically arguable, practically useless for our purposes.

The Scoring Task

import json
import httpx
from agentosaurus.models import Organization, SDG, SDGScore

def score_organization_against_sdg(org_id: int, sdg_number: int) -> dict:
    org = Organization.objects.get(id=org_id)
    sdg = SDG.objects.get(number=sdg_number)

    chunks = retrieve_relevant_chunks(org_id, sdg, top_k=5)

    if not chunks:
        return {
            "sdg_number": sdg_number,
            "score": 0,
            "confidence": "low",
            "reasoning": "No relevant content found for this goal.",
            "alignment_evidence": [],
            "gaps": [],
        }

    prompt = SDG_SCORING_PROMPT.format(
        sdg_number=sdg.number,
        sdg_title=sdg.title,
        sdg_description=sdg.description,
        sdg_targets="\n".join(f"- {t}" for t in sdg.targets),
        retrieved_chunks="\n\n---\n\n".join(chunks),
    )

    response = httpx.post(
        "http://mac-mini-amsterdam.tail1234.ts.net:11434/api/generate",
        json={
            "model": "llama3.1:8b",
            "prompt": prompt,
            "stream": False,
            "format": "json",
        },
        timeout=60.0,
    )
    response.raise_for_status()

    raw_response = response.json()["response"]

    try:
        result = json.loads(raw_response)
        result["sdg_number"] = sdg_number
        return result
    except json.JSONDecodeError:
        return {
            "sdg_number": sdg_number,
            "score": 0,
            "confidence": "low",
            "reasoning": "LLM returned malformed JSON.",
            "alignment_evidence": [],
            "gaps": [],
        }

We pass "format": "json" to Ollama to constrain the output. This helps but doesn't fully prevent malformed responses — the json.JSONDecodeError catch is load-bearing.

Aggregation and Storage

After scoring all 17 goals, we compute an overall score. We use a weighted average where SDG weights can be configured per organization category (a cleantech company gets different weights than a social enterprise):

DEFAULT_SDG_WEIGHTS = {
    1: 0.8, 2: 0.8, 3: 0.9, 4: 0.9, 5: 0.9,
    6: 1.0, 7: 1.2, 8: 1.0, 9: 1.0, 10: 1.0,
    11: 1.1, 12: 1.2, 13: 1.5, 14: 1.3, 15: 1.3,
    16: 1.0, 17: 0.7,
}

def compute_overall_score(sdg_scores: list[dict]) -> float:
    total_weight = 0
    weighted_sum = 0

    for score_data in sdg_scores:
        sdg_num = score_data["sdg_number"]
        score = score_data.get("score", 0)
        weight = DEFAULT_SDG_WEIGHTS.get(sdg_num, 1.0)

        # Downweight low-confidence scores
        if score_data.get("confidence") == "low":
            weight *= 0.5

        weighted_sum += score * weight
        total_weight += weight

    return round(weighted_sum / total_weight, 1) if total_weight > 0 else 0.0

We also expose manual overrides in Django admin — if a human reviewer has verified that an organization's SDG 13 score is inaccurate (too high or too low), they can set a manual_override_score and override_reason. The public-facing score shows the manual override when set.

The Accuracy Problem

I want to be direct about this: our scores are not audits.

Organizations are good at writing sustainability-sounding content. ESG language is cheap to produce. A company with genuinely harmful practices can publish a sustainability report full of aspirational language, and our system will score it based on that language. The text says "committed to net zero by 2035" — the LLM finds that, matches it to SDG 13, and scores alignment.

This is exactly why we show the reasoning and evidence alongside scores. A score of 72 on SDG 13 with evidence that says "committed to net zero by 2035" is a very different signal than a score of 72 backed by evidence of actual emissions reduction data, third-party certifications, or verified project outcomes.

We're working toward OSINT-backed verification: cross-referencing organizational claims against external databases (regulatory filings, news archives, NGO watchdog reports). This is still under development. For now, what we ship is: a systematic signal derived from published content, presented transparently with its evidence and limitations.

If you're using Agentosaurus scores for due diligence, treat them as a starting point for investigation, not a conclusion.

Questions about the SDG scoring system or RAG pipeline? Reach out at hello@agentosaurus.com.

How to Score Organizations Against UN SDGs Using RAG and LLMs

How to Score Organizations Against UN SDGs Using RAG and LLMs

The Architecture

SDG Reference Data

The RAG Retrieval Step

The Scoring Prompt

The Scoring Task

Aggregation and Storage

The Accuracy Problem

Build This Infrastructure?

Related Articles

Open Source AI Infrastructure in Europe: Who's Building What

Running Ollama on Apple Silicon in Production: Lessons from 6 Months