How We Built an OSINT Pipeline That Analyzes 560 Organizations in 24 Hours
---
By Wingston Sharon | March 2026
Open Source Intelligence used to be the exclusive domain of government agencies, investigative journalists, and well-funded security teams. Building a proper OSINT pipeline required expertise in web scraping, data normalization, natural language processing, and database engineering โ skills rarely found in a single team, let alone a startup.
That changed when large language models learned to reason about unstructured text. Combined with modern async web scrapers and vector databases, it's now possible to build OSINT pipelines that would have required a team of analysts and several months of work โ in a few thousand lines of Python.
Here's exactly how we built ours, and what we learned analyzing 560 Amsterdam organizations.
What OSINT Actually Means in 2026
The term "open source intelligence" originated in military and intelligence contexts, where "open source" means publicly available information (as opposed to classified sources). Today it refers to any systematic collection and analysis of publicly accessible data to produce actionable intelligence.
For sustainability research, OSINT means:
- Web presence analysis: What do organizations actually say on their websites vs. their marketing claims?
- Document intelligence: Annual reports, sustainability certifications, supplier lists, financial disclosures
- Cross-referencing: Does what they claim match what journalists, regulators, and third parties report?
- Pattern detection: Are sustainability claims consistent across platforms and time periods?
Traditional OSINT for organizational research is slow and expensive. A sustainability consulting firm typically charges EUR 150,000โ300,000 to audit 50 organizations, based on industry rate cards. The process involves human analysts reading documents, tracking down certifications, and manually cross-referencing claims โ a process that doesn't scale.
Our goal was to verify sustainability claims for 560 Amsterdam organizations within a 3-week window. Manual auditing was out of the question. We needed automation.
For the full findings from the Amsterdam pilot โ what we found about greenwashing rates, sector patterns, and methodology โ see our Amsterdam pilot writeup.
The Architecture: Five Components That Work Together
Our OSINT pipeline has five main components. Each one is open source. None requires a paid API (though we use OpenRouter for LLM inference).
1. Crawl4AI: The Extraction Engine
Crawl4AI is an async Python library built specifically for AI-friendly web scraping. Unlike traditional scrapers that return raw HTML, Crawl4AI is designed to produce clean markdown output suitable for feeding directly into language models.
What makes it different from BeautifulSoup or Playwright alone:
- Async-first: Can crawl dozens of pages simultaneously using asyncio and Playwright under the hood
- AI-optimized output: Strips navigation, ads, and boilerplate; returns structured markdown
- PDF extraction: Handles PDF downloads transparently alongside HTML pages
- Depth crawling: Follows internal links to specified depth (we used depth=3)
- Respectful crawling: Built-in rate limiting and robots.txt compliance
For our Amsterdam pilot, we ran Crawl4AI against 560 organization domains with this configuration:
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
async def crawl_organization(domain: str, depth: int = 3):
browser_config = BrowserConfig(
headless=True,
viewport_width=1920,
viewport_height=1080
)
run_config = CrawlerRunConfig(
word_count_threshold=50, # Skip thin pages
exclude_external_links=True, # Stay on domain
remove_overlay_elements=True, # Remove cookie banners
process_iframes=False,
max_depth=depth
)
async with AsyncWebCrawler(config=browser_config) as crawler:
results = await crawler.arun_many(
urls=get_sitemap_urls(domain),
config=run_config
)
return results
The output for each page is clean markdown with metadata (title, URL, crawl timestamp, word count). For a typical organization, we'd get 15โ30 pages of content including their sustainability reports, team pages, project listings, and partnerships.
Numbers from our Amsterdam run:
- 560 organizations crawled
- 8,400+ web pages extracted
- 2,100+ PDFs downloaded and parsed
- Average crawl time per organization: ~90 seconds
- Total extraction time: ~24 hours (parallel async execution)
2. Celery: The Task Orchestrator
Crawling 560 organizations in parallel requires task queue management. We use Celery with Redis as the message broker.
Each organization gets its own Celery task. A supervisor task monitors the queue and retries failed crawls (JavaScript-heavy sites sometimes need a second pass). Tasks are prioritized by organization size โ larger organizations with more pages are crawled during off-peak hours.
from celery import shared_task
from .crawler import crawl_organization
import asyncio
@shared_task(bind=True, max_retries=3, default_retry_delay=300)
def crawl_organization_task(self, organization_id: int, domain: str):
try:
results = asyncio.run(
crawl_organization(domain, depth=3)
)
store_crawl_results(organization_id, results)
trigger_embedding_task.delay(organization_id)
except Exception as exc:
raise self.retry(exc=exc)
The Celery worker runs inside a Docker container alongside a Playwright browser instance. This setup lets us scale horizontally โ we ran 8 parallel workers during the Amsterdam pilot.
3. pgvector: The Semantic Memory Layer
Raw web content is useless for analysis. We need to query across thousands of documents semantically: "Which organizations mention ISO 14001 certification?", "What's the distribution of UN SDG alignment claims?", "Which organizations have the most detailed supplier chain transparency?"
These queries don't work with traditional SQL full-text search. They require semantic search โ finding documents based on meaning rather than keyword matching.
We use pgvector, a PostgreSQL extension that adds vector similarity search. Each page of crawled content gets embedded using a sentence-transformers model and stored as a 768-dimensional vector in Postgres.
from sentence_transformers import SentenceTransformer
from django.db import models
from pgvector.django import VectorField
class OrganizationPage(models.Model):
organization = models.ForeignKey('Organization', on_delete=models.CASCADE)
url = models.URLField()
content = models.TextField()
embedding = VectorField(dimensions=768)
crawled_at = models.DateTimeField(auto_now_add=True)
class Meta:
indexes = [
models.Index(fields=['organization']),
]
With pgvector, we can find all pages discussing a specific concept across all 560 organizations in milliseconds:
from pgvector.django import L2Distance
def find_relevant_pages(query: str, limit: int = 20):
model = SentenceTransformer('all-MiniLM-L6-v2')
query_embedding = model.encode(query)
return OrganizationPage.objects.annotate(
distance=L2Distance('embedding', query_embedding)
).order_by('distance')[:limit]
4. RAG: The Intelligence Layer
Raw page content, even when semantically searchable, doesn't directly answer questions like "Is this organization's sustainability claim credible?" That requires reasoning โ connecting claims to evidence, identifying contradictions, flagging missing documentation.
We use Retrieval-Augmented Generation (RAG) to turn our vector database into an answerable knowledge base.
For each organization, the scoring process works like this:
- Query formulation: Generate 17 queries, one for each UN SDG (e.g., "evidence of climate action, renewable energy, carbon reduction")
- Retrieval: Use pgvector to find the top 5 most relevant pages for each query
- Context assembly: Combine retrieved pages into a structured prompt
- Reasoning: Pass context to an LLM with a structured scoring rubric
- Evidence extraction: The LLM returns scores + specific text citations as evidence
def score_organization_sdgs(organization_id: int):
org = Organization.objects.get(id=organization_id)
scores = {}
for sdg_num, sdg_query in SDG_QUERIES.items():
# Retrieve relevant pages
relevant_pages = find_relevant_pages(
query=sdg_query,
organization_id=organization_id,
limit=5
)
# Assemble context
context = "\n\n---\n\n".join([
f"URL: {page.url}\n{page.content[:2000]}"
for page in relevant_pages
])
# Score with LLM
response = openrouter_client.chat.completions.create(
model="anthropic/claude-3-haiku",
messages=[
{"role": "system", "content": SDG_SCORING_PROMPT},
{"role": "user", "content": f"Organization: {org.name}\n\nContext:\n{context}"}
],
response_format={"type": "json_object"}
)
scores[sdg_num] = json.loads(response.choices[0].message.content)
return scores
The LLM outputs structured JSON with a score (0โ10), confidence level, and supporting evidence quotes. This evidence is crucial โ it means every score is auditable, not a black box.
5. Django + Public API: The Delivery Layer
All of this analysis ultimately needs to be accessible. We expose results through a public REST API that powers the agentosaurus.com organization search interface.
The public dashboard lets anyone explore:
- Organization sustainability scores across all 17 SDGs
- Specific evidence supporting each score
- Comparison of organizations in the same sector
- Historical tracking (are they improving?)
What We Found in Amsterdam
When you analyze 560 organizations at this depth, patterns emerge that manual auditing would miss entirely.
Greenwashing is more common than expected: 23% of organizations that prominently feature sustainability language on their websites had no verifiable third-party certifications and no evidence of actual implementation. They use the right vocabulary without the underlying action.
The certification gap: Organizations with ISO 14001 or B Corp certifications scored significantly higher on average โ but only 8% of Amsterdam organizations held any recognized sustainability certification. There's a large gap between aspiration and verification.
PDF documents tell different stories: An organization's PDF documents (annual reports, sustainability reports) often contain more accurate information than their website. Website copy is marketing; reports have legal obligations. When these contradict each other, the reports are generally more reliable.
Sector patterns are clear: Technology companies consistently underperform on SDG 13 (climate action) and SDG 12 (responsible consumption) compared to manufacturing companies โ despite the tech sector's typical narrative of being "sustainable."
The Cost Comparison
Traditional consulting for this scope would have involved:
- 6โ8 senior sustainability analysts
- 3 months of data collection
- Custom audit framework development
- EUR 200,000โ400,000 budget
Our OSINT pipeline:
- 3 engineers, 2 months (including building the infrastructure)
- Cloud compute: EUR 1,800 (560 org crawls ร GPU inference)
- OpenRouter LLM costs: EUR 1,200
- Total: ~EUR 25,000 โ an 83โ94% cost reduction
More importantly, the pipeline is reusable. The second time we run it โ say, 6 months later for updated scores โ costs less than EUR 3,000 in compute.
Running It Yourself
The tools are open source:
- Crawl4AI: github.com/unclecode/crawl4ai
- pgvector: github.com/pgvector/pgvector
- Our Celery task patterns: Available in the agentosaurus Docker setup (reach out if you want details)
The Amsterdam results are publicly available at agentosaurus.com/cities/amsterdam โ 560 organizations, fully scored, with evidence links for every rating.
What's Next
The Amsterdam pilot validated the technical approach. We're running the same pipeline for:
- Rotterdam: 400 organizations (Q2 2026)
- Brussels: 300 organizations (Q2 2026)
- Stockholm: 350 organizations (Q3 2026)
Each city adds to a growing database of verified sustainability intelligence โ the foundation for an independent, AI-powered ESG verification platform for Europe.
If you're interested in running this analysis for your city, or want early access to the API: hello@agentosaurus.com
The tools exist. The data is public. All that's needed is the pipeline to make sense of it.
Agentosaurus is building open-source intelligence infrastructure for sustainability verification. Our distributed GPU network (powered by Mac M-series and OCI nodes) handles analysis at scale โ without sending your data through US-controlled cloud providers.
Questions about the technical architecture? Email build@agentosaurus.com
Build This Infrastructure?
We help AI teams build sovereign GPU clouds and autonomous systems. Free 30-minute consultation. Fixed-price projects from โฌ5K.
Schedule Free Consultation