[ ] Developer Philosophy 11 min read

$ The Next Dataset for the Web: Why Our Conversations With Agents Must Become a Commons

For the past decade, AI models trained on the open web. That era is ending. The next frontier of training data isn't on websites—it's in our conversations with agents. If we don't build infrastructure to capture and share this data, the future of AI will be fragmented, siloed, and closed. Here's why we need an Agent Data Commons, and what it would look like.

Cover image for: The Next Dataset for the Web: Why Our Conversations With Agents Must Become a Commons
// cover_image.render()

For the past decade, the internet has been the primary training ground for AI systems. Large language models grew up on a diet of books, scraped websites, technical forums, research papers, Reddit threads, Stack Overflow discussions, and any public text they could legally—or questionably—consume.

This was the first era of AI training: “Just scrape what humanity has already written.”

That era is ending.

Not because the web has disappeared, but because:

  • Much of it has already been scraped.
  • A growing portion of it is polluted with AI-generated noise.
  • The high-quality conversational data—the kind that encodes expertise, reasoning, Q&A patterns—is increasingly locked inside proprietary systems.

And most importantly:

The next frontier of high-quality training data isn’t on the web. It’s in our conversations with agents.

ChatGPT, Claude, Gemini, Reka, Llama agents, custom assistants—these are becoming our new interfaces to software, knowledge, and work. We are slowly replacing the old act of “searching the web and browsing pages” with “asking the agent.”

This shift changes everything.


The Web’s Original Superpower: Open, Accessible Knowledge

The open web was never perfect, but it had two incredible properties:

1. Knowledge was publicly accessible

If someone posted a question on Stack Overflow, the entire world could read it—and search engines could index it. Anyone could learn from the same pool of knowledge.

2. Data was shareable

The internet’s content was not built for LLMs, but it was inherently exposed. It could be collected, replicated, analyzed, transformed.

This meant that the early LLM developers—both the ethical ones and the less ethical ones—could build large-scale datasets without permission from centralized authorities.

It was chaotic, messy, and ethically questionable, but it democratized AI research in a strange way. Anyone could train on roughly the same raw material.


The Shift: Agents Are the New Interface

Now imagine the same question someone once asked on Stack Overflow:

“Why is my Kubernetes deployment stuck in CrashLoopBackoff?”

But instead of posting it online, they ask their AI agent.

The agent gives a detailed, personalized answer. The developer moves on. Nothing is posted publicly. No search engine indexes it. No forum preserves it.

That knowledge now lives inside a closed feedback loop, controlled by whichever company provides the agent.

The user gets value. But humanity loses a tiny piece of public knowledge.

Most importantly:

The next generation of models will need this kind of interactive, intention-rich, domain-specific Q&A data to improve—but it won’t exist on the public web anymore.

Unless we build the infrastructure to capture it.


Why Agent Conversations Are the New Gold

Conversations with agents are qualitatively different from web text.

They contain:

  • Intent — the user’s actual goal
  • Iteration — the back-and-forth refinements of a problem
  • Reasoning traces — how a user evaluates answers
  • Corrections — where the agent failed and the user guided it
  • Structured problem solving — not just statements, but process

This is the richest training data we’ve ever produced. And it’s being generated at massive scale.

But it’s fragmented. Locked away. Siloed by vendor. Invisible to the broader ecosystem.

If nothing changes, the next decade of AI research will be built on private, proprietary datasets that only a few companies have access to.

That is the opposite of how the internet grew.


We Need a New Infrastructure Layer: The Agent Data Commons

What’s missing is not technology—the agents already exist.

What’s missing is a platform and a protocol that:

  1. Collects agent interaction data (with full user consent)
  2. Anonymizes and structures it
  3. Gives users ownership and visibility into their own conversation history
  4. Makes the aggregated, cleaned dataset openly accessible to all model developers
  5. Operates under transparent, auditable, open-source governance

Imagine something like:

  • Hugging Face Datasets ×
  • LAION for conversational data ×
  • A personal AI “vault” for your chat history ×
  • A data union that ensures user control and optional compensation

This doesn’t exist today. But it’s exactly what the ecosystem needs next.


What Such a Platform Would Look Like

1. A Local-First “AI Conversation Vault” for Users

Every conversation you have with any agent—ChatGPT, Claude, Gemini, local Llama, whatever—is captured locally first, encrypted, and owned by you.

From there you can:

  • Search across all your past conversations
  • Organize them into projects / topics
  • Rehydrate an old thread with a different agent
  • Export, delete, audit your data

Users immediately get value even before contributing anything to the commons.

2. An Opt-In Pipeline for Donating Anonymized Conversations

Users can choose to share conversations with:

  • PII removed
  • Sensitive content filtered
  • Metadata added (domain, difficulty, tags)

A transparent UI shows:

  • What was shared
  • What wasn’t
  • Which models or researchers used it
  • What value it created

This is the opposite of the current opaque “trust us, we collect your data to improve the model” approach.

3. An Open Dataset for the Community

Aggregated data—structured, anonymized, deduplicated—would be published regularly:

  • on Hugging Face
  • on Kaggle
  • or via a foundation-hosted storage bucket

Every researcher, startup, company, and open-source model gets access to the same base layer.

This would be the moral successor to Stack Overflow, but native to the agent era.

4. A Governance Layer that Is Not Owned by Any Vendor

A non-profit foundation or cooperative ensures:

  • Transparent operations
  • Open-source tooling
  • User control
  • Auditability
  • Clear licensing
  • No vendor lock-in
  • No single-company capture

This layer must be vendor-neutral, or it loses its value.


Incentives: Why Anyone Would Participate

For Users

  • You get a searchable memory of your conversations—something none of the current agent products do well
  • You control what is shared or deleted
  • You contribute to a public good
  • Optionally: you participate in a data union that compensates contributors

For Developers (agent builders, OSS models)

  • They can integrate a drop-in SDK for structured logging
  • They gain analytics, feedback, and evaluation tools
  • They get access to the commons dataset to train better models

For Model Developers / Researchers

  • They get ethically sourced, multi-domain, high-quality conversational datasets—something the ecosystem desperately lacks

Everyone wins.


Why This Matters Now

Agents are becoming the new UI for software. In many cases, they already are.

Soon:

  • You won’t browse documentation; your agent will
  • You won’t search forums; your agent will
  • You won’t fill forms; your agent will
  • You won’t install apps; you’ll call capabilities through the agent

In that world, the agent interface becomes the primary site of human–machine communication.

If we don’t build an open, shared, privacy-first dataset from these interactions, then:

  • Knowledge becomes fragmented by vendor
  • Innovation slows outside a handful of companies
  • The next generation of models becomes less accessible
  • We repeat the centralized mistakes of web 2.0, but worse

The web made knowledge discoverable and accessible. Agents could make it invisible and siloed—unless we intervene.


Technical Shape: What a V1 Could Look Like

If you wanted to actually build this, here’s a concrete architecture:

Core Components

1. Local Collector

  • Browser extension + optional desktop app
  • Captures:
    • Page content (for Web UIs like ChatGPT)
    • Messages via APIs / webhooks for OSS agents
  • Writes into a local encrypted store first

2. User Vault Service

  • Syncs from local to remote if the user wants
  • Exposes:
    • /me/conversations API
    • search, tags, embeddings, backups
  • UX: think Notion/Logseq but for AI chats

3. Data Donation Pipeline

  • Opt-in export from user vault to the Commons
  • Steps:
    1. PII detection & redaction
    2. De-duplication & model-output tagging
    3. Safety filtering & content moderation
    4. Attach metadata (ratings, topics, languages)

4. Commons Storage & API

  • Public data lake (e.g., Parquet / JSONL on S3-style storage)
  • Snapshots pushed to Hugging Face Datasets or Kaggle
  • Query APIs for research

5. Governance / Transparency Tools

  • Data provenance tracking
  • Ability to “tombstone” a user’s contributions retroactively where legally required
  • Public dashboards: who uses the data, for what

Precedents: We’ve Done This Before

There are already precursors to this idea:

  • LAION’s OpenAssistant Conversations (OASST1) – a crowdsourced, human-generated assistant conversation dataset (161k messages, ~10k conversation trees, 13k+ volunteers)
  • LAION’s OIG (Open Instruction Generalist) – ~43M instruction samples, partially synthetic, meant as an open instruction-following dataset

But these are static, one-off datasets. What we need is more like:

“Stripe for conversation logs” + “Kaggle meets LAION for agent data” + “a user-centric data union.”


Key Challenges (and How to Address Them)

1. PII & Safety

  • Hard problem: no perfect anonymization
  • Approach: aggressively default-safe; allow per-conversation inspection; publish risk docs

2. Model Collapse & Source Separation

  • Datasets must distinguish:
    • Human prompts and edits
    • Model-generated suggestions
  • This becomes a feature: researchers can study human–AI collaboration, not just raw text

3. License Clarity

  • Avoid the current mess where web-scraped data relies on “fair use” or unclear terms
  • Build a new, explicit “conversation donation” license that:
    • Grants rights for training and evaluation
    • Preserves user right to delete / revoke going forward

4. Avoiding Capture by Big Players

  • Make it structurally hard to “embrace, extend, extinguish”:
    • Non-profit charter prohibits sale or IP lock-in
    • Strong community governance around API access and reciprocity requirements

Business / Incentive Model Options

Model A – “Free if you donate data”

  • If you opt in to donate anonymized data, you get:
    • The personal vault for free
    • Extra features (advanced search, tagging, backup)
  • If you don’t want to donate, you pay:
    • A subscription, or self-host the OSS stack

Revenue comes from:

  • Enterprise / research licenses for curated slices of the commons
  • Possibly value-added services (better tooling, hosting, visualizations)

Model B – Data union & revenue sharing

  • When commercial actors buy access to specific slices (e.g., “agent conversations about programming with high-quality human feedback”), a portion of the money:
    • Goes to the foundation (infra)
    • Flows back to data contributors (users + apps), like a cooperative dividend

Model C – Pure public-good model

  • Fully grant-funded / philanthropic (similar to some LAION efforts)
  • No paywall for users, no revenue share
  • Focus: maximize openness and scientific progress

In practice, you could start with C (bootstrap via grants / donors) and evolve towards A/B hybrid to sustain long-term infra.


This Is a Call to Action

We need:

  • An open-source protocol
  • A neutral foundation
  • A standardized schema
  • Privacy-first anonymization pipelines
  • Incentive mechanisms for users
  • A community of developers willing to integrate the SDK

This won’t be built by OpenAI, Google, or Anthropic—they have no incentive to open their logs.

It must be built by the open-source AI community, the same way the early web was built: messily, collaboratively, transparently, and collectively.

We are entering the era where agents become the primary interface for human knowledge work.

Let’s make sure that era isn’t built on closed data. Let’s ensure the next great dataset of the internet is owned by all of us.


This concept connects to several other ideas I’ve been exploring:

  1. The Web Is Quietly Ending: Agents as the New Front-End

    • Old training data: web, books, StackOverflow, Reddit
    • Why that well is drying up / polluted by AI-generated content
  2. The Risk of Knowledge Silos

    • If each vendor hoards their interaction logs, we regress to private knowledge stacks
    • This slows progress and centralizes power
  3. Self-Encapsulated Agents (see my previous post)

    • Building agents with compile-time embedded knowledge
    • Purpose-built distributions for specific domains

The Agent Data Commons is the infrastructure layer that makes all of this possible.


About this post: This essay began as a conversation about the future of AI training data and evolved into a concrete proposal for infrastructure. It explores what it would take to build a transparent, user-controlled, open dataset from our collective interactions with AI agents—before that data becomes permanently siloed.

If you’re interested in building this, thinking about governance models, or have ideas about how to make it happen—I’d love to hear from you. This is a problem the community needs to solve together.

The future of AI shouldn’t be built on closed conversations. Let’s build the commons we need.

// WAS THIS HELPFUL?