The Next Dataset for the Web: Why Our Conversations With Agents Must Become a Commons

For the past decade, the internet has been the primary training ground for AI systems. Large language models grew up on a diet of books, scraped websites, technical forums, research papers, Reddit threads, Stack Overflow discussions, and any public text they could legally—or questionably—consume.

This was the first era of AI training: “Just scrape what humanity has already written.”

That era is ending.

Not because the web has disappeared, but because:

Much of it has already been scraped.
A growing portion of it is polluted with AI-generated noise.
The high-quality conversational data—the kind that encodes expertise, reasoning, Q&A patterns—is increasingly locked inside proprietary systems.

And most importantly:

The next frontier of high-quality training data isn’t on the web. It’s in our conversations with agents.

ChatGPT, Claude, Gemini, Reka, Llama agents, custom assistants—these are becoming our new interfaces to software, knowledge, and work. We are slowly replacing the old act of “searching the web and browsing pages” with “asking the agent.”

This shift changes everything.

The Web’s Original Superpower: Open, Accessible Knowledge

The open web was never perfect, but it had two incredible properties:

1. Knowledge was publicly accessible

If someone posted a question on Stack Overflow, the entire world could read it—and search engines could index it. Anyone could learn from the same pool of knowledge.

2. Data was shareable

The internet’s content was not built for LLMs, but it was inherently exposed. It could be collected, replicated, analyzed, transformed.

This meant that the early LLM developers—both the ethical ones and the less ethical ones—could build large-scale datasets without permission from centralized authorities.

It was chaotic, messy, and ethically questionable, but it democratized AI research in a strange way. Anyone could train on roughly the same raw material.

The Shift: Agents Are the New Interface

Now imagine the same question someone once asked on Stack Overflow:

“Why is my Kubernetes deployment stuck in CrashLoopBackoff?”

But instead of posting it online, they ask their AI agent.

The agent gives a detailed, personalized answer. The developer moves on. Nothing is posted publicly. No search engine indexes it. No forum preserves it.

That knowledge now lives inside a closed feedback loop, controlled by whichever company provides the agent.

The user gets value. But humanity loses a tiny piece of public knowledge.

Most importantly:

The next generation of models will need this kind of interactive, intention-rich, domain-specific Q&A data to improve—but it won’t exist on the public web anymore.

Unless we build the infrastructure to capture it.

Why Agent Conversations Are the New Gold

Conversations with agents are qualitatively different from web text.

They contain:

Intent — the user’s actual goal
Iteration — the back-and-forth refinements of a problem
Reasoning traces — how a user evaluates answers
Corrections — where the agent failed and the user guided it
Structured problem solving — not just statements, but process

This is the richest training data we’ve ever produced. And it’s being generated at massive scale.

But it’s fragmented. Locked away. Siloed by vendor. Invisible to the broader ecosystem.

If nothing changes, the next decade of AI research will be built on private, proprietary datasets that only a few companies have access to.

That is the opposite of how the internet grew.

We Need a New Infrastructure Layer: The Agent Data Commons

What’s missing is not technology—the agents already exist.

What’s missing is a platform and a protocol that:

Collects agent interaction data (with full user consent)
Anonymizes and structures it
Gives users ownership and visibility into their own conversation history
Makes the aggregated, cleaned dataset openly accessible to all model developers
Operates under transparent, auditable, open-source governance

Imagine something like:

Hugging Face Datasets ×
LAION for conversational data ×
A personal AI “vault” for your chat history ×
A data union that ensures user control and optional compensation

This doesn’t exist today. But it’s exactly what the ecosystem needs next.

What Such a Platform Would Look Like

1. A Local-First “AI Conversation Vault” for Users

Every conversation you have with any agent—ChatGPT, Claude, Gemini, local Llama, whatever—is captured locally first, encrypted, and owned by you.

From there you can:

Search across all your past conversations
Organize them into projects / topics
Rehydrate an old thread with a different agent
Export, delete, audit your data

Users immediately get value even before contributing anything to the commons.

2. An Opt-In Pipeline for Donating Anonymized Conversations

Users can choose to share conversations with:

PII removed
Sensitive content filtered
Metadata added (domain, difficulty, tags)

A transparent UI shows:

What was shared
What wasn’t
Which models or researchers used it
What value it created

This is the opposite of the current opaque “trust us, we collect your data to improve the model” approach.

3. An Open Dataset for the Community

Aggregated data—structured, anonymized, deduplicated—would be published regularly:

on Hugging Face
on Kaggle
or via a foundation-hosted storage bucket

Every researcher, startup, company, and open-source model gets access to the same base layer.

This would be the moral successor to Stack Overflow, but native to the agent era.

4. A Governance Layer that Is Not Owned by Any Vendor

A non-profit foundation or cooperative ensures:

Transparent operations
Open-source tooling
User control
Auditability
Clear licensing
No vendor lock-in
No single-company capture

This layer must be vendor-neutral, or it loses its value.

Incentives: Why Anyone Would Participate

For Users

You get a searchable memory of your conversations—something none of the current agent products do well
You control what is shared or deleted
You contribute to a public good
Optionally: you participate in a data union that compensates contributors

For Developers (agent builders, OSS models)

They can integrate a drop-in SDK for structured logging
They gain analytics, feedback, and evaluation tools
They get access to the commons dataset to train better models

For Model Developers / Researchers

They get ethically sourced, multi-domain, high-quality conversational datasets—something the ecosystem desperately lacks

Everyone wins.

Why This Matters Now

Agents are becoming the new UI for software. In many cases, they already are.

Soon:

You won’t browse documentation; your agent will
You won’t search forums; your agent will
You won’t fill forms; your agent will
You won’t install apps; you’ll call capabilities through the agent

In that world, the agent interface becomes the primary site of human–machine communication.

If we don’t build an open, shared, privacy-first dataset from these interactions, then:

Knowledge becomes fragmented by vendor
Innovation slows outside a handful of companies
The next generation of models becomes less accessible
We repeat the centralized mistakes of web 2.0, but worse

The web made knowledge discoverable and accessible. Agents could make it invisible and siloed—unless we intervene.

Technical Shape: What a V1 Could Look Like

If you wanted to actually build this, here’s a concrete architecture:

Core Components

1. Local Collector

Browser extension + optional desktop app
Captures:
- Page content (for Web UIs like ChatGPT)
- Messages via APIs / webhooks for OSS agents
Writes into a local encrypted store first

2. User Vault Service

Syncs from local to remote if the user wants
Exposes:
- /me/conversations API
- search, tags, embeddings, backups
UX: think Notion/Logseq but for AI chats

3. Data Donation Pipeline

Opt-in export from user vault to the Commons
Steps:
1. PII detection & redaction
2. De-duplication & model-output tagging
3. Safety filtering & content moderation
4. Attach metadata (ratings, topics, languages)

4. Commons Storage & API

Public data lake (e.g., Parquet / JSONL on S3-style storage)
Snapshots pushed to Hugging Face Datasets or Kaggle
Query APIs for research

5. Governance / Transparency Tools

Data provenance tracking
Ability to “tombstone” a user’s contributions retroactively where legally required
Public dashboards: who uses the data, for what

Precedents: We’ve Done This Before

There are already precursors to this idea:

LAION’s OpenAssistant Conversations (OASST1) – a crowdsourced, human-generated assistant conversation dataset (161k messages, ~10k conversation trees, 13k+ volunteers)
LAION’s OIG (Open Instruction Generalist) – ~43M instruction samples, partially synthetic, meant as an open instruction-following dataset

But these are static, one-off datasets. What we need is more like:

“Stripe for conversation logs” + “Kaggle meets LAION for agent data” + “a user-centric data union.”

Key Challenges (and How to Address Them)

1. PII & Safety

Hard problem: no perfect anonymization
Approach: aggressively default-safe; allow per-conversation inspection; publish risk docs

2. Model Collapse & Source Separation

Datasets must distinguish:
- Human prompts and edits
- Model-generated suggestions
This becomes a feature: researchers can study human–AI collaboration, not just raw text

3. License Clarity

Avoid the current mess where web-scraped data relies on “fair use” or unclear terms
Build a new, explicit “conversation donation” license that:
- Grants rights for training and evaluation
- Preserves user right to delete / revoke going forward

4. Avoiding Capture by Big Players

Make it structurally hard to “embrace, extend, extinguish”:
- Non-profit charter prohibits sale or IP lock-in
- Strong community governance around API access and reciprocity requirements

Business / Incentive Model Options

Model A – “Free if you donate data”

If you opt in to donate anonymized data, you get:
- The personal vault for free
- Extra features (advanced search, tagging, backup)
If you don’t want to donate, you pay:
- A subscription, or self-host the OSS stack

Revenue comes from:

Enterprise / research licenses for curated slices of the commons
Possibly value-added services (better tooling, hosting, visualizations)

When commercial actors buy access to specific slices (e.g., “agent conversations about programming with high-quality human feedback”), a portion of the money:
- Goes to the foundation (infra)
- Flows back to data contributors (users + apps), like a cooperative dividend

Model C – Pure public-good model

Fully grant-funded / philanthropic (similar to some LAION efforts)
No paywall for users, no revenue share
Focus: maximize openness and scientific progress

In practice, you could start with C (bootstrap via grants / donors) and evolve towards A/B hybrid to sustain long-term infra.

This Is a Call to Action

We need:

An open-source protocol
A neutral foundation
A standardized schema
Privacy-first anonymization pipelines
Incentive mechanisms for users
A community of developers willing to integrate the SDK

This won’t be built by OpenAI, Google, or Anthropic—they have no incentive to open their logs.

It must be built by the open-source AI community, the same way the early web was built: messily, collaboratively, transparently, and collectively.

We are entering the era where agents become the primary interface for human knowledge work.

Let’s make sure that era isn’t built on closed data. Let’s ensure the next great dataset of the internet is owned by all of us.

This concept connects to several other ideas I’ve been exploring:

The Web Is Quietly Ending: Agents as the New Front-End
- Old training data: web, books, StackOverflow, Reddit
- Why that well is drying up / polluted by AI-generated content
The Risk of Knowledge Silos
- If each vendor hoards their interaction logs, we regress to private knowledge stacks
- This slows progress and centralizes power
Self-Encapsulated Agents (see my previous post)
- Building agents with compile-time embedded knowledge
- Purpose-built distributions for specific domains

The Agent Data Commons is the infrastructure layer that makes all of this possible.

About this post: This essay began as a conversation about the future of AI training data and evolved into a concrete proposal for infrastructure. It explores what it would take to build a transparent, user-controlled, open dataset from our collective interactions with AI agents—before that data becomes permanently siloed.

If you’re interested in building this, thinking about governance models, or have ideas about how to make it happen—I’d love to hear from you. This is a problem the community needs to solve together.

The future of AI shouldn’t be built on closed conversations. Let’s build the commons we need.