a builder's codex
codex · operators · Tejas Shahasane · ins_tejas-shahasane-piracy-ai-data-sourcing

Piracy is the only scalable way to get large digitized book datasets for AI training

By Tejas Shahasane · बी2बी | Content Marketer · 2026-04-10 · thread · Piracy is the only scalable way to get large digitized book datasets for AI trai

Tier B · TL;DR
Piracy is the only scalable way to get large digitized book datasets for AI training

Claim

The only remaining option for a large, organized, and digitized repository of books is pirated P2P torrents and archives like Z-Library or Anna's Archive.

Mechanism

Legally obtaining books at scale is cumbersome and not scalable—Anthropic spent millions buying and manually digitizing books. Amazon can't provide a digital library because they're a marketplace, not a reseller, so they'd need to negotiate opt-ins with every publisher and author. That leaves piracy as the only viable source for massive training datasets.

Open the interactive view → View original source → Markdown source →