Marketplace/Podcast Transcription Corpus — 890K Episodes, Speaker-Diarized, Topic-Classified
Media

Podcast Transcription Corpus — 890K Episodes, Speaker-Diarized, Topic-Classified

Full transcriptions of 890K English-language podcast episodes across 14 genres (true crime, business, technology, health, comedy, politics, etc.). Each episode is speaker-diarized, topic-modeled, and sentiment-scored at the segment level. Powers podcast search engines, content recommendation systems, and long-form audio AI.

Formats

JSONL transcriptsVTT timestampsParquet metadataMP3 aligned audio

Volume

890K episodes (~1.4M hours)

Time Range

2016-2026, 10 years

Refresh Rate

Weekly

Compliance & Privacy

No PII

Interested in this data?

Sign up to express interest, request a sample, or start a deal. Seller identity is revealed only after mutual interest.

Listed April 5, 2026