YouTube Transcript API: 4 Ways to Pull Captions in 2026 (with code)

You went looking for a YouTube transcript API and found a mess. The official YouTube Data API will list captions but won't let you download them. The most popular open-source library breaks every few months when YouTube ships a tweak. The hosted services hide their pricing behind a signup form. So you're back here, trying to figure out which option won't blow up the week you ship.

This is a developer's guide to pulling YouTube transcripts in 2026. Four real approaches, working code for each, what each one costs at 1,000 transcripts, and the failure modes that show up in production but never in the README. We'll show our SDK in the relevant section. The comparison is honest. If the open-source library fits your case, use it.

Why this is harder than it should be

YouTube's official Data API has a captions.download method, but it only works on videos you own. Captions on other people's videos can be listed, not downloaded. That single restriction is the reason every transcript tool you've ever used is some flavor of scraper.

YouTube serves caption tracks to the player as timedtext XML or VTT. The unauthenticated endpoints that return them are not officially supported, and YouTube has tightened them several times in the last two years. The result is a small ecosystem of libraries and hosted APIs doing roughly the same job: hit those endpoints, parse the response, return JSON.

Your job is picking which one breaks least often for your traffic profile and your budget. The right answer at 100 transcripts a month is the wrong answer at 100,000.

Option 1: The open-source Python library

youtube-transcript-api on PyPI is the default starting point for most developers. Install with pip, pass a video ID, get back a list of {text, start, duration} segments. The maintainer is Jonas Depoix; the GitHub repo is the most-starred transcript library on the platform.

from youtube_transcript_api import YouTubeTranscriptApi

ytt = YouTubeTranscriptApi()
transcript = ytt.fetch("dQw4w9WgXcQ")

for segment in transcript:
    print(f"[{segment.start:.1f}s] {segment.text}")

That's the whole API for a single transcript. It supports auto-generated and manually uploaded captions, multiple languages, and translation between languages YouTube exposes.

The pitch is that it's free. No API key, no signup, no per-call cost. For a side project pulling a few hundred transcripts a week from your laptop, this is the right answer.

The problem is that "free" is a lie when you ship it. The library hits YouTube's unauthenticated timedtext endpoint, which means it shares your IP's rate limit with every other person doing the same thing. Pull from a residential connection and you're fine. Pull from an AWS or GCP egress IP, and you'll start seeing IpBlocked errors within a few hundred requests. The maintainer's recommended fix is configuring proxy backends, which is now your problem to operate.

The other failure mode is upstream changes. YouTube ships small tweaks to the player config every few months. The library is well-maintained, but there's typically a 1-3 day window after each change where transcripts return empty or malformed data and you're staring at a Slack thread asking why production broke. We've watched this play out three times in the last 18 months.

For a notebook, that's fine. For a feature you ship to customers, it's a maintenance tax you'll pay forever.

Option 2: The official YouTube Data API

The official Captions endpoint lives at youtube.captions.list and youtube.captions.download. Listing tracks costs 50 quota units per call. Downloading costs 200, and only works on videos uploaded by your authenticated account.

That second restriction is the catch. Search for "captions.download forbidden 403" and you'll find a decade of developers running into this wall. Google has explicitly chosen not to expose third-party caption downloads through the Data API. They aren't going to change their mind.

The Data API is the right tool when you control the videos: pulling transcripts for a brand's own channel, processing user uploads on your platform, or running internal QA on content you produce. For everything else, it's a non-starter.

The quota math. Default Data API quota is 10,000 units per day. If listing captions costs 50 units, that's 200 list calls per day before you hit the wall. Quota increases require an audit and several weeks of review. Don't plan around getting one.

Option 3: The Influship SDK (or any hosted transcript API)

Hosted APIs do what the Python library does, except on someone else's servers, with someone else's IP rotation, with a paid SLA when something breaks. The current options worth comparing are Supadata, TranscriptAPI, youtube-transcript.io, and our own raw scrapers.

The pitch is the same across all of them: send a video ID, get a transcript back. The differences are in price, batch capability, and what else they bundle. For 10,000 transcripts a month:

Service	Per-transcript price	Notes
Supadata	~$0.005-0.01	Plan-based, transcript credits expire
TranscriptAPI	~$0.005	One credit = one transcript, search and channels included
youtube-transcript.io	~$0.003-0.008	Token-based, plan-tied
Influship raw scrapers	$0.005 flat	Same credit applies to channel data, search, profiles

The Influship SDK ships in Python and TypeScript and exposes the raw scrapers as first-class methods. Install, set the env var, call the method:

Python:

pip install influship

import os
from influship import Influship

client = Influship(api_key=os.environ["INFLUSHIP_API_KEY"])

transcript = client.raw.youtube.get_transcript(video_id="dQw4w9WgXcQ")
for segment in transcript.data.segments:
    print(f"[{segment.start:.1f}s] {segment.text}")

TypeScript:

npm install influship

import Influship from 'influship';

const client = new Influship({ apiKey: process.env['INFLUSHIP_API_KEY'] });

const transcript = await client.raw.youtube.getTranscript('dQw4w9WgXcQ');
for (const segment of transcript.data.segments) {
  console.log(`[${segment.start.toFixed(1)}s] ${segment.text}`);
}

Prefer hitting the REST endpoint directly? Same call:

curl -H "Authorization: Bearer $INFLUSHIP_API_KEY" \
  https://api.influship.com/v1/raw/youtube/transcript/dQw4w9WgXcQ

When a hosted API is the right call. You're running this in a server-side context where IP blocks would page someone. You're processing more than a few thousand transcripts a month. You need predictable per-call pricing for a customer-facing feature. You don't want to be the person who has to babysit proxy rotation.

When it isn't. You're pulling fewer than 1,000 transcripts a month from a residential IP, and you're comfortable patching the open-source library when YouTube changes things. In that case, the time you'd spend signing up and integrating an API costs more than the API would.

The Influship-specific reason to pick us over Supadata or TranscriptAPI: if you're already running creator search, profile lookups, or channel data through Influship, the transcript endpoint shares the same credit pool. One vendor, one bill, one auth header. If you only need transcripts and nothing else, TranscriptAPI is a fine choice and we'll happily lose that comparison.

Option 4: yt-dlp plus Whisper

If you don't trust the unofficial caption endpoints at all, you can sidestep them: download the audio with yt-dlp and run it through OpenAI Whisper or a hosted transcription service like Deepgram or AssemblyAI.

This is the most expensive option per transcript ($0.006-0.024 with Whisper depending on length, more with hosted alternatives) and by far the slowest (2-10 minutes per video versus 2-5 seconds for the others). What you get is a transcript that doesn't depend on YouTube's caption availability. Live videos, age-gated videos, videos in languages without auto-captions, all transcribable.

The other reason teams pick this path is quality. Auto-generated YouTube captions are passable for English on clear audio. They're rough on accents, technical jargon, and music. Whisper-large is materially better, especially with named-entity recognition.

For most projects, that quality bump isn't worth a 100x latency hit. If your application's value depends on transcript quality (legal review, medical content, podcast indexing), it's the right tool. If you're trying to extract product mentions from a creator's videos, the captions are fine.

A worked example: extracting product mentions across a creator's channel

Pretend you're building a creator-research feature. A brand drops in a YouTube channel handle, and you want to surface every product the creator has mentioned in the last 50 videos.

You need three calls in sequence:

List the creator's videos
Fetch transcripts for each video
Run NER (named-entity recognition) on the combined text

Doing this with the open-source library takes a few dozen lines and works fine for one channel from your laptop. Doing it for 200 channels in a customer-facing dashboard is the moment you want batch.

The Influship SDK exposes a single call that pulls up to 20 transcripts per channel:

result = client.raw.youtube.get_channel_transcripts(
    handle="mkbhd",
    sort="popularity",
    limit=20,
)

for video in result.data.transcripts:
    print(video.title, len(video.segments))

const result = await client.raw.youtube.getChannelTranscripts('mkbhd', {
  sort: 'popularity',
  limit: 20,
});

for (const video of result.data.transcripts) {
  console.log(video.title, video.segments.length);
}

At $0.005 per transcript, processing 50 videos for a channel costs $0.25. Two hundred channels: $50. Cache by videoId after the first pull, and subsequent runs only pay for new uploads. The same flow on the open-source library is two API calls per video (list + fetch), no batching, and IP-block risk on every iteration. That's the difference between "shippable feature" and "demo on my laptop."

If you also need to find which channels to pull from, the same SDK has creator search exposed, and the MCP server post walks through wiring it into Claude or ChatGPT for natural-language queries.

Costs per 1,000 transcripts

Round numbers, US data:

Open-source library on residential IP: $0 in API costs, plus your time when YouTube changes things and you have to update.
Open-source library on rotating proxies: $50-150 per 1k transcripts depending on proxy provider.
Hosted transcript APIs (Supadata, TranscriptAPI, Influship, etc.): $3-10 per 1k transcripts, fixed.
yt-dlp + Whisper-large via OpenAI: $6-24 per 1k transcripts depending on average video length, plus compute.
yt-dlp + Deepgram or AssemblyAI: $20-80 per 1k transcripts depending on tier.

The hosted APIs are usually cheaper than DIY-with-proxies once you factor in the maintenance time. Whisper-via-OpenAI is the budget option if quality matters; hosted transcription services are the premium option.

Failure modes that bite you in production

Five things go wrong often enough to plan for.

No captions exist. Roughly 5-10% of videos either have no captions or have auto-generated captions disabled. Your code path needs to handle the empty response. The open-source library raises TranscriptsDisabled or NoTranscriptFound; hosted APIs typically return a 404 or empty segments array.

Captions exist but are useless. Music videos, ASMR, anything with minimal speech. Your transcript will technically come back, but it'll be a list of [Music] markers. If your downstream task assumes meaningful text, gate on segment count or character count.

Language mismatch. A creator's channel may have English captions on some videos and Spanish auto-captions on others. The open-source library lets you specify languages and falls back through them; hosted APIs default to the video's primary language and require a parameter to translate.

Rate limits, even on paid APIs. Hosted services have per-key rate limits, usually 10-50 requests per second. If you're processing a backlog, queue and throttle.

Caption updates. Creators occasionally re-upload captions. If your pipeline caches transcripts by videoId and never refreshes, you'll have stale text on a small number of videos. Check the caption track's lastUpdated if your hosted API exposes it; the open-source library doesn't.

What we'd actually pick

A short opinionated decision tree:

Notebook, side project, fewer than 1,000 transcripts a month, residential IP: open-source Python library. Free is free. When it breaks, patch it.
Your own channel only, fewer than 200 videos a day: official YouTube Data API. The quota is enough and you don't need a third party.
Anything you'll ship to customers: the Influship SDK. We're priced at $0.005 per transcript, the same as the dedicated transcript-only competitors, and we throw in batch channel transcripts (up to 20 per call), YouTube search, channel data, and Instagram profile lookups on the same key. Buying a transcripts-only vendor and then bolting on three more vendors when your roadmap grows is the more expensive choice; we've watched teams do that math wrong twice this year.
Court-ready accuracy (legal review, medical transcription): nobody fetching YouTube's captions will help you here, including us. Use yt-dlp plus Whisper-large or a paid transcription service. Pay the cost. We'd rather you find out from this post than from a deposition.

If you only ever need transcripts and you're certain you'll never need creator search, channel metadata, or profile data, TranscriptAPI and Supadata are fine focused tools at the same price point. We don't think most production use cases stay that narrow, but if yours does, pick the cheapest option that does the one thing.

If you're building a creator-research feature where the transcript pull is one step in a wider workflow (finding influencers, verifying audience quality, running outreach), consolidate. One API, one credit pool, one thing to monitor.

Where to go next

The transcript endpoint is one piece of a developer-facing surface area that includes channel data, YouTube search, profile lookups, and AI agent integrations. If you're scoping a creator-research product, the Instagram Influencer Search API guide covers the equivalent for Instagram, and the Influencer Marketing MCP Server post shows how to wire all of it into Claude, ChatGPT, or Cursor with two lines of config.

For the broader vendor map, the Influencer Marketing APIs guide covers the eight categories of provider and which ones to evaluate against each other.