[Part 2] The YouTube script: youtube2llm.py

Hey, as promised, I’ll start introducing the different scripts in the Nootroscripts project.



Overview: analyse, embed, and ask anything about the content of a YouTube video.

Feel free to skip to Example usages for a quick start and refer to the documentation-like sections later if you’re not an RTFM kind of person.


  • action: analyse, embed, ask

analyse mode

  • vid video ID to retrieve/generate a transcript for

  • tf: the transcript filename that the LLM will analyse/create embedding for. in this case, value of vid is ignored

  • lmodel: the LLM to use for analysis (default: gpt-3.5-turbo). You can use your local ollama models. Mistral is great.

    • Note: this is different from LLM to use for embedding which is currently hardcoded to use OpenAI’s latest text embedding model.
  • mode: QnAs, note, summary/kp, tag, topix, thread, tp, cbb, definition, translation. See prompts.py for details.

  • prompt: the prompt to use in lieu of the built-in prompts for each of the modes available to choose from, as per above

  • nc: don’t pass the chapters available in the video to analyse. This works when you’re working with models that support larger context window (chunking is not as granular). But otherwise incentivise confabulation of info for segments in the video that aren’t yet the respective chapter. Try comparing the results with and without this arg to understand what I mean.

  • dla: download the audio stream as mp4 and transcribe the audio using Whisper. Aka, don’t use YouTube’ auto caption or official caption for the video

  • lang: language code of the video that Whisper can use to transcribe more appropriately. Defaults to en

embed mode

  • ef: embedding filename to use (a CSV file containing the text/set of words, vector generated by the embedding LLM for the set of words, and the row’s ID)

ask mode

  • q: your question. This string will be turned into vector by the embedding LLM and then calculated for distance against the list of vectors in the CSV.


analyse mode

  • [YOUTUBE_VIDEO_ID]-metadata.json

  • transcript in [YOUTUBE_VIDEO_ID]-transcript.txt

  • LLM result(s) in [YOUTUBE_VIDEO_ID]-[MODE]-[LMODEL].md

  • [YOUTUBE_VIDEO_ID].mp4 if --dla is specified

embed mode

  • embedding file in embeddings/[TRANSCRIPT_FILENAME]-transcript_embedding.csv

ask mode

  • answer file in [YOUTUBE_VIDEO_ID]-answer-[MD5_HASH_OF_QUESTION_STRING].txt


analyse mode

  • attempts to retrieve caption (auto or official) from YouTube unless --tf or --dla is specified

  • runs the transcript through text-completion LLM with the prompt selected (via mode selection) or custom prompt specified

embed mode

  • runs the transcript through text-embedding LLM

  • generates a CSV file with vectors from the transcript (txt) file

ask mode

  • takes your question/query specified as --q, turn the string into vector using the embedding LLM and then calculate the distance against the list of vectors in the CSV to return a list of words in the embedded text(s) that it considers the closest to your question. The best use case for this is a search for related concepts, related articles, as such.

Example usages

A. Analyse (after retrieving YouTube's caption/subtitles or transcribing with Whisper)

# process a YouTube video, passing the video ID. will produce a list of Q&As for the video by default when --mode is not specified
youtube2llm.py analyse --vid 2JmfDKOyQcI

# process a transcript file (e.g. when you already have the caption / transcript file saved locally, saving the web traffic calls to youtube)
youtube2llm.py analyse --tf output/FbquCdNZ4LM-transcript.txt

# don't send chapters, when it's making it too verbose or fragmented
youtube2llm.py analyse --vid=FbquCdNZ4LM --nc

# produce a note for this video
youtube2llm.py analyse --vid Lsf166_Rd6M --nc --mode note

# produce a list of definitions made in this video
youtube2llm.py analyse --vid Lsf166_Rd6M --nc --mode definition

# download and transcribe the audio file (don't use YouTube's auto caption)
youtube2llm.py analyse --vid=MNwdq2ofxoA --nc --lmodel mistral --dla

# the video has no subtitle, so it falls back to Whisper that transcribes it
youtube2llm.py analyse --vid=TVbeikZTGKY --nc

# this video in Bahasa Indonesia has no auto caption nor official subtitle, so we specify Indonesian language so Whisper's output is better
youtube2llm.py analyse --vid PDpyUMOOcyw --lang id

# this video has age filter on and can't be accessed without logging in
youtube2llm.py analyse --vid=FbquCdNZ4LM --nc
    pytube.exceptions.AgeRestrictedError: FbquCdNZ4LM is age restricted, and can't be accessed without logging in.
# in this case, I will use `yt-dlp -f` to get the audio file (m4a or mp4) and then run the audio file through audio2llm.py to transcribe it with Whisper
# another example of such video: https://www.youtube.com/watch?v=77ivEdhHKB0

# # uses mistral (via ollama) for summarisation. the script will use OpenAI's API for the ask and embed mode (still TODO)
youtube2llm.py analyse --vid=sYmCnngKq00 --nc --lmodel mistral

# run a custom prompt against an audio file (first it will retrieve the transcript from YouTube or generate the transcript using Whisper)
youtube2llm.py analyse --vid="-3vmxQet5LA" --prompt "all the public speaking techniques and tactics shared"

# run a custom prompt against a transcript file
youtube2llm.py analyse --tf output/-3vmxQet5LA-transcript.txt --prompt "1. what is the Joke Structure and 2. what is the memory palace technique"

B. Embedding. Generates CSV file with vectors from any transcript (txt) file

# this generates output/embeddings/452f186b-54f2-4f66-a635-6e1f56afbdd4_media.mp3-transcript_embedding.csv
youtube2llm.py embed --tf output/452f186b-54f2-4f66-a635-6e1f56afbdd4_media.mp3.txt

C. Asking the embedding file a question / query

# without q(uery) specified, defaults to:
    "what questions can I ask about what's discussed in the video so I understand the main argument and points that the speaker is making? and for each question please answer each and elaborate them in detail in the same response"
youtube2llm.py ask --ef output/embeddings/452f186b-54f2-4f66-a635-6e1f56afbdd4_media.mp3-transcript_embedding.csv

# or pass on a query
youtube2llm.py ask --ef output/embeddings/452f186b-54f2-4f66-a635-6e1f56afbdd4_media.mp3-transcript_embedding.csv --q "what is rule and what is norm"
youtube2llm.py ask --ef output/embeddings/452f186b-54f2-4f66-a635-6e1f56afbdd4_media-transcript_embedding.csv --q "if we could distill the transcript into 4 arguments or points, what would they be?"

# batching some video IDs in a text file (line-separated, one video ID per line)
while read -r f || [ -n "$f" ]; do; youtube2llm.py analyse --vid="$f" --nc --lmodel mistral; done < list_of_youtube_video_ids.txt

Leave a Reply

Your email address will not be published. Required fields are marked *