url_content_extractor

URL Content Extractor Module.

Scans message text for supported URL types and returns extracted content as system-injected annotations that can be appended to the LLM context.

Supported sources: Twitter/X, YouTube, GitHub repos/issues/PRs, arXiv papers, Reddit threads, Wikipedia articles, GitHub Gists, Bluesky posts, Stack Overflow/Exchange, NVD CVE entries, Spotify, SoundCloud, TikTok, Vimeo, paste sites (Pastebin/Hastebin/etc.), and cryptocurrency price mentions.

Direct image URLs are downloaded as multimodal content parts, and any yt-dlp-supported video/image URL is fetched (with disk caching, SSRF guarding, and background downloads) and surfaced as both text annotations and multimodal parts.

All extracted user-generated content is wrapped with wrap_untrusted_data() to prevent prompt injection.

url_content_extractor.pre_flight_ssrf_check(url)[source]

Validate that a URL does not resolve to a private or link-local address.

SSRF guard run before any yt-dlp subprocess touches a user-supplied URL. It resolves the hostname via socket.getaddrinfo() and rejects the request if any resolved address falls inside _SSRF_BLOCKLIST (loopback, RFC 1918 private ranges, the 169.254.169.254 cloud-metadata endpoint, and the IPv6 equivalents), so an attacker cannot coax the bot into fetching internal services. Emits dns_resolution_error / ssrf_attempt_blocked counters through observability.

Called by get_ytdlp_video_metadata() and download_ytdlp_video() before spawning yt-dlp, and exercised directly by tests/test_ytdlp_security.py.

Parameters:

url (str) – The target URL (or bare host) to validate.

Raises:
  • ValueError – If the URL is empty/unparseable or DNS resolution fails.

  • PermissionError – If any resolved address is on the SSRF blocklist.

Return type:

None

Write yt-dlp cookies to a transient RAM-disk file, deleting it on exit.

Context manager that materializes secret cookie data only as long as a yt-dlp subprocess needs it on disk. It creates a 0600 temp file under /dev/shm when available (so the bytes never hit persistent storage), yields its path, and unconditionally unlinks it in the finally block even if the body raises. Touches the filesystem via tempfile and os.

Called by get_ytdlp_video_metadata() and download_ytdlp_video() to pass user-supplied cookies to yt-dlp via --cookies, and exercised by tests/test_ytdlp_security.py.

Parameters:

cookie_data (str) – The Netscape-format cookie file contents to stage.

Yields:

str – Filesystem path to the temporary cookie file, valid only inside the with block.

Return type:

Generator[str, None, None]

url_content_extractor.wrap_untrusted_data(content)[source]

Wrap untrusted content in unique random tags to neutralize prompt injection.

Surrounds externally fetched, user-generated text (tweets, READMEs, paste bodies, etc.) with an unguessable UNTRUSTED_DATA_<uuid>_... open/close tag pair so the LLM treats the enclosed span as inert data rather than instructions. The per-call random UUID (via uuid.uuid4()) prevents an attacker from forging a matching closing tag to break out of the wrapper.

Called pervasively by every extract_*_content extractor and the format_*_annotation builders in this module before any remote text is folded into the model context. No external callers were found.

Parameters:

content (str) – The untrusted text to fence off.

Returns:

The content wrapped in a unique randomized untrusted-data tag pair.

Return type:

str

async url_content_extractor.extract_tweet_content(text)[source]

Build system annotations for any Twitter/X URLs in the message text.

Scans whitespace-split words for tweet URLs (via is_tweet_url()) and, for each, fetches the tweet through get_tweet_content() (an HTTP/API call in url_utils). It summarizes the author, body, media count, and thread status into a bracketed [System auto-extracted tweet ...] line, fencing the tweet body through wrap_untrusted_data() so the remote text cannot inject instructions.

Dispatched as one of the parallel text_extractors gathered by extract_all_url_content() (wrapped in _safe_extract()). No other callers were found.

Parameters:

text (str) – Raw message text to scan for tweet URLs.

Returns:

One annotation string per resolvable tweet URL (empty if none match or none resolve).

Return type:

List[str]

async url_content_extractor.extract_youtube_content(text)[source]

Build system annotations for any YouTube URLs in the message text.

Scans whitespace-split words for YouTube links (via is_youtube_url()) and fetches lightweight metadata through get_youtube_content() (an HTTP/oEmbed call in url_utils), emitting a bracketed [System auto-extracted YouTube video ...] line with the title and channel and a Shorts marker when applicable. The title is fenced through wrap_untrusted_data(). This is the cheap text-only path; the heavier transcript/video download is handled separately by extract_ytdlp_video_content().

Dispatched as one of the parallel text_extractors gathered by extract_all_url_content() (wrapped in _safe_extract()). No other callers were found.

Parameters:

text (str) – Raw message text to scan for YouTube URLs.

Returns:

One annotation string per resolvable YouTube URL (empty if none match or none resolve).

Return type:

List[str]

async url_content_extractor.extract_spotify_content(text)[source]

Build system annotations for any Spotify URLs in the message text.

Scans whitespace-split words for Spotify links (via is_spotify_url()) and resolves each through get_spotify_content() (an HTTP/oEmbed call in url_utils), emitting a bracketed [System auto-extracted Spotify <type> ...] line labelled with the capitalized resource type (track, album, playlist, etc.). The title is fenced through wrap_untrusted_data().

Dispatched as one of the parallel text_extractors gathered by extract_all_url_content() (wrapped in _safe_extract()). No other callers were found.

Parameters:

text (str) – Raw message text to scan for Spotify URLs.

Returns:

One annotation string per resolvable Spotify URL (empty if none match or none resolve).

Return type:

List[str]

async url_content_extractor.extract_soundcloud_content(text)[source]

Build system annotations for any SoundCloud URLs in the message text.

Scans whitespace-split words for SoundCloud links (via is_soundcloud_url()) and resolves each through get_soundcloud_content() (an HTTP/oEmbed call in url_utils), emitting a bracketed [System auto-extracted SoundCloud track ...] line with title, author, and a truncated (150-char) description when present. Title and description are each fenced through wrap_untrusted_data().

Dispatched as one of the parallel text_extractors gathered by extract_all_url_content() (wrapped in _safe_extract()). No other callers were found.

Parameters:

text (str) – Raw message text to scan for SoundCloud URLs.

Returns:

One annotation string per resolvable SoundCloud URL (empty if none match or none resolve).

Return type:

List[str]

async url_content_extractor.extract_tiktok_content(text)[source]

Build system annotations for any TikTok URLs in the message text.

Scans whitespace-split words for TikTok links (via is_tiktok_url()) and resolves each through get_tiktok_content() (an HTTP/oEmbed call in url_utils), emitting a bracketed [System auto-extracted TikTok video ...] line with the author and a title truncated to 200 chars. The title is fenced through wrap_untrusted_data().

Dispatched as one of the parallel text_extractors gathered by extract_all_url_content() (wrapped in _safe_extract()). No other callers were found.

Parameters:

text (str) – Raw message text to scan for TikTok URLs.

Returns:

One annotation string per resolvable TikTok URL (empty if none match or none resolve).

Return type:

List[str]

async url_content_extractor.extract_vimeo_content(text)[source]

Build system annotations for any Vimeo URLs in the message text.

Scans whitespace-split words for Vimeo links (via is_vimeo_url()) and resolves each through get_vimeo_content() (an HTTP/oEmbed call in url_utils), emitting a bracketed [System auto-extracted Vimeo video ...] line with title, author, and a M:SS duration when known. The title is fenced through wrap_untrusted_data().

Dispatched as one of the parallel text_extractors gathered by extract_all_url_content() (wrapped in _safe_extract()). No other callers were found.

Parameters:

text (str) – Raw message text to scan for Vimeo URLs.

Returns:

One annotation string per resolvable Vimeo URL (empty if none match or none resolve).

Return type:

List[str]

async url_content_extractor.extract_github_content(text)[source]

Build system annotations for any GitHub repo/issue/PR URLs in the text.

Scans whitespace-split words for GitHub links (via is_github_url()) and resolves each through get_github_content() (a GitHub HTTP API call in url_utils). For repositories it emits owner/repo, description, language, star count, and a README preview; for issues and pull requests it emits the number, state, title, and a body preview. Every remote string (description, README, title, body) is fenced through wrap_untrusted_data().

Dispatched as one of the parallel text_extractors gathered by extract_all_url_content() (wrapped in _safe_extract()). No other callers were found.

Parameters:

text (str) – Raw message text to scan for GitHub URLs.

Returns:

One annotation string per resolvable GitHub URL (empty if none match or none resolve).

Return type:

List[str]

async url_content_extractor.extract_arxiv_content(text)[source]

Build system annotations for any arXiv URLs in the message text.

Scans whitespace-split words for arXiv links (via is_arxiv_url()) and resolves each through get_arxiv_content() (an arXiv HTTP API call in url_utils), emitting a bracketed [System auto-extracted arXiv ...] line with the paper id, title, author list (collapsed to et al. past three), category, and abstract. Title and abstract are each fenced through wrap_untrusted_data().

Dispatched as one of the parallel text_extractors gathered by extract_all_url_content() (wrapped in _safe_extract()). No other callers were found.

Parameters:

text (str) – Raw message text to scan for arXiv URLs.

Returns:

One annotation string per resolvable arXiv URL (empty if none match or none resolve).

Return type:

List[str]

async url_content_extractor.extract_reddit_content(text)[source]

Build system annotations for any Reddit thread URLs in the text.

Scans whitespace-split words for Reddit links (via is_reddit_url()) and resolves each through get_reddit_content() (a Reddit HTTP/JSON call in url_utils), emitting a bracketed [System auto-extracted Reddit ...] line with the subreddit, post title, a truncated OP selftext, and the top comments. Title, selftext, and every comment body are each fenced through wrap_untrusted_data().

Dispatched as one of the parallel text_extractors gathered by extract_all_url_content() (wrapped in _safe_extract()). No other callers were found.

Parameters:

text (str) – Raw message text to scan for Reddit URLs.

Returns:

One annotation string per resolvable Reddit URL (empty if none match or none resolve).

Return type:

List[str]

async url_content_extractor.extract_wikipedia_content(text)[source]

Build system annotations for any Wikipedia article URLs in the text.

Scans whitespace-split words for Wikipedia links (via is_wikipedia_url()) and resolves each through get_wikipedia_content() (a Wikipedia REST/summary HTTP call in url_utils), emitting a bracketed [System auto-extracted Wikipedia ...] line carrying the language code, page title, and lead extract. Title and extract are each fenced through wrap_untrusted_data().

Dispatched as one of the parallel text_extractors gathered by extract_all_url_content() (wrapped in _safe_extract()). No other callers were found.

Parameters:

text (str) – Raw message text to scan for Wikipedia URLs.

Returns:

One annotation string per resolvable Wikipedia URL (empty if none match or none resolve).

Return type:

List[str]

async url_content_extractor.extract_gist_content(text)[source]

Build system annotations for any GitHub Gist URLs in the text.

Scans whitespace-split words for Gist links (via is_gist_url()) and resolves each through get_gist_content() (a GitHub Gist HTTP API call in url_utils), emitting a bracketed [System auto-extracted GitHub Gist ...] line with the owner, description, and the full content of each file (name, language, body). The description and every file body are fenced through wrap_untrusted_data().

Dispatched as one of the parallel text_extractors gathered by extract_all_url_content() (wrapped in _safe_extract()). No other callers were found.

Parameters:

text (str) – Raw message text to scan for Gist URLs.

Returns:

One annotation string per resolvable Gist URL (empty if none match or none resolve).

Return type:

List[str]

async url_content_extractor.extract_bluesky_content(text)[source]

Build system annotations for any Bluesky post URLs in the text.

Scans whitespace-split words for Bluesky links (via is_bluesky_url()) and resolves each through get_bluesky_content() (an AT Protocol HTTP call in url_utils), emitting a bracketed [System auto-extracted Bluesky post ...] line with the author, post text, engagement counts (likes/reposts/replies), and a media marker. The post text is fenced through wrap_untrusted_data().

Dispatched as one of the parallel text_extractors gathered by extract_all_url_content() (wrapped in _safe_extract()). No other callers were found.

Parameters:

text (str) – Raw message text to scan for Bluesky URLs.

Returns:

One annotation string per resolvable Bluesky URL (empty if none match or none resolve).

Return type:

List[str]

async url_content_extractor.extract_stackoverflow_content(text)[source]

Build system annotations for any Stack Exchange URLs in the text.

Scans whitespace-split words for Stack Overflow / Stack Exchange links (via is_stackoverflow_url()) and resolves each through get_stackoverflow_content() (a Stack Exchange HTTP API call in url_utils), emitting a bracketed [System auto-extracted Stack Exchange ...] line with the site, question title, score, answer count, tags, the question body, and the accepted/top answer. Title, question body, and answer body are each fenced through wrap_untrusted_data().

Dispatched as one of the parallel text_extractors gathered by extract_all_url_content() (wrapped in _safe_extract()). No other callers were found.

Parameters:

text (str) – Raw message text to scan for Stack Exchange URLs.

Returns:

One annotation string per resolvable Stack Exchange URL (empty if none match or none resolve).

Return type:

List[str]

async url_content_extractor.extract_nvd_cve_content(text)[source]

Build system annotations for any NVD CVE URLs in the text.

Scans whitespace-split words for NVD CVE links (via is_nvd_cve_url()) and resolves each through get_nvd_cve_content() (an NVD HTTP API call in url_utils), emitting a bracketed [System auto-extracted NVD CVE ...] line with the CVE id, CVSS v3 (or v2 fallback) score and severity, publish date, CWE ids, description, affected products, and the first few references. The description is fenced through wrap_untrusted_data().

Dispatched as one of the parallel text_extractors gathered by extract_all_url_content() (wrapped in _safe_extract()). No other callers were found.

Parameters:

text (str) – Raw message text to scan for NVD CVE URLs.

Returns:

One annotation string per resolvable CVE URL (empty if none match or none resolve).

Return type:

List[str]

async url_content_extractor.extract_paste_content(text)[source]

Build system annotations for any paste-site URLs in the text.

Scans whitespace-split words for paste links such as Pastebin or Hastebin (via is_paste_url()) and fetches the raw paste through get_paste_content() (an HTTP call in url_utils), emitting a bracketed [System auto-injected paste content ...] line tagged with the site and paste id. The fetched paste body is fenced through wrap_untrusted_data() since it is fully attacker-controlled.

Dispatched as one of the parallel text_extractors gathered by extract_all_url_content() (wrapped in _safe_extract()). No other callers were found.

Parameters:

text (str) – Raw message text to scan for paste-site URLs.

Returns:

One annotation string per resolvable paste URL (empty if none match or none resolve).

Return type:

List[str]

async url_content_extractor.extract_image_urls(text)[source]

Find direct image URLs in text and download them as multimodal parts.

Matches image links against _IMAGE_URL_RE (Discord CDN attachments, Imgur, and bare .png/.jpg/.gif/etc. URLs), de-duplicates them, and fetches each over HTTP via download_image_url(). Each download is normalized through platforms.media_common: GIFs are re-encoded to MP4 via maybe_reencode_gif() so the Gemini API receives a well-supported video format, and still images have their MIME type reconciled against the actual bytes via reconcile_image_mimetype(). The result is base64 data: URLs wrapped as OpenRouter content parts; each download is logged at info level.

Run (inside _images_safe()) as part of the concurrent gather in extract_all_url_content(), and exercised directly by tests/core/test_image_url_extraction.py.

Parameters:

text (str) – Raw message text to scan for image URLs.

Returns:

OpenRouter image_url / video_url content-part dicts, one per successfully downloaded URL (empty if none match or download).

Return type:

List[Dict[str, Any]]

async url_content_extractor.extract_crypto_prices(text)[source]

Build a system annotation of live prices for crypto symbols in the text.

Detects cryptocurrency ticker mentions via detect_crypto_mentions(), and if any are found fetches their live quotes through get_crypto_prices() (a Kraken HTTP API call in url_utils). It formats name, symbol, price, 24h change (with an up/down arrow), and 24h range into a single [System auto-injected cryptocurrency prices ...] block. Returns None when nothing is mentioned or the fetch yields no prices.

Run (inside _crypto_safe()) as part of the concurrent gather in extract_all_url_content(). No other callers were found.

Parameters:

text (str) – Raw message text to scan for crypto symbol mentions.

Returns:

The formatted price annotation block, or None when no symbols are mentioned or no prices resolve.

Return type:

Optional[str]

url_content_extractor.video_cache_lookup(url)[source]

Look up a URL in the disk media cache and return its files and metadata.

Reads the {key}.json sidecar in _VIDEO_CACHE_DIR (key from _video_cache_key()), treating a missing/corrupt sidecar or a TTL expiry past _VIDEO_CACHE_TTL as a miss (evicting expired entries via _evict_cache_entry()). It validates every referenced filename through _safe_video_cache_child() and confirms it exists; on a full hit it refreshes the entry’s cached_at (LRU touch) and rewrites the sidecar. Touches the filesystem only.

Called by extract_ytdlp_video_content() (via asyncio.to_thread()) to serve cached media without re-downloading, and exercised by tests/test_ytdlp_media_parts.py.

Parameters:

url (str) – The source media URL to look up.

Returns:

(paths, metadata) on a hit, or ([], None) on a miss / expiry / missing file.

Return type:

tuple[list[Path], dict | None]

url_content_extractor.video_cache_store(url, video_src, metadata)[source]

Copy freshly downloaded media into the disk cache and write its sidecar.

Persists yt-dlp output for later reuse: it copies each source file into _VIDEO_CACHE_DIR under the URL’s _video_cache_key() (single files keep their extension; multiple files are suffixed _0, _1, …), classifies the entry as image or video via _is_image_path(), and writes a {key}.json sidecar with the supplied metadata plus filenames, media kind, total size, and cached_at. Then it calls _enforce_cache_limits() to apply TTL and the size cap. Touches the filesystem (copies and JSON writes) via shutil.

Called by message_processor.video_history_patch after a background download completes, and exercised by tests/test_ytdlp_media_parts.py.

Parameters:
  • url (str) – The source media URL (used to derive the cache key).

  • video_src (Path | list[Path]) – One path or a list of paths to copy in.

  • metadata (dict) – Base metadata (title/channel/etc.) merged into the sidecar.

Returns:

The cached destination path(s), in source order.

Return type:

list[Path]

Raises:

ValueError – If video_src resolves to an empty list of sources.

async url_content_extractor.get_ytdlp_video_metadata(url, cookies_text=None)[source]

Fetch lightweight media metadata via yt-dlp --dump-json (no download).

Probes a media URL for title, channel, duration, and extractor without pulling the actual file, so callers can decide whether to download. It first runs pre_flight_ssrf_check() to block private/metadata hosts, then delegates to the nested _run helper which spawns yt-dlp. Cookie auth is resolved here: user-supplied cookies_text is staged on a RAM disk via secure_in_memory_cookie_file(), otherwise the default _DEFAULT_COOKIES_PATH is used when present.

Called by extract_ytdlp_video_content() on a cache miss to gate download decisions, and exercised by tests/test_ytdlp_security.py.

Parameters:
  • url (str) – The media URL to probe (validated for SSRF first).

  • cookies_text (str | None) – Optional Netscape-format cookie contents for authenticated extraction; falls back to the default cookies file.

Returns:

A title/channel/duration/extractor/url dict on success, a {"_cookie_error": True, ...} sentinel when auth is required, or None on any failure.

Return type:

dict | None

Raises:

ValueError | PermissionError – Propagated from pre_flight_ssrf_check() for blocked or unresolvable hosts.

async url_content_extractor.download_ytdlp_video(url, cookies_text=None)[source]

Download a media URL with yt-dlp into a temp dir and return its file(s).

The heavy companion to get_ytdlp_video_metadata(): it actually pulls the media (capped at 720p, _MAX_VIDEO_FILESIZE) for caching and multimodal use. It runs pre_flight_ssrf_check(), verifies yt-dlp is installed via shutil.which(), creates a temp working dir, and delegates to the nested _run helper (which spawns yt-dlp and resolves outputs via _resolve_ytdlp_download_paths()). Cookie auth mirrors the metadata path: user cookies_text is staged on a RAM disk via secure_in_memory_cookie_file(), else the default _DEFAULT_COOKIES_PATH is used when present. Touches the filesystem (temp dir + downloaded files).

Called by message_processor.video_history_patch to perform the background download, and exercised by tests/test_ytdlp_security.py.

Parameters:
  • url (str) – The media URL to download (validated for SSRF first).

  • cookies_text (str | None) – Optional Netscape-format cookie contents for authenticated download; falls back to the default cookies file.

Returns:

(paths, None) on success (paths may be multiple images for image-only extractions), or ([], error) on failure (error may be the literal "cookie_error" sentinel).

Return type:

tuple[list[Path], str | None]

Raises:

ValueError | PermissionError – Propagated from pre_flight_ssrf_check() for blocked or unresolvable hosts.

url_content_extractor.format_ytdlp_downloading_annotation(url, meta, *, kind='media')[source]

Build the context annotation shown while a yt-dlp download is in flight.

Produces a bracketed [System auto-extracted metadata ...] line carrying the title, channel, duration (formatted via _format_duration()), and platform, plus a note telling the model the actual media is still downloading in the background and only metadata is available for now. The kind keyword adjusts the wording for video, image, or unknown media. The title is fenced through wrap_untrusted_data().

Called by extract_ytdlp_video_content() (and by the back-compat shim format_video_downloading_annotation()) when a download is queued. No external callers were found.

Parameters:
  • url (str) – The source media URL the annotation refers to.

  • meta (dict) – yt-dlp metadata (title/channel/duration/ extractor).

  • kind (str) – One of video, image, or media (default), tuning the noun used in the message.

Returns:

The formatted “downloading” annotation block.

Return type:

str

url_content_extractor.format_video_downloading_annotation(url, meta)[source]

Backward-compatible alias for format_ytdlp_downloading_annotation().

Thin shim kept for older call sites that predate the kind keyword; it forwards to format_ytdlp_downloading_annotation() with kind="media". No external callers were found.

Parameters:
  • url (str) – The source media URL the annotation refers to.

  • meta (dict) – yt-dlp metadata passed straight through.

Returns:

The formatted “downloading” annotation block.

Return type:

str

url_content_extractor.format_ytdlp_ready_annotation(url, meta, *, kind='video')[source]

Build the context annotation for cached yt-dlp media that is ready to use.

Produces a bracketed [System auto-extracted <video|image> ...] line with the title, channel, duration (via _format_duration()), and platform, used when the media bytes are already in context (cache hit or just-finished download) so no “still downloading” caveat is needed. The kind keyword selects the video vs image label. The title is fenced through wrap_untrusted_data().

Called by extract_ytdlp_video_content() on a cache hit, by message_processor.video_history_patch once a background download lands, and by the back-compat shim format_video_ready_annotation().

Parameters:
  • url (str) – The source media URL the annotation refers to.

  • meta (dict) – yt-dlp metadata (title/channel/duration/ extractor).

  • kind (str) – video (default) or image.

Returns:

The formatted “ready” annotation block.

Return type:

str

url_content_extractor.format_video_ready_annotation(url, meta)[source]

Backward-compatible alias for format_ytdlp_ready_annotation().

Thin shim kept for older call sites; forwards to format_ytdlp_ready_annotation() with kind="video". No external callers were found.

Parameters:
  • url (str) – The source media URL the annotation refers to.

  • meta (dict) – yt-dlp metadata passed straight through.

Returns:

The formatted “ready” annotation block for a video.

Return type:

str

url_content_extractor.format_video_failed_annotation(url, meta)[source]

Build the context annotation for a video whose download failed.

Produces a bracketed [System auto-extracted video metadata ...] line with title, channel, duration (via _format_duration()), and platform, followed by a note that the download failed and only metadata is available, so the model can answer about the video without claiming to have watched it. The title is fenced through wrap_untrusted_data().

Called by message_processor.video_history_patch when a background download errors out. No external callers were found.

Parameters:
  • url (str) – The source media URL the annotation refers to.

  • meta (dict) – yt-dlp metadata (title/channel/duration/ extractor).

Returns:

The formatted “download failed” annotation block.

Return type:

str

Build the context annotation when a video needs cookie authentication.

Produces a bracketed [System note ...] line telling the model the URL is gated (age-restricted, private, members-only, etc.) and that someone must supply a cookies.txt via the set_user_api_key flow with service yt_dlp_cookies, so the assistant can guide the user instead of silently failing. Unlike the other builders this takes no metadata and does not wrap anything, since the URL is the only interpolated value.

Called by extract_ytdlp_video_content() when metadata extraction returns the cookie-error sentinel. No external callers were found.

Parameters:

url (str) – The authentication-gated media URL.

Returns:

The formatted cookie-required note.

Return type:

str

url_content_extractor.format_video_too_long_annotation(url, meta)[source]

Build the context annotation for a video over the duration limit.

Produces a bracketed [System auto-extracted video metadata ...] line with title, channel, duration (via _format_duration()), and platform, noting that the video exceeds the _MAX_VIDEO_DURATION cap so only metadata (no download) is available. The title is fenced through wrap_untrusted_data().

Called by extract_ytdlp_video_content() when the probed duration exceeds the cap. No external callers were found.

Parameters:
  • url (str) – The source media URL the annotation refers to.

  • meta (dict) – yt-dlp metadata (title/channel/duration/ extractor).

Returns:

The formatted “too long” annotation block.

Return type:

str

url_content_extractor.build_media_url_part_from_file(path)[source]

Read a media file and build an OpenRouter content part from its bytes.

Reads the file, guesses its MIME type via mimetypes, and (for images) reconciles that type against the actual bytes with reconcile_image_mimetype_sync() from platforms.media_common. It base64-encodes the bytes into a data: URL and wraps it as an image_url part for images or a video_url part otherwise, defaulting unknown-but-known-media-suffix files to video/mp4. Reads the filesystem; being synchronous, callers typically dispatch it via asyncio.to_thread().

Called by extract_ytdlp_video_content() and message_processor.video_history_patch to turn cached media into model input, by the alias build_video_url_part(), and exercised by tests/test_ytdlp_media_parts.py and tests/test_media_image_mime_reconcile.py.

Parameters:

path (Path) – Filesystem path to the media file to encode.

Returns:

An OpenRouter image_url or video_url content-part dict with an inline base64 data: URL.

Return type:

dict[str, Any]

url_content_extractor.build_video_url_part(video_path)[source]

Legacy alias for build_media_url_part_from_file().

Retained for older call sites that assumed the input was always a video; it simply forwards to build_media_url_part_from_file(), which also classifies images correctly. Prefer that function directly in new code. No external callers were found.

Parameters:

video_path (Path) – Filesystem path to the media file to encode.

Returns:

An OpenRouter image_url or video_url content-part dict.

Return type:

dict[str, Any]

url_content_extractor.ytdlp_paths_are_image_only(paths)[source]

Report whether a yt-dlp download produced only image files.

Returns True only for a non-empty list whose every entry is an image by suffix (via _is_image_path()), distinguishing an image-only extraction (e.g. a gallery) from a video download so the right ytdlp_media_kind and annotation label can be chosen.

Called by extract_ytdlp_video_content() and message_processor.video_history_patch to derive the media kind. No external callers were found.

Parameters:

paths (list[Path]) – The downloaded file paths to classify.

Returns:

True if the list is non-empty and contains only image files.

Return type:

bool

async url_content_extractor.extract_ytdlp_video_content(text, user_id='', redis_client=None, config=None)[source]

Extract yt-dlp supported video URLs and return context parts.

Returns (text_annotations, multimodal_parts, download_requests).

  • text_annotations: text strings to append to context.

  • multimodal_parts: image_url / video_url content-part dicts (cache hits only).

  • download_requests: dicts {"url": str, "metadata": dict, "cookies_text": str|None} for URLs that need background downloading.

Return type:

tuple[list[str], list[dict[str, Any]], list[dict[str, Any]]]

Parameters:
async url_content_extractor.extract_all_url_content(message_content, user_id='', redis_client=None, config=None)[source]

Extract content from all supported URL types in message_content.

Returns a (text_annotations, multimodal_parts, download_requests) tuple:

  • text_annotations is a string with all extracted text content concatenated (empty string if nothing was extracted).

  • multimodal_parts is a list of OpenRouter image_url / video_url content-part dicts for any detected media URLs.

  • download_requests is a list of dicts describing videos that need background downloading (consumed by the message processor to spawn asyncio.create_task calls).

Return type:

tuple[str, list[dict[str, Any]], list[dict[str, Any]]]

Parameters:
  • message_content (str)

  • user_id (str)

  • redis_client (Any)

  • config (Any)