url_content_extractor
URL Content Extractor Module.
Scans message text for supported URL types and returns extracted content as system-injected annotations that can be appended to the LLM context.
Supported sources: Twitter/X, YouTube, GitHub repos/issues/PRs, arXiv papers, Reddit threads, Wikipedia articles, GitHub Gists, Bluesky posts, Stack Overflow/Exchange, NVD CVE entries, Spotify, SoundCloud, TikTok, Vimeo, paste sites (Pastebin/Hastebin/etc.), and cryptocurrency price mentions.
Direct image URLs are downloaded as multimodal content parts, and any yt-dlp-supported video/image URL is fetched (with disk caching, SSRF guarding, and background downloads) and surfaced as both text annotations and multimodal parts.
All extracted user-generated content is wrapped with
wrap_untrusted_data() to prevent prompt injection.
- url_content_extractor.pre_flight_ssrf_check(url)[source]
Validate that a URL does not resolve to a private or link-local address.
SSRF guard run before any yt-dlp subprocess touches a user-supplied URL. It resolves the hostname via
socket.getaddrinfo()and rejects the request if any resolved address falls inside_SSRF_BLOCKLIST(loopback, RFC 1918 private ranges, the169.254.169.254cloud-metadata endpoint, and the IPv6 equivalents), so an attacker cannot coax the bot into fetching internal services. Emitsdns_resolution_error/ssrf_attempt_blockedcounters throughobservability.Called by
get_ytdlp_video_metadata()anddownload_ytdlp_video()before spawning yt-dlp, and exercised directly bytests/test_ytdlp_security.py.- Parameters:
url (
str) – The target URL (or bare host) to validate.- Raises:
ValueError – If the URL is empty/unparseable or DNS resolution fails.
PermissionError – If any resolved address is on the SSRF blocklist.
- Return type:
- url_content_extractor.secure_in_memory_cookie_file(cookie_data)[source]
Write yt-dlp cookies to a transient RAM-disk file, deleting it on exit.
Context manager that materializes secret cookie data only as long as a yt-dlp subprocess needs it on disk. It creates a
0600temp file under/dev/shmwhen available (so the bytes never hit persistent storage), yields its path, and unconditionally unlinks it in thefinallyblock even if the body raises. Touches the filesystem viatempfileandos.Called by
get_ytdlp_video_metadata()anddownload_ytdlp_video()to pass user-supplied cookies to yt-dlp via--cookies, and exercised bytests/test_ytdlp_security.py.
- url_content_extractor.wrap_untrusted_data(content)[source]
Wrap untrusted content in unique random tags to neutralize prompt injection.
Surrounds externally fetched, user-generated text (tweets, READMEs, paste bodies, etc.) with an unguessable
UNTRUSTED_DATA_<uuid>_...open/close tag pair so the LLM treats the enclosed span as inert data rather than instructions. The per-call random UUID (viauuid.uuid4()) prevents an attacker from forging a matching closing tag to break out of the wrapper.Called pervasively by every
extract_*_contentextractor and theformat_*_annotationbuilders in this module before any remote text is folded into the model context. No external callers were found.
- async url_content_extractor.extract_tweet_content(text)[source]
Build system annotations for any Twitter/X URLs in the message text.
Scans whitespace-split words for tweet URLs (via
is_tweet_url()) and, for each, fetches the tweet throughget_tweet_content()(an HTTP/API call inurl_utils). It summarizes the author, body, media count, and thread status into a bracketed[System auto-extracted tweet ...]line, fencing the tweet body throughwrap_untrusted_data()so the remote text cannot inject instructions.Dispatched as one of the parallel
text_extractorsgathered byextract_all_url_content()(wrapped in_safe_extract()). No other callers were found.
- async url_content_extractor.extract_youtube_content(text)[source]
Build system annotations for any YouTube URLs in the message text.
Scans whitespace-split words for YouTube links (via
is_youtube_url()) and fetches lightweight metadata throughget_youtube_content()(an HTTP/oEmbed call inurl_utils), emitting a bracketed[System auto-extracted YouTube video ...]line with the title and channel and a Shorts marker when applicable. The title is fenced throughwrap_untrusted_data(). This is the cheap text-only path; the heavier transcript/video download is handled separately byextract_ytdlp_video_content().Dispatched as one of the parallel
text_extractorsgathered byextract_all_url_content()(wrapped in_safe_extract()). No other callers were found.
- async url_content_extractor.extract_spotify_content(text)[source]
Build system annotations for any Spotify URLs in the message text.
Scans whitespace-split words for Spotify links (via
is_spotify_url()) and resolves each throughget_spotify_content()(an HTTP/oEmbed call inurl_utils), emitting a bracketed[System auto-extracted Spotify <type> ...]line labelled with the capitalized resource type (track, album, playlist, etc.). The title is fenced throughwrap_untrusted_data().Dispatched as one of the parallel
text_extractorsgathered byextract_all_url_content()(wrapped in_safe_extract()). No other callers were found.
- async url_content_extractor.extract_soundcloud_content(text)[source]
Build system annotations for any SoundCloud URLs in the message text.
Scans whitespace-split words for SoundCloud links (via
is_soundcloud_url()) and resolves each throughget_soundcloud_content()(an HTTP/oEmbed call inurl_utils), emitting a bracketed[System auto-extracted SoundCloud track ...]line with title, author, and a truncated (150-char) description when present. Title and description are each fenced throughwrap_untrusted_data().Dispatched as one of the parallel
text_extractorsgathered byextract_all_url_content()(wrapped in_safe_extract()). No other callers were found.
- async url_content_extractor.extract_tiktok_content(text)[source]
Build system annotations for any TikTok URLs in the message text.
Scans whitespace-split words for TikTok links (via
is_tiktok_url()) and resolves each throughget_tiktok_content()(an HTTP/oEmbed call inurl_utils), emitting a bracketed[System auto-extracted TikTok video ...]line with the author and a title truncated to 200 chars. The title is fenced throughwrap_untrusted_data().Dispatched as one of the parallel
text_extractorsgathered byextract_all_url_content()(wrapped in_safe_extract()). No other callers were found.
- async url_content_extractor.extract_vimeo_content(text)[source]
Build system annotations for any Vimeo URLs in the message text.
Scans whitespace-split words for Vimeo links (via
is_vimeo_url()) and resolves each throughget_vimeo_content()(an HTTP/oEmbed call inurl_utils), emitting a bracketed[System auto-extracted Vimeo video ...]line with title, author, and aM:SSduration when known. The title is fenced throughwrap_untrusted_data().Dispatched as one of the parallel
text_extractorsgathered byextract_all_url_content()(wrapped in_safe_extract()). No other callers were found.
- async url_content_extractor.extract_github_content(text)[source]
Build system annotations for any GitHub repo/issue/PR URLs in the text.
Scans whitespace-split words for GitHub links (via
is_github_url()) and resolves each throughget_github_content()(a GitHub HTTP API call inurl_utils). For repositories it emits owner/repo, description, language, star count, and a README preview; for issues and pull requests it emits the number, state, title, and a body preview. Every remote string (description, README, title, body) is fenced throughwrap_untrusted_data().Dispatched as one of the parallel
text_extractorsgathered byextract_all_url_content()(wrapped in_safe_extract()). No other callers were found.
- async url_content_extractor.extract_arxiv_content(text)[source]
Build system annotations for any arXiv URLs in the message text.
Scans whitespace-split words for arXiv links (via
is_arxiv_url()) and resolves each throughget_arxiv_content()(an arXiv HTTP API call inurl_utils), emitting a bracketed[System auto-extracted arXiv ...]line with the paper id, title, author list (collapsed toet al.past three), category, and abstract. Title and abstract are each fenced throughwrap_untrusted_data().Dispatched as one of the parallel
text_extractorsgathered byextract_all_url_content()(wrapped in_safe_extract()). No other callers were found.
- async url_content_extractor.extract_reddit_content(text)[source]
Build system annotations for any Reddit thread URLs in the text.
Scans whitespace-split words for Reddit links (via
is_reddit_url()) and resolves each throughget_reddit_content()(a Reddit HTTP/JSON call inurl_utils), emitting a bracketed[System auto-extracted Reddit ...]line with the subreddit, post title, a truncated OP selftext, and the top comments. Title, selftext, and every comment body are each fenced throughwrap_untrusted_data().Dispatched as one of the parallel
text_extractorsgathered byextract_all_url_content()(wrapped in_safe_extract()). No other callers were found.
- async url_content_extractor.extract_wikipedia_content(text)[source]
Build system annotations for any Wikipedia article URLs in the text.
Scans whitespace-split words for Wikipedia links (via
is_wikipedia_url()) and resolves each throughget_wikipedia_content()(a Wikipedia REST/summary HTTP call inurl_utils), emitting a bracketed[System auto-extracted Wikipedia ...]line carrying the language code, page title, and lead extract. Title and extract are each fenced throughwrap_untrusted_data().Dispatched as one of the parallel
text_extractorsgathered byextract_all_url_content()(wrapped in_safe_extract()). No other callers were found.
- async url_content_extractor.extract_gist_content(text)[source]
Build system annotations for any GitHub Gist URLs in the text.
Scans whitespace-split words for Gist links (via
is_gist_url()) and resolves each throughget_gist_content()(a GitHub Gist HTTP API call inurl_utils), emitting a bracketed[System auto-extracted GitHub Gist ...]line with the owner, description, and the full content of each file (name, language, body). The description and every file body are fenced throughwrap_untrusted_data().Dispatched as one of the parallel
text_extractorsgathered byextract_all_url_content()(wrapped in_safe_extract()). No other callers were found.
- async url_content_extractor.extract_bluesky_content(text)[source]
Build system annotations for any Bluesky post URLs in the text.
Scans whitespace-split words for Bluesky links (via
is_bluesky_url()) and resolves each throughget_bluesky_content()(an AT Protocol HTTP call inurl_utils), emitting a bracketed[System auto-extracted Bluesky post ...]line with the author, post text, engagement counts (likes/reposts/replies), and a media marker. The post text is fenced throughwrap_untrusted_data().Dispatched as one of the parallel
text_extractorsgathered byextract_all_url_content()(wrapped in_safe_extract()). No other callers were found.
- async url_content_extractor.extract_stackoverflow_content(text)[source]
Build system annotations for any Stack Exchange URLs in the text.
Scans whitespace-split words for Stack Overflow / Stack Exchange links (via
is_stackoverflow_url()) and resolves each throughget_stackoverflow_content()(a Stack Exchange HTTP API call inurl_utils), emitting a bracketed[System auto-extracted Stack Exchange ...]line with the site, question title, score, answer count, tags, the question body, and the accepted/top answer. Title, question body, and answer body are each fenced throughwrap_untrusted_data().Dispatched as one of the parallel
text_extractorsgathered byextract_all_url_content()(wrapped in_safe_extract()). No other callers were found.
- async url_content_extractor.extract_nvd_cve_content(text)[source]
Build system annotations for any NVD CVE URLs in the text.
Scans whitespace-split words for NVD CVE links (via
is_nvd_cve_url()) and resolves each throughget_nvd_cve_content()(an NVD HTTP API call inurl_utils), emitting a bracketed[System auto-extracted NVD CVE ...]line with the CVE id, CVSS v3 (or v2 fallback) score and severity, publish date, CWE ids, description, affected products, and the first few references. The description is fenced throughwrap_untrusted_data().Dispatched as one of the parallel
text_extractorsgathered byextract_all_url_content()(wrapped in_safe_extract()). No other callers were found.
- async url_content_extractor.extract_paste_content(text)[source]
Build system annotations for any paste-site URLs in the text.
Scans whitespace-split words for paste links such as Pastebin or Hastebin (via
is_paste_url()) and fetches the raw paste throughget_paste_content()(an HTTP call inurl_utils), emitting a bracketed[System auto-injected paste content ...]line tagged with the site and paste id. The fetched paste body is fenced throughwrap_untrusted_data()since it is fully attacker-controlled.Dispatched as one of the parallel
text_extractorsgathered byextract_all_url_content()(wrapped in_safe_extract()). No other callers were found.
- async url_content_extractor.extract_image_urls(text)[source]
Find direct image URLs in text and download them as multimodal parts.
Matches image links against
_IMAGE_URL_RE(Discord CDN attachments, Imgur, and bare.png/.jpg/.gif/etc. URLs), de-duplicates them, and fetches each over HTTP viadownload_image_url(). Each download is normalized throughplatforms.media_common: GIFs are re-encoded to MP4 viamaybe_reencode_gif()so the Gemini API receives a well-supported video format, and still images have their MIME type reconciled against the actual bytes viareconcile_image_mimetype(). The result is base64data:URLs wrapped as OpenRouter content parts; each download is logged at info level.Run (inside
_images_safe()) as part of the concurrent gather inextract_all_url_content(), and exercised directly bytests/core/test_image_url_extraction.py.
- async url_content_extractor.extract_crypto_prices(text)[source]
Build a system annotation of live prices for crypto symbols in the text.
Detects cryptocurrency ticker mentions via
detect_crypto_mentions(), and if any are found fetches their live quotes throughget_crypto_prices()(a Kraken HTTP API call inurl_utils). It formats name, symbol, price, 24h change (with an up/down arrow), and 24h range into a single[System auto-injected cryptocurrency prices ...]block. ReturnsNonewhen nothing is mentioned or the fetch yields no prices.Run (inside
_crypto_safe()) as part of the concurrent gather inextract_all_url_content(). No other callers were found.
- url_content_extractor.video_cache_lookup(url)[source]
Look up a URL in the disk media cache and return its files and metadata.
Reads the
{key}.jsonsidecar in_VIDEO_CACHE_DIR(key from_video_cache_key()), treating a missing/corrupt sidecar or a TTL expiry past_VIDEO_CACHE_TTLas a miss (evicting expired entries via_evict_cache_entry()). It validates every referenced filename through_safe_video_cache_child()and confirms it exists; on a full hit it refreshes the entry’scached_at(LRU touch) and rewrites the sidecar. Touches the filesystem only.Called by
extract_ytdlp_video_content()(viaasyncio.to_thread()) to serve cached media without re-downloading, and exercised bytests/test_ytdlp_media_parts.py.
- url_content_extractor.video_cache_store(url, video_src, metadata)[source]
Copy freshly downloaded media into the disk cache and write its sidecar.
Persists yt-dlp output for later reuse: it copies each source file into
_VIDEO_CACHE_DIRunder the URL’s_video_cache_key()(single files keep their extension; multiple files are suffixed_0,_1, …), classifies the entry asimageorvideovia_is_image_path(), and writes a{key}.jsonsidecar with the supplied metadata plus filenames, media kind, total size, andcached_at. Then it calls_enforce_cache_limits()to apply TTL and the size cap. Touches the filesystem (copies and JSON writes) viashutil.Called by
message_processor.video_history_patchafter a background download completes, and exercised bytests/test_ytdlp_media_parts.py.- Parameters:
- Returns:
The cached destination path(s), in source order.
- Return type:
- Raises:
ValueError – If video_src resolves to an empty list of sources.
- async url_content_extractor.get_ytdlp_video_metadata(url, cookies_text=None)[source]
Fetch lightweight media metadata via
yt-dlp --dump-json(no download).Probes a media URL for title, channel, duration, and extractor without pulling the actual file, so callers can decide whether to download. It first runs
pre_flight_ssrf_check()to block private/metadata hosts, then delegates to the nested_runhelper which spawns yt-dlp. Cookie auth is resolved here: user-supplied cookies_text is staged on a RAM disk viasecure_in_memory_cookie_file(), otherwise the default_DEFAULT_COOKIES_PATHis used when present.Called by
extract_ytdlp_video_content()on a cache miss to gate download decisions, and exercised bytests/test_ytdlp_security.py.- Parameters:
- Returns:
A
title/channel/duration/extractor/urldict on success, a{"_cookie_error": True, ...}sentinel when auth is required, orNoneon any failure.- Return type:
- Raises:
ValueError | PermissionError – Propagated from
pre_flight_ssrf_check()for blocked or unresolvable hosts.
- async url_content_extractor.download_ytdlp_video(url, cookies_text=None)[source]
Download a media URL with yt-dlp into a temp dir and return its file(s).
The heavy companion to
get_ytdlp_video_metadata(): it actually pulls the media (capped at 720p,_MAX_VIDEO_FILESIZE) for caching and multimodal use. It runspre_flight_ssrf_check(), verifiesyt-dlpis installed viashutil.which(), creates a temp working dir, and delegates to the nested_runhelper (which spawns yt-dlp and resolves outputs via_resolve_ytdlp_download_paths()). Cookie auth mirrors the metadata path: user cookies_text is staged on a RAM disk viasecure_in_memory_cookie_file(), else the default_DEFAULT_COOKIES_PATHis used when present. Touches the filesystem (temp dir + downloaded files).Called by
message_processor.video_history_patchto perform the background download, and exercised bytests/test_ytdlp_security.py.- Parameters:
- Returns:
(paths, None)on success (paths may be multiple images for image-only extractions), or([], error)on failure (errormay be the literal"cookie_error"sentinel).- Return type:
- Raises:
ValueError | PermissionError – Propagated from
pre_flight_ssrf_check()for blocked or unresolvable hosts.
- url_content_extractor.format_ytdlp_downloading_annotation(url, meta, *, kind='media')[source]
Build the context annotation shown while a yt-dlp download is in flight.
Produces a bracketed
[System auto-extracted metadata ...]line carrying the title, channel, duration (formatted via_format_duration()), and platform, plus a note telling the model the actual media is still downloading in the background and only metadata is available for now. The kind keyword adjusts the wording for video, image, or unknownmedia. The title is fenced throughwrap_untrusted_data().Called by
extract_ytdlp_video_content()(and by the back-compat shimformat_video_downloading_annotation()) when a download is queued. No external callers were found.- Parameters:
- Returns:
The formatted “downloading” annotation block.
- Return type:
- url_content_extractor.format_video_downloading_annotation(url, meta)[source]
Backward-compatible alias for
format_ytdlp_downloading_annotation().Thin shim kept for older call sites that predate the
kindkeyword; it forwards toformat_ytdlp_downloading_annotation()withkind="media". No external callers were found.
- url_content_extractor.format_ytdlp_ready_annotation(url, meta, *, kind='video')[source]
Build the context annotation for cached yt-dlp media that is ready to use.
Produces a bracketed
[System auto-extracted <video|image> ...]line with the title, channel, duration (via_format_duration()), and platform, used when the media bytes are already in context (cache hit or just-finished download) so no “still downloading” caveat is needed. The kind keyword selects the video vs image label. The title is fenced throughwrap_untrusted_data().Called by
extract_ytdlp_video_content()on a cache hit, bymessage_processor.video_history_patchonce a background download lands, and by the back-compat shimformat_video_ready_annotation().
- url_content_extractor.format_video_ready_annotation(url, meta)[source]
Backward-compatible alias for
format_ytdlp_ready_annotation().Thin shim kept for older call sites; forwards to
format_ytdlp_ready_annotation()withkind="video". No external callers were found.
- url_content_extractor.format_video_failed_annotation(url, meta)[source]
Build the context annotation for a video whose download failed.
Produces a bracketed
[System auto-extracted video metadata ...]line with title, channel, duration (via_format_duration()), and platform, followed by a note that the download failed and only metadata is available, so the model can answer about the video without claiming to have watched it. The title is fenced throughwrap_untrusted_data().Called by
message_processor.video_history_patchwhen a background download errors out. No external callers were found.
- url_content_extractor.format_video_cookie_error_annotation(url)[source]
Build the context annotation when a video needs cookie authentication.
Produces a bracketed
[System note ...]line telling the model the URL is gated (age-restricted, private, members-only, etc.) and that someone must supply acookies.txtvia theset_user_api_keyflow with serviceyt_dlp_cookies, so the assistant can guide the user instead of silently failing. Unlike the other builders this takes no metadata and does not wrap anything, since the URL is the only interpolated value.Called by
extract_ytdlp_video_content()when metadata extraction returns the cookie-error sentinel. No external callers were found.
- url_content_extractor.format_video_too_long_annotation(url, meta)[source]
Build the context annotation for a video over the duration limit.
Produces a bracketed
[System auto-extracted video metadata ...]line with title, channel, duration (via_format_duration()), and platform, noting that the video exceeds the_MAX_VIDEO_DURATIONcap so only metadata (no download) is available. The title is fenced throughwrap_untrusted_data().Called by
extract_ytdlp_video_content()when the probed duration exceeds the cap. No external callers were found.
- url_content_extractor.build_media_url_part_from_file(path)[source]
Read a media file and build an OpenRouter content part from its bytes.
Reads the file, guesses its MIME type via
mimetypes, and (for images) reconciles that type against the actual bytes withreconcile_image_mimetype_sync()fromplatforms.media_common. It base64-encodes the bytes into adata:URL and wraps it as animage_urlpart for images or avideo_urlpart otherwise, defaulting unknown-but-known-media-suffix files tovideo/mp4. Reads the filesystem; being synchronous, callers typically dispatch it viaasyncio.to_thread().Called by
extract_ytdlp_video_content()andmessage_processor.video_history_patchto turn cached media into model input, by the aliasbuild_video_url_part(), and exercised bytests/test_ytdlp_media_parts.pyandtests/test_media_image_mime_reconcile.py.
- url_content_extractor.build_video_url_part(video_path)[source]
Legacy alias for
build_media_url_part_from_file().Retained for older call sites that assumed the input was always a video; it simply forwards to
build_media_url_part_from_file(), which also classifies images correctly. Prefer that function directly in new code. No external callers were found.
- url_content_extractor.ytdlp_paths_are_image_only(paths)[source]
Report whether a yt-dlp download produced only image files.
Returns
Trueonly for a non-empty list whose every entry is an image by suffix (via_is_image_path()), distinguishing an image-only extraction (e.g. a gallery) from a video download so the rightytdlp_media_kindand annotation label can be chosen.Called by
extract_ytdlp_video_content()andmessage_processor.video_history_patchto derive the media kind. No external callers were found.
- async url_content_extractor.extract_ytdlp_video_content(text, user_id='', redis_client=None, config=None)[source]
Extract yt-dlp supported video URLs and return context parts.
Returns
(text_annotations, multimodal_parts, download_requests).text_annotations: text strings to append to context.
multimodal_parts:
image_url/video_urlcontent-part dicts (cache hits only).download_requests: dicts
{"url": str, "metadata": dict, "cookies_text": str|None}for URLs that need background downloading.
- async url_content_extractor.extract_all_url_content(message_content, user_id='', redis_client=None, config=None)[source]
Extract content from all supported URL types in message_content.
Returns a
(text_annotations, multimodal_parts, download_requests)tuple:text_annotations is a string with all extracted text content concatenated (empty string if nothing was extracted).
multimodal_parts is a list of OpenRouter
image_url/video_urlcontent-part dicts for any detected media URLs.download_requests is a list of dicts describing videos that need background downloading (consumed by the message processor to spawn
asyncio.create_taskcalls).