url_content_extractor
URL Content Extractor Module.
Scans message text for supported URL types and returns extracted content as system-injected annotations that can be appended to the LLM context.
Supported sources: Twitter/X, YouTube, GitHub repos/issues/PRs, arXiv papers, Reddit threads, Wikipedia articles, GitHub Gists, Bluesky posts, Stack Overflow/Exchange, NVD CVE entries, Spotify, SoundCloud, TikTok, Vimeo, and cryptocurrency price mentions.
All extracted user-generated content is wrapped with
wrap_untrusted_data() to prevent prompt injection.
- url_content_extractor.wrap_untrusted_data(content)[source]
Wrap untrusted content in unique security tags to prevent injection.
- async url_content_extractor.extract_stackoverflow_content(text)[source]
Extract stackoverflow content.
- async url_content_extractor.extract_paste_content(text)[source]
Extract content from paste-site URLs (Pastebin, Hastebin, etc.).
- async url_content_extractor.extract_image_urls(text)[source]
Find image URLs and download them as multimodal content parts.
Returns a list of OpenRouter
image_urlorvideo_urlcontent-part dicts. GIF images are re-encoded as MP4 so the Gemini API receives a well-supported video format.
- url_content_extractor.video_cache_lookup(url)[source]
Check the disk cache for a previously downloaded yt-dlp file(s).
Returns
(paths, metadata_dict)on hit,([], None)on miss.
- url_content_extractor.video_cache_store(url, video_src, metadata)[source]
Copy downloaded file(s) into the cache and write metadata JSON.
Returns the cached path(s).
- async url_content_extractor.get_ytdlp_video_metadata(url, cookies_text=None)[source]
Fetch video metadata via
yt-dlp --dump-json(no download).Returns a dict with title, channel, duration, extractor, etc. Returns
Noneon any failure.
- async url_content_extractor.download_ytdlp_video(url, cookies_text=None)[source]
Download via yt-dlp. Returns
(local_paths, error)on success.local_pathsis non-empty on success; may be multiple images when yt-dlp only produced image files.
- url_content_extractor.format_ytdlp_downloading_annotation(url, meta, *, kind='media')[source]
Build the annotation while a yt-dlp download is in progress.
kind is
"video","image", or"media"when unknown.
- url_content_extractor.format_video_downloading_annotation(url, meta)[source]
Backward compat: same as
format_ytdlp_downloading_annotation.
- url_content_extractor.format_ytdlp_ready_annotation(url, meta, *, kind='video')[source]
Annotation when cached yt-dlp media is ready. kind:
videoorimage.
- url_content_extractor.format_video_ready_annotation(url, meta)[source]
Backward compat: ready annotation for video.
- url_content_extractor.format_video_failed_annotation(url, meta)[source]
Build the text annotation for a video whose download failed.
- url_content_extractor.format_video_cookie_error_annotation(url)[source]
Build the text annotation when cookies are required.
- url_content_extractor.format_video_too_long_annotation(url, meta)[source]
Build the text annotation for a video that exceeds the duration limit.
- url_content_extractor.build_media_url_part_from_file(path)[source]
Build an OpenRouter
image_urlorvideo_urlpart from a file path.
- url_content_extractor.build_video_url_part(video_path)[source]
Prefer
build_media_url_part_from_file()(handles images correctly).
- url_content_extractor.ytdlp_paths_are_image_only(paths)[source]
True if all paths look like image files (yt-dlp image-only download).
- async url_content_extractor.extract_ytdlp_video_content(text, user_id='', redis_client=None, config=None)[source]
Extract yt-dlp supported video URLs and return context parts.
Returns
(text_annotations, multimodal_parts, download_requests).text_annotations: text strings to append to context.
multimodal_parts:
image_url/video_urlcontent-part dicts (cache hits only).download_requests: dicts
{"url": str, "metadata": dict, "cookies_text": str|None}for URLs that need background downloading.
- async url_content_extractor.extract_all_url_content(message_content, user_id='', redis_client=None, config=None)[source]
Extract content from all supported URL types in message_content.
Returns a
(text_annotations, multimodal_parts, download_requests)tuple:text_annotations is a string with all extracted text content concatenated (empty string if nothing was extracted).
multimodal_parts is a list of OpenRouter
image_url/video_urlcontent-part dicts for any detected media URLs.download_requests is a list of dicts describing videos that need background downloading (consumed by the message processor to spawn
asyncio.create_taskcalls).