url_content_extractor

URL Content Extractor Module.

Scans message text for supported URL types and returns extracted content as system-injected annotations that can be appended to the LLM context.

Supported sources: Twitter/X, YouTube, GitHub repos/issues/PRs, arXiv papers, Reddit threads, Wikipedia articles, GitHub Gists, Bluesky posts, Stack Overflow/Exchange, NVD CVE entries, Spotify, SoundCloud, TikTok, Vimeo, and cryptocurrency price mentions.

All extracted user-generated content is wrapped with wrap_untrusted_data() to prevent prompt injection.

url_content_extractor.wrap_untrusted_data(content)[source]

Wrap untrusted content in unique security tags to prevent injection.

Return type:: str
Parameters:: content (str)

async url_content_extractor.extract_tweet_content(text)[source]

Extract tweet content.

Parameters:: text (str) – Text content.
Returns:: The result.
Return type:: List[str]

async url_content_extractor.extract_youtube_content(text)[source]

Extract youtube content.

Parameters:: text (str) – Text content.
Returns:: The result.
Return type:: List[str]

async url_content_extractor.extract_spotify_content(text)[source]

Extract spotify content.

Parameters:: text (str) – Text content.
Returns:: The result.
Return type:: List[str]

async url_content_extractor.extract_soundcloud_content(text)[source]

Extract soundcloud content.

Parameters:: text (str) – Text content.
Returns:: The result.
Return type:: List[str]

async url_content_extractor.extract_tiktok_content(text)[source]

Extract tiktok content.

Parameters:: text (str) – Text content.
Returns:: The result.
Return type:: List[str]

async url_content_extractor.extract_vimeo_content(text)[source]

Extract vimeo content.

Parameters:: text (str) – Text content.
Returns:: The result.
Return type:: List[str]

async url_content_extractor.extract_github_content(text)[source]

Extract github content.

Parameters:: text (str) – Text content.
Returns:: The result.
Return type:: List[str]

async url_content_extractor.extract_arxiv_content(text)[source]

Extract arxiv content.

Parameters:: text (str) – Text content.
Returns:: The result.
Return type:: List[str]

async url_content_extractor.extract_reddit_content(text)[source]

Extract reddit content.

Parameters:: text (str) – Text content.
Returns:: The result.
Return type:: List[str]

async url_content_extractor.extract_wikipedia_content(text)[source]

Extract wikipedia content.

Parameters:: text (str) – Text content.
Returns:: The result.
Return type:: List[str]

async url_content_extractor.extract_gist_content(text)[source]

Extract gist content.

Parameters:: text (str) – Text content.
Returns:: The result.
Return type:: List[str]

async url_content_extractor.extract_bluesky_content(text)[source]

Extract bluesky content.

Parameters:: text (str) – Text content.
Returns:: The result.
Return type:: List[str]

async url_content_extractor.extract_stackoverflow_content(text)[source]

Extract stackoverflow content.

Parameters:: text (str) – Text content.
Returns:: The result.
Return type:: List[str]

async url_content_extractor.extract_nvd_cve_content(text)[source]

Extract nvd cve content.

Parameters:: text (str) – Text content.
Returns:: The result.
Return type:: List[str]

async url_content_extractor.extract_paste_content(text)[source]

Extract content from paste-site URLs (Pastebin, Hastebin, etc.).

Return type:: List[str]
Parameters:: text (str)

async url_content_extractor.extract_image_urls(text)[source]

Find image URLs and download them as multimodal content parts.

Returns a list of OpenRouter image_url or video_url content-part dicts. GIF images are re-encoded as MP4 so the Gemini API receives a well-supported video format.

Return type:: List[Dict[str, Any]]
Parameters:: text (str)

async url_content_extractor.extract_crypto_prices(text)[source]

Extract crypto prices.

Parameters:: text (str) – Text content.
Returns:: The result.
Return type:: Optional[str]

url_content_extractor.video_cache_lookup(url)[source]

Check the disk cache for a previously downloaded yt-dlp file(s).

Returns (paths, metadata_dict) on hit, ([], None) on miss.

Return type:: tuple[list[Path], dict | None]
Parameters:: url (str)

url_content_extractor.video_cache_store(url, video_src, metadata)[source]

Copy downloaded file(s) into the cache and write metadata JSON.

Returns the cached path(s).

Return type:

list[Path]

Parameters:

url (str)
video_src (Path | list[Path])
metadata (dict)

async url_content_extractor.get_ytdlp_video_metadata(url, cookies_text=None)[source]

Fetch video metadata via yt-dlp --dump-json (no download).

Returns a dict with title, channel, duration, extractor, etc. Returns None on any failure.

Return type:

dict | None

Parameters:

url (str)
cookies_text (str | None)

async url_content_extractor.download_ytdlp_video(url, cookies_text=None)[source]

Download via yt-dlp. Returns (local_paths, error) on success.

local_paths is non-empty on success; may be multiple images when yt-dlp only produced image files.

Return type:

tuple[list[Path], str | None]

Parameters:

url (str)
cookies_text (str | None)

url_content_extractor.format_ytdlp_downloading_annotation(url, meta, *, kind='media')[source]

Build the annotation while a yt-dlp download is in progress.

kind is "video", "image", or "media" when unknown.

Return type:

str

Parameters:

url (str)
meta (dict)
kind (str)

url_content_extractor.format_video_downloading_annotation(url, meta)[source]

Backward compat: same as format_ytdlp_downloading_annotation.

Return type:

str

Parameters:

url (str)
meta (dict)

url_content_extractor.format_ytdlp_ready_annotation(url, meta, *, kind='video')[source]

Annotation when cached yt-dlp media is ready. kind: video or image.

Return type:

str

Parameters:

url (str)
meta (dict)
kind (str)

url_content_extractor.format_video_ready_annotation(url, meta)[source]

Backward compat: ready annotation for video.

Return type:

str

Parameters:

url (str)
meta (dict)

url_content_extractor.format_video_failed_annotation(url, meta)[source]

Build the text annotation for a video whose download failed.

Return type:

str

Parameters:

url (str)
meta (dict)

url_content_extractor.format_video_cookie_error_annotation(url)[source]

Build the text annotation when cookies are required.

Return type:: str
Parameters:: url (str)

url_content_extractor.format_video_too_long_annotation(url, meta)[source]

Build the text annotation for a video that exceeds the duration limit.

Return type:

str

Parameters:

url (str)
meta (dict)

url_content_extractor.build_media_url_part_from_file(path)[source]

Build an OpenRouter image_url or video_url part from a file path.

Return type:: dict[str, Any]
Parameters:: path (Path)

url_content_extractor.build_video_url_part(video_path)[source]

Prefer build_media_url_part_from_file() (handles images correctly).

Return type:: dict[str, Any]
Parameters:: video_path (Path)

url_content_extractor.ytdlp_paths_are_image_only(paths)[source]

True if all paths look like image files (yt-dlp image-only download).

Return type:: bool
Parameters:: paths (list[Path])

async url_content_extractor.extract_ytdlp_video_content(text, user_id='', redis_client=None, config=None)[source]

Extract yt-dlp supported video URLs and return context parts.

Returns (text_annotations, multimodal_parts, download_requests).

text_annotations: text strings to append to context.
multimodal_parts: image_url / video_url content-part dicts (cache hits only).
download_requests: dicts {"url": str, "metadata": dict, "cookies_text": str|None} for URLs that need background downloading.

Return type:

tuple[list[str], list[dict[str, Any]], list[dict[str, Any]]]

Parameters:

text (str)
user_id (str)
redis_client (Any)
config (Any)

async url_content_extractor.extract_all_url_content(message_content, user_id='', redis_client=None, config=None)[source]

Extract content from all supported URL types in message_content.

Returns a (text_annotations, multimodal_parts, download_requests) tuple:

text_annotations is a string with all extracted text content concatenated (empty string if nothing was extracted).
multimodal_parts is a list of OpenRouter image_url / video_url content-part dicts for any detected media URLs.
download_requests is a list of dicts describing videos that need background downloading (consumed by the message processor to spawn asyncio.create_task calls).

Return type:

tuple[str, list[dict[str, Any]], list[dict[str, Any]]]

Parameters:

message_content (str)
user_id (str)
redis_client (Any)
config (Any)