media_cache

Disk-backed LRU media cache.

Caches downloaded media (images, audio, video, files) so that the same URL is never fetched twice. An in-memory index provides fast lookups while a configurable disk directory persists data across restarts.

Each cached entry is stored on disk as two files:

{sha256_of_url}.dat   – raw media bytes
{sha256_of_url}.json  – sidecar metadata (mimetype, filename, url, ts, size)

On startup the disk directory is scanned to rebuild the in-memory index without loading all bytes into RAM.

class media_cache.MediaCache(cache_dir='media_cache', max_size_mb=500, max_memory_items=64)[source]

Bases: object

Two-tier (memory + disk) LRU media cache.

Parameters:
  • cache_dir (str | Path) – Directory for persistent storage. Created automatically.

  • max_size_mb (int) – Approximate cap on total disk usage in megabytes. Oldest entries are evicted when the limit is exceeded.

  • max_memory_items (int) – Maximum number of entries whose bytes are kept in RAM. Entries beyond this limit are still indexed (metadata only) and will be read back from disk on the next access.

__init__(cache_dir='media_cache', max_size_mb=500, max_memory_items=64)[source]

Set up an empty cache rooted at cache_dir and create the directory.

Configures the disk-byte budget and in-memory item cap, allocates the LRU index (an OrderedDict keyed by URL), the hit/miss counters, and the asyncio.Lock that serializes index mutations. The actual disk scan is deferred to ensure_loaded() so that constructing a MediaCache never blocks the event loop; only the mkdir (touching the filesystem) happens here.

Constructed by the gateway and web services from configured values — see gateway_main.py and web_main.py, both of which pass cfg.media_cache_dir and cfg.media_cache_max_mb.

Parameters:
  • cache_dir (str | Path) – Directory used for persistent storage of the .dat/.json files; created automatically.

  • max_size_mb (int) – Approximate cap on total on-disk usage in megabytes; oldest entries are evicted once exceeded.

  • max_memory_items (int) – Maximum number of entries whose raw bytes are kept resident in RAM. Excess entries stay indexed by metadata only and reload from disk on next access.

Return type:

None

async ensure_loaded()[source]

Load the in-memory index from disk (non-blocking).

Called during async startup so the sync disk scan does not block the event loop. Idempotent — safe to call multiple times.

Return type:

None

async get(url)[source]

Look up url in the cache and return its media triple, or None on a miss.

On a hit the entry is promoted to most-recently-used and its last_access timestamp refreshed so the LRU ordering stays accurate. If the entry’s bytes are not resident in RAM (it was loaded by a disk scan or shed by the memory budget) they are re-read from disk via _read_disk(); should that file have vanished the stale index entry is evicted and a miss is reported. Increments the hit counter and logs on success. All work happens under the shared asyncio.Lock.

Reached indirectly through get_or_download(), which is the entry point used by the platform adapters (platforms/discord.py, platforms/matrix.py, platforms/discord_self.py, platforms/emoji_resolver.py).

Parameters:

url (str) – The media URL serving as the cache key.

Returns:

(data, mimetype, filename) if the URL is cached and its bytes are recoverable, otherwise None.

Return type:

tuple[bytes, str, str] | None

async put(url, data, mimetype, filename)[source]

Insert media bytes for url, writing through to disk and enforcing limits.

If the URL is already indexed this only bumps its LRU position and access time (the existing bytes are kept). For a new entry it derives the disk key, persists the bytes plus a JSON metadata sidecar via _write_disk(), records the entry in the in-memory index, then evicts the oldest entries while over the disk budget (_enforce_limits()) and sheds resident bytes for entries past the memory cap (_shed_memory()). Touches the filesystem and runs under the shared asyncio.Lock.

Called by get_or_download() after a successful, non-empty download; not invoked directly elsewhere in the repo.

Parameters:
  • url (str) – The media URL used as the cache key.

  • data (bytes) – The raw media bytes to persist.

  • mimetype (str) – The media MIME type, stored in the sidecar.

  • filename (str) – A human-readable filename, stored in the sidecar.

Return type:

None

async get_or_download(url, downloader)[source]

Return cached media or call downloader and cache the result.

downloader is an async callable returning (data, mimetype, filename).

GIF images are automatically re-encoded as MP4 before being stored so that the cache always contains the video format.

Return type:

tuple[bytes, str, str]

Parameters:
stats()[source]

Return a snapshot of cache statistics for monitoring.

Reports the total and in-memory entry counts, on-disk byte and megabyte totals, the configured size cap, the running hit/miss counters, and a derived hit rate. A pure read of in-memory state — it acquires no lock and touches neither disk nor network, so it is cheap and safe to poll from an admin endpoint.

Called by the bot admin status handler in web/bot_admin.py, which exposes the result under the media_cache key of its JSON response.

Returns:

A snapshot with keys entries, entries_in_memory, total_bytes, total_mb, max_mb, hits, misses, and hit_rate.

Return type:

dict[str, Any]