rag_system.pg_source_files module
Postgres whole-file storage for File RAG stores.
Each file-RAG schema may contain:
documents— extracted text (PKfilename)source_files— raw bytes (PKfilename)
Chunk vectors live in files_<schema> via vector_store.
- rag_system.pg_source_files.source_tables_ddl(schema)[source]
Build the
CREATEstatements for a store’s whole-file tables.Returns the ordered DDL that provisions the per-store schema plus the two whole-file tables (
documentsfor extracted text keyed onfilename,source_filesfor raw bytes keyed onfilename) so the layout matches the migrated production schema. Nothing is executed here; it is a pure string builder. The schema portion of every statement is sanitised through_qschema()before interpolation, which is the only injection guard on the returned SQL.Called by
ensure_source_tables(), which iterates the list and runs each statement against the synchronous pool; there are no callers outside this module.
- rag_system.pg_source_files.ensure_source_tables(schema)[source]
Provision a store’s whole-file schema and tables if they do not exist.
Idempotently creates the per-store schema and its
documents/source_filestables so subsequent upserts have somewhere to land. This is the one write-side bootstrap for whole-file storage; every other writer assumes the tables already exist.Borrows a connection from the shared synchronous pool via
vector_store.get_sync_pool()and executes each statement produced bysource_tables_ddl()(the only side effect is the Postgres DDL). Called byupsert_whole_file()before it writes, byrag_system.file_rag_managerwhen a store is initialised, and byscripts/verify_pgvector_stores.pyduring store verification.
- rag_system.pg_source_files.table_exists(schema, table)[source]
Return True if table exists in schema per
information_schema.Sanitises the schema name through
_qschema(), borrows a connection from the shared synchronous pool returned byvector_store.get_sync_pool(), and runs a parameterisedSELECT EXISTSagainstinformation_schema.tables(both schema and table are passed as bind parameters, not interpolated). It performs a read-only catalog lookup with no side effects on the data tables.This is the existence guard used before touching the whole-file tables: it is called by
get_document_text(),get_source_file_bytes(),list_whole_files(),delete_whole_file()andclear_source_tables()so those operations no-op gracefully when a store has not yet created itsdocuments/source_filestables.
- rag_system.pg_source_files.upsert_document(schema, filename, sha256, content, extraction_method='text')[source]
Upsert the extracted-text row for a file into the
documentstable.Inserts (or, on a primary-key conflict, replaces) the
documentsrow forfilename, recording the content hash, the text itself, its character count, how it was extracted, and a fresh UTC extraction timestamp. This is the text half of whole-file storage; the raw-bytes half is handled byupsert_source_file().Sanitises the schema via
_qschema(), stampsextracted_atwith the current UTC time, and runs a parameterisedINSERT ... ON CONFLICT (filename) DO UPDATEon a connection fromvector_store.get_sync_pool(); the sole side effect is the Postgres write. Assumes the table already exists (seeensure_source_tables()). Called byupsert_whole_file()(the only caller).- Parameters:
schema (
str) – The store/schema whosedocumentstable to write.filename (
str) – Primary-key filename for the row.sha256 (
str) – Hex SHA-256 of the source content, stored for change detection.content (
str) – Extracted plain text; its length is stored aschar_count.extraction_method (
str) – Howcontentwas produced, e.g."text"or"pdf".
- Return type:
- Returns:
None
- rag_system.pg_source_files.upsert_source_file(schema, filename, raw_bytes, sha256)[source]
Upsert the raw-bytes row for a file into the
source_filestable.Inserts (or, on a primary-key conflict, replaces) the
source_filesrow forfilename, storing the original bytes asbyteaalongside their size and content hash. This is the byte half of whole-file storage that lets a search return the exact original file; the extracted-text half lives inupsert_document().Sanitises the schema via
_qschema()and runs a parameterisedINSERT ... ON CONFLICT (filename) DO UPDATEon a connection fromvector_store.get_sync_pool(); the only side effect is the Postgres write, and the table is assumed to exist already. Called byupsert_whole_file()(the only caller).- Parameters:
- Return type:
- Returns:
None
- rag_system.pg_source_files.upsert_whole_file(schema, filename, sha256, content, raw_bytes, extraction_method='text')[source]
Persist both the extracted text and raw bytes of one file in one call.
The single public entry point for writing whole-file storage: it guarantees the tables exist, then writes the text and byte halves together so a store holds both a searchable
documentsrow and a downloadablesource_filesrow forfilename. There is no transaction spanning the two writes — each upsert runs on its own pooled connection.Calls
ensure_source_tables()to provision the schema, thenupsert_document()(text) andupsert_source_file()(bytes), all of which touch Postgres viavector_store.get_sync_pool(). Called byrag_system.file_rag_manager(its_upsert_whole_filewrapper, which first hashes the bytes) when whole-file storage is enabled for a store.- Parameters:
schema (
str) – The store/schema to write into.filename (
str) – Primary-key filename shared by both rows.sha256 (
str) – Hex SHA-256 used for both the text and byte rows.content (
str) – Extracted plain text for thedocumentsrow.raw_bytes (
bytes) – Original file bytes for thesource_filesrow.extraction_method (
str) – Howcontentwas produced, e.g."text"or"pdf".
- Return type:
- Returns:
None
- rag_system.pg_source_files.get_document_text(schema, filename)[source]
Fetch a file’s stored extracted text from the
documentstable.Reads back the plain text previously written by
upsert_document(), returningNonewhen the store has nodocumentstable yet, the file is absent, or its stored content is NULL — so callers can fall back to re-extraction gracefully. Read-only; no side effects on the data tables.Guards on
table_exists()first, then sanitises the schema via_qschema()and runs a parameterisedSELECT content ... WHERE filename = %son a connection fromvector_store.get_sync_pool(). Called byrag_system.file_rag_managerwhen serving a whole-file result.
- rag_system.pg_source_files.get_source_file_bytes(schema, filename)[source]
Fetch a file’s stored raw bytes from the
source_filestable.Reads back the original bytes previously written by
upsert_source_file(), returningNonewhen thesource_filestable is missing, the row is absent, or the stored content is NULL — which lets callers decide whether the exact original file is retrievable. Read-only; no side effects on the data tables.Guards on
table_exists()first, then sanitises the schema via_qschema()and runs a parameterisedSELECT content ... WHERE filename = %son a connection fromvector_store.get_sync_pool(), coercing the returnedbyteato abytesobject. Called byrag_system.file_rag_managerwhen serving the original file for a result.
- rag_system.pg_source_files.list_whole_files(schema)[source]
List every whole file in a store with size, hash, and modified time.
Produces one record per filename for the store’s whole-file tables, preferring the richer
documentsrows (which carrychar_countas the size and anextracted_attimestamp) and filling in any filenames that exist only insource_files(usingsize_bytesas the size and no timestamp). Files present in both tables are deduplicated by filename, with thedocumentsentry winning. Read-only; no side effects.Sanitises the schema via
_qschema(), guards each table withtable_exists(), and runs orderedSELECTqueries on a connection fromvector_store.get_sync_pool(). Called byrag_system.file_rag_managerto enumerate stored whole files.
- rag_system.pg_source_files.delete_whole_file(schema, filename)[source]
Delete a single file’s rows from both whole-file tables.
Removes
filenamefromdocumentsandsource_filesso a deleted file leaves no whole-file remnants behind; each table is only touched if it exists, making the call a safe no-op for chunk-only stores or stores that never created the tables. The chunk/vector rows for the file are deleted elsewhere by the caller — this handles only whole-file storage.Sanitises the schema via
_qschema(), guards each table withtable_exists(), and issues parameterisedDELETE ... WHERE filename = %sstatements on a connection fromvector_store.get_sync_pool()(the side effect is the Postgres deletes). Called byrag_system.file_rag_managerwhen removing or replacing an indexed file.
- rag_system.pg_source_files.clear_source_tables(schema)[source]
Truncate both whole-file tables, wiping all stored files for a store.
Empties
documentsandsource_filesin one pass so a store can be fully reset without dropping its schema; each table is only truncated if it exists, so the call is a no-op for stores that never created them. This clears whole-file storage only — chunk/vector data is reset separately by the caller.Sanitises the schema via
_qschema(), guards each table withtable_exists(), and runsTRUNCATE TABLEon a connection fromvector_store.get_sync_pool()(the side effect is the Postgres truncates). Called byrag_system.file_rag_managerwhen clearing or rebuilding a store.