RAG pipeline documentation

This document covers the platform's built-in RAG (Retrieval-Augmented Generation) pipeline — architecture, data-processing flow, index update procedures, and best practices. Paired with the RAG admin area (/RagAdmin), you can build AI applications grounded in your internal knowledge.


1. Overall architecture

The platform's RAG is composed of the following components.

  • Data Source: Internal documents, SharePoint, wikis, PDF archives, etc.
  • Crawler: Periodically crawls Data Sources and detects deltas
  • Preprocessor: Text extraction, cleaning, normalization, chunk splitting
  • Embedder: Generates vector embeddings
  • Vector Store: Vector database for search (Azure AI Search, etc.)
  • Indexer: Ingests metadata + embedding + source into the index
  • RAG Runtime: Search + generation pipeline (LLM)

These are managed through the RagAdmin screens — status, manual runs, and testing.

2. Data flow (crawl → index → RAG)

(1) Crawler (fetch)
  • Data fetch via authentication (OAuth / API key / cookie, etc.)
  • Delta detection (update time / hash comparison)
  • Supported formats: PDF, Word, Excel, HTML, Markdown, TXT, JSON
(2) Preprocessor

This step has the biggest impact on knowledge quality.

  • OCR (PDF / images)
  • Noise removal (navigation / TOC / headers / footers)
  • Paragraph- / context-aware chunking (default: 500–1,200 chars)
  • Metadata generation (title / path / tags / last-updated)
(3) Embedder
  • OpenAI / Azure OpenAI (text-embedding-3-large, etc.)
  • Vector generation after punctuation stripping and normalization
(4) Indexer
  • Stores vector + text + metadata as a document in the index
  • Supports partial-update and full-rebuild modes
(5) RAG runtime (search + generation)

The final step: search results are handed to the LLM.

  • Top-k search (default 5–10)
  • Hybrid search (vector + keyword)
  • Context-window optimization (dedup near-duplicate text)

3. Index schema

Field Description
idUnique document ID (UUID)
textChunked body text
vectorEmbedding vector (float[])
sourceTypeSharePoint / Web / PDF / Wiki / Other
sourcePathSource URL or path
updatedAtLast-updated timestamp

4. Best practices

■ Chunk-size tuning
  • Too short hurts search accuracy
  • Too long mixes in noise
  • 500–800 chars is most stable
■ Rich metadata
  • Adding department / document type / tags improves search accuracy
  • Tags can be re-applied later
■ Update-frequency tuning
  • Daily updates → small document counts
  • Weekly or change-detected → large corpora

5. APIs and webhooks

The RAG pipeline integrates with external systems via APIs and webhooks.

  • API docs — index creation, search, status
  • Webhook — crawl-complete and index-updated notifications

6. Troubleshooting

  • Crawl stops partway → expired auth, IP restrictions
  • Index keeps growing → revisit delta-detection rules
  • Search quality is unstable → re-run chunking / metadata rebuild

Next: API docs →