RAG pipeline documentation
This document covers the platform's built-in RAG (Retrieval-Augmented Generation) pipeline — architecture, data-processing flow, index update procedures, and best practices. Paired with the RAG admin area (/RagAdmin), you can build AI applications grounded in your internal knowledge.
1. Overall architecture
The platform's RAG is composed of the following components.
- Data Source: Internal documents, SharePoint, wikis, PDF archives, etc.
- Crawler: Periodically crawls Data Sources and detects deltas
- Preprocessor: Text extraction, cleaning, normalization, chunk splitting
- Embedder: Generates vector embeddings
- Vector Store: Vector database for search (Azure AI Search, etc.)
- Indexer: Ingests metadata + embedding + source into the index
- RAG Runtime: Search + generation pipeline (LLM)
These are managed through the RagAdmin screens — status, manual runs, and testing.
2. Data flow (crawl → index → RAG)
(1) Crawler (fetch)
- Data fetch via authentication (OAuth / API key / cookie, etc.)
- Delta detection (update time / hash comparison)
- Supported formats: PDF, Word, Excel, HTML, Markdown, TXT, JSON
(2) Preprocessor
This step has the biggest impact on knowledge quality.
- OCR (PDF / images)
- Noise removal (navigation / TOC / headers / footers)
- Paragraph- / context-aware chunking (default: 500–1,200 chars)
- Metadata generation (title / path / tags / last-updated)
(3) Embedder
- OpenAI / Azure OpenAI (text-embedding-3-large, etc.)
- Vector generation after punctuation stripping and normalization
(4) Indexer
- Stores vector + text + metadata as a document in the index
- Supports partial-update and full-rebuild modes
(5) RAG runtime (search + generation)
The final step: search results are handed to the LLM.
- Top-k search (default 5–10)
- Hybrid search (vector + keyword)
- Context-window optimization (dedup near-duplicate text)
3. Index schema
| Field | Description |
|---|---|
| id | Unique document ID (UUID) |
| text | Chunked body text |
| vector | Embedding vector (float[]) |
| sourceType | SharePoint / Web / PDF / Wiki / Other |
| sourcePath | Source URL or path |
| updatedAt | Last-updated timestamp |
4. Best practices
■ Chunk-size tuning
- Too short hurts search accuracy
- Too long mixes in noise
- 500–800 chars is most stable
■ Rich metadata
- Adding department / document type / tags improves search accuracy
- Tags can be re-applied later
■ Update-frequency tuning
- Daily updates → small document counts
- Weekly or change-detected → large corpora
5. APIs and webhooks
The RAG pipeline integrates with external systems via APIs and webhooks.
6. Troubleshooting
- Crawl stops partway → expired auth, IP restrictions
- Index keeps growing → revisit delta-detection rules
- Search quality is unstable → re-run chunking / metadata rebuild
Next: API docs →