Building Enterprise-Grade RAG 2.0 Systems: A Complete Practice from Document Parsing to Knowledge Retrieval
Combining scenarios like manufacturing and finance, dive deep into advanced RAG techniques such as complex document parsing, ontology constraints, cache optimization, and more.
1. Background: Evolution from RAG to RAG 2.0
Large models face three core issues in practical deployment: hallucination (providing false information), outdated knowledge (training data cutoff date), and data security (risk of sensitive information leakage). Retrieval-Augmented Generation (RAG) effectively alleviates these problems by introducing an external knowledge base.
Traditional RAG systems (i.e., RAG 1.0) typically adopt a linear pipeline of "index-retrieve-generate": chunking documents, vectorizing them, retrieving relevant fragments based on user queries, and then feeding them to a large model for answer generation. This approach works well in simple scenarios, but when faced with enterprise-level complex documents (e.g., PDF scans, engineering drawings, multi-level header tables) and multi-turn dialogues, it exposes the following shortcomings:
RAG 2.0 systematically upgrades the architecture by introducing modular design, splitting indexing, pre-retrieval processing, post-retrieval processing, and generation into independent pluggable components. Additionally, it incorporates ontology and caching mechanisms to make the system more controllable and efficient in complex business scenarios.
2. Document Parsing: From "Text Recognition" to "Structure Restoration"
Document parsing is the cornerstone of RAG systems. Enterprise documents come in various types, including PDFs, Word files, scanned documents, and engineering drawings. Improper parsing will undermine the semantic foundation for subsequent retrieval and generation.
2.1 Layout Structure Parsing
For PDFs and scanned documents, it is necessary to restore layout elements (titles, paragraphs, tables, images, headers/footers) and their reading order. Common open-source tools include RAGFlow's DeepDoc, PaddleOCR, etc. The core steps include:
2.2 Engineering Drawing Parsing
In manufacturing, engineering drawings contain critical information such as title blocks, drawing numbers, versions, materials, and technical requirements. Parsing requires:
2.3 Knowledge Chunking Strategy
Chunking is a key step after document parsing. Chunks that are too short lose context, while overly long chunks introduce noise. A two-stage chunking approach is recommended:
After chunking, two types of indexes should be generated:
3. Query Rewriting: Resolving Anaphora and Missing Information in Multi-Turn Dialogues
In multi-turn dialogues, subsequent user queries often depend on the preceding context. For example, "What is its price?" where "its" refers to a previously mentioned product. Without rewriting, direct retrieval will fail.
3.1 Anaphora Resolution and Information Completion
Model multi-turn query rewriting as a relation extraction task:
For example:
3.2 Rewriting Model Selection
Lightweight sequence labeling models (e.g., TPLinker) or small-parameter LLMs (e.g., Qwen2.5-7B) can be used for rewriting. For production environments, a hybrid rule+model strategy is recommended: rules handle common patterns, while models handle complex anaphora.
4. Hybrid Retrieval: Vector + Full-Text + Knowledge Graph
No single retrieval method covers all scenarios. Hybrid retrieval improves recall and precision by fusing multiple retrieval results.
4.1 Vector Retrieval
4.2 Full-Text Retrieval
4.3 Knowledge Graph Retrieval
For domains with dense entity relationships (e.g., finance, healthcare), a knowledge graph can be introduced. It provides structured knowledge through graph traversal and relation path matching.
Hybrid Retrieval Pipeline:
5. Ranking Optimization: From Coarse to Fine Ranking
The candidate set returned by the retrieval stage typically contains dozens to hundreds of fragments, requiring further ranking to select the most relevant Top-K fragments.
5.1 Coarse Ranking: RRF Fusion
RRF (Reciprocal Rank Fusion) fuses multiple result lists based on rank rather than score, using the formula:
$$\text{score}(d) = \sum_{r \in R} \frac{1}{k + \text{rank}_r(d)}$$
where $R$ is the set of retrieval paths, $\text{rank}_r(d)$ is the rank of document $d$ in path $r$, and $k$ is a smoothing parameter (typically 60).
5.2 Fine Ranking: Cross-Encoder
After coarse ranking, retain the Top-20 or so and use a cross-encoder (e.g., BGE-Reranker) for fine-grained ranking. The cross-encoder concatenates the query and document, feeds them into a Transformer, and computes a relevance score, achieving higher precision than bi-encoders.
5.3 Knowledge Filtering
After fine ranking, business rule filtering can be applied, for example:
6. Ontology Constraints: Making RAG Output More Controllable
Generic RAG systems lack domain knowledge constraints and may output answers that violate business logic. Introducing an ontology as a semantic foundation can effectively improve output accuracy and interpretability.
6.1 Ontology Modeling
An ontology is a formal description of concepts, relationships, and rules in a business domain. In enterprise RAG, an ontology typically includes:
6.2 Ways to Integrate Ontology with RAG
6.3 Practical Example
In a financial scenario, the ontology can define matching rules between "customer risk level" and "investment product risk level". When a user asks "Recommend financial products", the system retrieves relevant products and then uses ontology validation to ensure only risk-matched products are recommended, avoiding compliance risks.
7. Cache Optimization: Reducing Latency and Cost
For high-frequency repeated queries, caching can significantly reduce the number of retrieval and LLM calls.
7.1 Semantic Cache
Traditional caching relies on exact matching, but user queries may have synonymous rewrites. Semantic cache uses vector similarity to determine whether a query matches an existing cache entry; if the similarity exceeds a threshold, the cached result is returned directly.
7.2 Cache Strategies
7.3 Cache Invalidation
When the knowledge base content is updated, related caches must be cleared. Ontology relationships can be used to track affected queries for precise invalidation.
8. System Architecture and Engineering Practice
8.1 Layered Architecture
A typical enterprise-grade RAG 2.0 system adopts a layered design:
8.2 Offline and Online Separation
8.3 Observability
9. Conclusion
Enterprise-grade RAG 2.0 systems require systematic optimization across multiple dimensions, including document parsing, query rewriting, hybrid retrieval, ranking optimization, ontology constraints, and caching. The practical methods introduced in this article have been deployed in industries such as manufacturing and finance, significantly improving retrieval accuracy and system response efficiency.
For a deeper understanding of these techniques, refer to RAG System Evaluation and Optimization and AI Agent and Multi-Agent.
FAQ
1. How to choose document chunk size? Chunk size should balance document type and retrieval scenario. Generally, 256-512 tokens with 10%-20% overlap is recommended. For tables or code snippets, smaller chunks may be used to preserve structural integrity.
2. How to set weights for vector retrieval and full-text retrieval in hybrid retrieval? Weights can be determined via grid search or Bayesian optimization. A common practice is to use RRF fusion, which does not require explicit weights; if weighting is needed, vector retrieval weight of 0.6-0.7 and full-text retrieval weight of 0.3-0.4 is suggested.
3. How to handle conflicts between ontology constraints and LLM output? When LLM output conflicts with hard ontology constraints, ontology rules should take precedence, triggering a retry or returning an error message. For soft constraints (e.g., business conventions), the output can be marked as "unverified" with a user prompt.
4. How to evaluate the effectiveness of a RAG system? Common metrics include retrieval recall (Recall@K), ranking precision (MRR, NDCG), and generation accuracy (based on human or LLM evaluation). It is recommended to establish online A/B testing pipelines for continuous iteration.
5. What to do if cache hit rate is low? Analyze query patterns and template high-frequency queries; or expand the semantic cache similarity threshold (e.g., from 0.9 to 0.8), but be aware of the risk of false positives.
6. How to ensure data security? Document parsing and retrieval should be performed within the intranet, with RBAC controlling access permissions. Sensitive content should be desensitized, and raw data should not be transmitted during LLM calls.
Also available in 中文.