Building Enterprise-Grade RAG 2.0 Systems: A Complete Practice from Document Parsing to Knowledge Retrieval

Combining scenarios like manufacturing and finance, dive deep into advanced RAG techniques such as complex document parsing, ontology constraints, cache optimization, and more.

By AI Skill Navigation Editorial Team

1. Background: Evolution from RAG to RAG 2.0

Large models face three core issues in practical deployment: hallucination (providing false information), outdated knowledge (training data cutoff date), and data security (risk of sensitive information leakage). Retrieval-Augmented Generation (RAG) effectively alleviates these problems by introducing an external knowledge base.

Traditional RAG systems (i.e., RAG 1.0) typically adopt a linear pipeline of "index-retrieve-generate": chunking documents, vectorizing them, retrieving relevant fragments based on user queries, and then feeding them to a large model for answer generation. This approach works well in simple scenarios, but when faced with enterprise-level complex documents (e.g., PDF scans, engineering drawings, multi-level header tables) and multi-turn dialogues, it exposes the following shortcomings:

Coarse document parsing: Direct text extraction from PDFs loses layout structure, table relationships, and reading order.

Incomplete retrieval recall: Relying solely on vector retrieval fails to support exact matching (e.g., model numbers, contract IDs).

Insufficient ranking precision: A single ranking model struggles to balance semantic relevance and keyword matching.

Lack of domain constraints: The large model may generate answers that violate business rules.

RAG 2.0 systematically upgrades the architecture by introducing modular design, splitting indexing, pre-retrieval processing, post-retrieval processing, and generation into independent pluggable components. Additionally, it incorporates ontology and caching mechanisms to make the system more controllable and efficient in complex business scenarios.

2. Document Parsing: From "Text Recognition" to "Structure Restoration"

Document parsing is the cornerstone of RAG systems. Enterprise documents come in various types, including PDFs, Word files, scanned documents, and engineering drawings. Improper parsing will undermine the semantic foundation for subsequent retrieval and generation.

2.1 Layout Structure Parsing

For PDFs and scanned documents, it is necessary to restore layout elements (titles, paragraphs, tables, images, headers/footers) and their reading order. Common open-source tools include RAGFlow's DeepDoc, PaddleOCR, etc. The core steps include:

Image preprocessing: Deskew, denoise, and enhance contrast for scanned documents.

Layout analysis: Use object detection or segmentation models to identify areas such as titles, body text, tables, and images.

Reading order restoration: Determine the sequence of elements based on their positions and logical relationships.

Table structure recognition: Restore complex structures like multi-level headers, merged cells, and cross-page tables.

2.2 Engineering Drawing Parsing

In manufacturing, engineering drawings contain critical information such as title blocks, drawing numbers, versions, materials, and technical requirements. Parsing requires:

Locating the title block area and extracting attributes like drawing number, version, and name.

Identifying annotation descriptions and technical requirement text.

Establishing a mapping between drawing elements and their coordinates to support subsequent traceability.

2.3 Knowledge Chunking Strategy

Chunking is a key step after document parsing. Chunks that are too short lose context, while overly long chunks introduce noise. A two-stage chunking approach is recommended:

Structural segmentation: Divide by logical units such as titles, paragraphs, and tables, preserving hierarchical relationships.

Length-based segmentation: Apply sliding window segmentation to long texts, with a window size of 256-512 tokens and 10%-20% overlap.

After chunking, two types of indexes should be generated:

Text index: For full-text search (e.g., Elasticsearch BM25).

Vector index: Generate vectors using an embedding model and store them in a vector database (e.g., Milvus, Qdrant).

3. Query Rewriting: Resolving Anaphora and Missing Information in Multi-Turn Dialogues

In multi-turn dialogues, subsequent user queries often depend on the preceding context. For example, "What is its price?" where "its" refers to a previously mentioned product. Without rewriting, direct retrieval will fail.

3.1 Anaphora Resolution and Information Completion

Model multi-turn query rewriting as a relation extraction task:

Identify anaphoric expressions (e.g., "it", "this").

Find the referred entity from the conversation history.

Replace the anaphoric expression with the entity name and supplement missing context.

For example:

User: "What is the power of Model A?"

System: "The power is 100W."

User: "What about its voltage?" → Rewritten as: "What is the voltage of Model A?"

3.2 Rewriting Model Selection

Lightweight sequence labeling models (e.g., TPLinker) or small-parameter LLMs (e.g., Qwen2.5-7B) can be used for rewriting. For production environments, a hybrid rule+model strategy is recommended: rules handle common patterns, while models handle complex anaphora.

4. Hybrid Retrieval: Vector + Full-Text + Knowledge Graph

No single retrieval method covers all scenarios. Hybrid retrieval improves recall and precision by fusing multiple retrieval results.

4.1 Vector Retrieval

Advantages: Semantic similarity retrieval, supports cross-lingual and multimodal.

Common models: BGE-M3, BCE, GTE, M3E.

Use cases: Open-domain QA, synonym matching.

4.2 Full-Text Retrieval

Advantages: Exact keyword matching, strong interpretability.

Common algorithm: BM25.

Use cases: Queries for model numbers, IDs, proper nouns.

4.3 Knowledge Graph Retrieval

For domains with dense entity relationships (e.g., finance, healthcare), a knowledge graph can be introduced. It provides structured knowledge through graph traversal and relation path matching.

Hybrid Retrieval Pipeline:

The user query undergoes vector retrieval, full-text retrieval, and knowledge graph retrieval, each returning Top-N results.

Use Reciprocal Rank Fusion (RRF) or weighted fusion to combine results.

Feed the merged candidate set into the ranking module.

5. Ranking Optimization: From Coarse to Fine Ranking

The candidate set returned by the retrieval stage typically contains dozens to hundreds of fragments, requiring further ranking to select the most relevant Top-K fragments.

5.1 Coarse Ranking: RRF Fusion

RRF (Reciprocal Rank Fusion) fuses multiple result lists based on rank rather than score, using the formula:

$$\text{score}(d) = \sum_{r \in R} \frac{1}{k + \text{rank}_r(d)}$$

where $R$ is the set of retrieval paths, $\text{rank}_r(d)$ is the rank of document $d$ in path $r$, and $k$ is a smoothing parameter (typically 60).

5.2 Fine Ranking: Cross-Encoder

After coarse ranking, retain the Top-20 or so and use a cross-encoder (e.g., BGE-Reranker) for fine-grained ranking. The cross-encoder concatenates the query and document, feeds them into a Transformer, and computes a relevance score, achieving higher precision than bi-encoders.

5.3 Knowledge Filtering

After fine ranking, business rule filtering can be applied, for example:

Exclude outdated document versions.

Filter out fragments lacking permissions.

Ensure output complies with industry standards (e.g., financial compliance requirements).

6. Ontology Constraints: Making RAG Output More Controllable

Generic RAG systems lack domain knowledge constraints and may output answers that violate business logic. Introducing an ontology as a semantic foundation can effectively improve output accuracy and interpretability.

6.1 Ontology Modeling

An ontology is a formal description of concepts, relationships, and rules in a business domain. In enterprise RAG, an ontology typically includes:

Entities: e.g., Product, Order, Customer.

Relations: e.g., "Product belongs to Category", "Order contains Product".

Events: e.g., "Order status change".

Actions: e.g., "Create work order", "Modify alert".

Logic: Business rules, e.g., "Orders over 100,000 require approval".

6.2 Ways to Integrate Ontology with RAG

Pre-retrieval: Use the ontology to semantically expand the user query, e.g., expand "laptop" to "laptop (electronic device)".

Post-retrieval: Apply constraint checks on retrieval results, e.g., ensure returned contract clauses are from a valid version.

Post-generation: Validate the LLM output against rules; if it violates ontology constraints, trigger a retry or flag it.

6.3 Practical Example

In a financial scenario, the ontology can define matching rules between "customer risk level" and "investment product risk level". When a user asks "Recommend financial products", the system retrieves relevant products and then uses ontology validation to ensure only risk-matched products are recommended, avoiding compliance risks.

7. Cache Optimization: Reducing Latency and Cost

For high-frequency repeated queries, caching can significantly reduce the number of retrieval and LLM calls.

7.1 Semantic Cache

Traditional caching relies on exact matching, but user queries may have synonymous rewrites. Semantic cache uses vector similarity to determine whether a query matches an existing cache entry; if the similarity exceeds a threshold, the cached result is returned directly.

7.2 Cache Strategies

LRU eviction: Remove the least recently used cache entries.

TTL expiration: Set a time-to-live for cache entries to ensure knowledge freshness.

Hierarchical caching: Store hot data in memory (e.g., Redis) and cold data on disk.

7.3 Cache Invalidation

When the knowledge base content is updated, related caches must be cleared. Ontology relationships can be used to track affected queries for precise invalidation.

8. System Architecture and Engineering Practice

8.1 Layered Architecture

A typical enterprise-grade RAG 2.0 system adopts a layered design:

LayerComponentsResponsibilities

Algorithm LayerOCR, tokenizer, vector models, ranking modelsProvide basic algorithmic capabilities Pipeline LayerOffline ingestion, online Q&AOrchestrate parsing, retrieval, ranking, generation pipelines Data LayerVector DB, ES, MySQLStore indexes and metadata Management LayerKnowledge base management, model management, rule configurationProvide user configuration interfaces

8.2 Offline and Online Separation

Offline pipeline: Document parsing → Chunking → Generate vector and text indexes → Write to storage.

Online pipeline: User query → Query rewriting → Hybrid retrieval → Ranking → Ontology validation → LLM generation → Return result.

8.3 Observability

Record recall rate, ranking precision, and response time for each retrieval.

Manually annotate LLM outputs to continuously optimize ranking models and prompts.

Use distributed tracing (e.g., OpenTelemetry) to identify performance bottlenecks.

9. Conclusion

Enterprise-grade RAG 2.0 systems require systematic optimization across multiple dimensions, including document parsing, query rewriting, hybrid retrieval, ranking optimization, ontology constraints, and caching. The practical methods introduced in this article have been deployed in industries such as manufacturing and finance, significantly improving retrieval accuracy and system response efficiency.

For a deeper understanding of these techniques, refer to RAG System Evaluation and Optimization and AI Agent and Multi-Agent.

FAQ

1. How to choose document chunk size? Chunk size should balance document type and retrieval scenario. Generally, 256-512 tokens with 10%-20% overlap is recommended. For tables or code snippets, smaller chunks may be used to preserve structural integrity.

2. How to set weights for vector retrieval and full-text retrieval in hybrid retrieval? Weights can be determined via grid search or Bayesian optimization. A common practice is to use RRF fusion, which does not require explicit weights; if weighting is needed, vector retrieval weight of 0.6-0.7 and full-text retrieval weight of 0.3-0.4 is suggested.

3. How to handle conflicts between ontology constraints and LLM output? When LLM output conflicts with hard ontology constraints, ontology rules should take precedence, triggering a retry or returning an error message. For soft constraints (e.g., business conventions), the output can be marked as "unverified" with a user prompt.

4. How to evaluate the effectiveness of a RAG system? Common metrics include retrieval recall (Recall@K), ranking precision (MRR, NDCG), and generation accuracy (based on human or LLM evaluation). It is recommended to establish online A/B testing pipelines for continuous iteration.

5. What to do if cache hit rate is low? Analyze query patterns and template high-frequency queries; or expand the semantic cache similarity threshold (e.g., from 0.9 to 0.8), but be aware of the risk of false positives.

6. How to ensure data security? Document parsing and retrieval should be performed within the intranet, with RBAC controlling access permissions. Sensitive content should be desensitized, and raw data should not be transmitted during LLM calls.

Also available in 中文.