← Back to tutorials

Dify Enterprise Private Knowledge Base Complete Setup Guide: RAG Configuration & Best Practices (2026)

From Deployment to Optimization: Build an Enterprise-Grade RAG Knowledge Base Q&A System Step by Step

Direct Answer

Dify Knowledge Base Best Configuration (Quick Reference):

  • Chunk Size: 500–800 Tokens (Chinese docs: 400–600, English: 600–800)
  • Overlap Ratio: 10–15% (to avoid cutting critical information)
  • Embedding Model: Chinese: BGE-M3 (open-source, free); English/Multilingual: OpenAI text-embedding-3-large
  • Retrieval Strategy: Hybrid search (vector semantic + BM25 keyword) works best
  • Top K: 3–5 results (too many dilutes context, too few misses info)
  • Similarity Threshold: 0.5–0.6 (adjust based on business scenario)

  • Why Do Enterprises Need a Private Knowledge Base?

    To turn a generic AI into one that "understands your company's business," you need:

  • Domain Knowledge: Product manuals, FAQs, internal policies, historical cases
  • Data Security: Customer info and financial data cannot be sent to third-party AI
  • Real-Time Updates: When company documents are updated, AI answers must sync
  • RAG (Retrieval-Augmented Generation) is currently the most mature solution: first retrieve relevant content from the knowledge base, then let the LLM generate answers based on the retrieved results.


    Dify Private Deployment (Docker, 30 Minutes)

    bash
    git clone https://github.com/langgenius/dify.git
    cd dify/docker
    cp .env.example .env
    

    Modify SECRET_KEY and INIT_PASSWORD in .env

    docker compose up -d

    Visit http://localhost

    Configure Embedding Model

    Option A: OpenAI (Simplest) Admin Panel → Settings → Model Providers → OpenAI → Enter API Key. Recommended: text-embedding-3-large.

    Option B: BGE-M3 Local (Free, Best for Chinese)

    bash
    ollama pull bge-m3
    

    In Dify, configure Ollama Embedding Endpoint: http://localhost:11434


    Document Preprocessing Best Practices

    Format Priority (Best to Worst): Markdown > PDF (selectable text) > Word (.docx) > Web URL > Scanned PDF

    bash
    

    Batch convert PDF to Markdown using markitdown

    pip install markitdown markitdown company_handbook.pdf > handbook.md

    Must Clean: Headers/footers, page numbers, repeated disclaimers, sentences broken by line breaks (common in PDF extraction)


    Chunk Parameter Tuning (Most Critical)

    ScenarioRecommended Chunk Size

    FAQ / Q&A pairs200–300 Tokens Product manuals / Technical docs500–700 Tokens Legal contracts / Process specs600–900 Tokens Long reports / Research papers700–1000 Tokens

    Overlap Setting: 100–150 Tokens (~15%) to prevent key information from being cut at chunk boundaries.


    Retrieval Strategy: Hybrid Search (Recommended)

    Retrieval MethodAdvantageSuitable Scenario

    Pure vector searchGood semantic understanding, synonym matchingFuzzy questions, concept queries Pure keyword (BM25)Fast exact matchProper nouns, numeric queries Hybrid searchBalances bothMost scenarios (recommended)

    Recommended configuration: Vector weight 0.6, BM25 weight 0.4, enable BGE Reranker v2-m3 for re-ranking.

    Reranking is the single most impactful optimization for accuracy (typically +15–25%): First recall 20 results, then Reranker selects Top 4.


    Common Issue Diagnosis

    ProblemCauseSolution

    AI says "I don't know" but the answer exists in the knowledge baseSimilarity threshold too high / chunk cut key infoLower threshold to 0.45, check chunk boundaries Numbers/proper nouns are inaccuratePure vector search is weak on exact character matchingEnable hybrid search, increase BM25 weight Answers unchanged after updating documentsOld vector cache not invalidatedRe-index the document in Dify admin Answer is correct but cites wrong documentTop K too large, LLM confuses sourcesReduce Top K from 5 to 3


    Production-Grade Optimization: Knowledge Base Layered Architecture

    
    Knowledge Base A: High-frequency FAQ (500 Q&A pairs, fast exact match)
    Knowledge Base B: Product manuals (fine-grained chunks, hybrid search)
    Knowledge Base C: Historical cases (time-based partitioning, periodic archiving)

    Query routing rules:

  • Contains "how to" or "how do I" → Search FAQ first
  • Contains product name → Search product manuals
  • Others → Full knowledge base search
  • Continuous improvement: Weekly analysis of "unanswered queries" (questions users asked but AI didn't know) → Add to knowledge base.


    FAQ

    Q: What's the difference between Dify knowledge base and ChatGPT's GPT?

    A: Dify is deployed privately, data never leaves your server; supports batch document management and fine-grained retrieval configuration; can use open-source models (no OpenAI cost). GPT is a cloud service, simple but data privacy is limited, and it cannot manage documents in bulk.

    Q: What scale can the knowledge base handle?

    A: Dify uses Weaviate by default, which can store millions of vectors. For enterprise use, Qdrant (Rust-based, better performance) is recommended.

    Q: Does it support image/table understanding?

    A: Dify v0.7+ supports image OCR extraction. For tables, it's recommended to convert them to Markdown format before uploading—much better results than uploading Excel directly.


    Related Resources

  • RAG Pitfall Guide: aiskillnav.com/tutorials/rag-knowledge-base-best-practices
  • Vector Database Selection: aiskillnav.com/tutorials/vector-database-comparison-pinecone-weaviate-chroma-2026
  • n8n Workflow Automation: aiskillnav.com/tutorials/n8n-mcp-server-integration-guide-2026
  • Also available in 中文.