DeepSeek-R1 Local Deployment Complete Guide: Run a Top-Tier Reasoning Model at Zero Cost
Run DeepSeek-R1 on Mac/Linux/Windows with Ollama, fully offline, data stays local
DeepSeek-R1 Local Deployment Complete Guide
Supported Systems: macOS 12+, Ubuntu 20.04+, Windows 11 (WSL2)
Minimum Requirements: 8GB RAM (for 7B version); 32GB RAM recommended (for 70B version)
Why Choose DeepSeek-R1?
In January 2025, the release of DeepSeek-R1 shook the entire AI industry:
Local Deployment vs Cloud API
Step 1: Install Ollama
Ollama is the simplest tool for running local large models, supporting 100+ open-source models.
macOS
bash
Method 1: Download installer from website (recommended for beginners)
Visit https://ollama.ai to download the .dmg file, double-click to install
Method 2: Install via command line
brew install ollama
Linux
bash
curl -fsSL https://ollama.ai/install.sh | sh
Windows (WSL2)
bash
First install WSL2, then run in the WSL terminal:
curl -fsSL https://ollama.ai/install.sh | sh
Verify installation:
bash
ollama --version
Output similar to: ollama version 0.5.x
Step 2: Choose the Right DeepSeek-R1 Version
Choose a version based on your RAM/VRAM:
Recommendations:
deepseek-r1:14bdeepseek-r1:32bdeepseek-r1:70bStep 3: Download and Run the Model
bash
Download and run directly (first time requires download, takes a few minutes to tens of minutes)
ollama run deepseek-r1:14bOr download first, then run
ollama pull deepseek-r1:14b
ollama run deepseek-r1:14b
Once successful, you'll see a command-line interactive interface:
>>> Write a quick sort algorithm for me
The user wants a quick sort algorithm...
Here is a Python implementation of quick sort:
def quicksort(arr):
if len(arr) <= 1:
return arr
pivot = arr[len(arr) // 2]
...
💡 R1's signature: You'll see ... tags, which are the model's "chain of thought" during reasoning. DeepSeek-R1 displays its reasoning process.Exit the conversation: type /bye or press Ctrl+D
Step 4: Common Management Commands
bash
List downloaded models
ollama listDelete a model (free up disk space)
ollama rm deepseek-r1:1.5bRun Ollama service in background (API mode)
ollama serveView running models
ollama psUpdate a model to the latest version
ollama pull deepseek-r1:14b
Step 5: Integrate with Cursor (AI Coding Assistant)
Ollama provides an OpenAI-compatible API interface that seamlessly integrates with Cursor:
http://localhost:11434/v1
- Model name: deepseek-r1:14b
- API Key: ollama (any value, no local verification)deepseek-r1:14b as the current modelStep 6: Integrate with VS Code (Continue Plugin)
Continue is the best AI coding plugin for VS Code, with native support for Ollama:
~/.continue/config.json:json
{
"models": [
{
"title": "DeepSeek-R1 14B (Local)",
"provider": "ollama",
"model": "deepseek-r1:14b",
"apiBase": "http://localhost:11434"
}
],
"tabAutocompleteModel": {
"title": "DeepSeek-R1 1.5B (Fast)",
"provider": "ollama",
"model": "deepseek-r1:1.5b"
}
}
💡 Tip: Use the large model for conversations (14B) and the small model for Tab completion (1.5B) for faster speed and better experience.
Step 7: Use via API (Advanced)
Once Ollama is running, it provides an OpenAI-compatible API at http://localhost:11434:
python
from openai import OpenAIPoint to local Ollama
client = OpenAI(
base_url='http://localhost:11434/v1',
api_key='ollama' # any string
)response = client.chat.completions.create(
model='deepseek-r1:14b',
messages=[
{'role': 'user', 'content': 'Write a binary search function in Python with complete tests'}
]
)
print(response.choices[0].message.content)
Performance Optimization Tips
Enable GPU Acceleration (if you have a dedicated GPU)
bash
Ollama automatically detects NVIDIA/AMD GPUs, no extra configuration needed
Verify GPU usage:
ollama run deepseek-r1:14b
After running, execute:
ollama ps
If you see GPU: NVIDIA GeForce..., GPU is enabled
Apple Silicon Optimization
The unified memory architecture of M-series chips is particularly friendly to LLM inference. The 14B model can reach 30-50 tokens/s, close to cloud API experience.
bash
View real-time speed
ollama run deepseek-r1:14b --verbose
Concurrent Requests
Ollama supports concurrent requests by default, suitable for building multi-user applications:
bash
Set maximum concurrency (default is 1)
OLLAMA_NUM_PARALLEL=4 ollama serve
Other Recommended Local Models
qwen2.5-coder:7bqwen2.5:14bllama3.3:70bmistral:7bnomic-embed-textFAQ
Q: What if the model download is very slow? A: You can use a mirror for acceleration (for users in China):
bash
OLLAMA_REGISTRY_URL=https://registry.ollama.ai ollama pull deepseek-r1:14b
Or manually download the GGUF file from HuggingFace mirror and import it.Q: What if I get a "context length exceeded" error? A: Reduce the context window:
bash
ollama run deepseek-r1:14b --context-length 4096
Q: How do I make Ollama start on boot?
macOS:
bash
brew services start ollama
Linux (systemd):
bash
sudo systemctl enable ollama
sudo systemctl start ollama
Q: What's the difference between DeepSeek-R1 and DeepSeek-V3? A:
reasoning process, suitable for math/code/logic problemsFor coding scenarios, R1 is recommended; for daily conversation, V3 is recommended (ollama run deepseek-v3:8b).
Also available in 中文.