MLX + Embeddings：在 Apple Silicon 上打造本地語意搜尋

科技觀點 - 本文屬於一個選集。

§ 2: 在 Mac/iPhone 生態跑本地 AI：Ollama、MLX 與行動端工作流

§ 3: Streamlit + Ollama：打造本地 LLM Chatbot App

§ 4: 本文

一. 前言：搜尋不只是在找關鍵字
#

你有沒有遇過這種狀況：明明記得筆記裡寫過某個概念，但用關鍵字怎麼搜都搜不到？可能你記得的是「本地模型整理文件」，文件裡寫的卻是「RAG pipeline 先 retrieve context」。傳統搜尋很會找字串，卻不太懂意思；語意搜尋則是反過來，先把文字轉成向量，再找概念上最接近的內容。這些向量就是 embeddings。今天拍拍君要用 MLX 在 Apple Silicon 上做一個本地語意搜尋小工具。它可以讀取 Markdown 筆記，切成 chunks，建立本地向量索引，然後用自然語言 query 找回相關段落。如果你之前看過 Python MLX 入門，這篇就是把 MLX 從「會跑矩陣」推到「真的做出一個 AI 小功能」。而且全程可以留在本機，不需要把私人筆記丟到外部 API。

二. 今天的目標架構
#

我們要做的流程很簡單：

Markdown 文件
  ↓
讀檔與切 chunks
  ↓
embedding model 轉向量
  ↓
存成 vectors.npy + metadata.json
  ↓
query 也轉向量
  ↓
cosine similarity 排序
  ↓
回傳最相關片段

這就是很多 RAG 系統的核心骨架。先不要急著上向量資料庫，也不要一開始就塞進複雜框架。用 numpy 把流程寫一次，你會更清楚語意搜尋到底在做什麼。今天使用的主要套件是 mlx-embedding-models。它提供類似 SentenceTransformers 的 encode() 介面，但底層可以用 MLX 在 Apple Silicon 上跑。對個人知識庫、文件搜尋、小型 RAG prototype 來說，這個組合很舒服。

三. 建立專案
#

先建立一個乾淨專案：

mkdir mlx-semantic-search
cd mlx-semantic-search
uv init
uv add mlx mlx-embedding-models numpy rich typer

如果你還不熟 uv，可以先補這兩篇：

Python uv 入門
Python uv 進階：workspace、lockfile、script 與專案管理接著準備一個小資料夾：

mlx-semantic-search/
├── notes/
│   ├── python.md
│   ├── mlx.md
│   └── rag.md
└── search_notes.py

notes/python.md：

# Python 工具鏈
uv 可以管理專案、虛擬環境與 lockfile。
ruff 適合做 lint 與 format。
pytest 常用來寫單元測試。

notes/mlx.md：

# MLX
MLX 是 Apple 推出的機器學習陣列框架。
它針對 Apple Silicon 設計，適合在 Mac 上做本地推論與原型開發。

notes/rag.md：

# RAG
RAG 會先從文件庫找出相關內容，再把內容交給語言模型回答問題。
embedding 與向量搜尋通常是 RAG pipeline 的第一步。

四. 先確認 embedding model 能跑
#

正式寫索引前，先跑一個最小範例。建立 check_embedding.py：

from mlx_embedding_models import EmbeddingModel
model = EmbeddingModel.from_registry("bge-small")
sentences = [
    "MLX runs machine learning models on Apple Silicon.",
    "uv is a fast Python package manager.",
    "RAG retrieves relevant documents before generation.",
]
vectors = model.encode(sentences, batch_size=8, show_progress=False)
print(type(vectors))
print(vectors.shape)
print(vectors[0][:8])

執行：

uv run python check_embedding.py

你會看到類似：

<class 'numpy.ndarray'>
(3, 384)
[ 0.0123 -0.0345  0.0088 ... ]

這代表三句話被轉成三個 384 維向量。 bge-small 很適合入門：模型不大、速度快、向量也不會太佔空間。如果你的文件主要是中文或中英混合，可以之後測試 multilingual-e5-small。

五. Cosine similarity 的直覺
#

Embedding 只是把文字轉成向量，還需要一個方法判斷「像不像」。最常見的方法是 cosine similarity。如果向量已經 normalize，cosine similarity 可以直接用 dot product：

scores = document_vectors @ query_vector

其中：

document_vectors 是 (num_chunks, dim)
query_vector 是 (dim,)
scores 是 (num_chunks,) 分數越高，代表語意越接近。排序也很直接：

top_ids = np.argsort(scores)[::-1][:5]

這就是我們今天搜尋器的核心。

六. 讀取文件與切 chunks
#

不要把整份文件直接拿去 embed。原因有兩個：第一，模型有長度限制；第二，搜尋結果會太粗。我們先寫一個簡單 chunker。

from __future__ import annotations
import json
from dataclasses import asdict, dataclass
from pathlib import Path
import numpy as np
from mlx_embedding_models import EmbeddingModel
from rich.console import Console
from rich.panel import Panel
import typer
app = typer.Typer(help="Local semantic search with MLX embeddings")
console = Console()
@dataclass
class Chunk:
    id: int
    path: str
    title: str
    text: str

讀檔與切段：

def read_markdown_files(notes_dir: Path) -> list[Path]:
    return sorted(notes_dir.glob("**/*.md"))
def split_text(text: str, max_chars: int = 700, overlap: int = 120) -> list[str]:
    paragraphs = [p.strip() for p in text.split("\n\n") if p.strip()]
    chunks: list[str] = []
    current = ""
    for paragraph in paragraphs:
        if len(current) + len(paragraph) + 2 <= max_chars:
            current = f"{current}\n\n{paragraph}".strip()
            continue
        if current:
            chunks.append(current)
        tail = chunks[-1][-overlap:] if overlap and chunks else ""
        current = f"{tail}\n\n{paragraph}".strip() if tail else paragraph
    if current:
        chunks.append(current)
    return chunks

這不是最完美的 chunking，但足夠適合小工具。真實專案可以再做 heading-aware splitting，保留 Markdown 標題階層。

七. 建立本地索引
#

接著把每個 chunk 包成資料物件。

def build_chunks(notes_dir: Path) -> list[Chunk]:
    chunks: list[Chunk] = []
    for path in read_markdown_files(notes_dir):
        text = path.read_text(encoding="utf-8")
        title = path.stem
        for piece in split_text(text):
            chunks.append(
                Chunk(
                    id=len(chunks),
                    path=str(path),
                    title=title,
                    text=piece,
                )
            )
    return chunks

再寫 index 指令。

@app.command()
def index(
    notes_dir: Path = typer.Argument(Path("notes")),
    out_dir: Path = typer.Option(Path(".semantic-index")),
    model_name: str = typer.Option("bge-small"),
):
    """Build a local vector index."""
    out_dir.mkdir(parents=True, exist_ok=True)
    chunks = build_chunks(notes_dir)
    if not chunks:
        raise typer.BadParameter(f"No markdown files found in {notes_dir}")
    model = EmbeddingModel.from_registry(model_name)
    texts = [f"{chunk.title}\n\n{chunk.text}" for chunk in chunks]
    vectors = model.encode(texts, batch_size=32, show_progress=True).astype("float32")
    np.save(out_dir / "vectors.npy", vectors)
    metadata = {"model_name": model_name, "chunks": [asdict(c) for c in chunks]}
    (out_dir / "metadata.json").write_text(
        json.dumps(metadata, ensure_ascii=False, indent=2),
        encoding="utf-8",
    )
    console.print(f"[green]Indexed {len(chunks)} chunks[/]")

執行：

uv run python search_notes.py index notes

完成後會得到：

.semantic-index/
├── metadata.json
└── vectors.npy

vectors.npy 放向量矩陣，metadata.json 放每個向量對應的檔名與文字。這種設計很樸素，但幾千到幾萬個 chunks 都可以先這樣做。

八. 搜尋 query
#

載入索引：

def load_index(index_dir: Path):
    vectors = np.load(index_dir / "vectors.npy")
    metadata = json.loads((index_dir / "metadata.json").read_text(encoding="utf-8"))
    chunks = [Chunk(**item) for item in metadata["chunks"]]
    return metadata["model_name"], vectors, chunks

搜尋指令：

@app.command()
def search(
    query: str,
    index_dir: Path = typer.Option(Path(".semantic-index")),
    top_k: int = typer.Option(5),
):
    """Search notes by meaning, not just keywords."""
    model_name, vectors, chunks = load_index(index_dir)
    model = EmbeddingModel.from_registry(model_name)
    query_vector = model.encode([query], batch_size=1, show_progress=False)[0]
    query_vector = query_vector.astype("float32")
    scores = vectors @ query_vector
    top_ids = np.argsort(scores)[::-1][:top_k]
    for rank, idx in enumerate(top_ids, start=1):
        chunk = chunks[int(idx)]
        score = float(scores[int(idx)])
        console.print(
            Panel(
                chunk.text,
                title=f"#{rank} score={score:.3f} — {chunk.path}",
                subtitle=chunk.title,
            )
        )
if __name__ == "__main__":
    app()

測試：

uv run python search_notes.py search "How do I retrieve documents before asking an LLM?"

理想上第一名會是 rag.md。再試：

uv run python search_notes.py search "fast Python project manager with lock files"

這次應該會找到 python.md 裡的 uv 內容。重點是：query 不需要和文件使用一模一樣的字。

九. 中文文件的注意事項
#

如果你主要搜尋中文筆記，模型選擇很重要。可以試：

uv run python search_notes.py index notes --model-name multilingual-e5-small

E5 類模型常見做法是替 query 與 passage 加 prefix。

def format_for_e5(texts: list[str], *, is_query: bool) -> list[str]:
    prefix = "query: " if is_query else "passage: "
    return [prefix + text for text in texts]

建立索引時：

texts = format_for_e5([chunk.text for chunk in chunks], is_query=False)

搜尋時：

query_text = format_for_e5([query], is_query=True)
query_vector = model.encode(query_text, batch_size=1, show_progress=False)[0]

這看起來像小細節，但 embedding model 通常就是靠這些訓練慣例變準。如果結果怪怪的，先確認你有沒有照模型建議格式餵資料。

十. 什麼時候該換向量資料庫？
#

今天用 vectors.npy，是為了讓流程透明。資料量小時，它其實很夠用。但如果你有下面需求，就可以考慮升級：

工具	適合情境
FAISS	本地高效近似搜尋
LanceDB	嵌入式向量資料庫，開發體驗好
Qdrant	服務型向量資料庫，metadata filter 強
sqlite-vss	想留在 SQLite 生態
拍拍君的建議是：先用 numpy 寫一次，痛了再換。
不要還沒開始搜尋，就先把架構堆成一座塔。

十一. 實務調校清單
#

搜尋結果不準時，可以照這個順序查：

chunk 是否太長或太短
model 是否適合中文或技術文件
query / passage prefix 是否符合模型習慣
top_k 是否太小
文件標題是否有放進 embedding text
是否需要 keyword search 混合向量搜尋
是否需要 reranker 做第二階段排序其中「把標題放進 embedding text」很常有效：

texts = [f"{chunk.title}\n\n{chunk.text}" for chunk in chunks]

工程文件裡也常有精確名詞，例如 pyproject.toml、pytest.fixture、mlx.core.array。這種情況 keyword search 仍然很強。成熟系統常常是 hybrid search：keyword + vector 一起用。

十二. 接上 RAG
#

語意搜尋可以單獨使用，也可以接在 LLM 前面。

def build_context(results: list[Chunk]) -> str:
    return "\n\n---\n\n".join(chunk.text for chunk in results)

然後把 context 放進 prompt：

請只根據以下資料回答問題。
如果資料不足，請說不知道。
資料：
{context}
問題：
{question}

這就是最小 RAG。你可以搭配本地 Ollama，也可以搭配 API 模型。延伸可以看：

本地 LLM 實戰：Ollama + Python 打造自己的小助手
Streamlit + Ollama：打造本地 LLM Chatbot App 不過拍拍君要提醒：RAG 的品質不只取決於 LLM。 chunk、embedding model、top-k、prompt 約束，全部都會影響最後答案。 LLM 是最後一棒，不是唯一一棒。

十三. Apple Silicon 上的取捨
#

MLX 的魅力在於，它不是把 Linux GPU workflow 勉強搬到 Mac。它是從 Apple Silicon 出發設計的框架。做本地 embeddings 時，你會得到幾個好處：

安裝相對輕
在 Mac 上跑起來自然
適合個人筆記與小型 AI app
私人文件可以留在本機當然，若要服務大量使用者，還是要評估雲端 GPU、向量資料庫與專用推論服務。拍拍君會這樣分：個人原型用 MLX 很香，production 系統則看規模與維運需求。工具沒有信仰問題，只有適不適合。

結語
#

今天我們用 MLX embeddings 做了一個本地語意搜尋器。它的核心只有六步：讀文件、切 chunks、轉 embeddings、存索引、轉 query、排序回傳。但這個流程，已經是很多本地 AI app 與 RAG 系統的核心。如果你正在學 AI app 開發，拍拍君很建議親手寫一次 mini semantic search。不要只停在「embedding 是向量」這句話。真的拿自己的筆記搜尋幾個問題，你會很快理解系統準或不準的原因。那個理解，比背十個框架名字有用多了。