Python fsspec 實戰：統一讀寫本機、S3、HTTP 與資料管線路徑

Python 學習 - 本文屬於一個選集。

§ 95: 本文

一. 前言：路徑不是永遠都在你的硬碟上
#

很多 Python 小工具一開始都長這樣：

from pathlib import Path

path = Path("data/orders.csv")
text = path.read_text()

這樣很好。本機檔案、小型腳本、測試資料，都很適合用 pathlib。

問題是，資料管線通常不會永遠停在這裡。

今天資料在本機資料夾。明天它在 S3 bucket。後天它變成 HTTP URL。再過一週，測試想用記憶體假檔案系統，不想真的碰磁碟。

如果每種儲存後端都重寫一套 if s3 then ... else if local then ...，程式很快會開始變成一團濕掉的麵線。拍拍君看到這種程式，會先深呼吸三秒，然後默默泡咖啡。

fsspec 就是為了這種情境存在的。

它的全名是 filesystem specification。你可以把它想成 Python 資料工具圈常用的檔案系統抽象層：

本機檔案：file:// 或一般路徑。
S3：s3://bucket/path.csv。
GCS：gs://bucket/path.parquet。
HTTP：https://example.com/data.csv。
Zip：zip://inner.csv::archive.zip。
Memory：測試用的 in-memory filesystem。

它不是要取代 pathlib。 pathlib 很適合描述本機路徑。 fsspec 則適合描述「資料在哪裡」這件事，而且資料可能不在同一種檔案系統上。

如果你看過 Python pathlib 實戰，那篇重點是本機 path object、路徑組合與檔案操作。

如果你看過 Python PyArrow 實戰，那篇重點是 Arrow、Parquet schema 與跨工具資料交換。

今天這篇站在中間那層：資料工具要怎麼打開不同儲存後端的檔案。

拍拍君先講結論：只要你的資料路徑開始出現 S3、GCS、HTTP、zip、cache 或測試替身，fsspec 就值得放進工具箱。

二. 安裝：核心套件很小，後端套件另外裝
#

先建立一個練習專案：

mkdir fsspec-lab
cd fsspec-lab
uv init
uv add fsspec

不用 uv 的話：

python -m venv .venv
source .venv/bin/activate
pip install fsspec

確認版本：

import fsspec

print(fsspec.__version__)

fsspec 的核心套件只包含通用抽象與部分基本 filesystem。如果你要接特定雲端儲存，通常還要裝對應實作：

uv add s3fs gcsfs adlfs

常見搭配大概是：本機檔案用 fsspec，S3 用 s3fs，GCS 用 gcsfs，Azure Blob 用 adlfs，測試替身可以用內建的 memory://。

注意一件事：fsspec 是抽象層，不是認證魔法。

S3 權限、GCS service account、Azure token，還是要照各自平台處理。 fsspec 負責讓上層程式用一致的方式開檔、列目錄、讀寫 bytes。

三. 第一個例子：用同一個 open 讀本機檔案
#

先從最無聊的本機檔案開始。無聊是好事，因為抽象如果連本機都不好用，就不用往下看了。

建立一個檔案：

from pathlib import Path

Path("data").mkdir(exist_ok=True)
Path("data/orders.csv").write_text(
    "order_id,total\n"
    "A001,120\n"
    "A002,85\n",
    encoding="utf-8",
)

用 fsspec.open() 讀它：

import fsspec

with fsspec.open("data/orders.csv", mode="rt", encoding="utf-8") as f:
    text = f.read()

print(text)

你會得到：

order_id,total
A001,120
A002,85

這看起來跟內建 open() 差不多。差別在於，fsspec.open() 可以接受 protocol：

with fsspec.open("file://data/orders.csv", mode="rt", encoding="utf-8") as f:
    print(f.readline())

現在先記住一個實務規則：

如果你的函式未來可能讀 S3、GCS、HTTP 或測試用 memory filesystem，就不要把 Path 寫死在核心邏輯裡。

可以把「路徑字串」當成資料來源描述，交給 fsspec 去判斷 protocol。

四. 把讀檔邏輯包成可替換的函式
#

假設你正在寫一個很小的資料載入函式。

一開始可能這樣：

from pathlib import Path


def load_text(path: str) -> str:
    return Path(path).read_text(encoding="utf-8")

這對本機很好，但遇到 s3://... 就不用玩了。

改成 fsspec：

import fsspec


def load_text(url: str) -> str:
    with fsspec.open(url, mode="rt", encoding="utf-8") as f:
        return f.read()

呼叫端可以一樣簡單：

print(load_text("data/orders.csv"))

未來資料搬到遠端時，核心邏輯不用改：

print(load_text("s3://pypy-demo/orders.csv"))

當然，這段 S3 範例要有 s3fs 和權限設定才會真的跑起來。但從程式設計角度看，讀取流程已經被抽象掉了。

拍拍君很喜歡這種改法，因為它讓函式的責任比較乾淨：

load_text() 負責讀文字。
URL / protocol 負責描述資料位置。
認證與 storage option 從外面注入。

不要讓每個讀檔函式都長出一堆雲端平台分支。那會很快變成一隻誰都不敢摸的泥巴球。

五. storage_options：不要把認證硬塞進路徑
#

很多後端需要額外設定。例如 S3 可以有 profile、endpoint、匿名讀取、region。 GCS 可以有 token。 HTTP 可能需要 header。

fsspec 常用 storage_options 傳這些設定。

例如匿名讀公開 S3 資料：

import fsspec

url = "s3://some-public-bucket/example.csv"

with fsspec.open(url, mode="rt", anon=True) as f:
    print(f.readline())

或明確建立 filesystem：

s3 = fsspec.filesystem(
    "s3",
    profile="analytics-dev",
    client_kwargs={"region_name": "us-west-2"},
)

with s3.open("my-bucket/raw/orders.csv", mode="rt") as f:
    print(f.readline())

在專案裡，拍拍君通常會把 storage options 放在設定層，而不是散在每個讀檔函式裡：

路徑只描述資料位置。
認證和連線設定集中管理。
測試時可以換成 memory filesystem。
不會在程式碼裡到處散落 bucket name 和 profile。

如果你已經看過 Python dotenv 實戰，可以把 .env 想成設定來源之一。但不要把 secret 寫進文章裡的範例，也不要把 token commit 進 repo。拍拍君會皺眉。

六. open_files：一次處理一批檔案
#

資料管線常常不是讀一個檔案，而是讀一批。 fsspec.open_files() 可以用 glob pattern 產生一組可開啟的檔案：

import fsspec

files = fsspec.open_files("data/*.csv", mode="rt", encoding="utf-8")

for open_file in files:
    with open_file as f:
        print(open_file.path, f.readline().strip())

重點是這個 pattern 也可以換成遠端：

files = fsspec.open_files(
    "s3://pypy-demo/raw/orders/*.csv",
    mode="rt",
    encoding="utf-8",
    anon=False,
)

這種寫法適合小型 batch job。如果你已經在用 Dask、PyArrow dataset 或 Polars scan，那些工具本身也可能支援 fsspec URL，不一定要自己寫迴圈。重點是同一個想法：不要太早把所有檔案讀進記憶體。

七. memory filesystem：測試不用真的寫磁碟
#

memory:// 是拍拍君很推薦先學的後端。它可以讓你在測試裡建立假檔案系統，不需要碰真實磁碟，也不需要真的連雲端。

範例：

import fsspec

mem = fsspec.filesystem("memory")

with mem.open("/raw/orders.csv", mode="wt", encoding="utf-8") as f:
    f.write("order_id,total\nA001,120\n")

with mem.open("/raw/orders.csv", mode="rt", encoding="utf-8") as f:
    print(f.read())

把核心函式改成接 filesystem，測試就很乾淨：

def read_first_line(fs, path: str) -> str:
    with fs.open(path, mode="rt", encoding="utf-8") as f:
        return f.readline().strip()

測試可以這樣寫：

def test_read_first_line_from_memory_fs():
    fs = fsspec.filesystem("memory")
    fs.makedirs("/tmp", exist_ok=True)

    with fs.open("/tmp/example.txt", mode="wt", encoding="utf-8") as f:
        f.write("hello\nworld\n")

    assert read_first_line(fs, "/tmp/example.txt") == "hello"

這個測試沒有 temporary directory，也沒有 S3 mock server，卻能檢查你的核心邏輯是否正確。

如果 production 真的用 S3，還是要有少量 integration test 確認認證與權限。但大部分純資料邏輯不必每次都打到雲端。測試跑得快，大家才會願意跑。很現實，也很重要。

八. 快取與資料工具整合
#

有些資料來源讀起來很慢，或你不想同一個 job 重複下載同一份 HTTP 檔案。 fsspec 的 simplecache 可以先把遠端資料存成本機副本：

url = "simplecache::https://example.com/data/orders.csv"

with fsspec.open(
    url,
    mode="rt",
    encoding="utf-8",
    simplecache={"cache_storage": ".cache/fsspec"},
) as f:
    print(f.readline())

快取很方便，但不要忘記設計失效規則。拍拍君通常會問三個問題：

這份資料會不會更新？
cache 可以放多久？
如果讀到舊資料，後果是慢一點、錯一點，還是會害 production 爆炸？

fsspec 最常出現的地方，其實不是你手寫 fsspec.open()，而是資料工具背後默默用它。例如 pandas 可以讀 fsspec URL：

import pandas as pd

df = pd.read_csv(
    "s3://pypy-demo/raw/orders.csv",
    storage_options={"profile": "analytics-dev"},
)

PyArrow、Dask、Polars 也常常能接受 fsspec-style URL 或 filesystem。簡化概念可以這樣看：

import fsspec
import pyarrow.parquet as pq

fs = fsspec.filesystem("file")

with fs.open("data/orders.parquet", "rb") as f:
    table = pq.read_table(f)

print(table.schema)