Python 正規表達式完全攻略：re 模組從入門到實戰

Python 學習 - 本文屬於一個選集。

§ 31: 本文

一. 前言
#

嗨，大家好！我是拍拍君 🎉

你有沒有遇過這些情境？

🔍 從一大段文字中找出所有 email 地址
🔍 驗證使用者輸入的電話號碼格式
🔍 把 log 檔案裡的時間戳記全部提取出來
🔍 把 markdown 文件中的連結替換成純文字

這些問題都有一個共通的解法 — 正規表達式（Regular Expression, regex）！

Python 內建的 re 模組就是處理正規表達式的利器。今天拍拍君就帶你從零開始，一步步學會正規表達式的各種技巧，讓你在文字處理的戰場上無往不利！

二. 什麼是正規表達式？
#

正規表達式是一種**描述字串模式（pattern）**的迷你語言。你可以把它想成是一種「搜尋模板」：

普通字串搜尋：找 "hello" → 只能找到完全一樣的 hello
正規表達式搜尋：找 r"h.llo" → 可以匹配 hello、hallo、hxllo……

簡單來說，regex 讓你用模式而非固定文字來搜尋和操作字串。

三. re 模組基礎
#

匯入模組
#

import re

就這樣，不需要安裝任何東西，re 是 Python 標準函式庫的一員！

3.1 re.search() — 搜尋第一個匹配
#

import re

text = "我的 email 是 pypy@example.com，歡迎聯繫！"
match = re.search(r"[\w.]+@[\w.]+", text)

if match:
    print(match.group())  # pypy@example.com

re.search() 會掃描整個字串，找到第一個匹配的位置。回傳一個 Match 物件，找不到則回傳 None。

3.2 re.match() — 從字串開頭匹配
#

import re

# match() 只檢查字串開頭
print(re.match(r"\d+", "123abc"))   # <re.Match object; span=(0, 3), match='123'>
print(re.match(r"\d+", "abc123"))   # None（開頭不是數字）

💡 拍拍君小提醒： match() 只看開頭，search() 掃全文。大部分情況用 search() 比較不會踩坑！

3.3 re.findall() — 找出所有匹配
#

import re

text = "2026-03-06 天氣晴，2026-03-07 天氣陰"
dates = re.findall(r"\d{4}-\d{2}-\d{2}", text)
print(dates)
# ['2026-03-06', '2026-03-07']

findall() 回傳一個 list，包含所有匹配的字串。這是日常最常用的函式之一！

3.4 re.finditer() — 迭代所有匹配
#

import re

text = "價格：$100, $250, $3999"
for match in re.finditer(r"\$(\d+)", text):
    print(f"金額: {match.group(1)}, 位置: {match.span()}")

# 金額: 100, 位置: (4, 8)
# 金額: 250, 位置: (10, 14)
# 金額: 3999, 位置: (16, 21)

finditer() 跟 findall() 類似，但回傳的是 Match 物件的迭代器，可以取得更多資訊（位置、群組等）。

四. 正規表達式語法速查
#

這是正規表達式的核心，把這張表記下來，你就掌握了 80% 的 regex！

4.1 基本字元匹配
#

語法	意義	範例
`.`	任意字元（換行除外）	`a.c` → `abc`, `a1c`
`\d`	數字 `[0-9]`	`\d\d` → `42`
`\D`	非數字	`\D+` → `abc`
`\w`	字母、數字、底線	`\w+` → `hello_123`
`\W`	非字母數字底線	`\W` → `@`, `!`
`\s`	空白字元（空格、tab、換行）	`\s+` →
`\S`	非空白字元	`\S+` → `hello`

4.2 量詞
#

語法	意義	範例
`*`	0 次或多次	`ab*c` → `ac`, `abc`, `abbc`
`+`	1 次或多次	`ab+c` → `abc`, `abbc`（不匹配 `ac`）
`?`	0 次或 1 次	`colou?r` → `color`, `colour`
`{n}`	恰好 n 次	`\d{4}` → `2026`
`{n,m}`	n 到 m 次	`\d{2,4}` → `03`, `306`, `2026`
`{n,}`	至少 n 次	`\d{2,}` → `42`, `123`, `9999`

4.3 位置錨點
#

語法	意義
`^`	字串開頭
`$`	字串結尾
`\b`	字邊界（word boundary）

import re

# 字邊界的威力
text = "cat concatenate category"
print(re.findall(r"\bcat\b", text))
# ['cat']（不會匹配到 concatenate 或 category 裡的 cat）

五. 群組（Groups）：提取你要的部分
#

用 () 把 pattern 包起來，就能建立捕獲群組：

import re

text = "生日：1990-05-15"
match = re.search(r"(\d{4})-(\d{2})-(\d{2})", text)

if match:
    print(f"完整匹配：{match.group(0)}")  # 1990-05-15
    print(f"年：{match.group(1)}")        # 1990
    print(f"月：{match.group(2)}")        # 05
    print(f"日：{match.group(3)}")        # 15

命名群組
#

用 (?P<name>...) 給群組取名字，程式碼更清楚：

import re

log = "2026-03-06 10:30:45 ERROR 資料庫連線失敗"
pattern = r"(?P<date>\d{4}-\d{2}-\d{2}) (?P<time>\d{2}:\d{2}:\d{2}) (?P<level>\w+) (?P<msg>.+)"
match = re.search(pattern, log)

if match:
    print(match.group("date"))   # 2026-03-06
    print(match.group("level"))  # ERROR
    print(match.group("msg"))    # 資料庫連線失敗

findall() 與群組的互動
#

注意！當 pattern 含有群組時，findall() 只回傳群組的內容：

import re

text = "聯絡人：Alice (alice@mail.com), Bob (bob@mail.com)"

# 沒有群組 → 回傳完整匹配
print(re.findall(r"[\w.]+@[\w.]+", text))
# ['alice@mail.com', 'bob@mail.com']

# 有群組 → 只回傳群組內容
print(re.findall(r"(\w+) \(([\w.]+@[\w.]+)\)", text))
# [('Alice', 'alice@mail.com'), ('Bob', 'bob@mail.com')]

六. 取代：re.sub()
#

re.sub() 是 regex 版的字串替換，超級強大：

import re

# 基本替換
text = "我的電話是 0912-345-678，備用 0987-654-321"
clean = re.sub(r"\d{4}-\d{3}-\d{3}", "***-***-***", text)
print(clean)
# 我的電話是 ***-***-***，備用 ***-***-***

用群組做進階替換
#

import re

# 把日期格式從 YYYY-MM-DD 轉成 DD/MM/YYYY
text = "日期：2026-03-06"
result = re.sub(r"(\d{4})-(\d{2})-(\d{2})", r"\3/\2/\1", text)
print(result)
# 日期：06/03/2026

用函式做替換
#

import re

def censor_email(match):
    name, domain = match.group(1), match.group(2)
    return f"{name[0]}***@{domain}"

text = "聯絡：alice@gmail.com 或 bob@yahoo.com"
result = re.sub(r"(\w+)@([\w.]+)", censor_email, text)
print(result)
# 聯絡：a***@gmail.com 或 b***@yahoo.com

💡 拍拍君小提醒： re.sub() 的替換函式接收 Match 物件，可以做任何你想要的轉換邏輯！

七. 分割：re.split()
#

比 str.split() 更靈活：

import re

# 用多種分隔符號切割
text = "apple, banana; cherry|grape"
result = re.split(r"[,;|]\s*", text)
print(result)
# ['apple', 'banana', 'cherry', 'grape']

# 用空白切割（處理多個空格）
text = "hello   world    python"
result = re.split(r"\s+", text)
print(result)
# ['hello', 'world', 'python']

八. 編譯 Pattern：re.compile()
#

如果同一個 pattern 要用很多次，用 re.compile() 預編譯可以提升效能：

import re

# 編譯一次，重複使用
email_pattern = re.compile(r"[\w.+-]+@[\w-]+\.[\w.]+")

texts = [
    "聯絡 alice@example.com",
    "寄到 bob@company.co.uk",
    "沒有 email 的句子",
]

for text in texts:
    match = email_pattern.search(text)
    if match:
        print(f"找到: {match.group()}")
    else:
        print("沒有 email")

# 找到: alice@example.com
# 找到: bob@company.co.uk
# 沒有 email

💡 拍拍君小提醒： Python 內部其實有 regex cache（預設 512 個 pattern），所以少量使用時 re.search() 和 compile() 速度差異不大。但養成 compile() 的習慣是好事！

九. 旗標（Flags）
#

旗標可以改變 regex 的行為：

re.IGNORECASE (re.I) — 不分大小寫
#

import re

text = "Python PYTHON python PyThOn"
result = re.findall(r"python", text, re.IGNORECASE)
print(result)
# ['Python', 'PYTHON', 'python', 'PyThOn']

re.MULTILINE (re.M) — 多行模式
#

import re

text = """第一行：Hello
第二行：World
第三行：Python"""

# 沒有 MULTILINE，^ 只匹配字串開頭
print(re.findall(r"^第.行", text))
# ['第一行']

# 有 MULTILINE，^ 匹配每一行的開頭
print(re.findall(r"^第.行", text, re.MULTILINE))
# ['第一行', '第二行', '第三行']

re.DOTALL (re.S) — 讓 `.` 匹配換行
#

import re

html = "<div>\nhello\n</div>"

# 預設 . 不匹配換行
print(re.search(r"<div>(.+)</div>", html))
# None

# DOTALL 讓 . 匹配任何字元（包括換行）
match = re.search(r"<div>(.+)</div>", html, re.DOTALL)
print(match.group(1))
# \nhello\n

re.VERBOSE (re.X) — 寫出可讀的 regex
#

這是拍拍君最推薦的旗標！讓你寫出有註解的 regex：

import re

email_pattern = re.compile(r"""
    [\w.+-]+        # 使用者名稱（字母、數字、.、+、-）
    @               # @ 符號
    [\w-]+          # 域名
    \.              # 點
    [\w.]+          # 頂級域名（可能有多段，如 .co.uk）
""", re.VERBOSE)

print(email_pattern.search("email: pypy@daily.com").group())
# pypy@daily.com

組合多個旗標
#

import re

pattern = re.compile(r"hello", re.IGNORECASE | re.MULTILINE)

十. 貪婪 vs 懶惰匹配
#

這是 regex 最容易踩的坑之一！

貪婪匹配（預設）
#

import re

html = "<b>粗體</b>和<b>另一個粗體</b>"

# 貪婪：盡可能匹配最長的字串
match = re.search(r"<b>(.+)</b>", html)
print(match.group(1))
# 粗體</b>和<b>另一個粗體    ← 吃太多了！

懶惰匹配（加 `?`）
#

import re

html = "<b>粗體</b>和<b>另一個粗體</b>"

# 懶惰：盡可能匹配最短的字串
match = re.search(r"<b>(.+?)</b>", html)
print(match.group(1))
# 粗體    ← 完美！

💡 拍拍君小提醒： 在量詞後面加 ? 就變成懶惰匹配：*?、+?、??、{n,m}?。處理 HTML/XML 時幾乎一定要用懶惰匹配！

十一. 前瞻與後顧（Lookahead & Lookbehind）
#

這是進階技巧，用來匹配「前面或後面有特定內容」的文字，但不消耗字元：

前瞻 Lookahead
#

import re

# 正向前瞻：匹配後面跟著 "元" 的數字
text = "價格 100 元，重量 50 公斤"
print(re.findall(r"\d+(?=\s*元)", text))
# ['100']

# 負向前瞻：匹配後面不是 "元" 的數字
print(re.findall(r"\d+(?!\s*元)", text))
# ['50']

後顧 Lookbehind
#

import re

# 正向後顧：匹配前面有 "$" 的數字
text = "價格 $100 和 200 元"
print(re.findall(r"(?<=\$)\d+", text))
# ['100']

⚠️ 注意： Python 的 lookbehind 要求 pattern 長度固定，不能用 * 或 +。

十二. 實戰範例
#

12.1 驗證台灣手機號碼
#

import re

def is_valid_tw_phone(phone: str) -> bool:
    """驗證台灣手機號碼格式"""
    pattern = r"^09\d{2}-?\d{3}-?\d{3}$"
    return bool(re.match(pattern, phone))

print(is_valid_tw_phone("0912-345-678"))  # True
print(is_valid_tw_phone("0912345678"))    # True
print(is_valid_tw_phone("0812345678"))    # False
print(is_valid_tw_phone("091234567"))     # False

12.2 提取 URL
#

import re

text = """
歡迎參考：
- 官網 https://www.python.org/docs/
- GitHub https://github.com/python/cpython
- 也可以看 http://example.com/path?q=hello&lang=zh
"""

urls = re.findall(r"https?://[\w./-]+(?:\?[\w=&]+)?", text)
for url in urls:
    print(url)

# https://www.python.org/docs/
# https://github.com/python/cpython
# http://example.com/path?q=hello&lang=zh

12.3 清理 log 檔案
#

import re

logs = """
2026-03-06 10:30:45 INFO 伺服器啟動
2026-03-06 10:31:02 WARNING 記憶體使用率 85%
2026-03-06 10:31:15 ERROR 資料庫連線逾時
2026-03-06 10:31:30 INFO 重新連線成功
2026-03-06 10:32:00 ERROR 磁碟空間不足
"""

# 提取所有 ERROR 等級的 log
error_pattern = re.compile(
    r"(?P<datetime>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) "
    r"ERROR "
    r"(?P<message>.+)"
)

for match in error_pattern.finditer(logs):
    print(f"[{match.group('datetime')}] {match.group('message')}")

# [2026-03-06 10:31:15] 資料庫連線逾時
# [2026-03-06 10:32:00] 磁碟空間不足

12.4 密碼強度檢查
#

import re

def check_password(password: str) -> list[str]:
    """檢查密碼強度，回傳不符合的條件"""
    issues = []

    if len(password) < 8:
        issues.append("長度至少 8 個字元")
    if not re.search(r"[A-Z]", password):
        issues.append("需要至少一個大寫字母")
    if not re.search(r"[a-z]", password):
        issues.append("需要至少一個小寫字母")
    if not re.search(r"\d", password):
        issues.append("需要至少一個數字")
    if not re.search(r"[!@#$%^&*(),.?\":{}|<>]", password):
        issues.append("需要至少一個特殊字元")

    return issues

# 測試
print(check_password("abc"))
# ['長度至少 8 個字元', '需要至少一個大寫字母', '需要至少一個數字', '需要至少一個特殊字元']

print(check_password("MyP@ssw0rd"))
# []（通過所有檢查！）

12.5 Markdown 連結轉純文字
#

import re

markdown = """
請參考 [Python 官網](https://python.org) 和
[拍拍君的部落格](https://dailypypy.org)，
也可以看 [GitHub](https://github.com)。
"""

# 把 [text](url) 轉成 text (url)
plain = re.sub(r"\[([^\]]+)\]\(([^)]+)\)", r"\1 (\2)", markdown)
print(plain)
# 請參考 Python 官網 (https://python.org) 和
# 拍拍君的部落格 (https://dailypypy.org)，
# 也可以看 GitHub (https://github.com)。

十三. 常見陷阱與最佳實踐
#

陷阱 1：忘了用 raw string
#

import re

# ❌ 錯誤：\b 被 Python 解釋為退格字元
re.search("\bcat\b", "the cat sat")  # 可能不如預期

# ✅ 正確：用 r"" 讓 Python 不處理跳脫字元
re.search(r"\bcat\b", "the cat sat")  # 正確匹配

陷阱 2：貪婪匹配吃太多
#

前面已經講過，記得在量詞後加 ? 切換成懶惰匹配。

陷阱 3：不要用 regex 解析 HTML
#

# ❌ 千萬不要這樣做
re.findall(r"<div>(.+?)</div>", complex_html)

# ✅ 用專門的 HTML parser
from html.parser import HTMLParser
# 或用 BeautifulSoup

💡 拍拍君忠告： 正規表達式無法正確處理巢狀結構。解析 HTML/XML 請用 BeautifulSoup 或 lxml！

最佳實踐
#

永遠使用 raw string (r"...")
用 re.VERBOSE 寫複雜 pattern，加上註解
用 re.compile() 重複使用的 pattern
用命名群組 (?P<name>) 提高可讀性
先用簡單方法：如果 str.startswith()、str.endswith()、in 就能解決，不需要 regex
寫測試案例：regex 容易出錯，寫 test case 確保正確性

十四. 總結
#

今天我們學到了：

功能	函式	用途
搜尋	`re.search()`	找第一個匹配
開頭匹配	`re.match()`	檢查字串開頭
全部匹配	`re.findall()`	找所有匹配，回傳 list
迭代匹配	`re.finditer()`	找所有匹配，回傳迭代器
取代	`re.sub()`	替換匹配的文字
分割	`re.split()`	用 pattern 切割字串
編譯	`re.compile()`	預編譯 pattern