Python 正则表达式入门指南

正则表达式（Regular Expression，简称 regex 或 regexp）是一种强大的文本处理工具，它使用独立于编程语言的特殊字符串模式来匹配、查找、替换和操作文本。Python 通过内置的 re 模块提供了对正则表达式的全面支持。

1. 什么是正则表达式？

想象一下，你想要在一大段文字中找到所有符合特定格式的电话号码，或者验证一个字符串是否是有效的邮箱地址。手动编写代码来处理这些复杂的模式会非常冗长且容易出错。正则表达式就是为了解决这类问题而生的。它允许你用简洁的语法描述复杂的字符串模式。

2. `re` 模块概述

Python 的 re 模块提供了一系列函数来操作正则表达式：

re.match(): 尝试从字符串的起始位置匹配一个模式。
re.search(): 扫描整个字符串，查找模式的第一次出现。
re.findall(): 查找字符串中所有匹配模式的非重叠出现。
re.finditer(): 查找字符串中所有匹配模式的非重叠出现，并返回一个迭代器，每次迭代返回一个 match 对象。
re.sub(): 替换字符串中所有匹配模式的子串。
re.compile(): 编译正则表达式模式，生成一个正则表达式对象，以提高重复使用时的效率。

3. 基本匹配

3.1 字符匹配

最简单的正则表达式就是直接匹配字面字符。

“`python
import re

text = “Hello world”
pattern = “Hello”
match = re.search(pattern, text)
if match:
print(f”找到匹配: {match.group()}”) # 输出: 找到匹配: Hello
“`

3.2 元字符（Special Characters）

元字符是正则表达式中具有特殊含义的字符。

.: 匹配除换行符以外的任意单个字符。
python text = "cat, cot, cut" matches = re.findall("c.t", text) # ['cat', 'cot', 'cut']
*: 匹配前一个字符零次或多次。
python text = "caaat, ct, cat" matches = re.findall("ca*t", text) # ['caaat', 'ct', 'cat']
+: 匹配前一个字符一次或多次。
python text = "caaat, ct, cat" matches = re.findall("ca+t", text) # ['caaat', 'cat']
?: 匹配前一个字符零次或一次（使其成为可选的）。
python text = "color, colour" matches = re.findall("colou?r", text) # ['color', 'colour']
{n}: 匹配前一个字符恰好 n 次。
{n,}: 匹配前一个字符至少 n 次。
{n,m}: 匹配前一个字符 n 到 m 次。
python text = "a, aa, aaa, aaaa" print(re.findall("a{2}", text)) # ['aa', 'aa'] print(re.findall("a{2,}", text)) # ['aa', 'aaa', 'aaaa'] print(re.findall("a{1,3}", text)) # ['a', 'aa', 'aaa', 'aaa']
^: 匹配字符串的开头。
python text = "Hello world\nworld Hello" print(re.findall("^Hello", text)) # ['Hello'] print(re.findall("^world", text, re.MULTILINE)) # ['world'] (使用 re.MULTILINE 匹配每行开头)
$: 匹配字符串的结尾。
python text = "Hello world\nworld Hello" print(re.findall("world$", text)) # ['world'] print(re.findall("Hello$", text, re.MULTILINE)) # ['Hello']
[]: 匹配方括号内任意一个字符。
python text = "color, colour" matches = re.findall("colo[u]r", text) # ['colour'] matches = re.findall("colo[a-z]r", text) # ['color', 'colour'] (匹配字母a-z中的任意一个) matches = re.findall("colo[^u]r", text) # ['color'] (匹配非u的字符)
|: 或运算符，匹配 | 左右的任意一个模式。
python text = "cat or dog" matches = re.findall("cat|dog", text) # ['cat', 'dog']
(): 分组，将多个字符组合成一个单元，并可以捕获匹配到的内容。

4. 字符集缩写（Special Sequences）

为了方便，正则表达式提供了一些预定义的字符集。

\d: 匹配任何数字 (0-9)。等同于 [0-9]。
\D: 匹配任何非数字字符。等同于 [^0-9]。
\w: 匹配任何字母、数字或下划线。等同于 [a-zA-Z0-9_]。
\W: 匹配任何非字母、数字或下划线字符。
\s: 匹配任何空白字符（空格、制表符、换行符等）。等同于 [ \t\n\r\f\v]。
\S: 匹配任何非空白字符。
\b: 匹配单词边界。
python text = "cat, scatter, the cat" print(re.findall(r"\bcat\b", text)) # ['cat', 'cat'] (注意使用原始字符串 r"..." 以避免反斜杠转义问题)
\B: 匹配非单词边界。

5. 捕获组与非捕获组

使用括号 () 创建捕获组，可以提取匹配到的特定部分。

python text = "Name: John Doe, Age: 30" pattern = r"Name: (.*), Age: (\d+)" match = re.search(pattern, text) if match: print(f"姓名: {match.group(1)}") # John Doe print(f"年龄: {match.group(2)}") # 30 print(f"全部匹配: {match.group(0)}") # Name: John Doe, Age: 30 print(f"所有捕获组: {match.groups()}") # ('John Doe', '30')

使用 (?:...) 可以创建非捕获组，它会匹配内容但不会将其作为单独的组捕获。

python text = "catdogbird" pattern = r"(?:cat)(dog)(bird)" # cat 是非捕获组 match = re.search(pattern, text) if match: print(match.groups()) # ('dog', 'bird')

6. 贪婪与非贪婪匹配

默认情况下，正则表达式是“贪婪”的，即它会尽可能多地匹配字符。在量词后面加上 ? 可以使其变为“非贪婪”或“惰性”匹配，即尽可能少地匹配字符。

“`python
text = “HelloWorld“

贪婪匹配

print(re.findall(r”<.*>”, text)) # [‘HelloWorld‘]

非贪婪匹配

print(re.findall(r”<.*?>”, text)) # [‘‘, ‘‘, ‘‘, ‘‘]
“`

7. `re.sub()` 替换

re.sub(pattern, repl, string, count=0, flags=0) 用 repl 替换 string 中所有匹配 pattern 的子串。

python text = "The price is $100." new_text = re.sub(r"\$(\d+)", r"€\1", text) # \1 引用第一个捕获组 print(new_text) # The price is €100.

8. `re.compile()` 编译正则表达式

如果需要多次使用同一个正则表达式模式，可以使用 re.compile() 预编译模式，以提高效率。

“`python
import re

compiled_pattern = re.compile(r”\b\w+\b”)
text1 = “Hello world”
text2 = “Python programming”

matches1 = compiled_pattern.findall(text1) # [‘Hello’, ‘world’]
matches2 = compiled_pattern.findall(text2) # [‘Python’, ‘programming’]
“`

9. 常用标志（Flags）

re 模块提供了一些标志来修改匹配行为：

re.IGNORECASE 或 re.I: 忽略大小写。
re.MULTILINE 或 re.M: ^ 和 $ 匹配每行的开头和结尾，而不仅仅是整个字符串的开头和结尾。
re.DOTALL 或 re.S: 使 . 匹配包括换行符在内的所有字符。
re.ASCII 或 re.A: 使 \w, \W, \b, \B, \d, \D, \s, \S 只匹配 ASCII 字符。
re.VERBOSE 或 re.X: 允许在正则表达式中添加注释和空白，使其更易读。

python text = "hello\nWorld" print(re.findall(r"world", text, re.IGNORECASE)) # ['World'] print(re.findall(r".+", text, re.DOTALL)) # ['hello\nWorld']

10. 原始字符串（Raw Strings）

在 Python 中，反斜杠 \ 是一个转义字符。在正则表达式中，反斜杠也有特殊含义（例如 \d）。为了避免 Python 的转义与正则表达式的转义冲突，通常建议使用原始字符串（raw strings），即在字符串前加上 r。

“`python

错误示范：Python 会尝试转义 \n

pattern = “\bcat\b”

print(pattern) # ‘\x08cat\x08’

正确做法：使用原始字符串

pattern = r”\bcat\b”
print(pattern) # \bcat\b
“`

总结

正则表达式是处理字符串的强大工具，掌握它可以极大地提高文本处理效率。从简单的字符匹配到复杂的模式分组和替换，re 模块提供了丰富的功能。熟练运用元字符、字符集缩写、量词以及贪婪/非贪婪模式是精通正则表达式的关键。在实际应用中，建议多加练习，并通过在线工具（如 regex101.com）来测试和调试你的正则表达式。