“`python
article_content = “””# Python 正则表达式介绍：从入门到精通

正则表达式 (Regular Expression, 简称 regex 或 regexp) 是一种强大的文本处理工具，它使用一种特殊的字符序列来匹配和操作字符串。在 Python 中，通过内置的 re 模块，我们可以轻松地使用正则表达式来执行复杂的字符串搜索、替换和分割操作。

本文将带领你从正则表达式的基础概念出发，逐步深入到高级用法，助你从入门到精通。

1. 什么是正则表达式？为什么在 Python 中使用它？

1.1 什么是正则表达式？

正则表达式是用于描述字符串模式的强大工具。你可以把它们想象成一种迷你编程语言，专门用来处理文本。通过定义特定的模式，你可以：
* 查找 (Search): 在一个长字符串中找到符合特定模式的子字符串。
* 替换 (Replace): 将匹配到的子字符串替换为其他内容。
* 分割 (Split): 根据匹配到的模式将字符串分割成多个部分。
* 验证 (Validate): 检查一个字符串是否符合预设的格式（如邮箱、电话号码等）。

1.2 为什么在 Python 中使用正则表达式？

Python 的 re 模块提供了与 Perl 兼容的正则表达式操作。使用正则表达式的优势在于：
* 灵活性和强大: 能够处理各种复杂的字符串匹配需求，远超简单的字符串方法（如 str.find(), str.replace()）。
* 效率: 对于大量的文本处理任务，正则表达式通常比手动编写的字符串处理代码更高效。
* 简洁性: 复杂的匹配逻辑可以用一行简洁的正则表达式来表达。

2. 基本概念与语法

正则表达式由普通字符（如字母、数字）和特殊字符（称为“元字符”）组成。

2.1 字面量匹配

最简单的匹配是字面量匹配，即匹配字符串中完全相同的字符。

“`python
import re

text = “Hello, Python regex!”
match = re.search(“Python”, text)
if match:
print(f”找到匹配: {match.group()}”) # 输出: 找到匹配: Python
“`

2.2 元字符 (Metacharacters)

元字符是具有特殊含义的字符，它们赋予正则表达式强大的模式匹配能力。

元字符	描述	示例	匹配
`.`	匹配除换行符 `\\n` 之外的任何单个字符。	`a.b`	`acb`, `a#b`, `a b`
`^`	匹配字符串的开头。	`^Hello`	`Hello world`
`$`	匹配字符串的结尾。	`world$`	`Hello world`
`*`	匹配前一个字符零次或多次。	`a*b`	`b`, `ab`, `aaab`
`+`	匹配前一个字符一次或多次。	`a+b`	`ab`, `aaab`
`?`	匹配前一个字符零次或一次。	`a?b`	`b`, `ab`
`{m}`	匹配前一个字符恰好 `m` 次。	`a{3}b`	`aaab`
`{m,n}`	匹配前一个字符 `m` 到 `n` 次。	`a{1,3}b`	`ab`, `aab`, `aaab`
`[]`	字符集，匹配方括号中的任何一个字符。	`[aeiou]`	匹配任何元音字母
`[^]`	字符集，匹配不在方括号中的任何一个字符。	`[^0-9]`	匹配任何非数字字符
`\|`	或操作，匹配 `\|` 前或后的表达式。	`cat\|dog`	`cat` 或 `dog`
`()`	分组，将表达式分组，可以捕获匹配到的内容。	`(ab)+`	`ab`, `abab`
`\`	转义字符，将特殊字符转为字面量，或开始特殊序列。	`\.`	匹配字面量的 `.`

示例：

“`python
text = “The price is $12.99 and $5.00.”

匹配以 ‘The’ 开头的字符串

match_start = re.search(r”^The”, text)
if match_start:
print(f”匹配开头: {match_start.group()}”)

匹配以 ’00.’ 结尾的字符串

match_end = re.search(r”00.$”, text) # 注意 . 需要转义
if match_end:
print(f”匹配结尾: {match_end.group()}”)

匹配一个或多个数字

matches_numbers = re.findall(r”\d+”, text)
print(f”匹配数字: {matches_numbers}”) # 输出: 匹配数字: [’12’, ’99’, ‘5’, ’00’]

匹配任意一个元音字母

vowels = re.findall(r”[aeiou]”, “hello world”)
print(f”匹配元音字母: {vowels}”) # 输出: 匹配元音字母: [‘e’, ‘o’, ‘o’]
“`

2.3 特殊序列 (Shorthands)

Python 正则表达式提供了一些方便的特殊序列来匹配常见的字符类型。

序列	描述	等价于
`\d`	匹配任何数字 (0-9)。	`[0-9]`
`\D`	匹配任何非数字字符。	`[^0-9]`
`\w`	匹配任何单词字符 (字母、数字、下划线)。	`[a-zA-Z0-9_]`
`\W`	匹配任何非单词字符。	`[^a-zA-Z0-9_]`
`\s`	匹配任何空白字符 (空格、制表符、换行符等)。	`[\t\n\r\f\v]`
`\S`	匹配任何非空白字符。	`[^\t\n\r\f\v]`
`\b`	匹配单词边界。	(如 `\bcat\b` 匹配 “cat” 在独立单词中)
`\B`	匹配非单词边界。

示例：

“`python
text = “The quick brown fox jumps over the lazy dog. My email is [email protected].”

匹配所有单词字符

words = re.findall(r”\w+”, text)
print(f”单词字符: {words}”)

匹配所有数字

digits = re.findall(r”\d”, text)
print(f”数字: {digits}”)

匹配单词 ‘fox’ 的边界

fox_match = re.search(r”\bfox\b”, text)
if fox_match:
print(f”找到单词 ‘fox’: {fox_match.group()}”)
“`

3. `re` 模块的常用函数

Python 的 re 模块提供了多个函数来执行正则表达式操作。

3.1 `re.search(pattern, string, flags=0)`

扫描整个字符串，找到第一个匹配的模式，并返回一个 Match 对象。如果没有找到，则返回 None。

python text = "The rain in Spain." match = re.search(r"Spain", text) if match: print(f"search 找到: {match.group()}") else: print("search 未找到")

3.2 `re.match(pattern, string, flags=0)`

尝试从字符串的开头匹配模式。如果匹配成功，返回一个 Match 对象；否则返回 None。与 search() 不同，match() 只在字符串的开始处寻找匹配。

“`python
text = “The rain in Spain.”
match_start = re.match(r”The”, text)
if match_start:
print(f”match 找到: {match_start.group()}”) # 输出: match 找到: The

match_anywhere = re.match(r”rain”, text)
if match_anywhere:
print(f”match 找到: {match_anywhere.group()}”)
else:
print(“match 未找到 (因为 ‘rain’ 不在开头)”) # 输出: match 未找到 (因为 ‘rain’ 不在开头)
“`

3.3 `re.findall(pattern, string, flags=0)`

在字符串中找到所有非重叠的匹配项，并以列表的形式返回所有匹配到的字符串。

python text = "Cats are cute, dogs are friendly. Cats and dogs." matches = re.findall(r"cat|dog", text, re.IGNORECASE) # 使用 re.IGNORECASE 忽略大小写 print(f"findall 找到: {matches}") # 输出: findall 找到: ['Cats', 'dogs', 'Cats', 'dogs']

3.4 `re.finditer(pattern, string, flags=0)`

与 findall() 类似，但它返回一个迭代器，其中每个元素都是一个 Match 对象。当需要获取匹配的更多信息（如位置）时非常有用。

“`python
text = “Colors: red, green, blue, yellow.”
for match in re.finditer(r”\b\w{4}\b”, text): # 匹配所有4个字母的单词
print(f”finditer 找到: {match.group()} at position {match.span()}”)

输出:

finditer 找到: red at position (8, 11)

finditer 找到: blue at position (19, 23)

finditer 找到: yell at position (25, 29)

“`

3.5 `re.sub(pattern, repl, string, count=0, flags=0)`

替换字符串中所有匹配 pattern 的部分为 repl。count 参数限制替换的次数。

“`python
text = “I like apples and oranges. Apples are healthy.”
new_text = re.sub(r”Apples”, “Bananas”, text, count=1)
print(f”sub 替换一次: {new_text}”) # 输出: sub 替换一次: I like Bananas and oranges. Apples are healthy.

new_text_all = re.sub(r”apples|oranges”, “berries”, text, flags=re.IGNORECASE)
print(f”sub 全部替换: {new_text_all}”) # 输出: sub 全部替换: I like berries and berries. Bananas are healthy.
“`

3.6 `re.split(pattern, string, maxsplit=0, flags=0)`

根据 pattern 匹配到的分隔符将字符串分割成列表。maxsplit 参数限制分割的次数。

python text = "one,two;three-four" parts = re.split(r"[,;-]", text) print(f"split 分割: {parts}") # 输出: split 分割: ['one', 'two', 'three', 'four']

3.7 `re.compile(pattern, flags=0)`

编译正则表达式模式为一个 RegexObject 对象。当在一个程序中多次使用同一个正则表达式时，编译可以提高效率。

“`python
compiled_pattern = re.compile(r”\bword\b”, re.IGNORECASE)
text1 = “This is a word.”
text2 = “Another Word here.”

match1 = compiled_pattern.search(text1)
match2 = compiled_pattern.search(text2)

if match1:
print(f”编译模式匹配1: {match1.group()}”)
if match2:
print(f”编译模式匹配2: {match2.group()}”)
“`

4. Match 对象

当 re.search() 或 re.match() 找到匹配时，它们会返回一个 Match 对象，该对象包含匹配的详细信息。

“`python
text = “My name is Alice and my age is 30.”
match = re.search(r”My name is (\w+) and my age is (\d+).”, text)

if match:
print(f”完整匹配: {match.group(0)}”) # 整个匹配字符串
print(f”第一个捕获组 (姓名): {match.group(1)}”) # Alice
print(f”第二个捕获组 (年龄): {match.group(2)}”) # 30
print(f”所有捕获组: {match.groups()}”) # (‘Alice’, ’30’)
print(f”匹配起始位置: {match.start()}”) # 10
print(f”匹配结束位置: {match.end()}”) # 34
print(f”匹配范围: {match.span()}”) # (10, 34)
“`

5. 高级主题

5.1 匹配标志 (Flags)

匹配标志可以改变正则表达式的匹配行为。它们是 re 模块函数的可选参数。

标志	描述
`re.IGNORECASE`	`re.I` 不区分大小写匹配。
`re.MULTILINE`	`re.M` 使 `^` 匹配每行的开头，`$` 匹配每行的结尾（而不仅仅是字符串的开头/结尾）。
`re.DOTALL`	`re.S` 使 `.` 匹配包括换行符在内的所有字符。
`re.VERBOSE`	`re.X` 允许你写更具可读性的正则表达式，忽略空白和注释。
`re.ASCII`	`re.A` 使 `\w`, `\b`, `\s`, `\d` 只匹配 ASCII 字符。
`re.UNICODE`	`re.U` 使 `\w`, `\b`, `\s`, `\d` 匹配 Unicode 字符（默认行为）。

示例：re.IGNORECASE 和 re.MULTILINE

“`python
text = “hello\nWorld”

默认情况下，^ 只匹配字符串开头

match1 = re.search(r”^world”, text)
print(f”默认匹配: {match1}”) # None

使用 re.MULTILINE 匹配每行开头

match2 = re.search(r”^World”, text, re.MULTILINE)
print(f”多行匹配: {match2.group()}”) # World

同时使用多个标志

match3 = re.search(r”^world”, text, re.MULTILINE | re.IGNORECASE)
print(f”多行和忽略大小写匹配: {match3.group()}”) # World
“`

5.2 原始字符串 (Raw Strings `r''`)

在 Python 中，建议使用原始字符串（前缀为 r，如 r"your\regex"）来定义正则表达式模式。这可以避免反斜杠 \ 的双重转义问题。例如，\n 在普通字符串中是换行符，但在原始字符串中它就是字面量的 \ 和 n。

“`python

普通字符串，需要转义反斜杠

pattern_normal = “\d+” # 匹配一个或多个数字
print(re.search(pattern_normal, “abc123def”).group()) # 123

原始字符串，更简洁明了

pattern_raw = r”\d+”
print(re.search(pattern_raw, “abc456ghi”).group()) # 456
“`

5.3 贪婪与非贪婪匹配 (Greedy vs. Non-Greedy Quantifiers)

量词 (*, +, ?, {m,n}) 默认是贪婪的，它们会尽可能多地匹配字符。通过在量词后添加 ?，可以使其变为非贪婪的（或惰性的），它们会尽可能少地匹配字符。

非贪婪量词	描述
`*?`	零次或多次，非贪婪
`+?`	一次或多次，非贪婪
`??`	零次或一次，非贪婪
`{m,n}?`	m 到 n 次，非贪婪

示例：

“`python
html = “

This is a paragraph.

Another one.

“

贪婪匹配：匹配从第一个

到最后一个

的所有内容

greedy_match = re.search(r”

“, html)
print(f”贪婪匹配: {greedy_match.group()}”)

输出: 贪婪匹配:

This is a paragraph.

Another one.

非贪婪匹配：匹配从

到第一个

的内容

nongreedy_match = re.search(r”

.*?

“, html)
print(f”非贪婪匹配: {nongreedy_match.group()}”)

输出: 非贪婪匹配:

This is a paragraph.

“`

5.4 环视 (Lookarounds)

环视是一种零宽度断言，它们匹配位置，而不是实际的字符。它们不会消耗字符串中的字符。

先行断言 (Lookahead):
- (?=...) 正向先行断言: 匹配紧随其后是 ... 的位置。
- (?!...) 负向先行断言: 匹配紧随其后不是 ... 的位置。
后行断言 (Lookbehind):
- (?<=...) 正向后行断言: 匹配紧靠其前是 ... 的位置。
- (?<!...) 负向后行断言: 匹配紧靠其前不是 ... 的位置。

示例：

“`python
text = “I have 10 apples and 5 oranges.”

匹配后面是 ‘apples’ 的数字

match_apples = re.search(r”\d+(?=\s*apples)”, text)
if match_apples:
print(f”匹配 ‘apples’ 前的数字: {match_apples.group()}”) # 10

匹配前面是 ‘have’ 的数字

match_after_have = re.search(r”(?<=have\s*)\d+”, text)
if match_after_have:
print(f”匹配 ‘have’ 后的数字: {match_after_have.group()}”) # 10

匹配不是 ‘oranges’ 的数字

match_not_oranges = re.findall(r”\b\d+\b(?!\s*oranges)”, text)
print(f”匹配不是 ‘oranges’ 的数字: {match_not_oranges}”) # [’10’]
“`

5.5 命名组 (Named Groups)

可以使用 (?P<name>...) 语法为捕获组命名，这样可以通过名称而不是索引来访问匹配内容，提高代码可读性。

“`python
text = “Name: Alice, Age: 30”
match = re.search(r”Name: (?P\w+), Age: (?P\d+)”, text)

if match:
print(f”通过名称访问 – 姓名: {match.group(‘name’)}”) # Alice
print(f”通过名称访问 – 年龄: {match.group(‘age’)}”) # 30
print(f”所有命名组: {match.groupdict()}”) # {‘name’: ‘Alice’, ‘age’: ’30’}
“`

6. 实际应用示例

6.1 邮箱地址验证

“`python
def validate_email(email):
pattern = r”^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}$”
if re.fullmatch(pattern, email): # fullmatch 确保整个字符串都匹配
return True
return False

print(f”‘[email protected]’ 是有效邮箱: {validate_email(‘[email protected]’)}”) # True
print(f”‘invalid-email’ 是有效邮箱: {validate_email(‘invalid-email’)}”) # False
“`

6.2 提取电话号码

假设电话号码格式为 (XXX) XXX-XXXX 或 XXX-XXX-XXXX。

python phone_text = "Call me at (123) 456-7890 or 987-654-3210." pattern = r"\b(?:$\d{3}$\s*|\d{3}-)\d{3}-\d{4}\b" phone_numbers = re.findall(pattern, phone_text) print(f"提取的电话号码: {phone_numbers}") # ['(123) 456-7890', '987-654-3210']

6.3 解析日志文件

假设日志行包含时间戳、级别和消息。

“`python
log_line = “2023-10-26 14:30:05 INFO User ‘admin’ logged in.”
pattern = r”^(?P\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\s(?P\w+)\s(?P.*)$”
match = re.search(pattern, log_line)

if match:
print(f”时间戳: {match.group(‘timestamp’)}”)
print(f”级别: {match.group(‘level’)}”)
print(f”消息: {match.group(‘message’)}”)
“`

7. 最佳实践和技巧

从简单开始: 如果模式复杂，可以从匹配字符串的简单部分开始，然后逐步添加更复杂的元字符和量词。
测试你的正则表达式: 使用在线正则表达式测试工具（如 regex101.com, regextester.com）来测试和调试你的模式。
使用原始字符串 r'': 这可以避免反斜杠转义的问题，使你的正则表达式更清晰。
对于复杂模式使用 re.VERBOSE: 通过添加 re.VERBOSE 标志，你可以在正则表达式中添加空格和注释，使其更易读。
“`python
# 不使用 VERBOSE
pattern1 = r”^(\w+\s\w+),\s(\d{4}-\d{2}-\d{2})$”

使用 VERBOSE，更易读

pattern2 = re.compile(r”””
^ # 匹配行首
(?P\w+\s\w+) # 捕获姓名，由两个单词组成
,\s # 逗号和空格
(?P\d{4}-\d{2}-\d{2}) # 捕获日期
$ # 匹配行尾
“””, re.VERBOSE)
`` * **平衡正则表达式的强大与可读性:** 虽然正则表达式非常强大，但过于复杂和难以理解的模式可能会导致维护困难。在某些情况下，简单的字符串方法或多行 Python 代码可能更清晰。 * **re.fullmatch()vsre.match()vsre.search():** *re.fullmatch(): 只有当整个字符串都匹配模式时才成功。用于严格的字符串验证。 *re.match(): 只有当字符串的开头匹配模式时才成功。 *re.search()`: 扫描整个字符串，找到第一个匹配的模式。

8. 总结

Python 的 re 模块为字符串处理提供了无与伦比的强大功能。通过掌握元字符、量词、特殊序列以及 re 模块的各种函数，你将能够高效地处理各种文本匹配和操作任务。从简单的搜索到复杂的模式提取和验证，正则表达式都是你工具箱中不可或缺的一部分。

继续练习，尝试解决不同的文本处理问题，你将很快成为正则表达式的专家！”””

write_file(
file_path=”python_regex_tutorial.md”,
content=article_content
)
“`

1. 什么是正则表达式？为什么在 Python 中使用它？

1.1 什么是正则表达式？

1.2 为什么在 Python 中使用正则表达式？

2. 基本概念与语法

2.1 字面量匹配

2.2 元字符 (Metacharacters)

匹配以 ‘The’ 开头的字符串

匹配以 ’00.’ 结尾的字符串

匹配一个或多个数字

匹配任意一个元音字母

2.3 特殊序列 (Shorthands)

匹配所有单词字符

匹配所有数字

匹配单词 ‘fox’ 的边界

3. re 模块的常用函数

3.1 re.search(pattern, string, flags=0)

3.2 re.match(pattern, string, flags=0)

3.3 re.findall(pattern, string, flags=0)

3.4 re.finditer(pattern, string, flags=0)

输出:

finditer 找到: red at position (8, 11)

finditer 找到: blue at position (19, 23)

finditer 找到: yell at position (25, 29)

3.5 re.sub(pattern, repl, string, count=0, flags=0)

3.6 re.split(pattern, string, maxsplit=0, flags=0)

3.7 re.compile(pattern, flags=0)

4. Match 对象

5. 高级主题

5.1 匹配标志 (Flags)

默认情况下，^ 只匹配字符串开头

使用 re.MULTILINE 匹配每行开头

同时使用多个标志

5.2 原始字符串 (Raw Strings r'')

普通字符串，需要转义反斜杠

原始字符串，更简洁明了

5.3 贪婪与非贪婪匹配 (Greedy vs. Non-Greedy Quantifiers)

贪婪匹配：匹配从第一个 到最后一个 的所有内容

输出: 贪婪匹配: This is a paragraph. Another one.

非贪婪匹配：匹配从 到 第一个 的内容

输出: 非贪婪匹配: This is a paragraph.

5.4 环视 (Lookarounds)

匹配后面是 ‘apples’ 的数字

匹配前面是 ‘have’ 的数字

匹配不是 ‘oranges’ 的数字

5.5 命名组 (Named Groups)

6. 实际应用示例

6.1 邮箱地址验证

6.2 提取电话号码

6.3 解析日志文件

7. 最佳实践和技巧

使用 VERBOSE，更易读

8. 总结

3. `re` 模块的常用函数

3.1 `re.search(pattern, string, flags=0)`

3.2 `re.match(pattern, string, flags=0)`

3.3 `re.findall(pattern, string, flags=0)`

3.4 `re.finditer(pattern, string, flags=0)`

3.5 `re.sub(pattern, repl, string, count=0, flags=0)`

3.6 `re.split(pattern, string, maxsplit=0, flags=0)`

3.7 `re.compile(pattern, flags=0)`

5.2 原始字符串 (Raw Strings `r''`)

贪婪匹配：匹配从第一个

到最后一个

的所有内容

输出: 贪婪匹配:

This is a paragraph.

Another one.

非贪婪匹配：匹配从

到第一个

的内容

输出: 非贪婪匹配:

This is a paragraph.