Introduction
Regular expressions are a powerful tool for text processing. Python’s re module supports Perl-compatible regular expressions, useful for log parsing, data cleaning, validation, and more.
Basic Patterns
Common Metacharacters
| Pattern | Meaning | Example |
|---|---|---|
. | Any character | a.c → “abc”, “a1c” |
\d | Digit [0-9] | \d{3} → “123” |
\w | Word char [a-zA-Z0-9_] | \w+ → “hello_42” |
\s | Whitespace | \s+ → " “, “\t” |
^ / $ | Start / end of line | ^Hello$ |
* / + / ? | 0+, 1+, 0-1 | ab*c → “ac”, “abc” |
{n,m} | n to m times | \d{2,4} → “12”, “1234” |
Basic Operations
import re
text = "2026-02-26 Error: Connection timeout (retry: 3)"
# match: from start
m = re.match(r'\d{4}-\d{2}-\d{2}', text)
print(m.group()) # "2026-02-26"
# search: first match
m = re.search(r'retry: (\d+)', text)
print(m.group(1)) # "3"
# findall: all matches as list
numbers = re.findall(r'\d+', text)
print(numbers) # ['2026', '02', '26', '3']
# sub: replacement
cleaned = re.sub(r'\d{4}-\d{2}-\d{2}', '[DATE]', text)
print(cleaned) # "[DATE] Error: Connection timeout (retry: 3)"
Groups and Capturing
Named Groups
log = "2026-02-26 14:30:45 [ERROR] Database connection failed"
pattern = r'(?P<date>\d{4}-\d{2}-\d{2}) (?P<time>\d{2}:\d{2}:\d{2}) \[(?P<level>\w+)\] (?P<message>.+)'
m = re.match(pattern, log)
if m:
print(m.group('date')) # "2026-02-26"
print(m.group('level')) # "ERROR"
print(m.group('message')) # "Database connection failed"
print(m.groupdict()) # {'date': '2026-02-26', 'time': '14:30:45', ...}
Non-Capturing Groups
# (?:...) groups without capturing
pattern = r'(?:https?|ftp)://[\w./\-]+'
urls = re.findall(pattern, "Visit https://example.com or ftp://files.example.com")
print(urls) # ['https://example.com', 'ftp://files.example.com']
Lookahead and Lookbehind
Positive / Negative Lookahead
# Positive lookahead: (?=...)
passwords = ["abc123", "password", "Str0ng!Pass", "12345"]
strong = [p for p in passwords
if re.match(r'(?=.*[A-Z])(?=.*\d)(?=.*[!@#$%^&*]).{8,}', p)]
print(strong) # ['Str0ng!Pass']
# Negative lookahead: (?!...)
lines = ["test_func", "main_func", "test_class", "helper"]
non_test = [l for l in lines if re.match(r'(?!test)\w+', l)]
print(non_test) # ['main_func', 'helper']
Positive / Negative Lookbehind
# Positive lookbehind: (?<=...)
text = "Price: $100, Tax: $8, Total: $108"
amounts = re.findall(r'(?<=\$)\d+', text)
print(amounts) # ['100', '8', '108']
# Negative lookbehind: (?<!...)
text = "v1.0 v2.0-beta v3.0 v4.0-rc1"
stable = re.findall(r'v[\d.]+(?!-)', text)
print(stable) # ['v1.0', 'v3.0']
Practical Patterns
Email Extraction
text = "Contact us at info@example.com or support@test.co.jp"
pattern = r'[\w.+-]+@[\w-]+\.[\w.]+'
emails = re.findall(pattern, text)
print(emails) # ['info@example.com', 'support@test.co.jp']
Safe CSV Splitting
# Split on commas but ignore commas inside quotes
line = 'John,"Doe, Jr.",30,"New York, NY"'
pattern = r',(?=(?:[^"]*"[^"]*")*[^"]*$)'
fields = re.split(pattern, line)
print(fields) # ['John', '"Doe, Jr."', '30', '"New York, NY"']
IPv4 Validation
def is_valid_ipv4(ip):
pattern = r'^(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)$'
return bool(re.match(pattern, ip))
print(is_valid_ipv4("192.168.1.1")) # True
print(is_valid_ipv4("256.1.1.1")) # False
Performance Optimization
Compiled Patterns
Pre-compile patterns used repeatedly:
# Inefficient: recompiles each iteration
for line in lines:
re.search(r'\d{4}-\d{2}-\d{2}', line)
# Efficient: compile once
date_pattern = re.compile(r'\d{4}-\d{2}-\d{2}')
for line in lines:
date_pattern.search(line)
Greedy vs Non-Greedy
html = '<div>Hello</div><div>World</div>'
# Greedy (default): longest match
print(re.findall(r'<div>.*</div>', html))
# ['<div>Hello</div><div>World</div>']
# Non-greedy: shortest match (add ?)
print(re.findall(r'<div>.*?</div>', html))
# ['<div>Hello</div>', '<div>World</div>']
Avoiding Catastrophic Backtracking
# Dangerous: catastrophic backtracking (ReDoS)
# re.match(r'(a+)+b', 'a' * 30) # extremely slow
# Safe: simplified pattern
# re.match(r'a+b', 'a' * 30) # instant
Key re Module Flags
| Flag | Description |
|---|---|
re.IGNORECASE (re.I) | Case-insensitive matching |
re.MULTILINE (re.M) | ^/$ match each line |
re.DOTALL (re.S) | . matches newlines |
re.VERBOSE (re.X) | Allow comments and whitespace |
pattern = re.compile(r'''
(?P<year>\d{4}) # year
-(?P<month>\d{2}) # month
-(?P<day>\d{2}) # day
''', re.VERBOSE)
Related Articles
- Python Decorators Complete Guide - Combining decorators with regex for validation patterns.
- Python asyncio Introduction - Using regex in async text processing.
References
- Python re module documentation
- Friedl, J. E. F. (2006). Mastering Regular Expressions (3rd ed.). O’Reilly Media.
- Regular Expressions 101 - Online regex tester