Python Regular Expressions: Practical Guide from Basics to Performance Optimization

Introduction

Regular expressions are a powerful tool for text processing. Python’s re module supports Perl-compatible regular expressions, useful for log parsing, data cleaning, validation, and more.

Basic Patterns

Common Metacharacters

Pattern	Meaning	Example
`.`	Any character	`a.c` → “abc”, “a1c”
`\d`	Digit `[0-9]`	`\d{3}` → “123”
`\w`	Word char `[a-zA-Z0-9_]`	`\w+` → “hello_42”
`\s`	Whitespace	`\s+` → " “, “\t”
`^` / `$`	Start / end of line	`^Hello$`
`*` / `+` / `?`	0+, 1+, 0-1	`ab*c` → “ac”, “abc”
`{n,m}`	n to m times	`\d{2,4}` → “12”, “1234”

Basic Operations

import re

text = "2026-02-26 Error: Connection timeout (retry: 3)"

# match: from start
m = re.match(r'\d{4}-\d{2}-\d{2}', text)
print(m.group())  # "2026-02-26"

# search: first match
m = re.search(r'retry: (\d+)', text)
print(m.group(1))  # "3"

# findall: all matches as list
numbers = re.findall(r'\d+', text)
print(numbers)  # ['2026', '02', '26', '3']

# sub: replacement
cleaned = re.sub(r'\d{4}-\d{2}-\d{2}', '[DATE]', text)
print(cleaned)  # "[DATE] Error: Connection timeout (retry: 3)"

Groups and Capturing

Named Groups

log = "2026-02-26 14:30:45 [ERROR] Database connection failed"

pattern = r'(?P<date>\d{4}-\d{2}-\d{2}) (?P<time>\d{2}:\d{2}:\d{2}) \[(?P<level>\w+)\] (?P<message>.+)'
m = re.match(pattern, log)

if m:
    print(m.group('date'))     # "2026-02-26"
    print(m.group('level'))    # "ERROR"
    print(m.group('message'))  # "Database connection failed"
    print(m.groupdict())       # {'date': '2026-02-26', 'time': '14:30:45', ...}

Non-Capturing Groups

# (?:...) groups without capturing
pattern = r'(?:https?|ftp)://[\w./\-]+'
urls = re.findall(pattern, "Visit https://example.com or ftp://files.example.com")
print(urls)  # ['https://example.com', 'ftp://files.example.com']

Lookahead and Lookbehind

Positive / Negative Lookahead

# Positive lookahead: (?=...)
passwords = ["abc123", "password", "Str0ng!Pass", "12345"]
strong = [p for p in passwords
          if re.match(r'(?=.*[A-Z])(?=.*\d)(?=.*[!@#$%^&*]).{8,}', p)]
print(strong)  # ['Str0ng!Pass']

# Negative lookahead: (?!...)
lines = ["test_func", "main_func", "test_class", "helper"]
non_test = [l for l in lines if re.match(r'(?!test)\w+', l)]
print(non_test)  # ['main_func', 'helper']

Positive / Negative Lookbehind

# Positive lookbehind: (?<=...)
text = "Price: $100, Tax: $8, Total: $108"
amounts = re.findall(r'(?<=\$)\d+', text)
print(amounts)  # ['100', '8', '108']

# Negative lookbehind: (?<!...)
text = "v1.0 v2.0-beta v3.0 v4.0-rc1"
stable = re.findall(r'v[\d.]+(?!-)', text)
print(stable)  # ['v1.0', 'v3.0']

Practical Patterns

Email Extraction

text = "Contact us at info@example.com or support@test.co.jp"
pattern = r'[\w.+-]+@[\w-]+\.[\w.]+'
emails = re.findall(pattern, text)
print(emails)  # ['info@example.com', 'support@test.co.jp']

Safe CSV Splitting

# Split on commas but ignore commas inside quotes
line = 'John,"Doe, Jr.",30,"New York, NY"'
pattern = r',(?=(?:[^"]*"[^"]*")*[^"]*$)'
fields = re.split(pattern, line)
print(fields)  # ['John', '"Doe, Jr."', '30', '"New York, NY"']

IPv4 Validation

def is_valid_ipv4(ip):
    pattern = r'^(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)$'
    return bool(re.match(pattern, ip))

print(is_valid_ipv4("192.168.1.1"))   # True
print(is_valid_ipv4("256.1.1.1"))     # False

Performance Optimization

Compiled Patterns

Pre-compile patterns used repeatedly:

# Inefficient: recompiles each iteration
for line in lines:
    re.search(r'\d{4}-\d{2}-\d{2}', line)

# Efficient: compile once
date_pattern = re.compile(r'\d{4}-\d{2}-\d{2}')
for line in lines:
    date_pattern.search(line)

Greedy vs Non-Greedy

html = '<div>Hello</div><div>World</div>'

# Greedy (default): longest match
print(re.findall(r'<div>.*</div>', html))
# ['<div>Hello</div><div>World</div>']

# Non-greedy: shortest match (add ?)
print(re.findall(r'<div>.*?</div>', html))
# ['<div>Hello</div>', '<div>World</div>']

Avoiding Catastrophic Backtracking

# Dangerous: catastrophic backtracking (ReDoS)
# re.match(r'(a+)+b', 'a' * 30)  # extremely slow

# Safe: simplified pattern
# re.match(r'a+b', 'a' * 30)  # instant

Key re Module Flags

Flag	Description
`re.IGNORECASE` (`re.I`)	Case-insensitive matching
`re.MULTILINE` (`re.M`)	`^`/`$` match each line
`re.DOTALL` (`re.S`)	`.` matches newlines
`re.VERBOSE` (`re.X`)	Allow comments and whitespace

pattern = re.compile(r'''
    (?P<year>\d{4})   # year
    -(?P<month>\d{2}) # month
    -(?P<day>\d{2})   # day
''', re.VERBOSE)

Python Decorators: Mechanics and Practical Patterns - Combining decorators with regex for validation patterns.
Introduction to Asynchronous Programming with Python asyncio - Using regex in async text processing.
Building a Progress Bar in Python from Scratch - Practical Python tips.
Matplotlib Practical Tips: Creating Publication-Quality Figures - Visualization best practices.

References

Python re module documentation
Friedl, J. E. F. (2006). Mastering Regular Expressions (3rd ed.). O’Reilly Media.
Regular Expressions 101 - Online regex tester