Python Regular Expressions: Practical Guide from Basics to Performance Optimization

From basic re module patterns to named groups, lookahead/lookbehind, and performance optimization with practical examples.

Introduction

Regular expressions are a powerful tool for text processing. Python’s re module supports Perl-compatible regular expressions, useful for log parsing, data cleaning, validation, and more.

Basic Patterns

Common Metacharacters

PatternMeaningExample
.Any charactera.c → “abc”, “a1c”
\dDigit [0-9]\d{3} → “123”
\wWord char [a-zA-Z0-9_]\w+ → “hello_42”
\sWhitespace\s+ → " “, “\t”
^ / $Start / end of line^Hello$
* / + / ?0+, 1+, 0-1ab*c → “ac”, “abc”
{n,m}n to m times\d{2,4} → “12”, “1234”

Basic Operations

import re

text = "2026-02-26 Error: Connection timeout (retry: 3)"

# match: from start
m = re.match(r'\d{4}-\d{2}-\d{2}', text)
print(m.group())  # "2026-02-26"

# search: first match
m = re.search(r'retry: (\d+)', text)
print(m.group(1))  # "3"

# findall: all matches as list
numbers = re.findall(r'\d+', text)
print(numbers)  # ['2026', '02', '26', '3']

# sub: replacement
cleaned = re.sub(r'\d{4}-\d{2}-\d{2}', '[DATE]', text)
print(cleaned)  # "[DATE] Error: Connection timeout (retry: 3)"

Groups and Capturing

Named Groups

log = "2026-02-26 14:30:45 [ERROR] Database connection failed"

pattern = r'(?P<date>\d{4}-\d{2}-\d{2}) (?P<time>\d{2}:\d{2}:\d{2}) \[(?P<level>\w+)\] (?P<message>.+)'
m = re.match(pattern, log)

if m:
    print(m.group('date'))     # "2026-02-26"
    print(m.group('level'))    # "ERROR"
    print(m.group('message'))  # "Database connection failed"
    print(m.groupdict())       # {'date': '2026-02-26', 'time': '14:30:45', ...}

Non-Capturing Groups

# (?:...) groups without capturing
pattern = r'(?:https?|ftp)://[\w./\-]+'
urls = re.findall(pattern, "Visit https://example.com or ftp://files.example.com")
print(urls)  # ['https://example.com', 'ftp://files.example.com']

Lookahead and Lookbehind

Positive / Negative Lookahead

# Positive lookahead: (?=...)
passwords = ["abc123", "password", "Str0ng!Pass", "12345"]
strong = [p for p in passwords
          if re.match(r'(?=.*[A-Z])(?=.*\d)(?=.*[!@#$%^&*]).{8,}', p)]
print(strong)  # ['Str0ng!Pass']

# Negative lookahead: (?!...)
lines = ["test_func", "main_func", "test_class", "helper"]
non_test = [l for l in lines if re.match(r'(?!test)\w+', l)]
print(non_test)  # ['main_func', 'helper']

Positive / Negative Lookbehind

# Positive lookbehind: (?<=...)
text = "Price: $100, Tax: $8, Total: $108"
amounts = re.findall(r'(?<=\$)\d+', text)
print(amounts)  # ['100', '8', '108']

# Negative lookbehind: (?<!...)
text = "v1.0 v2.0-beta v3.0 v4.0-rc1"
stable = re.findall(r'v[\d.]+(?!-)', text)
print(stable)  # ['v1.0', 'v3.0']

Practical Patterns

Email Extraction

text = "Contact us at info@example.com or support@test.co.jp"
pattern = r'[\w.+-]+@[\w-]+\.[\w.]+'
emails = re.findall(pattern, text)
print(emails)  # ['info@example.com', 'support@test.co.jp']

Safe CSV Splitting

# Split on commas but ignore commas inside quotes
line = 'John,"Doe, Jr.",30,"New York, NY"'
pattern = r',(?=(?:[^"]*"[^"]*")*[^"]*$)'
fields = re.split(pattern, line)
print(fields)  # ['John', '"Doe, Jr."', '30', '"New York, NY"']

IPv4 Validation

def is_valid_ipv4(ip):
    pattern = r'^(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)$'
    return bool(re.match(pattern, ip))

print(is_valid_ipv4("192.168.1.1"))   # True
print(is_valid_ipv4("256.1.1.1"))     # False

Performance Optimization

Compiled Patterns

Pre-compile patterns used repeatedly:

# Inefficient: recompiles each iteration
for line in lines:
    re.search(r'\d{4}-\d{2}-\d{2}', line)

# Efficient: compile once
date_pattern = re.compile(r'\d{4}-\d{2}-\d{2}')
for line in lines:
    date_pattern.search(line)

Greedy vs Non-Greedy

html = '<div>Hello</div><div>World</div>'

# Greedy (default): longest match
print(re.findall(r'<div>.*</div>', html))
# ['<div>Hello</div><div>World</div>']

# Non-greedy: shortest match (add ?)
print(re.findall(r'<div>.*?</div>', html))
# ['<div>Hello</div>', '<div>World</div>']

Avoiding Catastrophic Backtracking

# Dangerous: catastrophic backtracking (ReDoS)
# re.match(r'(a+)+b', 'a' * 30)  # extremely slow

# Safe: simplified pattern
# re.match(r'a+b', 'a' * 30)  # instant

Key re Module Flags

FlagDescription
re.IGNORECASE (re.I)Case-insensitive matching
re.MULTILINE (re.M)^/$ match each line
re.DOTALL (re.S). matches newlines
re.VERBOSE (re.X)Allow comments and whitespace
pattern = re.compile(r'''
    (?P<year>\d{4})   # year
    -(?P<month>\d{2}) # month
    -(?P<day>\d{2})   # day
''', re.VERBOSE)

References