- Master capturing and non-capturing groups
- Use backreferences for pattern repetition
- Apply alternation and lookaround assertions
- Extract structured data from text using groups
Pattern Matching and Groups
Introduction
Imagine you're a chef trying to extract specific ingredients from a complex recipe. You don't just want to know that the recipe contains "flour" - you want to know exactly how much flour, what type, and where it's used. This is where regex groups come in.
Groups allow you to capture specific parts of a match, reference them later, and create much more sophisticated patterns. In this lesson, we'll explore how to use groups to extract structured data from text, create complex patterns, and build powerful text processing tools.
Capturing Groups
Parentheses () in regex create capturing groups. These groups "remember" the matched text and allow you to extract specific parts of the match.
Basic Group Capture
import re
# Extract area code and number from phone number
pattern = r'(\d{3})-(\d{3})-(\d{4})'
text = "Call me at 555-123-4567"
match = re.search(pattern, text)
if match:
print("Full match:", match.group(0)) # "555-123-4567"
print("Area code:", match.group(1)) # "555"
print("Exchange:", match.group(2)) # "123"
print("Number:", match.group(3)) # "4567"
# Or get all groups at once
print("All groups:", match.groups()) # ('555', '123', '4567')
Named Groups
For more readable code, you can name your groups:
pattern = r'(?P<area>\d{3})-(?P<exchange>\d{3})-(?P<number>\d{4})'
text = "Call me at 555-123-4567"
match = re.search(pattern, text)
if match:
print("Area code:", match.group('area')) # "555"
print("Exchange:", match.group('exchange')) # "123"
print("Number:", match.group('number')) # "4567"
# Get all named groups as a dictionary
print("Named groups:", match.groupdict())
# {'area': '555', 'exchange': '123', 'number': '4567'}
Non-Capturing Groups
Sometimes you need parentheses for grouping but don't want to capture the content. Use (?:...) for non-capturing groups:
# Capturing groups (creates groups)
pattern1 = r'(red|blue) (car|bike)'
text = "red car and blue bike"
matches = re.findall(pattern1, text)
print(matches) # [('red', 'car'), ('blue', 'bike')]
# Non-capturing groups (no groups created)
pattern2 = r'(?:red|blue) (?:car|bike)'
matches = re.findall(pattern2, text)
print(matches) # ['red car', 'blue bike']
Backreferences
Backreferences allow you to refer back to previously captured groups. Use \1, \2, etc. to reference the first, second, etc. captured group.
Matching Repeated Words
# Find repeated words (like "the the")
pattern = r'\b(\w+)\s+\1\b'
text = "The the quick brown fox jumps over the the lazy dog"
matches = re.findall(pattern, text, re.IGNORECASE)
print(matches) # ['the', 'the']
HTML Tag Matching
# Match opening and closing HTML tags
pattern = r'<(\w+)>.*?</\1>'
text = "<div>Hello <span>world</span></div>"
matches = re.findall(pattern, text)
print(matches) # ['div']
Password Validation
# Password must contain same character at start and end
def validate_password(password):
pattern = r'^(.).*\1$'
return bool(re.match(pattern, password))
print(validate_password("abcda")) # True (starts and ends with 'a')
print(validate_password("hello")) # False (starts with 'h', ends with 'o')
Alternation
The pipe symbol | creates alternation, allowing you to match one pattern OR another.
Simple Alternation
# Match different fruits
pattern = r'apple|banana|orange'
text = "I like apple and banana but not orange"
matches = re.findall(pattern, text)
print(matches) # ['apple', 'banana', 'orange']
Complex Alternation with Groups
# Match different date formats
pattern = r'\b(\d{1,2})/(\d{1,2})/(\d{4}|\d{2})\b|\b(\d{4})-(\d{1,2})-(\d{1,2})\b'
text = "Dates: 12/25/2023 and 2023-12-25"
matches = re.findall(pattern, text)
print(matches) # [('12', '25', '2023', '', '', ''), ('', '', '', '2023', '12', '25')]
Alternation in Character Classes
# Match words containing 'cat' or 'dog'
pattern = r'\b\w*(?:cat|dog)\w*\b'
text = "I have a cat, a dog, and a bird"
matches = re.findall(pattern, text)
print(matches) # ['cat', 'dog']
Lookaround Assertions
Lookaround assertions allow you to match patterns based on what comes before or after, without including those characters in the match.
Positive Lookahead (?=...)
Asserts that the pattern inside matches ahead:
# Match numbers followed by 'kg'
pattern = r'\d+(?=kg)'
text = "Weight: 50kg, Height: 170cm, Volume: 2l"
matches = re.findall(pattern, text)
print(matches) # ['50']
Negative Lookahead (?!...)
Asserts that the pattern inside does NOT match ahead:
# Match numbers NOT followed by 'kg'
pattern = r'\d+(?!kg)'
text = "Weight: 50kg, Height: 170cm, Volume: 2l"
matches = re.findall(pattern, text)
print(matches) # ['170', '2']
Positive Lookbehind (?<=...)
Asserts that the pattern inside matches behind:
# Match numbers preceded by '$'
pattern = r'(?<=\$)\d+'
text = "Price: $50, Discount: 10%, Tax: $5"
matches = re.findall(pattern, text)
print(matches) # ['50', '5']
Negative Lookbehind (?<!...)
Asserts that the pattern inside does NOT match behind:
# Match numbers NOT preceded by '$'
pattern = r'(?<!\$)\d+'
text = "Price: $50, Discount: 10%, Tax: $5"
matches = re.findall(pattern, text)
print(matches) # ['10']
Advanced Group Techniques
Conditional Groups
# Match different formats based on context
pattern = r'(\w+)(?(1)\s+\1|end)'
text = "Start Start end"
# This is complex - better to use alternation
pattern = r'\b(\w+)\s+\1\b|\bend\b'
Atomic Groups (?>...)
Atomic groups prevent backtracking within the group:
# Without atomic group (backtracks)
pattern1 = r'a+b'
text = "aaaaab"
# With atomic group (no backtracking)
pattern2 = r'(?>a+)b'
Recursive Patterns
For complex nested structures, you might need recursive patterns, but these are advanced and not always supported in all regex engines.
Real-World Applications
Email Header Parsing
def parse_email_header(header):
# Match: From: Name <email@domain.com>
pattern = r'From:\s*([^<]+)\s*<([^>]+)>'
match = re.search(pattern, header)
if match:
name = match.group(1).strip()
email = match.group(2)
return {"name": name, "email": email}
return None
header = "From: John Doe <john@example.com>"
result = parse_email_header(header)
print(result) # {'name': 'John Doe', 'email': 'john@example.com'}
Log File Parsing
def parse_log_entry(log_line):
# Match: [2023-12-01 10:30:45] INFO: User login successful
pattern = r'\[(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\]\s+(\w+):\s+(.+)'
match = re.search(pattern, log_line)
if match:
timestamp = match.group(1)
level = match.group(2)
message = match.group(3)
return {
"timestamp": timestamp,
"level": level,
"message": message
}
return None
log = "[2023-12-01 10:30:45] INFO: User login successful"
parsed = parse_log_entry(log)
print(parsed)
# {'timestamp': '2023-12-01 10:30:45', 'level': 'INFO', 'message': 'User login successful'}
URL Component Extraction
def parse_url(url):
pattern = r'(https?)://([^:/]+)(?::(\d+))?(?:/([^?]*))?(?:\?([^#]*))?(?:#(.*))?'
match = re.match(pattern, url)
if match:
return {
"protocol": match.group(1),
"domain": match.group(2),
"port": match.group(3),
"path": match.group(4) or "/",
"query": match.group(5),
"fragment": match.group(6)
}
return None
url = "https://www.example.com:8080/path/to/resource?param=value#section"
parsed = parse_url(url)
print(parsed)
# {'protocol': 'https', 'domain': 'www.example.com', 'port': '8080',
# 'path': 'path/to/resource', 'query': 'param=value', 'fragment': 'section'}
Data Validation with Lookaround
def validate_password_complexity(password):
patterns = [
r'(?=.*[a-z])', # At least one lowercase
r'(?=.*[A-Z])', # At least one uppercase
r'(?=.*\d)', # At least one digit
r'(?=.*[@$!%*?&])', # At least one special character
r'.{8,}' # At least 8 characters
]
return all(re.search(pattern, password) for pattern in patterns)
passwords = ["weak", "Strong123!", "weakpassword", "Str0ng!Pass"]
for pwd in passwords:
print(f"{pwd}: {'Valid' if validate_password_complexity(pwd) else 'Invalid'}")
Performance Considerations
Greedy vs Lazy Matching
text = "<div>content</div><div>more</div>"
# Greedy: matches too much
greedy = re.search(r'<div>.*</div>', text)
print(greedy.group()) # "<div>content</div><div>more</div>"
# Lazy: matches minimally
lazy = re.search(r'<div>.*?</div>', text)
print(lazy.group()) # "<div>content</div>"
Avoiding Catastrophic Backtracking
# Problematic pattern (nested quantifiers)
bad_pattern = r'(a+)+b'
text = "aaaaaaaaaaaaaaaaaaaaaaaaaaaa"
# This can take exponential time.
# Better: use atomic groups or possessive quantifiers if available
Pre-compiling Patterns
# Compile complex patterns once
email_pattern = re.compile(r'''
(?P<local>[a-zA-Z0-9._%+-]+)
@
(?P<domain>[a-zA-Z0-9.-]+)
\.
(?P<tld>[a-zA-Z]{2,})
''', re.VERBOSE)
match = email_pattern.match("user@example.com")
if match:
print(match.groupdict())
Key Points to Remember
- Groups
()capture matched text for later retrieval - Named groups
(?P<name>...)make code more readable - Non-capturing groups
(?:...)group without capturing - Backreferences
\1,\2refer to previously captured groups - Alternation
|matches one pattern OR another - Lookaround assertions check context without consuming characters
- Complex patterns enable sophisticated text parsing and validation
Groups give you powerful text extraction capabilities, but regex also supports flags that change matching behavior and advanced patterns for even more complex scenarios. In the next lesson, we'll explore regex flags, advanced character classes, and techniques for handling complex text processing challenges.
