Pattern Matching and Groups

Introduction

Imagine you're a chef trying to extract specific ingredients from a complex recipe. You don't just want to know that the recipe contains "flour" - you want to know exactly how much flour, what type, and where it's used. This is where regex groups come in.

Groups allow you to capture specific parts of a match, reference them later, and create much more sophisticated patterns. In this lesson, we'll explore how to use groups to extract structured data from text, create complex patterns, and build powerful text processing tools.

Capturing Groups

Parentheses () in regex create capturing groups. These groups "remember" the matched text and allow you to extract specific parts of the match.

Basic Group Capture

import re

# Extract area code and number from phone number
pattern = r'(\d{3})-(\d{3})-(\d{4})'
text = "Call me at 555-123-4567"

match = re.search(pattern, text)
if match:
    print("Full match:", match.group(0))    # "555-123-4567"
    print("Area code:", match.group(1))     # "555"
    print("Exchange:", match.group(2))      # "123"
    print("Number:", match.group(3))        # "4567"
    
    # Or get all groups at once
    print("All groups:", match.groups())    # ('555', '123', '4567')

Named Groups

For more readable code, you can name your groups:

pattern = r'(?P<area>\d{3})-(?P<exchange>\d{3})-(?P<number>\d{4})'
text = "Call me at 555-123-4567"

match = re.search(pattern, text)
if match:
    print("Area code:", match.group('area'))     # "555"
    print("Exchange:", match.group('exchange'))  # "123"
    print("Number:", match.group('number'))      # "4567"
    
    # Get all named groups as a dictionary
    print("Named groups:", match.groupdict())
    # {'area': '555', 'exchange': '123', 'number': '4567'}

Non-Capturing Groups

Sometimes you need parentheses for grouping but don't want to capture the content. Use (?:...) for non-capturing groups:

# Capturing groups (creates groups)
pattern1 = r'(red|blue) (car|bike)'
text = "red car and blue bike"

matches = re.findall(pattern1, text)
print(matches)  # [('red', 'car'), ('blue', 'bike')]

# Non-capturing groups (no groups created)
pattern2 = r'(?:red|blue) (?:car|bike)'
matches = re.findall(pattern2, text)
print(matches)  # ['red car', 'blue bike']

Backreferences

Backreferences allow you to refer back to previously captured groups. Use \1, \2, etc. to reference the first, second, etc. captured group.

Matching Repeated Words

# Find repeated words (like "the the")
pattern = r'\b(\w+)\s+\1\b'
text = "The the quick brown fox jumps over the the lazy dog"

matches = re.findall(pattern, text, re.IGNORECASE)
print(matches)  # ['the', 'the']

HTML Tag Matching

# Match opening and closing HTML tags
pattern = r'<(\w+)>.*?</\1>'
text = "<div>Hello <span>world</span></div>"

matches = re.findall(pattern, text)
print(matches)  # ['div']

Password Validation

# Password must contain same character at start and end
def validate_password(password):
    pattern = r'^(.).*\1$'
    return bool(re.match(pattern, password))

print(validate_password("abcda"))  # True (starts and ends with 'a')
print(validate_password("hello"))  # False (starts with 'h', ends with 'o')

Alternation

The pipe symbol | creates alternation, allowing you to match one pattern OR another.

Simple Alternation

# Match different fruits
pattern = r'apple|banana|orange'
text = "I like apple and banana but not orange"

matches = re.findall(pattern, text)
print(matches)  # ['apple', 'banana', 'orange']

Complex Alternation with Groups

# Match different date formats
pattern = r'\b(\d{1,2})/(\d{1,2})/(\d{4}|\d{2})\b|\b(\d{4})-(\d{1,2})-(\d{1,2})\b'
text = "Dates: 12/25/2023 and 2023-12-25"

matches = re.findall(pattern, text)
print(matches)  # [('12', '25', '2023', '', '', ''), ('', '', '', '2023', '12', '25')]

Alternation in Character Classes

# Match words containing 'cat' or 'dog'
pattern = r'\b\w*(?:cat|dog)\w*\b'
text = "I have a cat, a dog, and a bird"

matches = re.findall(pattern, text)
print(matches)  # ['cat', 'dog']

Lookaround Assertions

Lookaround assertions allow you to match patterns based on what comes before or after, without including those characters in the match.

Positive Lookahead (?=...)

Asserts that the pattern inside matches ahead:

# Match numbers followed by 'kg'
pattern = r'\d+(?=kg)'
text = "Weight: 50kg, Height: 170cm, Volume: 2l"

matches = re.findall(pattern, text)
print(matches)  # ['50']

Negative Lookahead (?!...)

Asserts that the pattern inside does NOT match ahead:

# Match numbers NOT followed by 'kg'
pattern = r'\d+(?!kg)'
text = "Weight: 50kg, Height: 170cm, Volume: 2l"

matches = re.findall(pattern, text)
print(matches)  # ['170', '2']

Positive Lookbehind (?<=...)

Asserts that the pattern inside matches behind:

# Match numbers preceded by '$'
pattern = r'(?<=\$)\d+'
text = "Price: $50, Discount: 10%, Tax: $5"

matches = re.findall(pattern, text)
print(matches)  # ['50', '5']

Negative Lookbehind (?<!...)

Asserts that the pattern inside does NOT match behind:

# Match numbers NOT preceded by '$'
pattern = r'(?<!\$)\d+'
text = "Price: $50, Discount: 10%, Tax: $5"

matches = re.findall(pattern, text)
print(matches)  # ['10']

Advanced Group Techniques

Conditional Groups

# Match different formats based on context
pattern = r'(\w+)(?(1)\s+\1|end)'
text = "Start Start end"

# This is complex - better to use alternation
pattern = r'\b(\w+)\s+\1\b|\bend\b'

Atomic Groups (?>...)

Atomic groups prevent backtracking within the group:

# Without atomic group (backtracks)
pattern1 = r'a+b'
text = "aaaaab"

# With atomic group (no backtracking)
pattern2 = r'(?>a+)b'

Recursive Patterns

For complex nested structures, you might need recursive patterns, but these are advanced and not always supported in all regex engines.

Real-World Applications

Email Header Parsing

def parse_email_header(header):
    # Match: From: Name <email@domain.com>
    pattern = r'From:\s*([^<]+)\s*<([^>]+)>'
    match = re.search(pattern, header)
    
    if match:
        name = match.group(1).strip()
        email = match.group(2)
        return {"name": name, "email": email}
    return None

header = "From: John Doe <john@example.com>"
result = parse_email_header(header)
print(result)  # {'name': 'John Doe', 'email': 'john@example.com'}

Log File Parsing

def parse_log_entry(log_line):
    # Match: [2023-12-01 10:30:45] INFO: User login successful
    pattern = r'\[(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\]\s+(\w+):\s+(.+)'
    match = re.search(pattern, log_line)
    
    if match:
        timestamp = match.group(1)
        level = match.group(2)
        message = match.group(3)
        return {
            "timestamp": timestamp,
            "level": level,
            "message": message
        }
    return None

log = "[2023-12-01 10:30:45] INFO: User login successful"
parsed = parse_log_entry(log)
print(parsed)
# {'timestamp': '2023-12-01 10:30:45', 'level': 'INFO', 'message': 'User login successful'}

URL Component Extraction

def parse_url(url):
    pattern = r'(https?)://([^:/]+)(?::(\d+))?(?:/([^?]*))?(?:\?([^#]*))?(?:#(.*))?'
    match = re.match(pattern, url)
    
    if match:
        return {
            "protocol": match.group(1),
            "domain": match.group(2),
            "port": match.group(3),
            "path": match.group(4) or "/",
            "query": match.group(5),
            "fragment": match.group(6)
        }
    return None

url = "https://www.example.com:8080/path/to/resource?param=value#section"
parsed = parse_url(url)
print(parsed)
# {'protocol': 'https', 'domain': 'www.example.com', 'port': '8080', 
#  'path': 'path/to/resource', 'query': 'param=value', 'fragment': 'section'}

Data Validation with Lookaround

def validate_password_complexity(password):
    patterns = [
        r'(?=.*[a-z])',  # At least one lowercase
        r'(?=.*[A-Z])',  # At least one uppercase
        r'(?=.*\d)',     # At least one digit
        r'(?=.*[@$!%*?&])',  # At least one special character
        r'.{8,}'         # At least 8 characters
    ]
    
    return all(re.search(pattern, password) for pattern in patterns)

passwords = ["weak", "Strong123!", "weakpassword", "Str0ng!Pass"]
for pwd in passwords:
    print(f"{pwd}: {'Valid' if validate_password_complexity(pwd) else 'Invalid'}")

Performance Considerations

Greedy vs Lazy Matching

text = "<div>content</div><div>more</div>"

# Greedy: matches too much
greedy = re.search(r'<div>.*</div>', text)
print(greedy.group())  # "<div>content</div><div>more</div>"

# Lazy: matches minimally
lazy = re.search(r'<div>.*?</div>', text)
print(lazy.group())    # "<div>content</div>"

Avoiding Catastrophic Backtracking

# Problematic pattern (nested quantifiers)
bad_pattern = r'(a+)+b'
text = "aaaaaaaaaaaaaaaaaaaaaaaaaaaa"

# This can take exponential time.
# Better: use atomic groups or possessive quantifiers if available

Pre-compiling Patterns

# Compile complex patterns once
email_pattern = re.compile(r'''
    (?P<local>[a-zA-Z0-9._%+-]+)
    @
    (?P<domain>[a-zA-Z0-9.-]+)
    \.
    (?P<tld>[a-zA-Z]{2,})
''', re.VERBOSE)

match = email_pattern.match("user@example.com")
if match:
    print(match.groupdict())

Key Points to Remember

Groups () capture matched text for later retrieval
Named groups (?P<name>...) make code more readable
Non-capturing groups (?:...) group without capturing
Backreferences \1, \2 refer to previously captured groups
Alternation | matches one pattern OR another
Lookaround assertions check context without consuming characters
Complex patterns enable sophisticated text parsing and validation

Groups give you powerful text extraction capabilities, but regex also supports flags that change matching behavior and advanced patterns for even more complex scenarios. In the next lesson, we'll explore regex flags, advanced character classes, and techniques for handling complex text processing challenges.