Introduction to Regular Expressions

Introduction

Imagine you're a librarian in a vast library with millions of books. A patron asks for all books about "cats" - but they want books that mention cats as pets, wild cats, even the musical "Cats." How do you find exactly what they need?

This is where regular expressions shine. They're like a super-powered search tool that can find patterns in text, not just exact words. Regular expressions (regex) are a sequence of characters that define a search pattern, allowing you to match, search, and manipulate text in incredibly flexible ways.

In Python, we use the re module to work with regular expressions. Let's explore how this powerful tool can transform the way you work with text.

What Are Regular Expressions?

A regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern.

Think of regex as a "mini programming language" specifically designed for text matching. Instead of writing complex loops to search through text, you write a pattern that describes what you're looking for.

Why Use Regular Expressions?

Powerful text processing: Find, replace, and extract text patterns
Data validation: Check if input matches expected formats (emails, phone numbers, etc.)
Text parsing: Extract structured data from unstructured text
Search and replace: Complex find-and-replace operations
Efficiency: Process large amounts of text quickly

Basic Regex Syntax

Let's start with the fundamental building blocks of regular expressions.

Literal Characters

The simplest regex patterns are just literal characters:

import re

# Match the word "cat"
pattern = r"cat"
text = "The cat sat on the mat."

match = re.search(pattern, text)
if match:
    print("Found:", match.group())  # "cat"

Metacharacters

Metacharacters are special characters with specific meanings:

Metacharacter	Description	Example	Matches
`.`	Any character except newline	`c.t`	"cat", "cut", "c9t"
`^`	Start of string/line	`^Hello`	"Hello world" (but not "world Hello")
`$`	End of string/line	`world$`	"Hello world" (but not "world Hello")
`*`	Zero or more	`ca*t`	"ct", "cat", "caat", "caaat"
`+`	One or more	`ca+t`	"cat", "caat", "caaat" (but not "ct")
`?`	Zero or one	`ca?t`	"ct", "cat" (but not "caat")
`{n}`	Exactly n times	`ca{2}t`	"caat"
`{n,}`	n or more times	`ca{2,}t`	"caat", "caaaat"
`{n,m}`	Between n and m times	`ca{1,3}t`	"cat", "caat", "caaat"

Character Classes

Character classes let you match any one of a set of characters:

# Match any vowel
pattern = r"[aeiou]"
text = "The quick brown fox"
matches = re.findall(pattern, text)
print(matches)  # ['e', 'u', 'i', 'o', 'o']

# Match any digit
pattern = r"[0-9]"
text = "Order #123 and #456"
matches = re.findall(pattern, text)
print(matches)  # ['1', '2', '3', '4', '5', '6']

# Match word characters (letters, digits, underscore)
pattern = r"\w+"
text = "Hello, world! 123"
matches = re.findall(pattern, text)
print(matches)  # ['Hello', 'world', '123']

Predefined Character Classes

Class	Description	Equivalent
`\d`	Digit (0-9)	`[0-9]`
`\D`	Non-digit	`[^0-9]`
`\w`	Word character	`[a-zA-Z0-9_]`
`\W`	Non-word character	`[^a-zA-Z0-9_]`
`\s`	Whitespace	`[ \t\n\r\f\v]`
`\S`	Non-whitespace	`[^ \t\n\r\f\v]`

Python's re Module

Python provides the re module for working with regular expressions. Here are the most important functions:

re.search() - Find First Match

import re

pattern = r"Python"
text = "I love Python programming!"

match = re.search(pattern, text)
if match:
    print("Found at position:", match.start())  # 7
    print("Match:", match.group())              # "Python"

re.findall() - Find All Matches

pattern = r"\d+"
text = "I have 3 cats and 2 dogs."

numbers = re.findall(pattern, text)
print(numbers)  # ['3', '2']

re.match() - Match from Start

pattern = r"Hello"
text1 = "Hello world"
text2 = "world Hello"

match1 = re.match(pattern, text1)  # Matches
match2 = re.match(pattern, text2)  # No match (doesn't start with "Hello")

re.sub() - Replace Matches

pattern = r"old"
replacement = "new"
text = "The old man and the old dog"

new_text = re.sub(pattern, replacement, text)
print(new_text)  # "The new man and the new dog"

re.split() - Split by Pattern

pattern = r"\s+"
text = "Split   this    text"

parts = re.split(pattern, text)
print(parts)  # ['Split', 'this', 'text']

Compiling Regular Expressions

For better performance when using the same pattern multiple times, compile it:

import re

# Compile once, use many times
email_pattern = re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')

emails = [
    "user@example.com",
    "invalid-email",
    "another@test.org"
]

for email in emails:
    if email_pattern.match(email):
        print(f"Valid: {email}")
    else:
        print(f"Invalid: {email}")

Validating Email Addresses Examples

Email Validation

def is_valid_email(email):
    pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    return bool(re.match(pattern, email))

emails = ["user@example.com", "invalid@", "test.email@domain.co.uk"]
for email in emails:
    print(f"{email}: {'Valid' if is_valid_email(email) else 'Invalid'}")

Phone Number Extraction

def extract_phone_numbers(text):
    # Match various phone number formats
    pattern = r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b'
    return re.findall(pattern, text)

text = "Call me at 555-123-4567 or 555.987.6543"
phones = extract_phone_numbers(text)
print("Found phones:", phones)

URL Extraction

def extract_urls(text):
    pattern = r'https?://(?:[-\w.])+(?:[:\d]+)?(?:/(?:[\w/_.])*(?:\?(?:[\w&=%.])*)?(?:#(?:\w*))*)?'
    return re.findall(pattern, text)

text = "Visit https://www.example.com and http://test.org/page?q=search"
urls = extract_urls(text)
print("Found URLs:", urls)

Common Patterns and Best Practices

Escaping Special Characters

When you want to match literal special characters, escape them with backslash:

# Match literal dot
pattern = r"\."
text = "file.txt and file.py"
matches = re.findall(pattern, text)
print(matches)  # ['.', '.']

Raw Strings

Always use raw strings (r"pattern") for regex patterns to avoid double-escaping:

# Good
pattern = r"\d+\.\d+"

# Avoid (hard to read)
pattern = "\\d+\\.\\d+"

Greedy vs Non-Greedy Matching

By default, quantifiers are greedy (match as much as possible):

text = "<div>content</div><div>more content</div>"

# Greedy (matches everything between first <div> and last </div>)
greedy = re.search(r'<div>.*</div>', text)
print(greedy.group())  # "<div>content</div><div>more content</div>"

# Non-greedy (matches as little as possible)
nongreedy = re.search(r'<div>.*?</div>', text)
print(nongreedy.group())  # "<div>content</div>"

Performance Considerations

Compile patterns when used multiple times
Use specific patterns rather than broad ones
Avoid catastrophic backtracking with nested quantifiers
Consider alternatives like string methods for simple searches

Key Points to Remember

Regular expressions are powerful patterns for text matching and manipulation
Use metacharacters like ., *, +, ? for flexible matching
Character classes [abc] and predefined classes \d, \w help match specific types of characters
Python's re module provides search(), findall(), match(), sub(), and split()
Always use raw strings for regex patterns
Consider performance and use compiled patterns for repeated use

Now that you understand the basics of regular expressions, we'll dive deeper into pattern matching with groups, which allow you to extract specific parts of matched text and create more complex patterns. This will give you the power to parse structured data from unstructured text.