- Understand what regular expressions are and why they're useful
- Learn basic regex syntax and metacharacters
- Use Python's re module for pattern matching
- Apply simple patterns for text searching
Introduction to Regular Expressions
Introduction
Imagine you're a librarian in a vast library with millions of books. A patron asks for all books about "cats" - but they want books that mention cats as pets, wild cats, even the musical "Cats." How do you find exactly what they need?
This is where regular expressions shine. They're like a super-powered search tool that can find patterns in text, not just exact words. Regular expressions (regex) are a sequence of characters that define a search pattern, allowing you to match, search, and manipulate text in incredibly flexible ways.
In Python, we use the re module to work with regular expressions. Let's explore how this powerful tool can transform the way you work with text.
What Are Regular Expressions?
A regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern.
Think of regex as a "mini programming language" specifically designed for text matching. Instead of writing complex loops to search through text, you write a pattern that describes what you're looking for.
Why Use Regular Expressions?
- Powerful text processing: Find, replace, and extract text patterns
- Data validation: Check if input matches expected formats (emails, phone numbers, etc.)
- Text parsing: Extract structured data from unstructured text
- Search and replace: Complex find-and-replace operations
- Efficiency: Process large amounts of text quickly
Basic Regex Syntax
Let's start with the fundamental building blocks of regular expressions.
Literal Characters
The simplest regex patterns are just literal characters:
import re
# Match the word "cat"
pattern = r"cat"
text = "The cat sat on the mat."
match = re.search(pattern, text)
if match:
print("Found:", match.group()) # "cat"
Metacharacters
Metacharacters are special characters with specific meanings:
| Metacharacter | Description | Example | Matches |
|---|---|---|---|
. |
Any character except newline | c.t |
"cat", "cut", "c9t" |
^ |
Start of string/line | ^Hello |
"Hello world" (but not "world Hello") |
$ |
End of string/line | world$ |
"Hello world" (but not "world Hello") |
* |
Zero or more | ca*t |
"ct", "cat", "caat", "caaat" |
+ |
One or more | ca+t |
"cat", "caat", "caaat" (but not "ct") |
? |
Zero or one | ca?t |
"ct", "cat" (but not "caat") |
{n} |
Exactly n times | ca{2}t |
"caat" |
{n,} |
n or more times | ca{2,}t |
"caat", "caaaat" |
{n,m} |
Between n and m times | ca{1,3}t |
"cat", "caat", "caaat" |
Character Classes
Character classes let you match any one of a set of characters:
# Match any vowel
pattern = r"[aeiou]"
text = "The quick brown fox"
matches = re.findall(pattern, text)
print(matches) # ['e', 'u', 'i', 'o', 'o']
# Match any digit
pattern = r"[0-9]"
text = "Order #123 and #456"
matches = re.findall(pattern, text)
print(matches) # ['1', '2', '3', '4', '5', '6']
# Match word characters (letters, digits, underscore)
pattern = r"\w+"
text = "Hello, world! 123"
matches = re.findall(pattern, text)
print(matches) # ['Hello', 'world', '123']
Predefined Character Classes
| Class | Description | Equivalent |
|---|---|---|
\d |
Digit (0-9) | [0-9] |
\D |
Non-digit | [^0-9] |
\w |
Word character | [a-zA-Z0-9_] |
\W |
Non-word character | [^a-zA-Z0-9_] |
\s |
Whitespace | [ \t\n\r\f\v] |
\S |
Non-whitespace | [^ \t\n\r\f\v] |
Python's re Module
Python provides the re module for working with regular expressions. Here are the most important functions:
re.search() - Find First Match
import re
pattern = r"Python"
text = "I love Python programming!"
match = re.search(pattern, text)
if match:
print("Found at position:", match.start()) # 7
print("Match:", match.group()) # "Python"
re.findall() - Find All Matches
pattern = r"\d+"
text = "I have 3 cats and 2 dogs."
numbers = re.findall(pattern, text)
print(numbers) # ['3', '2']
re.match() - Match from Start
pattern = r"Hello"
text1 = "Hello world"
text2 = "world Hello"
match1 = re.match(pattern, text1) # Matches
match2 = re.match(pattern, text2) # No match (doesn't start with "Hello")
re.sub() - Replace Matches
pattern = r"old"
replacement = "new"
text = "The old man and the old dog"
new_text = re.sub(pattern, replacement, text)
print(new_text) # "The new man and the new dog"
re.split() - Split by Pattern
pattern = r"\s+"
text = "Split this text"
parts = re.split(pattern, text)
print(parts) # ['Split', 'this', 'text']
Compiling Regular Expressions
For better performance when using the same pattern multiple times, compile it:
import re
# Compile once, use many times
email_pattern = re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')
emails = [
"user@example.com",
"invalid-email",
"another@test.org"
]
for email in emails:
if email_pattern.match(email):
print(f"Valid: {email}")
else:
print(f"Invalid: {email}")
Validating Email Addresses Examples
Email Validation
def is_valid_email(email):
pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
return bool(re.match(pattern, email))
emails = ["user@example.com", "invalid@", "test.email@domain.co.uk"]
for email in emails:
print(f"{email}: {'Valid' if is_valid_email(email) else 'Invalid'}")
Phone Number Extraction
def extract_phone_numbers(text):
# Match various phone number formats
pattern = r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b'
return re.findall(pattern, text)
text = "Call me at 555-123-4567 or 555.987.6543"
phones = extract_phone_numbers(text)
print("Found phones:", phones)
URL Extraction
def extract_urls(text):
pattern = r'https?://(?:[-\w.])+(?:[:\d]+)?(?:/(?:[\w/_.])*(?:\?(?:[\w&=%.])*)?(?:#(?:\w*))*)?'
return re.findall(pattern, text)
text = "Visit https://www.example.com and http://test.org/page?q=search"
urls = extract_urls(text)
print("Found URLs:", urls)
Common Patterns and Best Practices
Escaping Special Characters
When you want to match literal special characters, escape them with backslash:
# Match literal dot
pattern = r"\."
text = "file.txt and file.py"
matches = re.findall(pattern, text)
print(matches) # ['.', '.']
Raw Strings
Always use raw strings (r"pattern") for regex patterns to avoid double-escaping:
# Good
pattern = r"\d+\.\d+"
# Avoid (hard to read)
pattern = "\\d+\\.\\d+"
Greedy vs Non-Greedy Matching
By default, quantifiers are greedy (match as much as possible):
text = "<div>content</div><div>more content</div>"
# Greedy (matches everything between first <div> and last </div>)
greedy = re.search(r'<div>.*</div>', text)
print(greedy.group()) # "<div>content</div><div>more content</div>"
# Non-greedy (matches as little as possible)
nongreedy = re.search(r'<div>.*?</div>', text)
print(nongreedy.group()) # "<div>content</div>"
Performance Considerations
- Compile patterns when used multiple times
- Use specific patterns rather than broad ones
- Avoid catastrophic backtracking with nested quantifiers
- Consider alternatives like string methods for simple searches
Key Points to Remember
- Regular expressions are powerful patterns for text matching and manipulation
- Use metacharacters like
.,*,+,?for flexible matching - Character classes
[abc]and predefined classes\d,\whelp match specific types of characters - Python's
remodule providessearch(),findall(),match(),sub(), andsplit() - Always use raw strings for regex patterns
- Consider performance and use compiled patterns for repeated use
Now that you understand the basics of regular expressions, we'll dive deeper into pattern matching with groups, which allow you to extract specific parts of matched text and create more complex patterns. This will give you the power to parse structured data from unstructured text.
