- Understand regex syntax and common metacharacters
- Use Pattern and Matcher classes for searching and extraction
- Apply regex for validation and text processing tasks
Regular Expressions in Java
You need to validate email addresses, extract phone numbers from documents, or find URLs in web pages. Writing character-by-character checks is tedious and error-prone. Enter regular expressions.
Regex is a powerful language for describing text patterns. It lets you define complex searches concisely. Works in Java, Python, JavaScript, text editors, command-line—everywhere.
What is a Regular Expression?
A regex is a sequence of characters defining a search pattern. Like "find" on steroids. Simple search finds exact text. Regex finds patterns like "any word starting with 'J' and ending with 'a'" or "exactly 10 digits."
Example: \d{3}-\d{4} matches phone numbers like "555-1234"—three digits, hyphen, four digits.
The Two Main Classes: Pattern and Matcher
Java provides regex support through two main classes in the java.util.regex package:
Pattern - Represents a compiled regular expression. You compile your regex string into a Pattern object once, then use it multiple times.
Matcher - The engine that performs matching operations against a string using a Pattern.
// Step 1: Compile the pattern
Pattern pattern = Pattern.compile("hello");
// Step 2: Create a matcher for your input
Matcher matcher = pattern.matcher("hello world");
// Step 3: Perform operations
boolean found = matcher.find(); // true
For simple one-time checks, you can use shorthand methods:
// Quick check if entire string matches
boolean matches = Pattern.matches("hello", "hello"); // true
// Even shorter using String's matches method
boolean quick = "hello world".matches(".*world"); // true
Building Patterns: Character Classes
The real power of regex comes from special pattern syntax. Let's start with character classes - ways to match specific sets of characters.
Literal Characters
The simplest pattern matches literal characters. The pattern cat matches the text "cat".
The Dot (.) - Any Character
The dot matches any single character (except newline):
"cat".matches(".at") // true - 'c' matches '.'
"bat".matches(".at") // true - 'b' matches '.'
"hat".matches(".at") // true - 'h' matches '.'
"at".matches(".at") // false - nothing to match '.'
Character Sets with Square Brackets
Square brackets define a set of characters to match:
| Pattern | Meaning | Examples |
|---|---|---|
[abc] |
Match a, b, OR c | "a" ✓, "d" ✗ |
[a-z] |
Match any lowercase letter | "m" ✓, "M" ✗ |
[A-Z] |
Match any uppercase letter | "M" ✓, "m" ✗ |
[0-9] |
Match any digit | "5" ✓, "x" ✗ |
[a-zA-Z] |
Match any letter | "a" ✓, "Z" ✓ |
[^abc] |
Match anything EXCEPT a, b, c | "d" ✓, "a" ✗ |
"cat".matches("[abc]at") // true - 'c' is in [abc]
"hat".matches("[abc]at") // false - 'h' is not in [abc]
"hat".matches("[^abc]at") // true - 'h' is NOT a, b, or c
Predefined Character Classes
Writing [0-9] every time you want a digit is tedious. Java provides shorthand:
| Shorthand | Meaning | Equivalent |
|---|---|---|
\d |
Any digit | [0-9] |
\D |
Any non-digit | [^0-9] |
\w |
Any word character | [a-zA-Z0-9_] |
\W |
Any non-word character | [^a-zA-Z0-9_] |
\s |
Any whitespace | [ \t\n\r\f] |
\S |
Any non-whitespace | [^ \t\n\r\f] |
Important: In Java strings, backslash is an escape character, so you must double it: \\d instead of \d.
"123".matches("\\d+") // true - one or more digits
"hello".matches("\\w+") // true - one or more word characters
" ".matches("\\s+") // true - one or more whitespace
Quantifiers: How Many?
Quantifiers specify how many times a pattern element should appear.
Basic Quantifiers
| Quantifier | Meaning | Example Pattern | Matches |
|---|---|---|---|
* |
Zero or more | ab*c |
"ac", "abc", "abbc" |
+ |
One or more | ab+c |
"abc", "abbc" (not "ac") |
? |
Zero or one | colou?r |
"color", "colour" |
{n} |
Exactly n | a{3} |
"aaa" only |
{n,} |
n or more | a{2,} |
"aa", "aaa", "aaaa"... |
{n,m} |
Between n and m | a{2,4} |
"aa", "aaa", "aaaa" |
Understanding the Difference
Let's see how these differ in practice:
// * means "zero or more" - the 'b' is optional and can repeat
"ac".matches("ab*c") // true (zero b's)
"abc".matches("ab*c") // true (one b)
"abbbc".matches("ab*c") // true (three b's)
// + means "one or more" - at least one 'b' required
"ac".matches("ab+c") // false (no b - doesn't match)
"abc".matches("ab+c") // true (one b)
// ? means "zero or one" - the 'u' is optional
"color".matches("colou?r") // true (American spelling)
"colour".matches("colou?r") // true (British spelling)
Anchors: Position Matters
Sometimes you need to match at specific positions in the string.
| Anchor | Meaning |
|---|---|
^ |
Start of string (or line in multiline mode) |
$ |
End of string (or line in multiline mode) |
\b |
Word boundary |
\B |
Not a word boundary |
// ^ ensures pattern is at the start
"hello world".matches("^hello.*") // true
"say hello".matches("^hello.*") // false - "hello" not at start
// $ ensures pattern is at the end
"hello world".matches(".*world$") // true
// \b matches word boundaries
Pattern word = Pattern.compile("\\bcat\\b");
word.matcher("cat").find() // true - "cat" is a whole word
word.matcher("catch").find() // false - "cat" is part of "catch"
word.matcher("the cat sat").find() // true - "cat" is a word here
Groups: Capturing Parts of the Match
Parentheses create groups that allow you to:
- Extract specific parts of a match
- Apply quantifiers to multiple characters
- Use back-references
Basic Capturing Groups
Pattern pattern = Pattern.compile("(\\d{3})-(\\d{4})");
Matcher matcher = pattern.matcher("My number is 555-1234");
if (matcher.find()) {
System.out.println(matcher.group(0)); // "555-1234" (entire match)
System.out.println(matcher.group(1)); // "555" (first group)
System.out.println(matcher.group(2)); // "1234" (second group)
}
Named Groups (Java 7+)
For better readability, you can name your groups:
Pattern pattern = Pattern.compile("(?<area>\\d{3})-(?<number>\\d{4})");
Matcher matcher = pattern.matcher("555-1234");
if (matcher.find()) {
System.out.println(matcher.group("area")); // "555"
System.out.println(matcher.group("number")); // "1234"
}
Non-Capturing Groups
Sometimes you need grouping for structure but don't need to capture:
// (?:...) groups without capturing
Pattern pattern = Pattern.compile("(?:Mr|Mrs|Ms)\\.?\\s+(\\w+)");
Matcher matcher = pattern.matcher("Mr. Smith");
if (matcher.find()) {
// Group 1 is the name, not the title
System.out.println(matcher.group(1)); // "Smith"
}
Common Practical Patterns
Email Validation
String emailPattern = "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$";
"user@example.com".matches(emailPattern) // true
"user.name@domain.co.uk".matches(emailPattern) // true
"invalid@".matches(emailPattern) // false
Breaking down the pattern:
^- Start of string[a-zA-Z0-9._%+-]+- One or more valid characters for the local part@- Literal @ sign[a-zA-Z0-9.-]+- One or more valid domain characters\\.- Literal dot (escaped)[a-zA-Z]{2,}- Two or more letters for TLD$- End of string
Password Strength
// At least 8 chars, one uppercase, one lowercase, one digit, one special
String strongPassword = "^(?=.*[a-z])(?=.*[A-Z])(?=.*\\d)(?=.*[@$!%*?&]).{8,}$";
The (?=...) is a lookahead - it checks if a pattern exists without consuming characters. This pattern uses multiple lookaheads to ensure all requirements are met.
Extracting Data
String text = "Contact: john@email.com or jane@company.org";
Pattern emailFinder = Pattern.compile("[\\w.+-]+@[\\w.-]+\\.[a-zA-Z]{2,}");
Matcher matcher = emailFinder.matcher(text);
while (matcher.find()) {
System.out.println("Found: " + matcher.group());
}
// Output:
// Found: john@email.com
// Found: jane@company.org
String Methods Using Regex
Java's String class has several methods that use regex:
split() - Divide a String
// Split by comma
String csv = "apple,banana,cherry";
String[] fruits = csv.split(",");
// Result: ["apple", "banana", "cherry"]
// Split by any whitespace (one or more)
String text = "Hello World Java";
String[] words = text.split("\\s+");
// Result: ["Hello", "World", "Java"]
replaceAll() - Replace Patterns
// Mask all digits
String phone = "Call: 123-456-7890";
String masked = phone.replaceAll("\\d", "*");
// Result: "Call: ***-***-****"
// Normalize whitespace
String messy = "Hello World\n\tJava";
String clean = messy.replaceAll("\\s+", " ");
// Result: "Hello World Java"
// Reorder with back-references
String name = "Smith, John";
String reordered = name.replaceAll("(\\w+), (\\w+)", "$2 $1");
// Result: "John Smith"
find() vs matches()
Understanding the difference is crucial:
- matches() - The ENTIRE string must match the pattern
- find() - Searches for the pattern ANYWHERE in the string
Pattern pattern = Pattern.compile("\\d+");
// matches() - entire string must be digits
pattern.matcher("123").matches() // true
pattern.matcher("abc123").matches() // false - has letters
// find() - looks for digits anywhere
pattern.matcher("abc123def").find() // true - found "123"
Performance Tips
Compile patterns once: If you use the same pattern multiple times, compile it once and reuse
// Good - compile once private static final Pattern EMAIL = Pattern.compile("..."); // Bad - recompiles every call public boolean isValid(String s) { return s.matches("..."); // Creates new Pattern each time }Be specific: Greedy quantifiers (
*,+) can cause performance issues. Be as specific as possible.Use non-capturing groups when you don't need the captured value:
(?:...)instead of(...)
Regex lets you match character classes ([abc], \d, \w), specify repetition (*, +, ?), match positions (^, $, \b), and capture groups. Use Pattern.compile() to create reusable patterns. matcher.find() finds the pattern anywhere. matcher.matches() checks if the entire string matches. Takes practice, but once you get it, you'll use it everywhere.
