Applications of String Algorithms

You've learned KMP, Rabin-Karp, tries, and hashing. Now let's see where they actually get used.

Code Editors: Find and Replace

Hit Ctrl+F in your editor, search for "variable", and it highlights every occurrence instantly. That's string matching. For a 10,000-line file, naive search would check thousands of positions. KMP or Boyer-Moore make it near-instant.

Find and replace is the same, but with modifications. Search for "oldName", replace with "newName" throughout the codebase. The algorithm:

def find_and_replace(text, pattern, replacement):
    positions = kmp_search(text, pattern)  # Find all occurrences
    
    # Replace from right to left to avoid index shifting
    for pos in reversed(positions):
        text = text[:pos] + replacement + text[pos + len(pattern):]
    
    return text

VS Code, IntelliJ, Sublime—they all use optimized string matching for this. The difference between instant and laggy search is the algorithm.

Autocomplete: Tries in Action

Type "alg" in your IDE and it suggests "algorithm", "algorithmically", "algebra". How?

Store all known words in a trie. When you type "alg":

Navigate to the 'a' → 'l' → 'g' node
Do a DFS from there to collect all complete words
Return the top N by frequency or relevance

class Autocomplete:
    def __init__(self):
        self.trie = Trie()
        self.word_freq = {}  # Track word usage frequency
    
    def add_word(self, word):
        self.trie.insert(word)
        self.word_freq[word] = self.word_freq.get(word, 0) + 1
    
    def suggest(self, prefix):
        words = self.trie.find_words_with_prefix(prefix)
        # Sort by frequency, return top 5
        return sorted(words, key=lambda w: self.word_freq[w], reverse=True)[:5]

Google search, smartphone keyboards, IDEs—all use this pattern. The trie makes prefix search fast; frequency data ranks results.

Plagiarism Detection

Compare two documents to find copied sections. Naive approach: check every substring pair. For documents with n words each, that's O(n²) substring comparisons, each O(k) for k-word substrings.

Better: hash all k-length substrings of both documents, find matches.

def find_plagiarism(doc1, doc2, k=5):
    # Hash all k-word chunks from doc1
    hashes1 = set()
    words1 = doc1.split()
    for i in range(len(words1) - k + 1):
        chunk = ' '.join(words1[i:i+k])
        hashes1.add(hash_string(chunk))
    
    # Check doc2's k-word chunks
    matches = []
    words2 = doc2.split()
    for i in range(len(words2) - k + 1):
        chunk = ' '.join(words2[i:i+k])
        if hash_string(chunk) in hashes1:
            matches.append((i, chunk))
    
    return matches

Turnitin, Copyscape, and university plagiarism checkers use variants of this. They hash n-grams (sequences of n words) and look for matches across millions of documents.

DNA Sequence Analysis

DNA is a string over the alphabet {A, T, C, G}. Finding a gene in a genome is pattern matching. The human genome has 3 billion base pairs.

Find where gene "ATCGATCG" appears in a chromosome:

genome = load_chromosome()  # Billions of characters
gene = "ATCGATCG"

positions = kmp_search(genome, gene)  # O(n+m) guaranteed

KMP's O(n+m) guarantee matters here. With billions of characters, you can't afford O(nm) worst case.

Bioinformatics also uses suffix trees for finding repeated sequences, common substrings between genomes, and more complex queries.

Spam Filtering

Email spam filters check for keyword patterns. "Click here to win $1,000,000" triggers red flags.

But spammers adapt: "Cl1ck h3re", "C.l.i.c.k". Simple string matching fails.

Solution: fuzzy matching and pattern sets. Store spam patterns in a trie or use regular expressions (which compile to automata—related to tries):

class SpamFilter:
    def __init__(self):
        self.spam_patterns = Trie()
        # Add common spam phrases
        self.spam_patterns.insert("click here")
        self.spam_patterns.insert("you won")
        self.spam_patterns.insert("free money")
    
    def is_spam(self, email_text):
        # Normalize: lowercase, remove numbers/punctuation
        normalized = normalize(email_text)
        words = normalized.split()
        
        # Check for spam phrases
        for i in range(len(words) - 2 + 1):
            phrase = ' '.join(words[i:i+2])
            if self.spam_patterns.search(phrase):
                return True
        
        return False

Real spam filters use machine learning now, but string matching is still part of the pipeline.

IP Routing: Longest Prefix Match

Internet routers decide where to send packets based on IP addresses. A routing table maps IP prefixes to next hops:

192.168.1.0/24    → Router A
192.168.0.0/16    → Router B
192.0.0.0/8       → Router C

For IP 192.168.1.5, which route? All three match, but we want the longest (most specific) prefix: 192.168.1.0/24.

This is prefix matching in a trie. Build a binary trie where each bit of the IP address chooses left (0) or right (1):

        root
       /    \
      0      1
     /\     /\
    0  1   0  1
   ...

Traverse the trie following the IP's bits. The deepest node with routing info is the longest prefix match.

Routers handle millions of packets per second. Trie lookup is O(k) for k-bit addresses (IPv4: 32 bits). Hash tables can't do "longest prefix" efficiently.

Compression: Finding Repeated Patterns

LZ77 compression (used in gzip) finds repeated substrings and replaces them with references.

For "abcabcabc", encode as "abc" followed by "repeat previous 3 chars twice".

Finding these repeats efficiently uses string hashing or suffix arrays:

def find_repeats(text, min_length=3):
    repeats = {}
    n = len(text)
    
    # Hash all substrings of each length
    for length in range(min_length, n // 2):
        seen = {}
        for i in range(n - length + 1):
            substring = text[i:i+length]
            h = hash_string(substring)
            if h in seen:
                repeats[substring] = (seen[h], i)  # First and second occurrence
            else:
                seen[h] = i
    
    return repeats

Compression algorithms use this to build dictionaries of repeated strings, achieving better compression ratios.

Web Scraping: Extracting Data

Scrape product prices from web pages. HTML is messy, but patterns emerge:

<span class="price">$19.99</span>
<div class="price-box">$29.99</div>

Use string matching to find price patterns:

def extract_prices(html):
    prices = []
    
    # Pattern: dollar sign followed by digits and decimal
    import re
    pattern = r'\$\d+\.\d{2}'
    
    matches = re.findall(pattern, html)
    return [float(m[1:]) for m in matches]  # Remove $, convert to float

Regular expressions (regex) are compiled into finite automata, which are related to tries and string matching algorithms. The regex engine uses similar principles to KMP for efficient matching.

Version Control: Diff Algorithms

Git's diff command shows changes between file versions. It finds the longest common subsequence (LCS) of lines.

For files A and B, LCS tells you which lines are unchanged. Everything else is added or deleted.

File A: Line1, Line2, Line3, Line4
File B: Line1, Line3, Line5, Line4

LCS: Line1, Line3, Line4
Diff:
  Line1 (unchanged)
- Line2 (deleted)
  Line3 (unchanged)
+ Line5 (added)
  Line4 (unchanged)

LCS uses dynamic programming, but string hashing speeds it up by treating lines as strings and hashing them for quick equality checks.

Search Engines: Inverted Indexes

Google doesn't search the entire web when you query "string algorithms". It uses an inverted index:

"string"    → [page1, page5, page12, ...]
"algorithms"→ [page1, page3, page8, ...]

For query "string algorithms", find intersection of [pages with "string"] and [pages with "algorithms"].

Building the index requires tokenizing billions of web pages and storing word→pages mappings. Tries help with autocomplete on queries. String matching identifies keywords.

The ranking algorithm (PageRank, etc.) is more complex, but it all starts with efficient string processing.

The Common Thread

All these applications share a pattern:

They process large amounts of text
They need substring search, prefix matching, or pattern recognition
Naive algorithms are too slow
Tries, KMP, hashing, or suffix structures make them practical

String algorithms aren't just theoretical exercises. They're the infrastructure behind tools you use every day. Without them, your editor would lag, autocomplete wouldn't work, and spam would overwhelm your inbox.

Understanding the algorithms means understanding how the digital world processes text—and that's everywhere.