Regular Expressions for Beginners: A Practical Starter Guide

What regex is, and what it's good for

A regular expression, usually abbreviated regex, is a compact notation for describing patterns in text. The notation was invented by the mathematician Stephen Cole Kleene in 1956 as part of his work on regular languages, a formal model of computation. Ken Thompson, one of the creators of Unix, implemented the first practical regex engine in the QED editor in 1968, and regex has been a feature of programming tools ever since — grep, sed, awk, Perl, JavaScript, Python, Java, and every modern text editor.

Regex is good for three things: searching for text that matches a pattern, validating that a string conforms to a format, and extracting parts of a string that match sub-patterns. Searching for all email addresses in a document, validating that a user-entered phone number has the right shape, extracting the domain from a URL — these are the bread and butter of regex. The notation is dense and looks cryptic to beginners, but each symbol has a specific meaning, and once you learn the vocabulary, regex becomes an indispensable tool.

Regex is bad for things that require context or nesting. Parsing HTML with regex is a famous anti-pattern, because HTML is a nested structure and regex is a flat pattern matcher. Parsing arithmetic expressions with regex is also bad, because parentheses can nest arbitrarily and regex cannot count. For these problems, use a proper parser — a recursive descent parser, a parser generator like ANTLR, or a library specific to the format. Regex is a tool for patterns, not for grammars.

Literals, anchors, and the simplest patterns

The simplest regex is just a literal string. The pattern hello matches the string hello wherever it appears in the input. Most characters in regex are literals that match themselves: letters, digits, and most punctuation. The exceptions are the metacharacters, which have special meanings: the backslash, the caret, the dollar sign, the period, the pipe, the question mark, the asterisk, the plus, the parentheses, the square brackets, and the curly braces. To match a metacharacter literally, precede it with a backslash: to match a literal period, write \. in your regex.

Anchors match positions, not characters. The caret ^ matches the start of the string (or the start of a line in multiline mode). The dollar sign $ matches the end of the string (or the end of a line). The pattern ^hello matches hello only at the start of the string. The pattern world$ matches world only at the end. The pattern ^hello world$ matches the entire string hello world and nothing else. Anchors are zero-width assertions: they consume no characters, they just check a position.

The period, or dot, matches any single character except a newline (by default). The pattern h.llo matches hello, hallo, hxllo, h5llo, and any other string where the first and last four characters are h and llo with one character between. The dot is the most commonly used metacharacter and the most commonly misused, because any character is rarely what you actually want. When you find yourself writing a dot, ask whether you mean a specific character class instead.

Character classes and shorthand

A character class matches any one of a set of characters, written inside square brackets. The pattern [aeiou] matches any vowel. The pattern [a-z] matches any lowercase letter. The pattern [a-zA-Z0-9] matches any alphanumeric character. Ranges in a character class are based on character codes, so [a-z] works as expected but [A-z] is wrong, because the range from Z (code 90) to a (code 97) includes six punctuation characters. Always specify the case ranges separately.

A caret at the start of a character class negates it: [^aeiou] matches any character that is not a vowel, including consonants, digits, and punctuation. The negation is any character not in this set, which is different from any consonant. A common bug is to forget that negated classes also match punctuation, whitespace, and newlines, leading to patterns that match more than intended.

Several character classes have shorthand forms that are more concise and more readable than the bracket version. \d matches any digit (equivalent to [0-9]). \w matches any word character (equivalent to [a-zA-Z0-9_]). \s matches any whitespace character (space, tab, newline, and a few others). The uppercase versions negate: \D matches any non-digit, \W matches any non-word character, \S matches any non-whitespace. These shorthands are portable across most regex flavors and are preferred over the bracket forms for readability.

Quantifiers, greediness, and laziness

Quantifiers specify how many times the preceding element may repeat. The asterisk * means zero or more. The plus + means one or more. The question mark ? means zero or one (optional). The curly brace notation {n}, {n,}, and {n,m} means exactly n, at least n, and between n and m (inclusive), respectively. The pattern \d{3} matches exactly three digits. The pattern \d{3,} matches three or more digits. The pattern \d{2,4} matches two, three, or four digits.

By default, quantifiers are greedy: they match as much as possible while still allowing the overall pattern to succeed. The pattern <.*> applied to the string <a><b> matches the entire string <a><b>, not just <a>, because the greedy asterisk extends as far as it can. This is a common source of bugs in regex that match HTML or other delimited text. The fix is to make the quantifier lazy by appending a question mark: <.*?> matches <a> (the shortest possible match), and applying it repeatedly matches each tag separately.

Lazy quantifiers (*?, +?, ??, {n,m}?) match as little as possible while still allowing the overall pattern to succeed. They are the right choice when you want the shortest match, as in extracting individual HTML tags. They are not free: lazy quantifiers can be slower than greedy ones, because the engine must try the shorter match first and backtrack if the rest of the pattern fails. For high-performance regex, a possessive quantifier (*+, ++, ?+, {n,m}+) or an atomic group prevents backtracking entirely, at the cost of giving up the ability to retry. Most modern regex engines support these.

Groups, alternation, and capture

Parentheses create groups. A group serves two purposes: it applies a quantifier to a sub-pattern, and it captures the matched text for later use. The pattern (ab)+ matches ab, abab, ababab, and so on, because the plus quantifier applies to the group ab as a whole. The matched text inside the group is captured, and in most regex flavors you can refer to it later in the same pattern with a backreference (\1 for the first group, \2 for the second, and so on) or extract it from the match result in your code.

Alternation, written with the pipe |, matches any of several alternatives. The pattern cat|dog|bird matches cat or dog or bird. Alternation has low precedence, so cat|dog matches cat or dog, while to match cat or dog followed by s you need to group: (cat|dog)s. The alternatives are tried left to right, and the first one that allows the overall pattern to succeed wins, which can produce surprising results if a shorter alternative appears before a longer one that you intended. Order alternatives from most specific to least specific.

Non-capturing groups, written (?:...), group without capturing. They are faster and clearer than capturing groups when you do not need the captured text. Named groups, written (?<name>...) in most flavors, capture the text under a name you choose, which makes the regex self-documenting and the code that extracts the matches more readable. Use named groups whenever the regex has more than one or two capturing groups; the readability gain is substantial. Lookahead and lookbehind assertions, written (?=...), (?!...), (?<=...), (?<!...), check for a pattern without consuming characters, which is useful for context-sensitive matching.

Common patterns and the limits of regex

A practical email validation pattern is something like ^[\w.+-]+@[\w-]+\.[a-zA-Z]{2,}$. This matches most common email addresses, but it is not strictly RFC 5322 compliant, because the full email specification allows quoted local parts, comments, and other constructs that a simple regex cannot handle. The right approach for email validation is to check that the string has the basic shape of an email (something@something.something) and then send a confirmation message to that address, which is the only way to know the address actually works and belongs to the user.

A URL pattern is ^(https?:\/\/)?([\w-]+\.)+[a-zA-Z]{2,}(:\d+)?(\/[^\s]*)?$. This matches http and https URLs with optional port and path. Like the email pattern, it is a pragmatic approximation, not a full implementation of the URL specification (RFC 3986). For real URL parsing, use the URL class in JavaScript or the urllib.parse module in Python, which handle edge cases that a regex will miss.

A US phone number pattern is ^(\+1[-.\s]?)?$?\d{3}$?[-.\s]?\d{3}[-.\s]?\d{4}$, which matches 555-123-4567, (555) 123-4567, +1 555 123 4567, and several other common formats. For international numbers, the E.164 standard format (+CC followed by the national number, no spaces) is simpler to validate: ^\+\d{6,15}$.

The limits of regex appear when the pattern you want to match depends on context that regex cannot express. You cannot write a regex that matches only balanced parentheses, because regex cannot count. You cannot write a regex that matches a date only if the day is valid for the month, because regex does not know what month it is. You cannot write a regex that matches an HTML element with the correct closing tag, because HTML is a nested grammar. These problems require parsers, not patterns. The boundary is not always clear, but the rule of thumb is: if the thing you are matching can nest, you need a parser. If it cannot nest, regex is the right tool.