Regex Code Snippets for Common Validation and Parsing

Regular expressions are one of those tools where the same five patterns get reached for in almost every project: validating emails, parsing URLs, extracting numbers from text, normalizing whitespace, splitting CSV-ish data. The patterns are familiar, but the wrong ones get copied around the web constantly, and the regex that worked fine on test data starts failing the moment real input arrives. This is a working collection of regex snippets that handle the common cases correctly, along with explanations of why each pattern works and where it tends to break.

The snippets target ECMAScript-style regex (JavaScript, TypeScript) and PCRE-style regex (Python, PHP, most modern languages), which cover the vast majority of practical use cases. Differences between dialects are called out where they matter.

terminal window showing monospaced text
Photo by Kyle Miller on Pexels

1. Email Validation (Practical Rather Than Perfect)

The single most over-engineered regex on the internet is the "complete RFC 5322 email validator." Real email addresses can contain quoted local parts, IP-literal domains, and other rarities that 99.9 percent of applications do not need to support. A practical email validator should reject obviously broken inputs while accepting everything users actually type.

^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$

What this pattern accepts:

Letters, digits, dots, underscores, percent signs, plus signs, and hyphens in the local part
Letters, digits, dots, and hyphens in the domain
A top-level domain of at least two letters

What it rejects:

Spaces, quotes, and most special characters
Missing @ or missing TLD
Single-character TLDs

What it deliberately does not try to enforce:

RFC 5322 quoting rules (rarely needed)
Internationalized email addresses (IDN domains need separate handling)
Distinguishing valid TLDs from invalid ones (use a TLD list if needed)

The right approach for production: validate format with regex, then verify reachability with an email confirmation flow. No regex by itself can confirm an address actually receives mail. The W3C HTML specification for the input type="email" element uses a very similar pragmatic pattern for exactly this reason.

2. URL Validation

URL validation regex is harder than it looks because real URLs have query strings, fragments, ports, paths with encoded characters, and protocols beyond http and https. The following pattern accepts the practical majority of HTTP and HTTPS URLs:

^https?:\/\/([\w-]+\.)+[\w-]+(:\d+)?(\/[\w\-._~:\/?#[\]@!$&'()*+,;=%]*)?$

What this matches:

http:// or https:// (case-sensitive; add the i flag for case-insensitive)
A hostname with at least one dot (rejects bare hostnames like localhost)
An optional port (:8080)
An optional path, query string, or fragment with URL-safe characters

For production parsing rather than just validation, language-native URL parsers (Python's urllib.parse, JavaScript's URL constructor, MDN documentation is the canonical reference) are more reliable than regex. The regex is appropriate for format checks and pattern matching, not for canonicalizing or normalizing URLs.

3. Extracting Numbers From Text

A surprisingly common task: pull all numbers out of a string. Common variations include positive integers, decimals, signed numbers, and numbers with thousand separators.

Plain positive integers:

\d+

Decimals (with or without leading zero):

\d+\.\d+|\.\d+

Signed integers and decimals:

-?\d+(\.\d+)?

Numbers with thousand separators (US format: comma as separator, period as decimal):

-?\d{1,3}(,\d{3})*(\.\d+)?

The pattern with thousand separators is the one that gets misused most often. Note that it requires at least three digits in each group after the first comma, which prevents matching invalid formats like 1,2,3. For European number formats (period as separator, comma as decimal), swap the comma and period and watch for ambiguity with date strings.

notebook annotated diagrams pen
Photo by www.kaboompics.com on Pexels

4. Whitespace Normalization

A practical text-processing pattern: collapse runs of whitespace (including tabs and newlines) to single spaces, and trim leading and trailing whitespace.

Collapse whitespace runs:

\s+

Replace with a single space, then trim:

text.replace(/\s+/g, ' ').trim()

The \s character class includes spaces, tabs, newlines, and carriage returns in all major regex dialects. The g flag is essential; without it, only the first run is collapsed.

A common gotcha: some Unicode whitespace characters (non-breaking spaces, em spaces, ideographic spaces) are not matched by \s in some regex engines. If your input may contain pasted text from word processors or PDF extraction, consider:

[\s  -     　]+

This expanded character class catches the common Unicode space characters that \s misses in JavaScript regex.

5. Splitting CSV Lines (Without a Library)

Full CSV parsing requires handling quoted fields, escaped quotes, and embedded newlines, all of which exceed what regex alone can do well. But splitting a simple, well-behaved CSV line (no embedded commas, no quotes) is a one-liner:

For a line that may have quoted fields containing commas but no escaped quotes:

(?:"([^"]*)"|([^,]+))(?:,|$)

This pattern captures either a double-quoted field (without preserving the quotes) or an unquoted field, separated by a comma or end of line. It does not handle escaped quotes inside quoted fields. For production CSV parsing, a real CSV library is the right answer. Regex is appropriate for known-format CSV-like data where the trade-offs are understood.

"Most regex bugs in production code are not in the regex itself. They are in someone assuming that input always matches the format they expected. Validate first, parse second, and never trust that the data is clean." - Dennis Traina, founder of 137Foundry

6. Extracting Tagged Data (HTML, Markdown, Custom Tags)

Pulling structured data out of tagged text is one of the most common regex use cases. The classic warning is that regex cannot parse arbitrary HTML, and the warning is correct. But for known-structure tagged content, it works well.

Extract content from a specific HTML tag (single line, no nested tags):

<h1>(.*?)<\/h1>

The ? after * makes the match non-greedy, so it stops at the first closing tag rather than consuming everything to the last one.

Extract URL and link text from a markdown link:

\[([^\]]+)\]\(([^)]+)\)

Group 1 captures the link text, group 2 captures the URL. This handles standard markdown links but breaks on links with literal parentheses in the URL or brackets in the text.

For HTML extraction beyond toy cases, a real HTML parser (DOMParser in browsers, BeautifulSoup in Python, OWASP recommends parser-based approaches for any security-sensitive extraction) is more reliable than regex. The regex approach is fine for known-structure data; it fails badly on real-world HTML.

7. Validating Common ID Formats

Many applications need to validate IDs in specific formats: UUIDs, ZIP codes, phone numbers, ISBNs. A few common ones:

UUID v4:

^[0-9a-f]{8}-[0-9a-f]{4}-4[0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}$

US ZIP code (5-digit or ZIP+4):

^\d{5}(-\d{4})?$

US phone number (flexible format):

^(\+1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}$

ISBN-13:

^(?:97[89])\d{10}$

Each of these patterns hits the practical balance between strict validation and accepting real-world variations. None claim to verify that the ID is actually valid in its real-world system; they only verify that the format is plausible.

chalkboard math formulas writing
Photo by www.kaboompics.com on Pexels

8. Performance Considerations

A few regex patterns can become catastrophically slow on certain inputs, a problem known as catastrophic backtracking. The classic cases involve nested quantifiers like (a+)+ against a long string of as followed by a non-matching character. The regex engine tries every combination of how to split the as between the inner and outer quantifier, and the runtime explodes.

Practical defenses:

Avoid nested quantifiers when possible. Replace (a+)+ with a+ whenever the outer quantifier is redundant.
Use possessive quantifiers (*+, ++) or atomic groups in dialects that support them. JavaScript does not; Python and PCRE do.
Cap input length before applying regex to user-controlled input.
Test patterns against pathological inputs. regex101.com shows step counts and can reveal backtracking before it hits production.

In security-sensitive contexts, a slow regex applied to attacker-controlled input is a denial-of-service vector (often called ReDoS). The OWASP guidance on this is straightforward: assume any regex on untrusted input can be made slow if the pattern allows backtracking, and benchmark accordingly.

9. Common Mistakes to Avoid

A handful of recurring regex mistakes show up in code reviews often enough to be worth flagging:

Forgetting to escape dots in domains. example.com in a regex matches exampleXcom because . is the any-character wildcard. Use \. for literal dots.
Anchoring inconsistently. A regex without ^ and $ anchors matches substrings, which is rarely what validators want. ^[A-Z]+$ validates that the entire string is uppercase; [A-Z]+ only checks that there is uppercase content somewhere.
Using greedy matching when non-greedy is needed. <.*> matches the entire string from the first < to the last >. <.*?> matches each tag individually.
Putting the regex in a hot loop without compiling it. In Python, re.compile() once and reuse. In JavaScript, a literal regex /.../ is reused; a new RegExp(...) inside a loop is not.

Pairing Regex With the Right Tooling

Regex is a precision tool. It excels at format validation, simple extraction, and find-and-replace operations on known-structure text. It fails badly at parsing nested structures, arbitrary HTML, and any context-sensitive grammar.

For applications that combine regex validation with structured parsing, automation tooling, or integration with downstream systems, the production setup matters more than any single pattern. At https://137foundry.com we build the data pipelines and validation layers that sit around regex use in production systems, particularly the data integration and AI automation work where input formats are messy and the cost of bad parsing compounds downstream.

For more on the related topic of validating data before processing, the 137Foundry articles covers patterns for data validation layers, input sanitization, and the production engineering practices that make regex-based validation actually reliable in real applications.

Regex is not magic. The right patterns, applied carefully to inputs you understand, are useful tools. The wrong patterns, applied without thinking about what input will actually arrive, are the source of half the bugs in any production data pipeline. The snippets above are a starting point, not a finish line. Test them on your actual data before shipping.