What are Regex?

To most people they’re long complicated strings of text that look too scary to be anything they’d want to know about.  Which is a shame, because they can actually be very useful and quite straightforward.

So to start again with an open mind. Regex are a way of describing a pattern of text that should be matched. Still too scary? Ok, imagine a friend asked you what an email address looked like. After asking what rock he’d been living under, you might answer:

It’s got a name part, then an ‘@’, then a domain part (e.g. the textpression part in davidhowes@textpression.com), then a ‘.’, then a top level domain (ie. com,net etc).

Let’s break it down a little more:

The ‘name’ part of the email address can be one or more letters, numbers or the underscore

Then we need an ‘@’

Then we need the ‘domain’, which can be one or more letters, numbers or the underscore

Then we need a ‘.’

Then we need the top level domain which we’ll assume is 2 to 4 letters long.

Other email addresses will follow this same basic pattern as well. So if we can express this pattern in a way a computer parser will understand, we’ve got a way of matching email addresses in text. The way is a Regular Expression (commonly known as regex).

Here’s our regex:

\w+@\w+\.[A-Z]{2,4}

Here’s how it breaks down

\w match a ‘word’ type char (letter, number or underscore). The backslash here says: ‘don’t match the character ‘w’, I want a ‘word’ type char’)
+ between 1 and unlimited times
@ then match an ‘@’
\w and then match another ‘word’ type char
+ between 1 and unlimited times
\. then match a ‘.’ (regex syntax usually uses a ‘.’ to mean match (almost) anything, the backslash here says ‘I actually want a dot, not (almost) anything’
[A-Z] then match a character between A and Z (The square brackets denote a class. You type chars or ranges (ranges use a ‘-’ symbol) of chars inside classes. The parser will match one char in the searched text, so long as it appears in the class)
{2,4} between 2 and 4 times.

Regex for the most part aren’t hard to understand. What puts most people off is the syntax.

Some people see past the syntax; remember this?:

Neo: Is that…

Cypher: The Matrix? Yeah.

Neo: Do you always look at it encoded?

Cypher: Well you have to. The image translators work for the construct program. But there’s way too much information to decode the Matrix. You get used to it. I…I don’t even see the code. All I see is blonde, brunette, red-head. Hey, you uh… want a drink?

Neo: Sure

Most people don’t see past regex syntax though, partly because they simply don’t use regex that often. And this is often because every time they come to use one, they get bogged down by syntax and spend hours trying to debug something that doesn’t work because of a misplaced or missing backslash or something equally trivial but frustrating.

So that’s regex in a nutshell. A really small nutshell, and true enough I could bore you half to death with the history of regex and lot’s of other stuff you’ll skip over in an attempt to find a solution to the problem you’re trying to solve right now.

In practise though, to create a regex do this:

1. Think about how you’d describe what you’re looking for to a friend; not a smart friend, not a friend with common sense, the kind of friend you literally have to spell everything out to in minute detail

2. Write out what you’re looking for in plain English and keep breaking it down while still in English; similar to the way I did with the email example, but make sure you’ve got all the possibilities covered that you need to cover

3. Figure out what regex functionality you need to describe your pattern; the landing page of this guide has got you covered for the most part

4. Either: go research the syntax you need to implement each part of your regex and then carefully apply the syntax, baring in mind the differences between different regex flavors, the fact that some regex flavors don’t support all regex functionality, that you need to provide escaping to avoid confusing string handling when using regex libraries in certain languages OR (you must have seen the plug coming [embarrassed smile] ), download Textpression (for free at present) that handles all this for you!