Official website for Linux User & Developer
FOLLOW US ON:
Apr
12

Regular expressions guide

by Richard Smedley

Learn regular expressions to more effectively search through code and the shell

We’re always searching for something – the file where we wrote that recipe (Python or baking); the comment in 100,000 lines of code that points to an unfinished module; the log entry about an iffy connection. Regular expressions (abbreviated as regexps hereafter, but you’ll also see regex and re) are a codified method of searching which, to the unenlightened, suggests line noise. Yet, despite a history that stretches back to Ken Thompson’s 1968 QED editor, they’re still a powerful tool today, thanks to grep – ‘global regular expression print’. Using grep exposes only the limited Basic Regular Expressions (BRE); grep -E (or egrep) gives Extended Regular Expressions (ERE). For other languages, most adopt PCRE (Perl Compatible Regular Expressions), developed in 1997, by Philip Hazel, and understood by many languages, though not always implemented in exactly the same way. We’ll use grep -P when we need to access these. Emacs has its own regexp style but, like grep, has a -P option to use Perl-compatible regexps.

This introduction is mostly aimed at searching from the shell, but you should easily be able to adapt it to standalone Perl scripts, and other languages which use PCRE.

Even the simplest regexp can make you more productive at the command line
Even the simplest regexp can make you more productive at the command line

Resources

Your favourite editor
Perl 5.10 (or later)

Step-by-step

Step 01 Word up!

You’re probably used to searching a text file for occurrences of a word with grep – in that case, the word is the regular expression. More complicated regexps are simply concise ways for searching for parts of words, or character strings, in particular positions.

Word up!
Word up!

Step 02 Reserved character

Some characters mean special things in regexp pattern matching: . * [ ] ^ $ \ in Basic Regular Expressions. The ‘.’ matches any character, so using it above doesn’t just find the full stop unless grep’s -F option is used to make the string entirely literal.

Reserved character
Reserved character

Step 03 Atlantic crossing

Extended Regular Expressions add ? | { } ( ) to the metacharacters. grep -E or egrep lets you use them, as above, where ‘standardise|standardize’ can match British or American (and ‘Oxford’) spellings of ‘standardise’.

Atlantic crossing
Atlantic crossing

Step 04 Colourful?

‘|’ gives a choice between the two characters in the parentheses – standardi(s|z)e – saving unnecessary typing. Another way to find both British and American spellings is ‘?’ to indicate one or zero of the preceding element, such as the u in colour.

Colourful?
Colourful?

Step 05

The other quantifiers are + for at least one of the preceding regexps (‘_+’ finds lines with at least one underscore) and * for zero or more (coo*l matches col, cool, coooooooool, but not cl, useful for different spellings of mmmmmmmmm or zzzzzzzzzz).

Mmmmm, cooooool
Mmmmm, cooooool

Step 06 No number

Feeling confident? Good, time for more goodies. [0-9] is short for [0123456789] and matches any element in the square brackets. The ^ inside the brackets is a negation, here matching any non-number but the other ^? …

No number
No number

Step 07 Start to finish

The ^ matches the expression at the beginning of the line; a $ matches the end. Now you can sort your document.text from text.doc and find lines beginning with # or ending in a punctuation mark other than a period.

Step 08 A to Z Guide

The range in [] can be anything from the ASCII character set, so [ \t\r\n\v\f] indicates the whitespace characters (tab, newline et al). [^bd]oom$ matches all words ending in ‘oom’, occurring at the end of the line, except boom and doom.

Step 09 POSIX classes

The POSIX classes for character ranges save a lot of the [A-Za-z0-9], but perhaps most useful is the non-POSIX addition of [:word:] which matches [A-Za-z0-9_], the addition of underscore helping to match identifiers in many programming languages.

POSIX classes
POSIX classes

Step 10 ASCII style

Where character classes aren’t implemented, knowledge of ASCII’s underpinnings can save you time: so [ -~] is all printable ASCII characters (character codes 32-127) and its inverse [^ -~] is all non-printable ASCII characters.

Step 11 Beyond grep

Find and Locate both work well with regexps. In The Linux Command Line (reviewed in LUD 111), William Shotts gave the great example of find . -regex ‘.*[^-_./0-9a-zA-Z].*’ to find filenames with embedded spaces and other nasties.

Beyond grep
Beyond grep

Step 12 Nice one Cyril

Speaking of non-standard characters, while [:alpha:] depends on your locale settings, and may only find ASCII text, you can still search for characters of other alphabets – from accented French and Welsh letters to the Greek or Russian alphabet.

Nice one Cyril
Nice one Cyril

Step 13 Ranging repeat

While {4} would match the preceding element if it occurred four times, putting in two numbers gives a range. So, [0-9]{1,3} in the above screenshot finds one-, two- or three- digit numbers – a quick find for dotted quads, although it won’t filter out 256-999.

Ranging repeat
Ranging repeat

Step 14 Bye bye, IPv4

FOSDEM was all IPv6 this year, so let’s not waste any more time on IPv4 validation, as the future may actually be here. As can be seen in this glimpse of IPv6 validators, despite some Perl ‘line noise’, it boils down to checking appropriate amounts of hex.

Bye bye, IPv4
Bye bye, IPv4

Step 15 Validation

By now regexps should be looking a lot less like line noise, so it’s time to put together a longer one, just building from some of the simpler parts. A common programming task, particularly with web forms, is validating input is in the correct format – such as dates.

In this case we’re looking at validating dates, eg for date-of-birth (future dates could then be filtered using current date). Note that (0[1-9]|[12][0-9]|3[01]) checks numbers 01-31, but won’t prevent 31st February.

Validation
Validation

Step 16 Back to basics

Now we have the basics, and can string them together, don’t neglect the grep basics – here we’re looking at how many attempts at unauthorised access were made by SSH in a given period. An unnecessary pipe replaced with grep -c.

Back to basics
Back to basics

Step 17 Why vi?

Whatever your position in the venerable and affectionate vi/Emacs war, there will be times and servers where vi is your only tool, so grab yourself a cheat-sheet. Vi and vim mostly follow BRE. Here we see one of the \< \> word boundaries.

Step 18 Boundary guard

As well as ^ and $ for line ends, word boundaries can be matched in regexps with \b – enabling matches on, say, ‘hat’ without matching ‘chatter’. The escape character, \, is used to add a number of extra elements, such as \d for numerical digit.

Step 19 Literally meta

Speaking of boundaries, placing \Q \E around a regexp will treat everything within as literals rather than metacharacters – meaning you can just quote a part of the regexp, unlike grep -F where everything becomes a literal.

Boundary guard
Literally meta

Step 20 Lazy = good

Time to think about good practice. * is a greedy operator, expanding something like <.*> by grabbing the last closing tag and anything between, including further tags. <.*?> is non- greedy (lazy), taking the first closing tag.

Lazy = good
Lazy = good

Step 21 Perl -pie

Aside from grep, Perl remains the most comfortable fit with regexps, as is far more powerful than the former. With perl -pie on the command line, you can perform anything from simple substitutions on one or more files, to…

Lazy = good
Perl -pie

Step 22 Perl one-liner

…counting the empty lines in a text file (this from Krumin’s Perl One-Liners, see next month’s book reviews). /^$/ matches an empty line; note Perl’s use of // to delimit a regexp; ,, could also be used if / is one of the literals used.

Step 23 A regexp too far

Now you know the basics, you can build slightly more complicated regexps – but, as Jeff Atwood said: “Regular expressions are like a particularly spicy hot sauce – to be used in moderation and with restraint, only when appropriate.”

A regexp too far
A regexp too far

Step 24 Tagged offender

Finally, know the limitations of regexps. Don’t use on HTML, as they don’t parse complex languages well. Here the legendary StackOverflow reply by Bob Ince to a query on their use with HTML expresses the passion this question engenders.

Tagged offender
Tagged offender

  • Tell a Friend
  • Follow our Twitter to find out about all the latest Linux news, reviews, previews, interviews, features and a whole more.
    • Jeffrey Ruby

      Am I missing something above? I tried different browsers, but it seems the “code” is missing. I get steps and a little explanation, but no commands.

      Thanks.

    • Robin Lovelace

      Yeah where’s the code? Would be useful

    • WmFS

      Although the code seems to be missing from these examples, the test ipv6 code can be found at http://download.dartware.com/thirdparty/test-ipv6-regex.pl