Random techie babbling...

The joys of regex

So a couple of days ago someone (forgotten who though, sorry!) put up an interesting post about using Regex (Regular Expressions) in Eclipse to save you loads of repetative typing. As a follow up I thought I’d show a real life example of using regex that just saved me a whole heap of laborious grunt work.

The problem

A client asked me to update the images on a site I’m building for them. The image details are all in a simple xml file with some details and a link to the image file, e.g…

<image file="img1.jpg"><client>Foobar</client><by>Someone</by></image>
<image file="img2.jpg"><client>Barfoo</client><by>Somebody Else</by></image>

The client however supplied me with a text doucument that looked something like…

1. Foobar/Someone
2. Barfoo/Somebody Else

…and a bunch of jpeg files that had been named to match the document, so they were actually called ’1. Foobar_Someone.jpg’ etc and needed to be renamed for safe use on the web (I never like having mixed case and spaces in web filenames).

Now as there were around 80 of these files it could have been a long and boring ‘rename & save’ job, then a whole bunch of cutting and pasting, so instead I used Eclipse and Perl’s regex powers.

Eclipse solution

The first thing I did was load the text file in Eclipse, then hit CTRL + F for the find & replace dialogue and checked the ‘Regular Expressions’ box. In the ‘find’ box I put

^(\d*)\. (.*)/(.*)$

This is a fairly simple pattern match using the braces to capture matching ‘groups’ that we can use later. Taking it from the begining…

  • the ^ character matches the start of a line, the (\d*) matches the first numbers
  • the \. matches a litteral dot (the slash is an escape character as the dot normally means match anything)
  • It may be hard to see here, but there’s then a space which we ignore
  • the (.*)/(.*) matches the two groups of words around the slash
  • and the $ matches the end of the line

Then in the replace box I put

<image file="img$1.jpg"><client>$2</client><by>$3</by></image>

The dollar+number means use the contents of capture group n, so you can see I’m simply ‘pasting’ the captured bits in the correct places.

Then just hit ‘replace all’ and job done!
(Hint: You can also use CTRL+SPACE in the find & replace input boxes to remind you of the regex syntax)

Perl solution

Of course I still needed to rename all the files, so next I used the ‘rename’ command. I’m working on Linux, but I believe ‘rename’ comes as part of Perl, so it should be somewhere on your system and work the same no matter what platform.

The syntax for the rename command is…

rename perlexpr [ files ]

…and basically runs the regular expression ‘perlexpr’ on the filename of all files matching [files]. The expression I used here is…

rename -v 's/(d*)\..*/img$1.jpg/' *.jpg

The regular expression is the messy looking bit inside the inverted commas, and it’s matching all the .jpg’s in the folder. Again taking the regex from the top…

  • The “s” means substitute. The syntax is s/old/new/ — substitute the old with the new
  • The (d*) captures the intial number in the filename
  • The \. matches a litteral dot
  • The .* matches anything else after it
  • We then discard all the other crap apart from the captured number, and use it in the substituted filename.

(Hint: more info : http://tips.webdesign10.com/how-to-bulk-rename-files-in-linux-in-the-terminal)


Well it’s taken me a hell of a lot longer to write this post than it did to rename all those files, I wouldn’t like to guess how long it would have taken by hand but I’m sure it was much easier this way.

Try bit of regex yourself, you just might like it! :)

If you enjoyed this post, please consider leaving a comment or subscribing to the RSS feed to have future articles delivered to your feed reader.


Got something to say? Feel free, I want to hear from you! Leave a Comment

  1. Gareth says:

    I’ve finally had to use this find and replace within eclipse. However when I try to replace the text, it doesn’t seem to work…I get a “No group 1″ error??

    Find: \{\W+
    Replace With: $1this._

    which should find everything from { to the first letter it finds and replace that with the “found” string + this._ shouldn’t it?

    When I just do a “Find” it matches what I’m looking for perfectly, but it doesn’t seem to do the replace.

  2. Gareth says:

    Never mind. I changed it to $0 instead of $1 in the replace statement and it worked correctly.

Leave a Comment

Let us know your thoughts on this post but remember to place nicely folks!