The joys of regex
December 11th, 2007So a couple of days ago someone (forgotten who though, sorry!) put up an interesting post about using Regex (Regular Expressions) in Eclipse to save you loads of repetative typing. As a follow up I thought I’d show a real life example of using regex that just saved me a whole heap of laborious grunt work.
The problem
A client asked me to update the images on a site I’m building for them. The image details are all in a simple xml file with some details and a link to the image file, e.g…
<images> <image file="img1.jpg"><client>Foobar</client><by>Someone</by></image> <image file="img2.jpg"><client>Barfoo</client><by>Somebody Else</by></image> ... </images>
The client however supplied me with a text doucument that looked something like…
1. Foobar/Someone 2. Barfoo/Somebody Else ....
…and a bunch of jpeg files that had been named to match the document, so they were actually called ‘1. Foobar_Someone.jpg’ etc and needed to be renamed for safe use on the web (I never like having mixed case and spaces in web filenames).
Now as there were around 80 of these files it could have been a long and boring ‘rename & save’ job, then a whole bunch of cutting and pasting, so instead I used Eclipse and Perl’s regex powers.
Eclipse solution
The first thing I did was load the text file in Eclipse, then hit CTRL + F for the find & replace dialogue and checked the ‘Regular Expressions’ box. In the ‘find’ box I put
^(\d*)\. (.*)/(.*)$
This is a fairly simple pattern match using the braces to capture matching ‘groups’ that we can use later. Taking it from the begining…
- the ^ character matches the start of a line, the (\d*) matches the first numbers
- the \. matches a litteral dot (the slash is an escape character as the dot normally means match anything)
- It may be hard to see here, but there’s then a space which we ignore
- the (.*)/(.*) matches the two groups of words around the slash
- and the $ matches the end of the line
Then in the replace box I put
<image file="img$1.jpg"><client>$2</client><by>$3</by></image>
The dollar+number means use the contents of capture group n, so you can see I’m simply ‘pasting’ the captured bits in the correct places.
Then just hit ‘replace all’ and job done!
(Hint: You can also use CTRL+SPACE in the find & replace input boxes to remind you of the regex syntax)
Perl solution
Of course I still needed to rename all the files, so next I used the ‘rename’ command. I’m working on Linux, but I believe ‘rename’ comes as part of Perl, so it should be somewhere on your system and work the same no matter what platform.
The syntax for the rename command is…
rename perlexpr [ files ]
…and basically runs the regular expression ‘perlexpr’ on the filename of all files matching [files]. The expression I used here is…
rename -v 's/(d*)\..*/img$1.jpg/' *.jpg
The regular expression is the messy looking bit inside the inverted commas, and it’s matching all the .jpg’s in the folder. Again taking the regex from the top…
- The “s” means substitute. The syntax is s/old/new/ — substitute the old with the new
- The (d*) captures the intial number in the filename
- The \. matches a litteral dot
- The .* matches anything else after it
- We then discard all the other crap apart from the captured number, and use it in the substituted filename.
(Hint: more info : http://tips.webdesign10.com/how-to-bulk-rename-files-in-linux-in-the-terminal)
Conclusion
Well it’s taken me a hell of a lot longer to write this post than it did to rename all those files, I wouldn’t like to guess how long it would have taken by hand but I’m sure it was much easier this way.
Try bit of regex yourself, you just might like it!



