Reading time: 2 – 3 minutes

How to return tag contents with regular expressions
(Photo: Jeff Kubina)

As most of you already know, I LOVE regular expressions, and think they are great to solve simple and complex tasks involving strings.
One thing I usually need to do, is extract contents from inside HTML or XML tags.
Take the following XML document for example:

<catalog>
  <cd>
    <title>empire burlesque</title>
    <artist>bob dylan</artist>
    <country>usa</country>
    <company>columbia</company>
    <price>10.90</price>
    <year>1985</year>
  </cd>
  <cd>
    <title>hide your heart</title>
    <artist>bonnie tyler</artist>
    <country>uk</country>
    <company>cbs records</company>
    <price>9.90</price>
    <year>1988</year>
  </cd>
  <cd>
    <title>greatest hits</title>
    <artist>dolly parton</artist>
    <country>usa</country>
    <company>rca</company>
    <price>9.90</price>
    <year>1982</year>
  </cd>
</catalog>

Now, let’s say
we want to return all the titles contained within the tags, but not return the tags themselves.
An accepted pattern would be writte
n like:

/<title>(.+?)</title>/m

The concept is very simple, we’re saying we want to match one or more characters after the tag. we then add a “?” to make our regular expression non-greedy (regular expressions are greedy by default, and wi
ll try and find the last occurrence of what’s being searched).
Now, if you run the aforementioned pattern against the XML string on your favourite language (I chose ruby here as it’s got the method scan that does exactly what I need), you should get an array with all the titles on the xml.

xml_re = /<title>(.+?)</title>/m
m = strXML.scan(xml_re)
print m

Nifty eh?

2 Responses to “How to return tag contents with regular expressions”

  1. Totally right Matthew, but I just used XML to illustrate how easy it would be. In fact, I’d rather use XPath (which could also be using regular expressions) for it as you mentioned.

    I would normally use the regular expression approach when dealing with HTML.

    I used this a lot when developing some of my projects here as well.

    See http://cfaday.placona.co.uk/ for example. It uses this kind of regular expressions to get the various bits and pieces from Adobe’s documentation.

  2. I think you’d be better served solving this example with an XML Parser and XPath. While I agree regular expressions are a very powerful way to solve a number of different problems, they’re also a very easy way to introduce bugs.

Leave a Reply

You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>