Mailing List Archive

Support open source code!

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: stripping HTML tags with Perl

>>>>> "Drew" == Drew C Poulin <> writes:

    Drew> Honto da. Gulp.  Since I get paid by the word, that could
    Drew> get expensive.


    Drew> As Darren Cook mentions, the ? makes it stingy, so that it
    Drew> matches the next as it works forward through the string.
    Drew> Without the ?, it jumps to the end of the string, works
    Drew> backward, and matches the first that it finds as it moves in
    Drew> that direction. Or so I understand.

Yeah, I think XEmacs understands that syntax now, too.  I just don't
use it because I've been using re's that are greedy for 15 years, so I
automatically use negated character classes.

    Drew> does wipe out everything between and including the two
    Drew> outermost < >s, but that's probably not what you'd consider
    Drew> success.

Nope.  Probably the best bet (for a one-liner) is


This (1) is stingy (because of the `>' in the inverse character
class), (2) insists that end tags be non-empty, too (Viktor's
suggestion), and (3) simply fails to match on nested < ... < ... >>
constructs and <> email addresses.

You're actually probably OK for email addresses and any nested
constructs in valid HTML because what you would actually have to do
for that address is &lt;;.  Silly me.

I'm not sure that everything in HTML is #PCDATA, though; any place
that #CDATA (unparsed character data) is allowed, you would have to
use the usual ASCII.  Save those words....

University of Tsukuba                Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Institute of Policy and Planning Sciences       Tel/fax: +81 (298) 53-5091
_________________  _________________  _________________  _________________
What are those straight lines for?  "XEmacs rules."

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links