TLUG Mailing List

Mailing List Archive
Support open source code!
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: stripping HTML tags with Perl

To: tlug@example.com

Subject: Re: stripping HTML tags with Perl

From: "Stephen J. Turnbull" <turnbull@example.com>

Date: Tue, 5 Dec 2000 12:55:19 +0900

Content-Transfer-Encoding: 7bit

Content-Type: text/plain; charset=us-ascii

In-Reply-To: <20001204190339Z.poulin@example.com>

References: <20001204133053G.poulin@example.com><14892.18933.990298.358983@example.com><20001204190339Z.poulin@example.com>

Reply-To: tlug@example.com

Resent-From: tlug@example.com

Resent-Message-ID: <jXn5h.A.ehC.JgGL6@example.com>

Resent-Sender: tlug-request@example.com
>>>>> "Drew" == Drew C Poulin <poulin@example.com> writes:

    Drew> Honto da. Gulp.  Since I get paid by the word, that could
    Drew> get expensive.

Yep.

    Drew> As Darren Cook mentions, the ? makes it stingy, so that it
    Drew> matches the next as it works forward through the string.
    Drew> Without the ?, it jumps to the end of the string, works
    Drew> backward, and matches the first that it finds as it moves in
    Drew> that direction. Or so I understand.

Yeah, I think XEmacs understands that syntax now, too.  I just don't
use it because I've been using re's that are greedy for 15 years, so I
automatically use negated character classes.

    Drew> does wipe out everything between and including the two
    Drew> outermost < >s, but that's probably not what you'd consider
    Drew> success.

Nope.  Probably the best bet (for a one-liner) is

s/<\/?[^@<>]+>//g

This (1) is stingy (because of the `>' in the inverse character
class), (2) insists that end tags be non-empty, too (Viktor's
suggestion), and (3) simply fails to match on nested < ... < ... >>
constructs and <phb@example.com> email addresses.

You're actually probably OK for email addresses and any nested
constructs in valid HTML because what you would actually have to do
for that address is &lt;phb@example.com&gt;.  Silly me.

I'm not sure that everything in HTML is #PCDATA, though; any place
that #CDATA (unparsed character data) is allowed, you would have to
use the usual ASCII.  Save those words....


-- 
University of Tsukuba                Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Institute of Policy and Planning Sciences       Tel/fax: +81 (298) 53-5091
_________________  _________________  _________________  _________________
What are those straight lines for?  "XEmacs rules."
References:

stripping HTML tags with Perl
From: "Drew C. Poulin" <poulin@example.com>

stripping HTML tags with Perl
From: "Stephen J. Turnbull" <turnbull@example.com>

Re: stripping HTML tags with Perl
From: "Drew C. Poulin" <poulin@example.com>

Prev by Date: Re: stripping HTML tags with Perl

Next by Date: Re: stripping HTML tags with Perl

Prev by thread: Re: stripping HTML tags with Perl

Next by thread: Re: stripping HTML tags with Perl

Index(es):

Date

Thread

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links