TLUG Mailing List

Mailing List Archive
Support open source code!
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: stripping HTML tags with Perl

To: tlug@example.com, turnbull@example.com

Subject: Re: stripping HTML tags with Perl

From: "Drew C. Poulin" <poulin@example.com>

Date: Mon, 04 Dec 2000 19:03:39 -0800

Content-Transfer-Encoding: 7bit

Content-Type: Text/Plain; charset=us-ascii

In-Reply-To: <14892.18933.990298.358983@example.com>

References: <20001204133053G.poulin@example.com><14892.18933.990298.358983@example.com>

Reply-To: tlug@example.com

Resent-From: tlug@example.com

Resent-Message-ID: <N2-dX.A.OcC.7qFL6@example.com>

Resent-Sender: tlug-request@example.com
From: "Stephen J. Turnbull" <turnbull@example.com>

> The actual effect of this is to delete from the string "diff" (or
> "DIFF" or "dIfF" to EOL, so that

Honto da. Gulp.  Since I get paid by the word, that could get expensive.


> "\r" is the Perl idiom for ASCII CR (0x0D).  You can use the literal
> escape with arbitrary characters, but it doesn't transport well (the

I see.

> 	s/<.*?>//ig;
> 
> This is an oops, I think.  AFAIK Perl regexps are _greedy_, matching
> the longest possible string.  Thus

As Darren Cook mentions, the ? makes it stingy, so that it matches the 
next > as it works forward through the string.  Without the ?, it
jumps to the end of the string, works backward, and matches the first
> that it finds as it moves in that direction. Or so I understand.


> This avoids trashing
> the Pascal inequality test "<>" which is not a legal tag, but will
> fail miserably on stuff like
> 
> <address default="<phb@example.com>">

And it still fails miserably.  Taking the ? out of 

s/<.*?>//ig; 

does wipe out everything between and including the two outermost < >s, but 
that's probably not what you'd consider success.

Thanks for all the pointers.

Drew Poulin
Follow-Ups:

Re: stripping HTML tags with Perl
From: "Stephen J. Turnbull" <turnbull@example.com>

References:

stripping HTML tags with Perl
From: "Drew C. Poulin" <poulin@example.com>

stripping HTML tags with Perl
From: "Stephen J. Turnbull" <turnbull@example.com>

Prev by Date: Re: stripping HTML tags with Perl

Next by Date: Re: stripping HTML tags with Perl

Prev by thread: Re: stripping HTML tags with Perl

Next by thread: Re: stripping HTML tags with Perl

Index(es):

Date

Thread

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links