Mailing List Archive

Support open source code!


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: stripping HTML tags with Perl



From: "Stephen J. Turnbull" <turnbull@example.com>

> The actual effect of this is to delete from the string "diff" (or
> "DIFF" or "dIfF" to EOL, so that

Honto da. Gulp.  Since I get paid by the word, that could get expensive.


> "\r" is the Perl idiom for ASCII CR (0x0D).  You can use the literal
> escape with arbitrary characters, but it doesn't transport well (the

I see.

> 	s/<.*?>//ig;
> 
> This is an oops, I think.  AFAIK Perl regexps are _greedy_, matching
> the longest possible string.  Thus

As Darren Cook mentions, the ? makes it stingy, so that it matches the 
next > as it works forward through the string.  Without the ?, it
jumps to the end of the string, works backward, and matches the first
> that it finds as it moves in that direction. Or so I understand.


> This avoids trashing
> the Pascal inequality test "<>" which is not a legal tag, but will
> fail miserably on stuff like
> 
> <address default="<phb@example.com>">

And it still fails miserably.  Taking the ? out of 

s/<.*?>//ig; 

does wipe out everything between and including the two outermost < >s, but 
that's probably not what you'd consider success.

Thanks for all the pointers.

Drew Poulin


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links