Mailing List Archive

Support open source code!


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

stripping HTML tags with Perl



>>>>> "Drew" == Drew C Poulin <poulin@example.com> writes:

        s/diff .*?\n//ig;     #delete lines beginning with diff(sp)

The actual effect of this is to delete from the string "diff" (or
"DIFF" or "dIfF" to EOL, so that

And now for something really different!
A whole nother smoke.

==> And now for something really A whole nother smoke.

To get diff at BOL, you want "^diff .*\n".  The `?' is redundant.

        s/[0-9].*?\n//ig;
	s/\^M//ig;

"\r" is the Perl idiom for ASCII CR (0x0D).  You can use the literal
escape with arbitrary characters, but it doesn't transport well (the
"^M" in your mail is two printing characters, not a single control
character).  The `i' flag is irrelevant since there are no alphabetic
characters here.

	s/<.*?>//ig;

This is an oops, I think.  AFAIK Perl regexps are _greedy_, matching
the longest possible string.  Thus

What I <em>really</em> want to say.

==> What I  want to say.

not only losing the HTML emphasis but verbal emphasis as well:

You really want "<[^>]+>" (delete anything bracketed by "<>"
containing some text which doesn't contain ">").  This avoids trashing
the Pascal inequality test "<>" which is not a legal tag, but will
fail miserably on stuff like

<address default="<phb@example.com>">

which may or may not be legal HTML.


-- 
University of Tsukuba                Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Institute of Policy and Planning Sciences       Tel/fax: +81 (298) 53-5091
_________________  _________________  _________________  _________________
What are those straight lines for?  "XEmacs rules."


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links