TLUG Mailing List

Mailing List Archive
Support open source code!
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
stripping HTML tags with Perl

To: "Drew C. Poulin" <poulin@example.com>

Subject: stripping HTML tags with Perl

From: "Stephen J. Turnbull" <turnbull@example.com>

Date: Tue, 5 Dec 2000 10:50:45 +0900

Cc: tlug@example.com

Content-Transfer-Encoding: 7bit

Content-Type: text/plain; charset=us-ascii

In-Reply-To: <20001204133053G.poulin@example.com>

References: <20001204133053G.poulin@example.com>

Reply-To: tlug@example.com

Resent-From: tlug@example.com

Resent-Message-ID: <F-ptUB.A.bXC.UrEL6@example.com>

Resent-Sender: tlug-request@example.com
>>>>> "Drew" == Drew C Poulin <poulin@example.com> writes:

        s/diff .*?\n//ig;     #delete lines beginning with diff(sp)

The actual effect of this is to delete from the string "diff" (or
"DIFF" or "dIfF" to EOL, so that

And now for something really different!
A whole nother smoke.

==> And now for something really A whole nother smoke.

To get diff at BOL, you want "^diff .*\n".  The `?' is redundant.

        s/[0-9].*?\n//ig;
	s/\^M//ig;

"\r" is the Perl idiom for ASCII CR (0x0D).  You can use the literal
escape with arbitrary characters, but it doesn't transport well (the
"^M" in your mail is two printing characters, not a single control
character).  The `i' flag is irrelevant since there are no alphabetic
characters here.

	s/<.*?>//ig;

This is an oops, I think.  AFAIK Perl regexps are _greedy_, matching
the longest possible string.  Thus

What I <em>really</em> want to say.

==> What I  want to say.

not only losing the HTML emphasis but verbal emphasis as well:

You really want "<[^>]+>" (delete anything bracketed by "<>"
containing some text which doesn't contain ">").  This avoids trashing
the Pascal inequality test "<>" which is not a legal tag, but will
fail miserably on stuff like

<address default="<phb@example.com>">

which may or may not be legal HTML.


-- 
University of Tsukuba                Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Institute of Policy and Planning Sciences       Tel/fax: +81 (298) 53-5091
_________________  _________________  _________________  _________________
What are those straight lines for?  "XEmacs rules."
Follow-Ups:

Re: stripping HTML tags with Perl
From: Viktor Pavlenko <vp@example.com>

Re: stripping HTML tags with Perl
From: Darren Cook <darrenj@example.com>

Re: stripping HTML tags with Perl
From: "Drew C. Poulin" <poulin@example.com>

Re: stripping HTML tags with Perl
From: Shimpei Yamashita <shimpei@example.com>

References:

stripping HTML tags with Perl
From: "Drew C. Poulin" <poulin@example.com>

Prev by Date: Re: stripping HTML tags with Perl

Next by Date: Re: stripping HTML tags with Perl

Prev by thread: Re: stripping HTML tags with Perl

Next by thread: Re: stripping HTML tags with Perl

Index(es):

Date

Thread

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links