Mailing List ArchiveSupport open source code!
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]stripping HTML tags with Perl
- To: "Drew C. Poulin" <poulin@example.com>
- Subject: stripping HTML tags with Perl
- From: "Stephen J. Turnbull" <turnbull@example.com>
- Date: Tue, 5 Dec 2000 10:50:45 +0900
- Cc: tlug@example.com
- Content-Transfer-Encoding: 7bit
- Content-Type: text/plain; charset=us-ascii
- In-Reply-To: <20001204133053G.poulin@example.com>
- References: <20001204133053G.poulin@example.com>
- Reply-To: tlug@example.com
- Resent-From: tlug@example.com
- Resent-Message-ID: <F-ptUB.A.bXC.UrEL6@example.com>
- Resent-Sender: tlug-request@example.com
>>>>> "Drew" == Drew C Poulin <poulin@example.com> writes: s/diff .*?\n//ig; #delete lines beginning with diff(sp) The actual effect of this is to delete from the string "diff" (or "DIFF" or "dIfF" to EOL, so that And now for something really different! A whole nother smoke. ==> And now for something really A whole nother smoke. To get diff at BOL, you want "^diff .*\n". The `?' is redundant. s/[0-9].*?\n//ig; s/\^M//ig; "\r" is the Perl idiom for ASCII CR (0x0D). You can use the literal escape with arbitrary characters, but it doesn't transport well (the "^M" in your mail is two printing characters, not a single control character). The `i' flag is irrelevant since there are no alphabetic characters here. s/<.*?>//ig; This is an oops, I think. AFAIK Perl regexps are _greedy_, matching the longest possible string. Thus What I <em>really</em> want to say. ==> What I want to say. not only losing the HTML emphasis but verbal emphasis as well: You really want "<[^>]+>" (delete anything bracketed by "<>" containing some text which doesn't contain ">"). This avoids trashing the Pascal inequality test "<>" which is not a legal tag, but will fail miserably on stuff like <address default="<phb@example.com>"> which may or may not be legal HTML. -- University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Institute of Policy and Planning Sciences Tel/fax: +81 (298) 53-5091 _________________ _________________ _________________ _________________ What are those straight lines for? "XEmacs rules."
- Follow-Ups:
- Re: stripping HTML tags with Perl
- From: Viktor Pavlenko <vp@example.com>
- Re: stripping HTML tags with Perl
- From: Darren Cook <darrenj@example.com>
- Re: stripping HTML tags with Perl
- From: "Drew C. Poulin" <poulin@example.com>
- Re: stripping HTML tags with Perl
- From: Shimpei Yamashita <shimpei@example.com>
- References:
- stripping HTML tags with Perl
- From: "Drew C. Poulin" <poulin@example.com>
Home | Main Index | Thread Index
- Prev by Date: Re: stripping HTML tags with Perl
- Next by Date: Re: stripping HTML tags with Perl
- Prev by thread: Re: stripping HTML tags with Perl
- Next by thread: Re: stripping HTML tags with Perl
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links