Mailing List ArchiveSupport open source code!
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: stripping HTML tags with Perl
- To: tlug@example.com
- Subject: Re: stripping HTML tags with Perl
- From: "Stephen J. Turnbull" <turnbull@example.com>
- Date: Tue, 5 Dec 2000 12:55:19 +0900
- Content-Transfer-Encoding: 7bit
- Content-Type: text/plain; charset=us-ascii
- In-Reply-To: <20001204190339Z.poulin@example.com>
- References: <20001204133053G.poulin@example.com><14892.18933.990298.358983@example.com><20001204190339Z.poulin@example.com>
- Reply-To: tlug@example.com
- Resent-From: tlug@example.com
- Resent-Message-ID: <jXn5h.A.ehC.JgGL6@example.com>
- Resent-Sender: tlug-request@example.com
>>>>> "Drew" == Drew C Poulin <poulin@example.com> writes: Drew> Honto da. Gulp. Since I get paid by the word, that could Drew> get expensive. Yep. Drew> As Darren Cook mentions, the ? makes it stingy, so that it Drew> matches the next as it works forward through the string. Drew> Without the ?, it jumps to the end of the string, works Drew> backward, and matches the first that it finds as it moves in Drew> that direction. Or so I understand. Yeah, I think XEmacs understands that syntax now, too. I just don't use it because I've been using re's that are greedy for 15 years, so I automatically use negated character classes. Drew> does wipe out everything between and including the two Drew> outermost < >s, but that's probably not what you'd consider Drew> success. Nope. Probably the best bet (for a one-liner) is s/<\/?[^@<>]+>//g This (1) is stingy (because of the `>' in the inverse character class), (2) insists that end tags be non-empty, too (Viktor's suggestion), and (3) simply fails to match on nested < ... < ... >> constructs and <phb@example.com> email addresses. You're actually probably OK for email addresses and any nested constructs in valid HTML because what you would actually have to do for that address is <phb@example.com>. Silly me. I'm not sure that everything in HTML is #PCDATA, though; any place that #CDATA (unparsed character data) is allowed, you would have to use the usual ASCII. Save those words.... -- University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Institute of Policy and Planning Sciences Tel/fax: +81 (298) 53-5091 _________________ _________________ _________________ _________________ What are those straight lines for? "XEmacs rules."
- References:
- stripping HTML tags with Perl
- From: "Drew C. Poulin" <poulin@example.com>
- stripping HTML tags with Perl
- From: "Stephen J. Turnbull" <turnbull@example.com>
- Re: stripping HTML tags with Perl
- From: "Drew C. Poulin" <poulin@example.com>
Home | Main Index | Thread Index
- Prev by Date: Re: stripping HTML tags with Perl
- Next by Date: Re: stripping HTML tags with Perl
- Prev by thread: Re: stripping HTML tags with Perl
- Next by thread: Re: stripping HTML tags with Perl
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links