[TCLUG] Re: [TCLUG:20540] [OT] Regexp and HTML sanitization

Fri Aug 18 17:00:00 CDT 2000

Robert P. Goldman said:
> Kevin, I think there is an *in principle* reasons why this should not
> be possible.
> 
> Parsing HTML is a context-free parsing problem (since the tags can
> embed and you have to have a stack to track the things you want to
> match), not a regular expression parsing problem (there's no fixed
> bound of memory you need to do this job).

I disagree.  Unless there's more going on here than the original question
stated, Kevin doesn't sound like he's interested in the structure of the HTML
tags or whether they match up.  He just wants to create a list of 'approved'
tags and make everything else go away.

At worst, he might need to walk through the (surviving) tags with a set of
flags for whether, e.g., <I> is turned on and append a </I> to the document
if the submitter forgot to close it.

-- 
"Two words: Windows survives." - Craig Mundie, Microsoft senior strategist
"So does syphillis. Good thing we have penicillin." - Matthew Alton
Geek Code 3.1:  GCS d- s+: a- C++ UL++$ P+>+++ L+++>++++ E- W--(++) N+ o+
!K w---$ O M- V? PS+ PE Y+ PGP t 5++ X+ R++ tv b+ DI++++ D G e* h+ r++ y+

---------------------------------------------------------------------
To unsubscribe, e-mail: tclug-list-unsubscribe at mn-linux.org
For additional commands, e-mail: tclug-list-help at mn-linux.org