
Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [tlug] Do you whitelist or blacklist utf-8?
Shmuel, Josh, Peter,
Thank you for responding.
> I think that every character that is above the ascii range can be safely
> passed. So you don't need a huge array. just small one.
This sounds promising.
> but first you need to tell us something about your data. is the user
> allowed to enter HTML tags?
Nope. I want to be real strict. They get:
No punctuation at all.
Only spaces, no other white space (tabs, line feed characters, or
anything else).
They can have 0-9a-zA-Z, and anything above the ASCII range (taking into
account what you wrote above).
> or are you using different mark-down scheme?
I don't know what "mark-down scheme" means... so, uh... no? Maybe?
I looked at the pages Peter suggested (I had seen some of them before),
and according to that page, these might be the regular expressions I'm
looking for:
\p{L} (any kind of letter from any language)
\p{N} (any number from any language)
There is also \p{Z} for "any kind of white space", but I'm not sure how
to handle this. I don't want line feeds or tabs or anything like that,
but since Japanese, as one example, has it's own space character, I
should allow that kind of space character from different languages.
So, I suck at regex, but maybe I want to do something like this:
^\p{L}\p{N}\p{Z}$
... and then black list the space characters I don't like:
^\n\r\t$
The only other thing that I'm not confident about is if this regular
expression notation is compatible in PHP and Javascript. On the page
Peter linked to, it mentions a ton of different langages, like Perl,
Java, and PCRE and gives different notes on all of them, which gives me
the impression that different languages have different particulars.
Am I on the right track here?
--
Dave M G
Home |
Main Index |
Thread Index