Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Do you whitelist or blacklist utf-8?



Shmuel, Josh, Peter,

Thank you for responding.


> I think that every character that is above the ascii range can be safely 
> passed. So you don't need a huge array. just small one.

This sounds promising.

> but first you need to tell us something about your data. is the user 
> allowed to enter HTML tags?

Nope. I want to be real strict. They get:
No punctuation at all.
Only spaces, no other white space (tabs, line feed characters, or
anything else).
They can have 0-9a-zA-Z, and anything above the ASCII range (taking into
account what you wrote above).

> or are you using different mark-down scheme?

I don't know what "mark-down scheme" means... so, uh... no? Maybe?

I looked at the pages Peter suggested (I had seen some of them before),
and according to that page, these might be the regular expressions I'm
looking for:

\p{L} (any kind of letter from any language)
\p{N} (any number from any language)

There is also \p{Z} for "any kind of white space", but I'm not sure how
to handle this. I don't want line feeds or tabs or anything like that,
but since Japanese, as one example, has it's own space character, I
should allow that kind of space character from different languages.

So, I suck at regex, but maybe I want to do something like this:

^\p{L}\p{N}\p{Z}$

... and then black list the space characters I don't like:

^\n\r\t$

The only other thing that I'm not confident about is if this regular
expression notation is compatible in PHP and Javascript. On the page
Peter linked to, it mentions a ton of different langages, like Perl,
Java, and PCRE and gives different notes on all of them, which gives me
the impression that different languages have different particulars.

Am I on the right track here?

-- 
Dave M G



Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links