Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Do you whitelist or blacklist utf-8?



Hello

Shmuel Fomberg wrote:
after it is in the target encoding, you probably want to examine only
characters that are in the ascii range.
if your encoding is utf-8, you can write a tight loop that examine the MSB
of a byte, and pass this byte if it is set. else - whitelist / blacklist this byte.

> On 2011/02/22 12:57, Dave M G wrote:
>> The thing is that I also want to be able to allow CJK characters, and
>> any other language with non-Latin characters. This is a snap to do if
>> you just want to allow 0-9a-zA-Z. But once you get into Unicode land, it
>> seems to be a whole other ballgame.

To allow anything that is a letterlike character in any language (or in some languages) you could make use of Unicode Regular Expressions in combination with suitable character properties like "Letter" or script names like "Hiragana" or "Han".

See here for more information:
http://www.regular-expressions.info/unicode.html
http://unicode.org/reports/tr18/

IMHO, only whitelist.

Only whitelist, from my point of view too. When blacklisting, you will always miss something that will backfire at one point.

Of course, all this is not excuse for not using pre-compiled SQL queries
with placeholders, or whatever they are called in PHP.

Fully agreed. Never build your SQL-queries by String-concatenation. Instead, use the mechanisms that your progamming environment provides.

Cheers,
Peter




Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links