Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [tlug] Why is Shift_JIS bad?



Two other (in)famous Shift-JIS problems are:

1. C compiler: char as signed byte.

The char type is often a single byte signed integer.  The sign often gets
extended into the most significant byte when converting to a word integer on
Intel machines.  That is, byte 0x82 becomes word 0xFF82.  When doing
bit-wise logical operations (and, or), you must careful to type cast or mask
the char to get rid of the sign extension.

<untested code fragment>
char *string = "※";     /* 0x81 0xa6 */
unsigned short int	sjis_char;
sjis_char = (string[0] << 8) | (string[1] & 0xff);
printf( "%02x, %02x, %04x, %04x, %04x¥n", ¥
	string[0], string[1], ((string[0] << 8) | string[1]), (string[0] <<
8) | (string[1] & 0xff), sjis_char ¥
	);
/* 2-byte int outputs: ff81, ffa6, ffa6, 81a6, 81a6 */
/* 4-byte int outputs: ffffff81, ffffffa6, ffffffa6, ffff81a6, 81a6 */
</untested code fragment>

Programs that have dealt only with 7-bit ASCII sometimes get caught by this
sign extension; I have seen this in DOS programs in the past.


2. 0x5C problem in file names

Some operating systems use the backslash as a delimiter in path names.  The
backslash is encoded as 0x5C.  But 0x5C is also used in the second byte of
Shift_JIS encoding.  Software that does a simple strtok looking for 0x5C
characters when parsing file names will incorrectly hit the 0x5C second byte
in zenkaku katakana So 0x83 0x5C, or kanji Hyou (table) 0x95 0x5C.

This happens to be when I try to use English language software to process
Japanese filenames in FAT file systems.

Best regards,
jimb.

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links