Mailing List ArchiveSupport open source code!
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]tlug: Character Encodings Again
- To: tlug@example.com
- Subject: tlug: Character Encodings Again
- From: Matt Gushee <matt@example.com>
- Date: Sun, 1 Nov 1998 16:47:16 +0900
- Content-Type: text/plain; charset=US-ASCII
- Reply-To: tlug@example.com
- Sender: owner-tlug@example.com
Hello-- Speaking with some trepidation since this may be considered an inappropriate question under the new regime (if so, feel free to respond off-list): I have a character mapping table which is supposed to represent the JIS-X0208 character set (for more explanation, see below). It looks like this: 41344 33 UNUSED 41377 94 8481 41471 1 UNUSED 41600 33 UNUSED 41633 94 8737 41727 1 UNUSED 41856 33 UNUSED 41889 94 8993 41983 1 UNUSED ...... . ...... 65185 94 32289 65279 1 UNUSED The center number indicates a range of characters, so that '41377 94 \ 8481' means that the 94 characters beginning with 41377 correspond to 94 characters beginning with 8481 ... or something like that. There are, in fact, exactly 94 such ranges of 94 characters each, which (as you may have noticed) sounds an awful lot like the kuten tables for the JIS character set. So, while the pattern of numbers makes sense and is more or less what I would expect, the specific numbers look completely different from anything else I've seen. Do these values ring a bell with anyone? (I've been told that one side or the other is Unicode numbers, but that doesn't jibe with the neat 94x94 grouping, nor do they have any apparent relation to the 4e00 -> index numbers) In case you're wondering what all this is about: I'm trying to write an SGML declaration that will allow the use of kanji in markup (e.g., instead of <par></par>, you could have <段落></段落> ... and so on. The (amended) SGML standard definitely allows this, and according to my limited understanding of the docs, nsgmls should support such documents, but I haven't been able to make it work. I *have* gotten nsgmls to parse documents with Japanese content, as long as only Roman characters are used in the markup. The above table, in fact, is what makes that work. It comes from the file 'japan.sgmldecl', which is provided with the SP package. An SGML declaration, for those who don't know, has one mapping table (actually a set of tables) specifying the characters that are allowed in content, and another table for the characters that are allowed in markup. 'japan.sgmldecl', in its original form, provides for Japanese characters only in content. So I'm trying to modify it, proceeding on the obvious assumption that I should just use the same numbers for the markup characters as for the content characters ... but it's not working. Actually, I suspect that something other than character indexes is causing my troubles, but if I could understand these numbers, then I'd at least have something solid to go on. By the way, I've asked about this on comp.text.sgml and have gotten some help, but nobody can give me a good explanation of the character indexes. In theory, fj.comp.sgml would be ideal, but it seems to be completely dead. So if anybody knows anything ... pleasepleaseplease ... Matt Gushee Oshamanbe, Hokkaido --------------------------------------------------------------- Next Nomikai: 20 November, 19:30 Tengu TokyoEkiMae 03-3275-3691 Next Technical Meeting: January, 1999 (details TBA) --------------------------------------------------------------- Sponsor: PHT, makers of TurboLinux http://www.pht.co.jp
- Follow-Ups:
- Re: tlug: Character Encodings Again
- From: "J. David Beutel" <jdb@example.com>
- tlug: Character Encodings Again
- From: "Stephen J. Turnbull" <turnbull@example.com>
Home | Main Index | Thread Index
- Prev by Date: tlug: linux.org down all week
- Next by Date: tlug: More compiler problems
- Prev by thread: tlug: linux.org down all week
- Next by thread: Re: tlug: Character Encodings Again
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links