tlug: Character Encodings Again

To: tlug@example.com
Subject: tlug: Character Encodings Again
From: Matt Gushee <matt@example.com>
Date: Sun, 1 Nov 1998 16:47:16 +0900
Content-Type: text/plain; charset=US-ASCII
Reply-To: tlug@example.com
Sender: owner-tlug@example.com


Hello--

Speaking with some trepidation since this may be considered an
inappropriate question under the new regime (if so, feel free to
respond off-list):

I have a character mapping table which is supposed to represent the
JIS-X0208 character set (for more explanation, see below). It looks
like this:

41344 33 UNUSED
41377 94  8481
41471  1 UNUSED
41600 33 UNUSED
41633 94  8737
41727  1 UNUSED
41856 33 UNUSED
41889 94  8993
41983  1 UNUSED
...... . ......

65185 94 32289
65279  1 UNUSED

The center number indicates a range of characters, so that '41377 94 \
8481' means that the 94 characters beginning with 41377 correspond to
94 characters beginning with 8481 ... or something like that. There
are, in fact, exactly 94 such ranges of 94 characters each, which (as
you may have noticed) sounds an awful lot like the kuten tables for
the JIS character set. So, while the pattern of numbers makes sense
and is more or less what I would expect, the specific numbers look
completely different from anything else I've seen.

Do these values ring a bell with anyone? (I've been told that one side 
or the other is Unicode numbers, but that doesn't jibe with the neat
94x94 grouping, nor do they have any apparent relation to the 4e00 ->
index numbers)


In case you're wondering what all this is about: I'm trying to write
an SGML declaration that will allow the use of kanji in markup (e.g.,
instead of <par></par>, you could have <段落></段落> ... and so
on. The (amended) SGML standard definitely allows this, and according
to my limited understanding of the docs, nsgmls should support such
documents, but I haven't been able to make it work.

I *have* gotten nsgmls to parse documents with Japanese content, as
long as only Roman characters are used in the markup. The above table, 
in fact, is what makes that work. It comes from the file
'japan.sgmldecl', which is provided with the SP package. An SGML
declaration, for those who don't know, has one mapping table (actually 
a set of tables) specifying the characters that are allowed in
content, and another table for the characters that are allowed in
markup.

'japan.sgmldecl', in its original form, provides for Japanese
characters only in content. So I'm trying to modify it, proceeding on
the obvious assumption that I should just use the same numbers for
the markup characters as for the content characters ... but it's not
working. Actually, I suspect that something other than character
indexes is causing my troubles, but if I could understand these
numbers, then I'd at least have something solid to go on.

By the way, I've asked about this on comp.text.sgml and have gotten
some help, but nobody can give me a good explanation of the character
indexes. In theory, fj.comp.sgml would be ideal, but it seems to be
completely dead.

So if anybody knows anything ... pleasepleaseplease ...

Matt Gushee
Oshamanbe, Hokkaido
---------------------------------------------------------------
Next Nomikai: 20 November, 19:30 Tengu TokyoEkiMae 03-3275-3691
Next Technical Meeting: January, 1999 (details TBA)
---------------------------------------------------------------
Sponsor: PHT, makers of TurboLinux http://www.pht.co.jp

Follow-Ups:
- Re: tlug: Character Encodings Again
  - From: "J. David Beutel" <jdb@example.com>
- tlug: Character Encodings Again
  - From: "Stephen J. Turnbull" <turnbull@example.com>

Prev by Date: tlug: linux.org down all week
Next by Date: tlug: More compiler problems
Prev by thread: tlug: linux.org down all week
Next by thread: Re: tlug: Character Encodings Again
Index(es):
- Date
- Thread

Home | Main Index | Thread Index