Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Free program translates Euro languages to/from English



Josh Glover <jmglov@example.com> writes:

> Machine translation of Japanese (and, I would assume, Korean) is
> considered by some linguists to be an impossible problem. I
> disagree, but at the same time admit that the problem is
> decidedly non-trivial.  It will take some serious AI-style shit
> to solve.
> 
> I am of the opinion that anything can be modelled, provided you
> have a dense enough set of rules. These rules might be
> grammar-based, or they might be heuristics. In the case of
> Japanese (and probably Korean as well, though maybe to a
> slightly lesser degree), you also have to model a shit-tonne of
> context. And this is what current machine translation programs
> *do not* do well.

Actually, I think one problem with most machine translation tools
is that they try to be _too_ smart. Sometimes (maybe usually) I
don't want or need grammar translation. I just want the source
translated word-by-word. Or even morpheme-by-morpheme. For
example, I might like to see 会いたかったの? (aitakatta no?)
translated like this:

  "aitakatta no" =  "meet/see [want] [past tense] [question]"

I think that even someone not familar with Japanese at all would
say, OK, looks as if that means something like "Wanted to meet?"
or "Wanted to see?" in English. (Yeah, depending on the context,
it might could probably really mean more like "Did you miss me?")

But if I put 会いたかったの? into Babelfish or Google
translation, I get:

  When you want to meet?

Which is just plain wrong. Where the hell does it get "when" from?
So try 会いたかったんですか? (aitakattan desu ka?), and get:

  When we would like to meet, it is?

Huh?

So, try to keep it as simple as possible. Type in 会った。(atta)
and 会いました。 (aimashita).

  atta      = It met.  aimashita = It met.

Now 友達と会った。(tomodachi to atta) goes in. And out comes:

  tomodachi to atta = "It met with the friend."

So now it's doing the "No idea what the subject should be so I'll
just use 'It'" and the "OK, we need an indefinite or definite
article here, so I'll just choose 'the'" things. At the very
least, for these cases, no tool should be inventing an arbitrary
subject or arbitrarily choosing an article. Better:

  tomodachi to atta = {subject ellided} met with a/the friend(s).

But what would be much more helpful instead is:

  atta      = meet/see [plain/informal past tense] aimashita =
  meet/see [polite past tense]

  tomodachi to atta = friend(s) with meet/see [plain past tense]

Of course the person reading that would need to understand that
Japanese uses "subject object verb" order. But if I understand the
word order, having it translated as "friend(s) with meet/see
[plain past tense]" is much more clear to me than "It met with the
friend."

I seem to remember once seeing a tool once that did Japanese to
English translation in a word-by-word sort of "aitakatta no" =
"meet/see [want] [past tense] [question]" way.  (Or maybe it only
did English to Japanese.)  What it actually did was: Given some
source text (a web page, maybe?), it would re-render the entire
text, but with word-by-word rubi translations added above each
line. I think it also created hyperlinks for each word --
dictionary links. So if a word had multiple meanings, you could
see what those multiple meanings were, and figure out from the
context it was in which meaning was the intended one.

That's another problem with most other machine translation tools:
They don't preserve any of the amiguity of the original text. For
example, 会う (au) could be translated as both "meet" and "see".
If most tools find a word with multiple possible translations,
they just choose one and put that into the translated output. I
would guess that in most cases, they are just choosing the most
common translation of the word. I would much rather they just
showed me all the possible translations.

That said, I guess there is not nearly as much of an issue with
ambiguity in translating most Japanese and Chinese text -- where
most of the text is ideograms -- as there is in translating text
that is in a language written in a phonetic alphabet.

  --Mike

-- 
Michael Smith
http://sideshowbarker.net/



Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links