Mailing List Archive

Support open source code!


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: tlug: msword files



>>>>> "Hirotaka" == Hirotaka Yoshioka <hyoshiok@example.com> writes:

    Hirotaka> Now can we hack the code fragment which accepts one SJIS
    Hirotaka> character? Of course we need to look ahead one byte to
    Hirotaka> test if the byte sequence is a valid SJIS character.

Sure.  The problem is that almost everything is a valid SJIS
character, so most of most binary files will get passed through SJIS.
For example, here are a few lines from "strings `which strings`":

/lib/ld-linux.so.2
__gmon_start__
libbfd-2.9.1.0.25.so
_DYNAMIC
_GLOBAL_OFFSET_TABLE_
_init
_fini

[snipped symbol table here]

_end
GLIBC_2.1
GLIBC_2.0
PTRh
QVhp
(<xt
G<Pj
ueh\
80t-@example.com(@example.com#@
WVSj/
80t-@example.com(@example.com#@
@@+E
80t-@example.com(@example.com#@
@@example.com$
8#t+C8#t&C8#t!C
80t-@example.com(@example.com#@
@@example.com$
@example.com$
8#t+C8#t&C8#t!C
80t-@example.com(@example.com#@
cHJy
version
help
target

[snipped]

I don't know what that mojibake means, but a moderately large
executable will give you hundreds or thousands of lines of it.  This
is for plain old ASCII; the effect would be much worse for shift JIS.

    Hirotaka> Does anybody send me the source code of 'strings'? I
    Hirotaka> suppose it is not a large program.

I don't happen to have a copy at the moment but it's in GNU binutils.

    Hirotaka> Can we write a SJIS version of 'strings'?
    >> No.

    Hirotaka> I think 'no' is too strong word but not impossible. You
    Hirotaka> need to make some dirty hack :-)

Well, no.  It would be like trying to write `strings' for ISO-8859-1:
you just get so many false positives that you end up with 90% of the 
file.

It might be good enough, but the chances aren't high enough that I'll
spend any more time on it ;-)

You could add lots more heuristics, but that wouldn't really be
`strings' any more, since you'd have to be really careful to avoid
stripping out stuff surrounded by MS Word formatting characters, which
is my point.

-- 
University of Tsukuba                Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Institute of Policy and Planning Sciences       Tel/fax: +81 (298) 53-5091
__________________________________________________________________________
__________________________________________________________________________
What are those two straight lines for?  "Free software rules."
-------------------------------------------------------------------
Next Technical Meeting: October 9 (Sat), 13:30   place: Temple Univ.
* Linux Internationalisation Initiative (Li18nux) speaker: Akio Kido
* Japanese TrueType Fonts                     speaker: Adrian Havill
Next Technical Meeting: November 13 (Sat), 13:30 place: Temple Univ.
* Network Security                               speaker: Steve Baur
Next Nomikai:  December 17 (Fri), 19:00 Tengu TokyoEkiMae 03-3275-3691
-------------------------------------------------------------------
more info: http://www.tlug.gr.jp        Sponsor: Global Online Japan


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links