Re: [tlug] Re: Security question with grep/e...

Date: Wed, 24 Mar 2004 22:17:50 +1100 (EST)
From: Jim Breen <Jim.Breen@example.com>
Subject: Re: [tlug] Re: Security question with grep/e...
I'm going to cover a summary of responses in one hit (since I read the
digest, so it's easier that way.)

>> From: Brett Robson <b-robson@example.com>
>> 
>> I never used reg exps in C, how hard is it to write them in C? I'm
>> thinking in term of escaping characters etc.

Looking at the man page, I'd say it was designed to be fast and
efficient rather than easy-to-use. You feed it the regex, it compiles
it into a buffer, then you use the compiled form. There is a thicket of
compiler and error flags.

>> Sounds like you are running on a reasonably well adminisited server. One
>> small sugestion I'd make is to make your files read only (to all). If
>> you are not making a lot of updates it doesn't complicate things much

Nothing's writeable-to-all. I actually do a number of updates, in fact
there are cron jobs scrubbing and polishing in the wee hours.

>> From: "Stephen J. Turnbull" <stephen@example.com>
>> >>>>> "Jim" == Jim Breen <Jim.Breen@example.com> writes:
>> 
>>     Jim> Can you be more specific about the risks? As I understand it,
>>     Jim> doing a system("foobar par1 par2"); just stokes up /bin/sh
>>     Jim> under my account (it's usually cgiwrap or equivalent) and
>>     Jim> runs foobar.
>> 
>> ISTR that the main direct system risks come from the environment
>> (especially things like LD_PRELOAD).  The indirect risks come from the
>> fact that most command line apps are written assuming that the user is
>> right there and authorized.  They are large with many complex options
>> (especially the GNU versions), with lots of legacy code.  You really
>> can't depend on all buffer and array accesses being bounds-checked,
>> etc.

I've checked the environment that a system()-invoked program gets, and
there's no LD_PRELOAD. In fact it's pretty much the environment the
httpd instance gets. As for bound-checking, well I'm far from being sin
free on that.

>> You also have to be sure that you get the protection against quote
>> characters right.  Note that if you're using "" for the regexp
>> argument in your system() call, you probably need to strip out $ and
>> `, since they're interpreted and the results interpolated.  Of course,
>> ` can do anything that you can, and at least some shells accept $
>> extensions that call programs (although I don't know if they do if
>> called as /bin/sh, but see next para).

I think that goes to the crux of it. I've backed away from any thought
of letting user-generated strings near a command-line, regardless of
vetting. The trial version I'm running does a 
system("egrep -i -f paramfile-nnnn text-file > resultsfile-nnnn"); with
the user string in paramfile-nnnn (nnnn is the PID). Since the user
string never get into the command-line, and is only pulled in by egrep
once it starts executing, I don't think there's much risk of a hack that
way. Yes, there's always chance of an overflow once egrep starts, but
I think it's probably no different to the case where I did my own 
regcomp/regexec calls.

>> Also, /bin/sh is not terribly portable; ash, bash, and zsh all flunk
>> as POSIX shells in different ways.  (Maybe recent versions of ash are
>> OK.)  The problems with bash and zsh are not of interest to POSIX-
>> conforming scripts, they're impermissible extensions, but they are
>> extensions that conceivably could be used by hackers.  (No, I don't
>> know specifically; the point is that since I don't know, I have to
>> assume they are a risk.)

I thought /bin/sh was a sort-of lowest common denominator when it came
to shells. Certainly I'm not asking for anything but a pipe and a simple
redirection of STDOUT. Even DOS could do that.

>>     Jim> I'm not doing it with anything suid, etc. I don't have su
>>     Jim> rights on the host.
>> 
>> Do you have a shell account?  Does the host have a working C compiler
>> on it?  If the answer to both questions is "yes", then the possibility
>> of a hostile agent using a web exploit to achieve shell access via
>> your account, and from there trampolining to root cannot be
>> discounted.  Again, I don't know the details, but AFAICT at the time
>> it's been done to me, so I know it's possible.  :-/

It's yes to both, on at least one site. But I don't really think what
I'm suggesting is raising the chance.

>> From: Alain Hoang <hoanga@example.com>
>> 
>> 	Wow, I learned a lot myself.

I always do from Stephen.

>> 	I believe Dr. Turnbull covered everything I could possibly think of
>> and more in terms of what to worry about from a security aspect of
>> running a CGI script that pipes the output from an egrep (with
>> proper escapes).

Erm. The output from the egrep is piped through "head -nnn" simply to
limit the results, and then goes to workfile. Then the CGI program
opens the workfile and reads it in copying the contents into the
generated HTML response. I don't think the risk is high.

>> 	I would just like to add that on the surface the egrep idea seems
>> portable but there seems to be those small niggling unknowns that
>> bother me if I knew this was going to be mirrored across many different
>> types of architectures.  Even though egrep is 'available' on all 
>> machines
>> as mentioned earlier, the implementation of them all slightly differs
>> so one regexp that seems reasonable on your test machine behaves
>> oddly on another because the egrep doesn't support one set.  Or
>> perhaps another system DOES have egrep but it's located somewhere
>> else and it's not the first one that is called on the PATH in the CGI.
>> At this point you might decide just go with GNU egrep but then you
>> now have the issue of calling GNU egrep reliably on a large
>> set of machines that might have stuck GNU egrep in lots of different
>> places.   You also get the problem of does GNU egrep have any
>> security exploits?   Which version of GNU egrep is on that machine?
>> Or you try to support a subset that all these versions of egrep support.
>> That's a bit of reading on different versions of egrep.  At this point 
>> you're
>> probably better off writing your own program rather than trying to 
>> patch up
>> all these systems.  Or wondering if Perl is starting to become a more
>> viable option :-)

The only real alternative is to use regcomp/regexec, and from what I
read on man pages it is as variable as egrep. Not suprising really.
As for Perl, well apart from being a non-issue for me, my limited Perl
experience has exposed me to masses of version differences. With a
master site and 6-7 mirrors, I don't want to add another language to the
equation.

>> 	I think this brings up another good point.  The more visible
>> you are the more the arrows get pointed at you.  I've found
>> Monash a really useful resource for years as a student of Japanese.
>> I think that visibility brings with it nuisances that think it would be 
>> great
>> to take down a useful site.

Our sysadmins look after it well. It's a bit painful - I don't have a
usable shell account on the server itself. I can mount the server files
elsewhere, but the server can't mount other files, so it's an island in
that sense. Also being a Solaris box it's not as much in the sights of
hackers as some Linices.

>> From: Tim Hurman <kano-tlug@example.com>
>> 
>> Would it not be easier just to do this in PERL anyway, here is my
>> reasoning,

Well, you have to factor in the rise time for me to learn enough Perl to
be confident of doing it right. Then since the server itself is in C,
I'd have to ..... Anyway, you see how I feel.

>> 1) before doing the system(), you have to do a whole lot of messing to get
>> the output of the egrep back (not to mention parsing it), this basically
>> involves a fork(), but it is an expensive call and a lot of usage may
>> affect the machine.

Well, calling Perl from the server (in C) takes a fork. And frankly
using efficiency of execution as a reason for moving from C to Perl is
just a little bizarre. 

>> 2) charsets. Even though you are passing stuff to egrep, I would presume
>> you have to have it in a common charset, and the likelyhood is that you
>> will get it in utf-8, which may or may not be a good thing depending on
>> the charset you are comparing it to. Also you may have multiple encodings
>> for a double quote.

Forget charsets (and character sets) in this case. The file is in EUC-JP
and the user string is too. Works fine right now in C and egrep.

>> 3) egrep is going to involve a lot of file IO, are yor disks up to it?

How is this any different from doing it myself in C or Perl?

>> however a few ideas about putting it in PERL:
>> 
>> 1) charsets are sorted, you just let PERL handle the conversion (from 5.6
>> onwards), no matter what the OS. PERL knows about broken iconvs and
>> oddities on different platforms.

Non-issue.

>> 2) you can loose even the initial fork from apache(?) by using modperl.

???

>> 3) you can easily put your entire sentance list into a hash/DBM which
>> could be easier to search, and depending on the size, completely memory
>> resident.

Effectively the whole sentence file is in RAM anyway. It's getting hit
so often that it's almost always in cache. 

>> 6) you get rid of SEGVs when mis-calculating the buffer size needed for a
>> multibyte character strings and all the other C nasties.

I think the only SEGVs I get these days are from Apache itself.

>> From: "Stephen J. Turnbull" <stephen@example.com>
>> 
>> No.  For most sane people (and Jim is one such), working with Perl
>> involves bowing to the Porcelain God.

Thank you.

>> Not to mention that if I grok his post correctly, Jim is working with
>> multilingual files, and the definition of "file character set" and the
>> various character classes is up for grabs.  In Jim's case, it's no big
>> deal.  But relying on Perl to get this stuff right is going to cost
>> you some day.  Multilingual text is hard, and POSIX didn't even try to
>> deal with it (Perl has extensions, but it's based on the POSIX model).

Naruhodo. As I warn in the instruction page of the server function, the
code only currently knows about bytes; not double-bytes, so putting
kanji or kana into [ ] won't get what you expect. At some time in the 
distant future I may get the whole shebang migrated to UTF8 and I'll
see if I can get wide-char grepping set up then. Maybe POSIX will be
doing multilingual. Right now the server's working and it's fast.

Whew. Far too long.

Many thanks for the comments and advice.

If you want to play with the server function, the testbed is at:

http://www.csse.monash.edu.au/cgi-bin/cgiwrap/jwb/wwwjdic2?10

Cheers

Jim

-- 
Jim Breen                                http://www.csse.monash.edu.au/~jwb/
Computer Science & Software Engineering,                Tel: +61 3 9905 9554
Monash University, VIC 3800, Australia                  Fax: +61 3 9905 5146
(Monash Provider No. 00008C)                ジム・ブリーン@モナシュ大学
Follow-Ups:
- Re: [tlug] Re: Security question with grep/e...
  - From: Stephen J. Turnbull
Prev by Date: [tlug] Re: Tech Meeting
Next by Date: Re: [tlug] Tech Meeting
Previous by thread: [tlug] im-ja 1.0
Next by thread: Re: [tlug] Re: Security question with grep/e...
Index(es):
- Date
- Thread
Home | Main Index | Thread Index