Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] presentation wish list



On 14 October 2014 14:17, Travis Cardwell
<travis.cardwell@example.com> wrote:
>
> Tokyo Parsing Study Group? ;)

Well, if there is sufficient interest, then why not?!

As a cautionary note though, interest groups with a very specialised
and narrow topic tend to be shortlived even in a place like Tokyo. We
once had an informal group of LLVM compiler hackers, some four or five
people, and met for beers and chat once a month but it didn't last
more than a few months.


> Whether there is enough interest to warrant a presentation or not, I look
> forward to discussing the topic with you, perhaps at a nijikai.

Sure, that sounds like a fun topic.

> I have a
> project (that is unfortunately currently on hold) that requires a bit of
> parsing, and I would love to get your thoughts on my strategy.

Perhaps as a general post-scriptum on this. Building a proper parser
is really a widely applicable and useful skill to know about. Contrary
to common perception, it is not limited to programming language
design. In fact, Terence Parr, the author of ANTLR told me that he has
changed the usage model and design for ANTLR in version 4 because
demand for the tool was coming from all kinds of areas but least from
language design.

Whenever you build a piece of software, there is various sources of
input data that should always be verified because bad input is a very
common cause for security vulnerabilities. Any kind of input, whether
interactive input from a user terminal/browser session or from a data
file should ALWAYS be verified for 100% correctness in order to close
this exploit route. Unfortunately, verifying input data is very often
neglected.

People often use ad hoc verification that does not catch all possible
malformed input. I vividly remember the sorry excuse for a parser to
read configuration files in Asterisk. It searched from left to right
to find the first opening square bracket and then at the end of a line
from right to left to find the first closing square bracket, then
accepted any input between them without further verification.

Something like ...

[[foobar]]

would be accepted as

[foobar]

and I had demonstrated how this could have been used in hosted
multi-tenant PBXes to hijack another tenant's account and make phone
calls on their bill.

Yet nobody cared because the task of writing a proper parser for such
a simple configuration file was considered overkill. I have since seen
similar situations both in open source as well as in commercial
projects.

There are three reasons why people tend to neglect building proper
parsers to verify all their input:

(1) when trying to use an automated parser generation tool, they find
out that the parser generator has a very steep learning curve and only
does half of the work, while the other half of the work still has to
be coded manually.

(2) perhaps never having written a recursive descent parser from
scratch, many people believe this must be an immense effort that
requires a rocket scientist of sorts, when it is in fact rather
simple, even more so for simple input data formats.

(3) many software folks tend to be like children in a candy store,
they don't like to spend a lot of time on planning and design, but
want to start hacking code right away. Building a parser, whether
manually or using a generator tool, always requires a fair amount of
preparation and design. You need to write a grammar, verify the
grammar, then build your parser strictly following the grammar. This
runs counter to many people's habits.

The reason why I proposed a presentation that starts off with a bit of
theory on recursive descent and then shows how to code a parser by and
and with a tool that automates the tedious verification and conflict
catching is precisely to counter the prevailing perception and habits.

(1) parsing is not rocket science, it can be done fairly simple.
(2) parsing is universally useful, it can significantly contribute to
keeping software more secure and reliable.
(3) the effort to plan and design, tedious as it may seem, really pays
off when implementing.
(4) the most tedious activities such as grammar verification can be
automated using tools such as ANTLR.

In other words, the idea is to take the pain out of parsing, both
perceived and actual pain.

A presentation of flex/bison may however have the opposite effect.


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links