There was an interesting problem at work recently, where I basically wanted to generate a whole lot of C language expressions for testing purposes. Rather than generating them by hand, I wondered if I could somehow feed the C language grammar into a program and get most, if not all, possible valid expressions out of it. Of course, I knew that a syntactically valid program might not necessarily by semantically valid, but I figured that I could eliminate or transform those cases manually.
So off I went, using this problem as an opportunity to write some Ruby code. The end result is Ramble, a Ruby program that given a grammar like
start : paragraphs; paragraphs : paragraph | paragraph LINEBREAK LINEBREAK paragraphs ; paragraph : sentence | sentence paragraph ; sentence : subject verb object PERIOD ; subject : CAT | DOG | PONY ; verb : EATS | GULPS | SWALLOWS ; object : HAY | FOOD | MILK ;
and some helper code to “stringify” terminal symbols like CAT, DOG and PONY, will generate text like
Cat swallows hay. Pony swallows milk. Dog eats hay. Dog swallows hay. Pony gulps milk. Pony gulps hay. Pony gulps food. Dog eats food. Dog gulps milk. Pony eats food. Cat swallows hay.
Cool, don’t you think?
You can browse the source code online at github here, but the general idea is to parse the grammar file (I’m assuming a simplified yacc format), generate an abstract tree of productions, and then traverse the tree to generate text (with a callback to get text for terminal symbols). Pretty similar to what a compiler does, except that instead of accepting source code as input, it takes a grammar file, and instead of emitting code, it spits out sentences in the language.
I used Treetop to parse the grammar, and wrote some hairy recursive code to filter and operate on the generated tree. Now the grammar could be left recursive i.e., it could have a rule like
A : A | B
and a generator that naively tries to generate every possible sentence recursively would quickly overflow the stack and die. I chose to randomly pick one of the two (or more) options, hoping that running the program repeatedly will give me a variety of sentences.
Now for the bad part – I tried running this on a subset of the C grammar, and it crashed with a stack overflow error. With rules that were transitively recursive, the random choice option wasn’t enough to control the recursion before it blew the stack.
And boy, was I wrong about filtering out syntactically correct but semantically valid expressions! The few times the program completed without crashing, the emitted expressions were just total rubbish semantically – there was no way I could have manually massaged those into shape.
It was an interesting exercise though, and it helped scratch two itches at the same time – dabbling with language stuff, and writing serious Ruby code (I’m a Ruby newbie).