Embedding one JavaCC grammar in another
25th September, 2006
JavaCC is very handy tool for Java software developers. This isn't really a JavaCC tutorial, so I won't be explaining much about what a compiler compiler does. The capsule summary, however, would be that you write grammar rules in a syntax that's a mixture of EBNF and Java. Then you run JavaCC's compiler to generate from those grammar rules a set of Java classes that will consume strings, check that conform to the grammar rules, and return some programmatic objects representing the processed form of the input text - a parser, in standard terminology.
Many different Java applications use JavaCC, and this presents, in principle, quite an opportunity for software re-use. For example, suppose we wanted to define a simple language, as follows:
<let> <var> = <sparql> <semicolon>
This language would assign a variable, say, ?a, with the result
of an arbitrary SPARQL query. For example:
let ?results = select ?a {?a dc:creator "Ian Dickinson" . ?a dc:type "article"} ;
Incidentally, I picked SPARQL because this is an actual example I've been working on recently, but in principle this could be any language. Now SPARQL is a reasonably complex language, and while the grammar is published in the W3C specification document, it would be quite a lot of work to write the parser and get it working. Fortunately, Andy Seaborne has written a SPARQL parser, using JavaCC, as part of Jena. So all have we to do to implement the example language above is write the productions for the rest of the grammar, then work out how to invoke the existing SPARQL parser at the right point. That's the focus of this article.
Before we go any further, let's look briefly at what JavaCC produces. Ignoring the tree
pre-processor JJTree for now, JavaCC takes an input file, say, javacc-test.jj, and
generates a number of Java source files:
JavaCharStream.javaParseException.javaToken.javaTokenMgError.javaJavaCCTest.javaJavaCCTestConstants.javaJavaCCTestTokenManager.java
The first four of these files are only generated once, the latter three are regenerated
each time the grammar changes. In this example, JavaCCTest.java is the
actual parser itself, with public methods that correspond to the public entry points
into the language defined by the grammar. The TokenManager class
handles the job of converting characters from the input source into lexical tokens,
while the parser ensures that the token sequence corresponds to the grammar. The
outputs from JavaCC go to an output directory, and Java package, of the user's choice.
Personally, I use an ant script to perform
the generation step, but there are command line tools and
plugins for Eclipse. Crucially
for this discussion, the generated files don't implement any common interface or
extend any known JavaCC class. This means that JavaCC-generated
parsers don't need any runtime libraries -
they are completely standalone at deployment time (which is a good thing).
Let's put some scaffolding in place for this example. Here's a first attempt at the JavaCC file for our mini-language:
PARSER_BEGIN(EmbeddedLangTest)
package org.nuin.script.parser;
public class EmbeddedLangTest {
}
PARSER_END(EmbeddedLangTest)
SKIP :
{ " " | "\r" | "\t" | "\n" }
TOKEN :
{
< ASSIGN: "<-" >
| < SEMICOLON: ";" >
| < LET: "let" >
}
TOKEN : /* Identifiers */
{
< IND_VARIABLE: "?" <WORD> >
| < WORD: <LEADING> (<NORMAL>)* >
| < #LEADING: <UPPER> | <LOWER>
| < #NORMAL: <UPPER> | <LOWER> | <DIGIT>
| < #UPPER: ["A"-"Z"] >
| < #LOWER: ["a"-"z"] >
| < #DIGIT: ["0"-"9"] >
}
/* This is the production we will use to test language embedding */
void testEmbedded() : {}
{
<LET> <IND_VARIABLE> <ASSIGN> sparql() <SEMICOLON>
}
void sparql() :
{
Query query = null;
}
{
{query = delegate();}
}
There are two noteworthy points here: the production testEmbedded is the
entry point we'll use from a JUnit test case, and the production sparql is
where we'd like to delegate processing to an embedded SPARQL grammar.
So far so good, now on to the mechanism of the delegation itself. One choice, of
course, would be to create some delimiter syntax that would allow us to recognise
a query string in its entirety, then extract it and fire it off to a completely
separate parser. For example, using the delimiters {{ ... }}:
let ?results = {{ select ?a {?a dc:creator "Ian Dickinson" . ?a dc:type "article"} }} ;
Besides being ugly and inelegant, this method is brittle: you have to be sure that the closing delimiter cannot occur in the sub-language. Moreover, the embedded string image will come out as one single token in the outer language, which will be inefficient for many situations (e.g. consider embedding an N3 parser to define an entire knowledge base).
A second alternative would be to try source-level integration between the sub-language's
.jj file, and the .jj file we're working on. There isn't, as far as I know, a
mechanism in JavaCC for importing other grammars, and in any case this approach would have its own
difficulties. JavaCC parsers tend to rely on having methods or state in the parser
class, put there by the language developer. Integrating the two parser classes (ours
and the sub-language class) would be very difficult. Worse still, the two languages
might have different, incompatible, tokenisation rules.
A third solution is to have the sub-language parser take over the input stream
at the right point, and then have our parser take up the reins again when the sub-language
is finished. JavaCC parser classes are created with a number of constructors, one
of which simply takes an input java.io.Reader. OK!
So: we get the input reader from the current
parser, construct a new sub-language parser with that reader, and we're done, yes?
No. Unsurprisingly, it's not so simple as that. The culprit in this case is
JavaCharStream.java, which JavaCC creates to help it manage the
input character source. As well as handling unicode escaping conventions,
JavaCharStream caches the input, typically in buffers of 4096 characters.
If we simply gave the input reader from the outer grammar, to the parser for the inner grammar,
an unknowable amount of cached input will be skipped. Here lie dragons.
OK, so how about giving the inner parser the JavaCharStream object
itself? Then the inner parser would use up exactly the same source of input, and
leave the JavaCharStream in a good state for the main language
parser to resume parsing. Sounds promising. Let's try it:
PARSER_BEGIN(EmbeddedLangTest)
package org.nuin.script.parser;
import com.hp.hpl.jena.query.Query;
import com.hp.hpl.jena.query.Syntax;
import com.hp.hpl.jena.query.lang.sparql.SPARQLParser;
import com.hp.hpl.jena.query.lang.sparql.SPARQLParserTokenManager;
public class EmbeddedLangTest {
protected Query delegate()
throws ParseException
{
// create a SPARQL parser that we will delegate to
SPARQLParserTokenManager tm =
new SPARQLParserTokenManager( jj_input_stream );
SPARQLParser sp = new SPARQLParser( tm );
// the parser fills in an empty query template
Query q = new Query();
sp.setQuery( q );
try {
// parse to the end of the query
sp.Query();
q.setSyntax(Syntax.syntaxSPARQL);
}
catch (ParseException e0) {
// handle a parse exception here
}
return q;
}
}
PARSER_END(EmbeddedLangTest)
The resulting code, while it looks right, won't compile. Why? Because our parser, and the
sub-language parser, have different, unrelated, definitions of JavaCharStream.
Specifically, org.nuin.script.parser.JavaCharStream cannot be used where
com.hp.hpl.jena.query.lang.sparql.JavaCharStream is expected. jj_input_stream,
highlighted, is a member variable in our parser class and references our JavaCharStream
object.
Adapter pattern to the rescue
Actually, it's not entirely true that the two JavaCharStream classes are
unrelated. They share no common interface, nor a common super-class other than java.lang.Object.
However, they were both generated by the same code template, which tips us to the solution.
Adapter pattern is a design pattern
for just this situation. We want to adapt a org.nuin.script.parser.JavaCharStream to
look like a SPARQL JavaCharStream, so we need a simple adapter class like this:
/**
* Adapter class that allows one grammar's JavaCharStream to be used by another
* parser, despite being in different packages. This adapter allows
* {@link JavaCharStream org.nuin.script.parser.JavaCharStream} to be used
* in place of {@link com.hp.hpl.jena.query.lang.sparql.JavaCharStream}.
*/
public class JavaCharStreamAdapter
extends com.hp.hpl.jena.query.lang.sparql.JavaCharStream
{
// Constants
//////////////////////////////////
// Static variables
//////////////////////////////////
// Instance variables
//////////////////////////////////
protected JavaCharStream m_source;
// Constructors
//////////////////////////////////
public JavaCharStreamAdapter( JavaCharStream source ) {
super( (Reader) null );
m_source = source;
}
// External signature methods
//////////////////////////////////
protected void ExpandBuff( boolean wrapAround ) {
m_source.ExpandBuff( wrapAround );
}
protected void FillBuff() throws java.io.IOException {
m_source.FillBuff();
}
protected char ReadByte() throws java.io.IOException {
return m_source.ReadByte();
}
public char BeginToken() throws java.io.IOException {
return m_source.BeginToken();
}
... many more methods like this ...
Every method invoked on the faux-SPARQL JavaCharStream
actually invokes the appropriate behaviour on the wrapped local
JavaCharStream instance.
It's slightly tedious but straightforward to write, and once the pattern has
been created it would be easy to copy and re-use for another adapter with
just a few changed lines. So now we can show the functioning version of
the delegate method:
PARSER_BEGIN(EmbeddedLangTest)
package org.nuin.script.parser;
import com.hp.hpl.jena.query.Query;
import com.hp.hpl.jena.query.Syntax;
import com.hp.hpl.jena.query.lang.sparql.SPARQLParser;
import com.hp.hpl.jena.query.lang.sparql.SPARQLParserTokenManager;
public class EmbeddedLangTest {
/**
* Delegate parsing to a nested SPARQL parser, but one that will
* use the JavaCharStream that the current parser is using
* @return A Jena query object parsed according to the SPARQL grammar.
* @exception org.nuin.script.parser.ParseException if parsing fails
* on the given input.
*/
protected Query delegate()
throws ParseException
{
// create a SPARQL parser that we will delegate to
JavaCharStreamAdapter a = new JavaCharStreamAdapter( jj_input_stream );
SPARQLParserTokenManager tm = new SPARQLParserTokenManager( a );
SPARQLParser sp = new SPARQLParser( tm );
// the parser fills in an empty query template
Query q = new Query();
sp.setQuery( q );
try {
// parse to the end of the query
sp.Query();
q.setSyntax(Syntax.syntaxSPARQL);
}
catch (com.hp.hpl.jena.query.lang.sparql.ParseException e0) {
ParseException e1 =
new ParseException( "Error reported by SPARQL parser: " +
e0.getMessage() );
throw (ParseException) e1.initCause( e0 );
}
return q;
}
}
PARSER_END(EmbeddedLangTest)
Note that a similar problem held with the two different definitions
of ParseException, so we have to catch and re-throw the
sub-language parser exception.
Testing
Two final wrinkles show up in unit testing. Here's the first version of a unit test:
public void testParser0()
throws ParseException
{
String test =
"let ?a <- select ?a where {?a dc:creator \"Ian Dickinson\"} ;"
StringReader r = new StringReader( test );
EmbeddedLangTest t = new EmbeddedLangTest( r );
t.testEmbedded();
}
This test fails because the dc:creator predicate uses an
undefined prefix. We can fix the query source in the test:
public void testParser0()
throws ParseException
{
String test =
"let ?a <- PREFIX dc: <http://purl.org/dc/elements/1.1/>" +
" select ?a where {?a dc:creator \"Ian Dickinson\"} ;"
StringReader r = new StringReader( test );
EmbeddedLangTest t = new EmbeddedLangTest( r );
t.testEmbedded();
}
However, while this works it looks a bit clunky. It would be nice if the language
we're defining, which contains SPARQL as a sub-language, could have its own prefix
defining mechanism.
Fortunately, we can side
effect the Query object to tell it about known prefixes before we start
parsing. A greatly simplified version is shown in the example code below, and
tested in the second unit test.
The second wrinkle is that, when the test is run, it fails! The failure is not on the
SPARQL parsing, but on the trailing semi-colon after the query. What's happening
is that the sub-language parser has to run one character past the end of its
language to know that it has completed its job. That means that when the outer parser
resumes processing, the first character after the embedded expression is missing.
Fortunately, JavaCC makes it
easy to put back the one-character overrun with jj_input_stream.backup(1).
So the finished grammar looks like this:
PARSER_BEGIN(EmbeddedLangTest)
package org.nuin.script.parser;
import com.hp.hpl.jena.query.Query;
import com.hp.hpl.jena.query.Syntax;
import com.hp.hpl.jena.query.lang.sparql.SPARQLParser;
import com.hp.hpl.jena.query.lang.sparql.SPARQLParserTokenManager;
public class EmbeddedLangTest {
/**
* Delegate parsing to a nested SPARQL parser, but one that will
* use the JavaCharStream that the current parser is using
* @return A Jena query object parsed according to the SPARQL grammar.
* @exception org.nuin.script.parser.ParseException if parsing fails
* on the given input.
*/
protected Query delegate()
throws ParseException
{
// create a SPARQL parser that we will delegate to
JavaCharStreamAdapter a = new JavaCharStreamAdapter( jj_input_stream );
SPARQLParserTokenManager tm = new SPARQLParserTokenManager( a );
SPARQLParser sp = new SPARQLParser( tm );
// the parser fills in an empty query template
Query q = new Query();
sp.setQuery( q );
setPrefixes( q );
try {
// parse to the end of the query
sp.Query();
q.setSyntax(Syntax.syntaxSPARQL);
// push back the final char, which the sparql parser will have
// consumed to know that the query was finished
jj_input_stream.backup(1);
}
catch (com.hp.hpl.jena.query.lang.sparql.ParseException e0) {
ParseException e1 =
new ParseException( "Error reported by SPARQL parser: " +
e0.getMessage() );
throw (ParseException) e1.initCause( e0 );
}
return q;
}
/**
* Set up some well-known prefixes for the query, to avoid having to
* define them in the query itself. Obviously, this would ideally be
* sensitive to the execution context.
*/
protected void setPrefixes( Query q ) {
q.setPrefix( "foaf", "http://xmlns.com/foaf/0.1/" );
}
}
PARSER_END(EmbeddedLangTest)
SKIP :
{
" "
| "\r"
| "\t"
| "\n"
}
TOKEN :
{
< ASSIGN: "<-" >
| < SEMICOLON: ";" >
| < LET: "let" >
}
TOKEN : /* Identifiers */
{
< IND_VARIABLE: "?" <WORD> >
| < WORD: <LEADING> (<NORMAL>)* >
| < #LEADING: <UPPER> | <LOWER>
| < #NORMAL: <UPPER> | <LOWER> | <DIGIT>
| < #UPPER: ["A"-"Z"] >
| < #LOWER: ["a"-"z"] >
| < #DIGIT: ["0"-"9"] >
}
/* This is the production we will use to test language embedding */
void testEmbedded() : {}
{
<LET> <IND_VARIABLE> <ASSIGN> sparql() <SEMICOLON>
}
void sparql() :
{
Query query = null;
}
{
{query = delegate();}
}
And the final test cases are:
public void testParser0()
throws ParseException
{
String test =
"let ?a <- PREFIX dc: <http://purl.org/dc/elements/1.1/> " +
"select ?a where {?a dc:creator \"Ian Dickinson\"} ;";
StringReader r = new StringReader( test );
EmbeddedLangTest t = new EmbeddedLangTest( r );
t.testEmbedded();
}
public void testParser1()
throws ParseException
{
// built-in prefix mapping
String test =
"let ?a <- select ?a where {?a foaf:name \"Ian Dickinson\"} ;";
StringReader r = new StringReader( test );
EmbeddedLangTest t = new EmbeddedLangTest( r );
t.testEmbedded();
}
Conclusion
This method of integrating two JavaCC-based grammars relies on delegating
the parsing of the input stream from the main parser to the sub-language
parser. The delegation is achieved by sharing a single input
source, via the JavaCharStream
input manager. Since the manager classes are created in different packages,
we need to use adapter pattern to allow the outer parser's JavaCharStream
object to be used by the sub-language parser. Of course, I haven't shown
what to do with the newly parsed query, but that's outside the scope of this brief article.
