» home
» articles index

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 2.0 License.

Valid XHTML 1.0 Strict
Valid CSS

Embedding one JavaCC grammar in another

25th September, 2006

JavaCC is very handy tool for Java software developers. This isn't really a JavaCC tutorial, so I won't be explaining much about what a compiler compiler does. The capsule summary, however, would be that you write grammar rules in a syntax that's a mixture of EBNF and Java. Then you run JavaCC's compiler to generate from those grammar rules a set of Java classes that will consume strings, check that conform to the grammar rules, and return some programmatic objects representing the processed form of the input text - a parser, in standard terminology.

Many different Java applications use JavaCC, and this presents, in principle, quite an opportunity for software re-use. For example, suppose we wanted to define a simple language, as follows:

<let> <var> = <sparql> <semicolon>

This language would assign a variable, say, ?a, with the result of an arbitrary SPARQL query. For example:

let ?results = select ?a {?a dc:creator "Ian Dickinson" . ?a dc:type "article"} ;

Incidentally, I picked SPARQL because this is an actual example I've been working on recently, but in principle this could be any language. Now SPARQL is a reasonably complex language, and while the grammar is published in the W3C specification document, it would be quite a lot of work to write the parser and get it working. Fortunately, Andy Seaborne has written a SPARQL parser, using JavaCC, as part of Jena. So all have we to do to implement the example language above is write the productions for the rest of the grammar, then work out how to invoke the existing SPARQL parser at the right point. That's the focus of this article.

Before we go any further, let's look briefly at what JavaCC produces. Ignoring the tree pre-processor JJTree for now, JavaCC takes an input file, say, javacc-test.jj, and generates a number of Java source files:

The first four of these files are only generated once, the latter three are regenerated each time the grammar changes. In this example, JavaCCTest.java is the actual parser itself, with public methods that correspond to the public entry points into the language defined by the grammar. The TokenManager class handles the job of converting characters from the input source into lexical tokens, while the parser ensures that the token sequence corresponds to the grammar. The outputs from JavaCC go to an output directory, and Java package, of the user's choice. Personally, I use an ant script to perform the generation step, but there are command line tools and plugins for Eclipse. Crucially for this discussion, the generated files don't implement any common interface or extend any known JavaCC class. This means that JavaCC-generated parsers don't need any runtime libraries - they are completely standalone at deployment time (which is a good thing).

Let's put some scaffolding in place for this example. Here's a first attempt at the JavaCC file for our mini-language:

PARSER_BEGIN(EmbeddedLangTest)
package org.nuin.script.parser;

public class EmbeddedLangTest {
}

PARSER_END(EmbeddedLangTest)

SKIP :
{  " " |  "\r" |  "\t" |  "\n" }

TOKEN :
{
    < ASSIGN: "<-" >
  | < SEMICOLON: ";" >
  | < LET: "let" >
}

TOKEN : /* Identifiers */
{
      < IND_VARIABLE: "?" <WORD> >
    | < WORD: <LEADING> (<NORMAL>)* >
    | < #LEADING: <UPPER> | <LOWER>
    | < #NORMAL: <UPPER> | <LOWER> | <DIGIT>
    | < #UPPER: ["A"-"Z"] >
    | < #LOWER: ["a"-"z"] >
    | < #DIGIT: ["0"-"9"] >
}

/* This is the production we will use to test language embedding */
void testEmbedded() : {}
{
  <LET> <IND_VARIABLE> <ASSIGN> sparql() <SEMICOLON>
}

void sparql() :
{
  Query query = null;
}
{
  {query = delegate();}
}

There are two noteworthy points here: the production testEmbedded is the entry point we'll use from a JUnit test case, and the production sparql is where we'd like to delegate processing to an embedded SPARQL grammar.

So far so good, now on to the mechanism of the delegation itself. One choice, of course, would be to create some delimiter syntax that would allow us to recognise a query string in its entirety, then extract it and fire it off to a completely separate parser. For example, using the delimiters {{ ... }}:

let ?results = {{ select ?a {?a dc:creator "Ian Dickinson" . ?a dc:type "article"} }} ;

Besides being ugly and inelegant, this method is brittle: you have to be sure that the closing delimiter cannot occur in the sub-language. Moreover, the embedded string image will come out as one single token in the outer language, which will be inefficient for many situations (e.g. consider embedding an N3 parser to define an entire knowledge base).

A second alternative would be to try source-level integration between the sub-language's .jj file, and the .jj file we're working on. There isn't, as far as I know, a mechanism in JavaCC for importing other grammars, and in any case this approach would have its own difficulties. JavaCC parsers tend to rely on having methods or state in the parser class, put there by the language developer. Integrating the two parser classes (ours and the sub-language class) would be very difficult. Worse still, the two languages might have different, incompatible, tokenisation rules.

A third solution is to have the sub-language parser take over the input stream at the right point, and then have our parser take up the reins again when the sub-language is finished. JavaCC parser classes are created with a number of constructors, one of which simply takes an input java.io.Reader. OK! So: we get the input reader from the current parser, construct a new sub-language parser with that reader, and we're done, yes? No. Unsurprisingly, it's not so simple as that. The culprit in this case is JavaCharStream.java, which JavaCC creates to help it manage the input character source. As well as handling unicode escaping conventions, JavaCharStream caches the input, typically in buffers of 4096 characters. If we simply gave the input reader from the outer grammar, to the parser for the inner grammar, an unknowable amount of cached input will be skipped. Here lie dragons.

OK, so how about giving the inner parser the JavaCharStream object itself? Then the inner parser would use up exactly the same source of input, and leave the JavaCharStream in a good state for the main language parser to resume parsing. Sounds promising. Let's try it:

PARSER_BEGIN(EmbeddedLangTest)
package org.nuin.script.parser;

import com.hp.hpl.jena.query.Query;
import com.hp.hpl.jena.query.Syntax;
import com.hp.hpl.jena.query.lang.sparql.SPARQLParser;
import com.hp.hpl.jena.query.lang.sparql.SPARQLParserTokenManager;

public class EmbeddedLangTest {
    protected Query delegate()
        throws ParseException
    {
        // create a SPARQL parser that we will delegate to
        SPARQLParserTokenManager tm =
              new SPARQLParserTokenManager( jj_input_stream );
        SPARQLParser sp = new SPARQLParser( tm );

        // the parser fills in an empty query template
        Query q = new Query();
        sp.setQuery( q );

        try {
            // parse to the end of the query
            sp.Query();
            q.setSyntax(Syntax.syntaxSPARQL);
        }
        catch (ParseException e0) {
            // handle a parse exception here
        }

        return q;
    }
}

PARSER_END(EmbeddedLangTest)

The resulting code, while it looks right, won't compile. Why? Because our parser, and the sub-language parser, have different, unrelated, definitions of JavaCharStream. Specifically, org.nuin.script.parser.JavaCharStream cannot be used where com.hp.hpl.jena.query.lang.sparql.JavaCharStream is expected. jj_input_stream, highlighted, is a member variable in our parser class and references our JavaCharStream object.

Adapter pattern to the rescue

Actually, it's not entirely true that the two JavaCharStream classes are unrelated. They share no common interface, nor a common super-class other than java.lang.Object. However, they were both generated by the same code template, which tips us to the solution. Adapter pattern is a design pattern for just this situation. We want to adapt a org.nuin.script.parser.JavaCharStream to look like a SPARQL JavaCharStream, so we need a simple adapter class like this:

/**
 * Adapter class that allows one grammar's JavaCharStream to be used by another
 * parser, despite being in different packages. This adapter allows
 * {@link JavaCharStream org.nuin.script.parser.JavaCharStream} to be used
 * in place of {@link com.hp.hpl.jena.query.lang.sparql.JavaCharStream}.
 */
public class JavaCharStreamAdapter
    extends com.hp.hpl.jena.query.lang.sparql.JavaCharStream
{
    // Constants
    //////////////////////////////////

    // Static variables
    //////////////////////////////////

    // Instance variables
    //////////////////////////////////

    protected JavaCharStream m_source;

    // Constructors
    //////////////////////////////////

    public JavaCharStreamAdapter( JavaCharStream source ) {
        super( (Reader) null );
        m_source = source;
    }

    // External signature methods
    //////////////////////////////////

    protected void ExpandBuff( boolean wrapAround ) {
        m_source.ExpandBuff( wrapAround );
    }

    protected void FillBuff() throws java.io.IOException {
        m_source.FillBuff();
    }

    protected char ReadByte() throws java.io.IOException {
        return m_source.ReadByte();
    }

    public char BeginToken() throws java.io.IOException {
        return m_source.BeginToken();
    }

  ... many more methods like this ...

Every method invoked on the faux-SPARQL JavaCharStream actually invokes the appropriate behaviour on the wrapped local JavaCharStream instance. It's slightly tedious but straightforward to write, and once the pattern has been created it would be easy to copy and re-use for another adapter with just a few changed lines. So now we can show the functioning version of the delegate method:

PARSER_BEGIN(EmbeddedLangTest)
package org.nuin.script.parser;

import com.hp.hpl.jena.query.Query;
import com.hp.hpl.jena.query.Syntax;
import com.hp.hpl.jena.query.lang.sparql.SPARQLParser;
import com.hp.hpl.jena.query.lang.sparql.SPARQLParserTokenManager;

public class EmbeddedLangTest {
    /**
     * Delegate parsing to a nested SPARQL parser, but one that will
     * use the JavaCharStream that the current parser is using
     * @return A Jena query object parsed according to the SPARQL grammar.
     * @exception org.nuin.script.parser.ParseException if parsing fails
     * on the given input.
     */
    protected Query delegate()
        throws ParseException
    {
        // create a SPARQL parser that we will delegate to
        JavaCharStreamAdapter a = new JavaCharStreamAdapter( jj_input_stream );
        SPARQLParserTokenManager tm = new SPARQLParserTokenManager( a );
        SPARQLParser sp = new SPARQLParser( tm );

        // the parser fills in an empty query template
        Query q = new Query();
        sp.setQuery( q );

        try {
            // parse to the end of the query
            sp.Query();
            q.setSyntax(Syntax.syntaxSPARQL);
        }
        catch (com.hp.hpl.jena.query.lang.sparql.ParseException e0) {
            ParseException e1 =
                new ParseException( "Error reported by SPARQL parser: " +
                                    e0.getMessage() );
            throw (ParseException) e1.initCause( e0 );
        }

        return q;
    }
}

PARSER_END(EmbeddedLangTest)

Note that a similar problem held with the two different definitions of ParseException, so we have to catch and re-throw the sub-language parser exception.

Testing

Two final wrinkles show up in unit testing. Here's the first version of a unit test:

public void testParser0()
    throws ParseException
{
    String test =
        "let ?a <- select ?a where {?a dc:creator \"Ian Dickinson\"} ;"
    StringReader r = new StringReader( test );
    EmbeddedLangTest t = new EmbeddedLangTest( r );
    t.testEmbedded();
}

This test fails because the dc:creator predicate uses an undefined prefix. We can fix the query source in the test:

public void testParser0()
    throws ParseException
{
    String test =
        "let ?a <- PREFIX dc: <http://purl.org/dc/elements/1.1/>" +
        " select ?a where {?a dc:creator \"Ian Dickinson\"} ;"
    StringReader r = new StringReader( test );
    EmbeddedLangTest t = new EmbeddedLangTest( r );
    t.testEmbedded();
}

However, while this works it looks a bit clunky. It would be nice if the language we're defining, which contains SPARQL as a sub-language, could have its own prefix defining mechanism. Fortunately, we can side effect the Query object to tell it about known prefixes before we start parsing. A greatly simplified version is shown in the example code below, and tested in the second unit test.

The second wrinkle is that, when the test is run, it fails! The failure is not on the SPARQL parsing, but on the trailing semi-colon after the query. What's happening is that the sub-language parser has to run one character past the end of its language to know that it has completed its job. That means that when the outer parser resumes processing, the first character after the embedded expression is missing. Fortunately, JavaCC makes it easy to put back the one-character overrun with jj_input_stream.backup(1). So the finished grammar looks like this:

PARSER_BEGIN(EmbeddedLangTest)
package org.nuin.script.parser;

import com.hp.hpl.jena.query.Query;
import com.hp.hpl.jena.query.Syntax;
import com.hp.hpl.jena.query.lang.sparql.SPARQLParser;
import com.hp.hpl.jena.query.lang.sparql.SPARQLParserTokenManager;

public class EmbeddedLangTest {
    /**
     * Delegate parsing to a nested SPARQL parser, but one that will
     * use the JavaCharStream that the current parser is using
     * @return A Jena query object parsed according to the SPARQL grammar.
     * @exception org.nuin.script.parser.ParseException if parsing fails
     * on the given input.
     */
    protected Query delegate()
        throws ParseException
    {
        // create a SPARQL parser that we will delegate to
        JavaCharStreamAdapter a = new JavaCharStreamAdapter( jj_input_stream );
        SPARQLParserTokenManager tm = new SPARQLParserTokenManager( a );
        SPARQLParser sp = new SPARQLParser( tm );

        // the parser fills in an empty query template
        Query q = new Query();
        sp.setQuery( q );
        setPrefixes( q );

        try {
            // parse to the end of the query
            sp.Query();
            q.setSyntax(Syntax.syntaxSPARQL);

            // push back the final char, which the sparql parser will have
            // consumed to know that the query was finished
            jj_input_stream.backup(1);
        }
        catch (com.hp.hpl.jena.query.lang.sparql.ParseException e0) {
            ParseException e1 =
              new ParseException( "Error reported by SPARQL parser: " +
                                  e0.getMessage() );
            throw (ParseException) e1.initCause( e0 );
        }

        return q;
    }

    /**
     * Set up some well-known prefixes for the query, to avoid having to
     * define them in the query itself. Obviously, this would ideally be
     * sensitive to the execution context.
     */
    protected void setPrefixes( Query q ) {
        q.setPrefix( "foaf", "http://xmlns.com/foaf/0.1/" );
    }
}

PARSER_END(EmbeddedLangTest)

SKIP :
{
   " "
|  "\r"
|  "\t"
|  "\n"
}

TOKEN :
{
    < ASSIGN: "<-" >
  | < SEMICOLON: ";" >
  | < LET: "let" >
}

TOKEN : /* Identifiers */
{
      < IND_VARIABLE: "?" <WORD> >
    | < WORD: <LEADING> (<NORMAL>)* >
    | < #LEADING: <UPPER> | <LOWER>
    | < #NORMAL: <UPPER> | <LOWER> | <DIGIT>
    | < #UPPER: ["A"-"Z"] >
    | < #LOWER: ["a"-"z"] >
    | < #DIGIT: ["0"-"9"] >
}

/* This is the production we will use to test language embedding */
void testEmbedded() : {}
{
  <LET> <IND_VARIABLE> <ASSIGN> sparql() <SEMICOLON>
}

void sparql() :
{
  Query query = null;
}
{
  {query = delegate();}
}

And the final test cases are:

    public void testParser0()
        throws ParseException
    {
        String test =
            "let ?a <- PREFIX dc: <http://purl.org/dc/elements/1.1/> " +
            "select ?a where {?a dc:creator \"Ian Dickinson\"} ;";
        StringReader r = new StringReader( test );
        EmbeddedLangTest t = new EmbeddedLangTest( r );
        t.testEmbedded();
    }

    public void testParser1()
        throws ParseException
    {
        // built-in prefix mapping
        String test =
            "let ?a <- select ?a where {?a foaf:name \"Ian Dickinson\"} ;";
        StringReader r = new StringReader( test );
        EmbeddedLangTest t = new EmbeddedLangTest( r );
        t.testEmbedded();
    }

Conclusion

This method of integrating two JavaCC-based grammars relies on delegating the parsing of the input stream from the main parser to the sub-language parser. The delegation is achieved by sharing a single input source, via the JavaCharStream input manager. Since the manager classes are created in different packages, we need to use adapter pattern to allow the outer parser's JavaCharStream object to be used by the sub-language parser. Of course, I haven't shown what to do with the newly parsed query, but that's outside the scope of this brief article.