... is a software toolkit for semi-automatic
grammar recovery. In fact, the toolkit goes beyond the basic idea
of obtaining a relatively correct and complete grammar from a language
reference, but GRK even addresses reverse and re-engineering of the
language reference itself. This is demonstrated for IBM's VS Cobol II
by deriving an
authoritative definition of VS Cobol II from IBM's application
programming reference semi-automatically. This authoritative
definition comprehends all the text from IBM's reference as well as
comments on the performed adaptations. Furthermore, GRK provides some
simple means for grammar deployment, that is, it derives actual
parsers. In the case of VS Cobol II, two parsers are derived: a slow,
Prolog-based parser, and a fast, BTYACC-based parser (the latter with
the help of GDK).
Because of the potentially different kinds of documents that one can
use for grammar recovery, there is no final answer to the question of
tool support. The present GRK includes a number of tools that are of
general use, e.g., a transformation tool for grammars, but several
other tools are biased towards IBM standards such as the VS Cobol II
application programming reference, e.g., a specific tool for diagram
extraction.
GRK is being developed by Ralf
Lämmel at the Free University, Amsterdam and CWI. It is free
software. Version 1.0 was released on June 4, 2003. GRK is implemented
in SWI-Prolog and gmake is used to glue together all components. The
GRK functionality can be seen as a careful implementation of the
functionality that Ralf Lämmel and Chris Verhoef describe in
their paper on "Semi-Automatic
Grammar Recovery" which appeared in SP&E in 2001. GRK goes
beyond that approach in that it also supports document re-engineering.
GRK is part of a larger effort at VU & CWI in Amsterdam on what we
call engineering of grammarware.
GRK tools
dia2ast --- a parser for syntax diagrams as they appear in
IBM documents.
ast2dia --- a pretty printer for syntax diagrams.
fst --- a framework for syntax transformations.
ast2ebnf --- a pretty printer to represent GRK's grammars as EBNF.
ast2lll --- an export tool to represent GRK's grammars in GDK's LLL format.
ast2dcg --- generation of (Prolog-like) DCG from GRK's grammars.
reduced --- a checking tool for reduced/terminated grammars.
prepare --- preparation of IBM's VS Cobol II document.
extract --- extraction of diagrams from IBM's VS Cobol II document.
inline --- reinsert revised diagrams and comments into IBM's VS Cobol II document.
parser --- a stub for DCG-based (say, Prolog-based) validation parser for Cobol.
lexical --- a Prolog implementation of a scannerless lexer for Cobol.
The VS Cobol II case
The distribution comprehends a version of IBM's application
programming reference for VS Cobol II as downloaded from the freely
accessible IBM
BookManagerŪ BookServer Library. Several other documents and files
are generated from this document. There are the following steps:
Preparation: IBM's document is prepared to establish a number of
notational and markup assumptions.
Extraction: the syntax diagrams are extracted from IBM's document.
Parsing: the syntax diagrams are parsed to an enriched EBNF format.
Recovery transformations: the EBNF is refactored, corrected and completed.
Inlining: a revision of IBM's document is generated.
Deployment transformations: a Prolog-like DCG is generated as a kind of
validation parser.
The result of preparation is called the the reverse-engineered
IBM document as it is useful on its own as a self-contained
version of the IBM's application programming reference with a
normalised notation. For example, all diagrams are labelled by
appropriate names, and notational anomalies were eliminated. The
result of all steps up-to inlining is called the the re-engineered IBM
document because several transformations were applied on the
syntax diagrams. All the transformation scripts are included in the
distribution but we link them here for convenience:
The installation of GRK and SWI-Prolog is really trivial.
The generated Prolog-based Cobol parser can be readily used.
Using the fast, BTYACC-based parser relies on the following:
Going for plain YACC (say, LALR(1)) would be a real hassle.
BTYACC is a good reference technology for the plain parsing part.
Generalised parsing: no conflicts, but maybe too many ambiguities.
You might want to use different languages for frontends.
You might want to use open source, GPL, etc.
Parser generators lack good means of customisation.
Generate parser generator inputs; don't write them manually.
Think of the abstract syntax as well incl. automation.
Typing pays off for larger syntaxes.
Acknowledgements
I am grateful for the collaboration with Jan Kort on the subject of
providing tooling for treating grammars as engineering artifacts.
This activity contributes an overall effort on engineering of
grammarware. In this context, I am grateful for collaboration with
Paul Klint, Steven Klusener, and Chris Verhoef. I am also very
grateful for discussions and comments from colleagues, and I apologise
for any omission in the following list: Mark van den Brand, Jim Cordy,
Jan Heering, Niels Veerman, Ernst-Jan Verhoeven, Joost Visser.
Page last updated November 26, 2003.
Send your email remarks to Ralf Lämmel.