Alex Sellink and Chris Verhoef
University of Amsterdam,
Programming Research Group
Kruislaan 403, 1098 SJ Amsterdam, The Netherlands
alex@wins.uva.nl, x@wins.uva.nl
When a compiler is designed carefully, it is possible to extract its grammar. We reengineer the extracted grammar to one that is geared towards reengineering. From this reengineering grammar we generate an architecture called a software renovation factory. This includes: generic analysis and transformation functionality and a native pattern language using the concrete syntax of the language for which the renovation is necessary. Moreover, we generate the grammar in HTML format so that reengineers can quickly understand the language. We applied our approach successfully to an exceptionally complex and large proprietary language. Our approach enables rapid development of software renovation factories. We believe that our approach can partly solve the lack of Year 2000 tool support for many languages.
Categories and Subject Description:
D.2.6 [Software Engineering]: Programming Environments--Interactive;
D.2.7 [Software Engineering]: Distribution and
Maintenance--Restructuring;
D.3.4. [Processors]: Parsing.
Additional Key Words and Phrases: Reengineering, System renovation, Software renovation factories, Language description development, Grammar reengineering, Document generation, Computer aided language engineering (CALE).
In various papers on the Year 2000 problem we can read that outside the mainstream languages the Year 2000 tool support is virtually nonexistent. In [40] Newcomb and Scott argue that the Year 2000 problem is an extraordinarily complex pervasive task and without automated tool support it becomes unmanageable. Moreover, they claim that language independent tool support cannot assess complex situations, and that even the best tools break down on multi-language systems, embedded sublanguages, and embedded data-manipulation languages. Capers Jones, in his recent book on the Year 2000 problem confirms the findings of Newcomb and Scott. In his book we can find that 30 percent of the software is written in hundreds of proprietary or obscure languages for which there are only few trained programmers. Even worse, for some of these languages there is no course or tutorial material [27, loc. cit. p. 86]. Jones mentions the example of CORAL66 and CHILL used for telecommunications that will be very expensive to repair due to the shortage of tools and lack of available programming talent. Indeed, this is a serious problem. In [44] Rekdal reports that there are worldwide 12 to 15 thousand CHILL programmers. Combine this with the fact that telecommunication software is complex, large in size (1 to 10 MLOC of CHILL, or some other switching language), in continuous development, and no single person has full understanding of the whole system [44]. Rekdal estimates that 45 percent of the digital local lines installed have been implemented using CHILL. Jones estimates that 20 percent of the telephone switching systems is implemented using CHILL and that there are a few thousand CHILL programmers so this is the same order of magnitude.
Also the mixed language issue that Newcomb and Scott [40] mention is confirmed by Jones: in [27] it is stated that when multiple programming languages are used for an application many of the Year 2000 search engines come to a halt. About 30 percent of the US software applications contain at least two languages. Twelve languages inside an application is the maximum that Jones has found among his clients. Preliminary data indicate an increase of 5 percent for each extra language of the costs of Year 2000 repair efforts [27, loc. cit. p. 50].
Needless to say that we need rapid development of tools that help in analysis and transformations. The bottleneck for rapid development of such tools is constructing the parser and its connection to existing back-ends. In [40] an automated correction assistant is discussed for which automatically identity transformations are generated. The user can modify them in order to define corrections, changes or translations. In [6] the generation of generic transformations and analyzers from arbitrary context-free grammars is discussed. So, if we have a grammar at our disposal we can generate full tool support that enables automated software renovation. In many papers we experience that construction of grammars is a huge problem. During the sixth workshop on program comprehension it was stated that parsers and their connectivity are a key issues for the Year 2000 problem. In [40] it is stated that organizations should be aware that grammar development is potentially time-consuming. If we combine this with the remark that Jones makes that there is a lack of course material etc, it is imaginable that for solving Y2K problems in the obscure 30 percent it will be difficult to develop tool support. Furthermore, Jones reports Year 2000 search engines support for less than 50 languages and Year 2000 repair engines are available for about 10 languages [27, loc. cit. p. 325].
To give the reader an idea of the usefulness of our approach, we were able to parse a 10 KLOC example program sector in our reengineering environment one hour after we migrated the original compiler sources from our customer. Upon first try the 10 KLOC program sector parsed without any errors. In half a day we generated about 3000 context-free production rules for eleven sublanguages and 3 accompanying lexical modules. As a comparison, in [8] we measured productivity of 300 production rules per month for the construction of COBOL grammars by hand. The code we had at our disposal parsed upon first try (+8 MLOC) except for one file. We inspected the generated HTML version of the grammar and concluded that the file could not be correct. Ericsson reported us later that the file was a compiler test file that was not supposed to parse.
In [21] it is stated that it is difficult, if not impossible, for code analyzers to employ the front-end of compilers. They discuss their experience with building a front-end for C++ from scratch. In [46] it is also noted that reusing existing parsers is very hard. They report on the lessons learned when they tried to adapt a reverse engineering tool to a dialect of the language and an embedded sublanguage. Their claim is to separate parsing and analysis. In [18,17] this problem was also recognized: a special purpose language (called GENII), designed to convert typical parser output, is used to make converters to a format that is known by an analyzer generator tool called GENOA. It takes 1.600 LOC of GENII code to connect an existing front-end of C++ to GENOA. A serious limitation of this tool is that it is not possible to change code with it. Other limitations are that the grammar is not geared towards reengineering, the one that is simple to explain is the limitation that a parser for a compiler removes comments. This is not too much of a problem for analysis, for which GENOA is designed, but it is for software renovation. In [21] this is the reason that they reimplement the C++ grammar from scratch. So the results of our case study are completely orthogonal to the results that are reported on in the literature.
In the realm of natural language processing, the problems with large and complex grammars are also well-known. In [39] we can read that formal grammars for natural languages can become unmanageable when they evolve. In [10] maintenance problems with large and evolving grammars for reengineering purposes are addressed. In [39] it is noted that grammars are usually written by a small group of people. These people are the only ones who are able to understand its structure. If they leave, the grammar cannot be maintained anymore since its size and complexity prevent anyone else from gaining insight in it [39]. Therefore, in [39] tool support is provided for affix grammars [33] over a finite lattice for natural language. It consists of modularity support, analysis support for grammars, random test sentence generation, and a parser generator with trace facilities. The tool support was developed in the natural language front end project aiming at the construction of comprehensive and well-validated grammars and efficient parsers for a number of European languages. So their focus is on development of grammars for natural languages. Our focus is on reengineering of existing grammars for computer languages. Noteworthy perhaps, is that we use parser generator technology (GLR Parsing) for our parsers that also stem from natural language processing [35,53,45,54], see [10] for an elaborate discussion.
In order to get an idea of CALE tools we will briefly mention a few that we have developed in our CALE factory. All language descriptions are written in a programming language themselves. Most of the times this is a dialect of BNF [2]. Since there are many dialects for BNF it is not a good idea to develop tool support for every single dialect. Even within one compiler it is possible to have more than one BNF dialect (in our case study we about six different dialects). Therefore, one set of useful CALE tools are translators that translate from some BNF dialect to one specific very extended BNF. For our case study we needed about six such translators plus accompanying parsers of the BNF dialects. Once the dialects are translated one-to-one to our dedicated BNF, we have other classes of CALE tools at our disposal. One important class is tools to assess the quality of grammars. We developed simple but very useful quality assessment tools. We have tools that list top sorts (defined but not used sorts), bottom sorts (used but not defined sorts), all the sorts, all the keywords, the double rules, etc. When a grammar has one top-sort and a few bottom sorts then the quality of the grammar is perfect. Specifically, if the bottom sorts are lexical sorts, the chance that this grammar can be reused without human intervention is about 100 percent. In our case study most of the grammars had one top sort and only 3 to 5 bottom sorts that were lexical sorts. In all cases the bottom sorts of the context-free grammars happened to be top sorts in the lexical syntax modules. Therefore, we could construct working parsers without human intervention using our tools. Of course, then the parsers are not yet geared towards reengineering purposes. We use a variety of other tools, e.g., several tools that recognize simulated list constructs. We transform those simulated ones into more natural grammar constructs suited for reengineering. There is also an assembly line that migrates BNF to SDF (SDF stands for Syntax Definition Language, for which sophisticated parser generation support is available [24]. We also developed tools for generating HTML from SDF. Some of the tools mentioned above have been defined and used in [49]. Some others are new.
In the sequel we describe a computer aided language engineering process, where we use the abovementioned tools so that their usage becomes more clear.
The first step in the process is to reverse engineer the compiler source code. If the compiler is designed carefully, it is a simple task to locate the source code of the parser. Parser generator technology is well-known in compiler design (see, e.g., [1]). One of the most well-known parser generators is Yacc [25] in combination with a lexical analyzer called Lex [36]. So tracing the source files that are responsible for the parser is more simple when the reverse engineer knows the general design of compilers. A simple lexical search for keywords of the language reveals those files immediately, or looking for files that end in .y. In our case a simple search for a few keywords of the language revealed the entire parser part. Since a proprietary parser generator was used, looking for .y files would not have worked. If another technology is used, like recursive descent parsing, then another tool is needed to reverse engineer the grammar. The basic principle stays the same: extract the grammar from the source ocde of the compiler. Only the knowledge is not that explicit in the code.
When the portion of the compiler is detected that contains the grammar, we have to extract what we call the business rules. In this study the business rules are those parts of the compiler that we need to construct a parser geared towards reengineering targets. Usually, such files contain a dialect of BNF [2] for recognition of certain language construct plus code that deals with what to do with the code after recognition. Such additional code is called a semantic action. In a compiler such semantic actions are used for building the parse tree. They can also be used for other calculations, such as building certain reference tables, type checking, partial control-flow analysis, partial data flow analysis and such. We consider the semantic actions not as business rules: for, we want to know on an abstract level what the syntax of the language is, and not what the intermediate tree format used in the compiler for code generation looks like. Note that using the approach taken in [18,17] the semantic actions cannot be ignored: the format of the parse tree needs to be known so that a conversion to GENOA can be specified with the GENII formalism. For, the output of the parser is reused, not its grammar. Also in other approaches the semantic actions cannot be ignored. Since they are often idiosyncratic for the compiler, reuse becomes then more difficult. This clarifies the problems with parser reuse that are reported on in the literature.
To make the idea of business rules a bit more concrete we provide some source code from a textbook on Lex and Yacc [37]:
expression: expression '+' NUMBER { $$ = $1 + $3; }
| expression '-' NUMBER { $$ = $1 - $3; }
| NUMBER { $$ = $1; }
;
This code implements a simple calculator. As can be seen, the grammar and what to do with it later on are totally intertwined. This prohibits reuse of its output for other applications. To migrate this parser it is not necessary to know the semantic actions--in this examples, the calculations. So the business rule (for our purpose) is:
expression: expression '+' NUMBER
| expression '-' NUMBER
| NUMBER
;
We can extract these rules automatically. This is done as follows. We specify the grammar of the parser generator code and generate a parser to parse the parser generator code that defines the parser. Note that this is an example of self-application. Let us explain this idea in more detail, since this idea is the key to our approach. The code that contains the business rules has itself a grammar. We can express this grammar also in a parser generator language. This generates a parser that parses the code containing the business rules. We can specify the semantic actions to be LAYOUT (see below for a definition). Then after parsing the parser generator code, the semantic actions are ignored. We could call this nonstandard parsing. The quality of the code that is available in the compiler is a measure for the time it takes to write such a parser. For high quality code it takes about 15 minutes to write a parser that extracts the business rules. Let us give an example of a simple Yacc grammar in SDF [24] that parses the above example. The specification below is executable and there is no need to specify semantic actions since the support platform we use automatically generates the abstract syntax (this is the ASF+SDF Meta-Environment [31]). Hence, the bare BNF we extract is enough to generate a parser.
exports
sorts
Yacc-Terminal Yacc-NonTerminal Yacc-Element
Yacc-Rule Yacc-Program Yacc-Elements
lexical syntax
[ \t\n] -> LAYOUT
"/*" Inside-comments* "*/" -> LAYOUT
"{" Inside-action* "}" -> LAYOUT
~[*] -> Inside-comments
[*] ~[/] -> Inside-comments
~[}] -> Inside-action
"'" ~[']* "'" -> Yacc-Terminal
[A-Z]+ -> Yacc-Terminal
[a-z]+ [a-zA-Z0-9]* -> Yacc-NonTerminal
context-free syntax
Yacc-Rule+ -> Yacc-Program
Yacc-NonTerminal ":" { Yacc-Elements "|" }+ ";"
-> Yacc-Rule
Yacc-Element* -> Yacc-Elements
Yacc-Terminal -> Yacc-Element
Yacc-NonTerminal -> Yacc-Element
We did not depict the above code in order to explain the Yacc parser in detail. Some specific aspects deserve attention. The LAYOUT consists of spaces, tabs and return characters, it consists of comments in C [30] and everything inside actions (which is between curly braces). As soon as we parse a file containing Yacc rules and semantic actions and C comment, all that remains are the business rules. We can also generate a pretty printer automatically from the Yacc grammar using technology described in [11]. So after pretty printing we have the business rules. Of course, this can be done using many other technologies and tricks as well. It is important that the idea to reuse the parser code instead of its output makes the difference between reuse easy instead of impossible.
We note that the definition of Inside-action is rather simple: everything but an opening brace. When the quality of the parser generator code is high, then all semantic actions are abbreviated in function calls so that the grammar structure is clearly visible. This implies that the 15 minute implementation suffices. When the quality of parser generator code is low, this implies often that the semantic actions contain a lot of target source code (C in the case of Yacc). If that is the case our parser should be more sophisticated. For instance, the braces must match properly. If the code contains unbalanced braces, it might even be necessary to incorporate large parts of the C grammar. Fortunately, SDF is modular so it is possible to import those parts and use a combined Yacc plus C grammar as the basis for tools to extract the grammar. When you do not have access to modular parsing algorithms, an occasional hack will also help. Jeromy Carrière [13] suggested to preprocess the parser generator code by making the braces context-free distinguishable using a simple lexical perl script. Again, as can be seen, many ways lead to the goal: extracting the bare BNF. In our case the approach is systematic, repeatable, and reliable. Summarizing, high quality compiler source code leads to instantaneous BNF extraction, if the quality is less, we have to put more effort in the reverse engineering task. Either way, the BNF can be extracted relatively easily if we compare it to making a grammar by trail and error using source code and reference manuals only.
Of course, one could argue, write a robust Yacc parser so that every Yacc file can be parsed and thus, the business rules are immediately at your disposal. True, but there are many variations on Lex and Yacc. The number of BNF dialects is impressive. It is our experience that for every parser in a compiler it is necessary to write at least one BNF parser. In our case study we had more than six BNF dialects in one compiler. In the next section we explain what we do with the obtained business rules.
In order to migrate the parsers we translate the obtained business rules into SDF. We do this so that we can use the grammar in our implementation platform that will form the Software Renovation Factory in the end. As said earlier, there are a myriad of BNF dialects. So it is not a wise idea to specify these migrations over and over again for each new compiler. Therefore, we defined a domain specific language called EBNF (for Extended BNF) for which we made the translation and a number of assessment tools. When we migrate a new compiler we simply write a one-to-one translation of the dialect BNF to our EBNF. So for Yacc there is a tool called Yacc2EBNF. The construction of those tools is simple. We discuss the top and bottom sort detectors in more detail, since they help in further reverse engineering of the compiler. They make sure that we do not miss grammars.
At this point we are able to test the results and parse code in the support platform we use (the ASF+SDF Meta-Environment [31]). In our case study we used +8 MLOC of SSL code to test the parser. We first tested all sector grammars in isolation. We extracted from the SSL code the sectors and tested whether those sectors parsed. To give an indication, we could parse about one MLOC of the most important and large sector in an hour without a single error upon first try. We used this test suite in order to perform regression tests on consecutive modifications to the grammar. We discuss those changes in the next section.
At this point in the process we have a parser in a new environment. However, it is geared towards a compiler but we need a parser that is convenient for reengineering. In this section we discuss the issues that play a role in retargeting the parser source code.
First of all, we should provide some intuition as to what a reengineering grammar is. In our situation we focus on a grammar with the following properties:
We briefly discuss the above properties. For a start, a compiler does not use comments since they are no-ops. However, for a reengineer, it is not possible to just throw away comments. This means that we need to deal with them in a way similar to the executable code: we treat them as part of the language. Furthermore, we also wish to incorporate certain code ourselves during sophisticated reengineering tasks. Therefore, we adapt the grammar so that this extra code can be inserted and later on parsed so that other tools can use the information themselves. In fact, we scaffold the code like is common during development [51]. Furthermore, we need to recognize certain code patterns. Therefore, it is convenient when we can speak of, e.g., zero or more statements. We also need variables in order to denote certain pieces of the code, this we do as native as possible [50]. An example is that we can say Statement-1* which stands for zero or more arbitrary statements matched by the variable Statement-1*. Then we wish to modularize the grammars. This in order to deal with dialects, extensions, and other modifications. If the grammar is not modular, this enables a lot of maintenance problems [10]. Finally, we extend the grammar with syntax (and semantics) so that construction of reengineering tools becomes easy. As can be seen, none of the above issues is available in the source code of the parser of a compiler.
Next we will explain the steps necessary to turn a grammar into a reengineering grammar. We use between 20 to 30 separate tools to arrive in this situation. We will only discuss some main steps and tools. The first step is that we start to restructure the BNF that was extracted from the compiler source code. Such code is not always optimal. In fact, in the case of compilers, the grammar was mostly first available in a natural form, like a standard, or a manual, then the grammar is squeezed into Yacc-like formalisms, something we call informally yaccified. What we basically do, is deyaccify the code. A compiler grammar often contains so-called dummy sorts. A standard way to force an action in Yacc is to use such dummy sorts. They are in the grammar on certain places only to force special actions, the dummy sorts only ring a bell that something special should be done next. Dummy sorts do not produce any syntax. We do not rely on LALR(1) parsing, so we do not need such dummy sorts. Note that the semantic actions are generated automatically by the ASF+SDF Meta-Environment so we do not have to deal with this level at all. We can safely remove them since we are not interested in the semantic actions. Another phenomenon that occurs frequently is that certain rules are chains: A -> B is a chain if there are no other productions for B. This implies that we can eliminate A in favor of B (or vice versa). Also the removal of dummy sorts can create chains. If we remove dummies and chains, we can create double production rules: rules that become the same after the first steps. So we remove those. These--and more--steps are performed automatically, using our CALE-Factory. Now we have massaged the BNF code in such a way that we can swap syntax in a one-to-one fashion: we turn the BNF code into SDF code. However, this SDF code is far from optimal for reengineering purposes. So we perform another step: we restructure the SDF code. We mention the introduction of lists in the grammar. We give an example that we took from our case study. We abstracted from the real sort names. In the original code we found:
A B -> A
A C -> A
B -> A
As can be seen this is a list that starts with a B and then is followed by zero or more Bs or Cs. We turn this into:
B -> B-C
C -> B-C
B B-C* -> A
Here we first define the sort B-C meaning a B or a C. Then we have the production rule stating that first we want a B and then zero or more B-Cs expressed by the asterisk. As can be seen, we recognized a certain pattern that simulates a list construct in Yacc, and we turned it into a real list construct. Also other list constructs are recognized, including lists with and without separators, lists with zero or more occurrences or with one or more occurrences, etc.
The next step is to adapt the grammar so that comments are included in the language as well. Since the same holds for scaffolding we treat them in one step. We explain this by giving an example. Suppose we have a grammar fragment:
lexical syntax
[a-z]+ -> C
context-free syntax
"b" -> B
B "a" C -> D
Since at every location both comments and scaffolding can occur, we could translate this into:
lexical syntax
[a-z]+ -> C
context-free syntax
X "b" -> B
X B X "a" X C -> D
In this way we have incorporated the sort X short for extension of
the grammar. However, this cannot be done since this grammar is ambiguous.
Suppose that the comments are as in C. The term /*foo*/ b a c
has two valid parse trees. One parse tree is made by using the last
context-free rule. There we choose the first X to be /*foo*/
,
and then the b is recognized by the first context-free rule, by
choosing the X empty. etc. The other valid parse tree is to take
the last context-free rule and choose X empty, then to recognize
the /*foo*/ b
we use the first context-free rule, and so forth.
As can be seen, it is very dangerous to modify grammars in a naive way.
We solve this by connecting each X to the next concrete token.
As is done below:
lexical syntax
X [a-z]+ -> C
context-free syntax
X "b" -> B
B X "a" C -> D
This introduces another problem. Every token is either a context-free keyword, e.g., b, or a lexically defined regular expression, e.g., [a-z]+. Since the lexer chops the strings into tokens, the X in lexical definitions becomes a single token. This is an unwanted situation, for, we wish to detect X context-free in order to recognize it in tools. Therefore, the correct solution is:
lexical syntax
[a-z]+ -> C-lex
context-free syntax
X C-lex -> C
X "b" -> B
B X "a" C -> D
Now we are almost done. We have to take into account that at the end of a program also comments are allowed. As can be seen from the above grammar, this is not allowed at the moment. We transform the rule that produces the top-sort in a special way. This is the resulting grammar:
lexical syntax
[a-z]+ -> C-lex
context-free syntax
X C-lex -> C
X "b" -> B
B X "a" C X -> D
The resulting grammar is not ambiguous and we can manipulate comment and scaffolding context-free on every location. Moreover, this step is fully automated. For implementation details we refer to [51]. As can be seen, it is far from trivial to incorporate comments and scaffolding to a grammar. We have automated the above algorithm so addition of scaffolding is a matter of pressing a button. In the case study we of course performed extensive regression testing.
Some readers may worry about the issue of scaling up with respect to performance of the generated parser of the heavily reengineered grammar. Namely, on virtually every location in the grammar the scaffolding sort has been added. It is not possible to ignore comments and other intermediate results in program code when we have to reengineer it. On the other hand, software renovation deals with large systems in general. First and foremost, we do not care about performance, since automated renovation should be compared to renovation by hand. Moreover, the progressions in hardware development solve a number of performance issues just by waiting. Apart from that, any tool with very bad performance is faster and more accurate than renovation by hand. However, for the sake of the (in our opinion irrelevant) argument, we have done a few measures. We took a 108.771 LOC COBOL/SQL file and parsed the code. This took 37 CPU minutes. The maximal usage of internal memory was 568 Mb (on a machine with 576 Mb internal memory we were also using swap space). We store the files in a binary form of ATF (Annotated term Format [5]) called BAF (Binary ATF) that uses term sharing heavily [4,29]. This leads to a file of 1.7 Mb. The file of 108.771 LOC is 4.7 Mb, so the parsed version is 2.8 times smaller than the text file. Indeed, many empty nodes are produced by the scaffolding but they are shared in the binary ATF version. We refer the reader to more information on scaffolding in a paper focusing on that issue entirely [51].
Since parser generators are often somewhat limited in expressive power, it is a wise idea to be prepared for unusual situations. We scanned the parser source code comments for words like NOTE, or should, and such. We detected two irregularities due to the nature of the used parser generator algorithm. Fortunately, everything was documented very well, so we could modify the two BNF rules to what they should be, since our parser generation algorithm has no problems with those rules. We asked the compiler specialists whether there were other irregularities that we missed. There is no indication that this was the case. Also the +8 MLOC test provided an indication that there were no irregularities.
In some cases it is useful to modularize the grammar. This means that we put certain parts of the grammar in separate files. We do this since modular grammars enable the possibility of dealing with different dialects of a language. In our case study, the compiler contains more than one grammar and depending on the sector it chooses a subparser. In our case we cannot use subparser technology. This is due to the fact that some changes affect more than one sector. We also cannot simply unite both grammars since in some sectors production rules occur more than once. This causes true ambiguities. We have to remove rules that occur more than once. We have a tool that lists double productions. The duplicates are good candidates for modules: they can then be imported by the parts containing them in the original situation. In this way we also modularized the SSL grammar. To give the reader an idea, we depict the import-graph of part of the SSL grammar that is suited for certain large public switches in Figure 1. A box is an SDF module, an arrow from box AN to ID means that ID is imported by AN and so forth. The information of Figure 1 is automatically generated using a tool for SDF [38]. The visualization of this information is being done using dot [22,41]. The number of production rules that is represented by Figure 1 is about 2000. The upper part of the picture has been extracted from the compiler and reengineered. Below the Comment module we automatically generated all the syntax necessary to become a reengineering grammar.
We connected all the sector grammars using a super grammar AN that we found in the compiler as well. In that grammar we detected a few sectors that we had not found yet. We found all but two. They were one of the 4 embedded assembly grammars ASSMBLR, and a grammar of a no-op sector ID. There was no grammar in the compiler, for it is a no-op sector. We made that one by hand. The assembly grammar was a recursive descent grammar. We generated the grammar from the comments containing a form of BNF. Here we see another difference with a compiler grammar: the ID sector is important for version and configuration management, and is used by such tools. Is does not add to the executable. However, when reengineering such code, these ID sectors also change, and should not be removed.
Of course, after each step in this grammar reengineering process we did a regression test using our test suite. In this way irregularities in our process could readily be found. It took about a week to make the tools, to make the test suite, to perform the tasks, to test the various version of the grammars, and to do all the hand work. Total costs of the project, including acquiring hardware, were about 25.000 US$. We heard from Ericsson that Cordell Green estimated the construction of this total grammar on a few million US$. We have to add that this was a ball park estimate. Moreover, we think that he based his estimate on hand work. Cordell Green told us that 25.000 US$ is nothing for such a grammar.
Now the grammar is reengineered and it is ready to be used to generate a software renovation factory. We discuss this briefly in the next section.
We recall that in [46] it is mentioned that reusing a parser is a real problem. They strongly suggest that parsing and analysis should be separated. The basic impediment to extending some program understanding tool called CMS2REV is the tight coupling of language syntax and semantic processing. CMS2REV does not maintain a separation of parsing and analysis functions. The parser actions of the CMS2REV tool directly populate idiosyncratic data structures used to produced the desired reports [46]. Our approach is even more rigorous than the ones proposed in [46] or recently [55]. We separate even tree building from the syntax description. Since this is done automatically under the surface we can effortlessly reuse parsers by extracting just the syntax rules and migrate them to the SDF formalism supported by the ASF+SDF Meta-Environment. This approach towards components enables effortless parser reuse.
We think that it is also possible to enable parser reuse among other tools along similar lines as discussed in this paper. First extract the business rules from the compiler front-end. Then migrate them to the desired input format of the parser generator that is used for generating a parser for the program understanding tool. Add a skeleton for the idiosyncratic semantic actions to it and complete the idiosyncratic part by hand. Then a parser can be generated that is geared towards program understanding or another target. Another idea would be to dump the parse tree in a language independent format, as proposed by [46], [5], and [55] so that other tools can use this format.
In order to go from the compiler source code to its corresponding software renovation factory, we need to reverse engineer the source code of the compiler. Normally it is difficult to reverse engineer systems software, in particular, when the high level design is unknown. In the case of compilers this is almost never the case. Compiler design has been lectured everywhere and the reference architecture of compilers is well-known and described in many textbooks. In our case the source code of the compiler was about 63 Mb. It took us less than 5 minutes to find the part where the front-end code resided. Then we found almost all grammar files. Then we needed to implement about 6 small grammars: one for lexical definitions and five for various dialects for context-free definitions. These take 15 minutes each. One of the 5 assembly sectors was implemented as a recursive descent parser. Also the recursive decent parser was carefully designed. Concrete, this implied that the BNF was added in comments. We could use the comments to generate a new grammar. If the recursive decent parser had been implemented without comments (which is less careful) than we could have used a C renovation factory (generated from the source code of the C compiler), to write a few simple transformations that take the recursive descent code and transform it into BNF. This would have prevented instantaneous generation of renovation factories, but it would not at all have prevented it. It would just take more time: generating a C factory and making the appropriate transformation for extracting the BNF from say recursive descent parser source code.
We note that if lexical definitions are not defined in a natural fashion, it may be the case that this translation can better be done by hand. Since lexical definitions are often very small this is not a problem, although the repetitiveness of the generation process is lost.
When we can parse all the raw source code containing the grammar information, we map the code to a domain specific language: in our case an extended form of BNF. These transformations are straightforward. The difficult part of the transformation process is from this EBNF to SDF. And, of course, from SDF to a renovation factory. In our case study, transforming from raw source code to EBNF is about 20 minutes for the entire grammar. The part to move to SDF is taking about half a day. The generation of a software renovation factory takes about 50 hours calculation time.
As noted in the introduction, Jones estimates that 30 percent of the Year 2000 problems are in the obscure languages. Our experience and intuition tells us that the more obscure a language is, the better the source code of the compiler can be made available for supporting in solving the Year 2000 problem. We have a few reasons to believe this. First, the owners the compiler will be glad that they can deliver a Y2K solution for their limited set of customers. If the owners of the compiler are also the owners of the code that needs Y2K repair, then there should be no problem at all. For very common languages such as COBOL there is a lot of tool support.
In our opinion, a possible scenario could be to use a combination of tools that reuse parsers in order to attack the problem. For instance, the GENOA/GENII combination could be used to reuse the output of the parser so that an impact analysis can be done. Since GENOA is independent of the front-end, it means that a generic Year 2000 search engine should be possible. In the mean time a process that we described in this paper could be started to implement a software renovation factory. Another possibility is to gear a CALE factory towards adding parsers to commercial systems such as COSMOS [20]. We note that according to Gartner [23,28] COSMOS is leading technology providing Year 2000 search and repair engines. Since there is a separation between parsing and the Y2K search engines in COSMOS, it is in principle possible to generate a skeleton parser generator file from a compiler front-end that needs to be completed by hand, by COSMOS developers. We estimate that this approach can also be applied to the Revolution 2000 Maintenance Environment [40].
We note that is not necessary to use the implementation platform that we used. In fact, we think that this approach can also be implemented using, for instance, Reasoning's Software Refinery. Since the implementation of our CALE Factory did not take much time we think that other sophisticated programming environment development tools can also be used to generate automated Year 2000 search and repair support by reusing the source code of existing parsers rather than their output.