next up previous


Cracking the 500 Language Problem

R. Lämmel*,** and C. Verhoef**
*Centrum voor Wiskunde en Informatica,

Kruislaan 413, 1098 SJ Amsterdam, The Netherlands
**Free University of Amsterdam, Department of Mathematics and Computer Science,

De Boelelaan 1081a, 1081 HV Amsterdam, The Netherlands
ralf@cs.vu.nl, x@cs.vu.nl

Abstract:

We explain what the 500 language problem is, why it is a relevant problem, and why solutions are needed. We propose a solution, which is rapid development of renovation parsers by stealing grammars. We illustrate this by applying this approach to two non-trivial but representative languages: a proprietary real-time language from the telecommunications industry, and a well-known dialect of the most popular language in the world: IBM's VS Cobol II. We share the lessons we learned with our efforts to solve the 500 language problem.

   
Introduction

Capers Jones estimates that there are at least 500 languages and dialects available in commercial form or in the public domain. On top of that, he estimates that some 200 proprietary languages have been developed by corporations for their own use [1, p. 321]. In his book on estimating the costs of the Year 2000 problem [2] he furthermore indicated that systems written in all those 500 plus 200 languages were affected. The findings of Jones inspired many Y2K whistle-blowers to mention his estimates as a major impediment to solve the Year 2000 problem. Let us have a look at what these people had to say. For instance, Ed Yourdon replied with a boilerplate email when you send him mail containing the words: Y2K and solution. He mentions the 500 language problem, and it is worthwhile to quote this part in its entirety:

I recognize that there is always a chance that someone will come up with a brilliant solution that everyone else has overlooked, but at this late date, I think it's highly unlikely. In particular, I think the chances of a ``silver bullet'' solution that will solve ALL y2k problems is virtually zero. If you think you have such a solution, I have two words for you: embedded systems. If that's not enough, I have three words for you: 500 programming languages. The immense variety of programming languages (yes, there really are 500!), hardware platforms, operating systems, and environmental conditions virtually eliminates any chance of a single tool, method, or technique being universally applicable.

The number 500 should be taken poetically, like the 1000 in the preserving process for so-called thousand-year-old eggs, which last only 100 days. For a start, the 200 proprietary languages should be added, moreover other estimates indicate that 700 is rather conservative: Weinberg estimated already in 1971 that in 1972 programming languages will be invented at the rate of one per week--or more, if we consider the ones which never make it to the literature, and enormously more if we consider dialects, too [3, p. 242]. Also Peter de Jager created awareness for the 500 language problem. He writes about the availability of Y2K tools [4]:

There are close to 500 programming languages used to develop applications. Most of these conversion or inventory tools are directed toward a very small subset of those 500 languages. A majority of the tools are focused on Cobol, the most popular business programming language in the world. Very few tools, if any have been designed to help in the area of APL or JOVIAL for example.

If everyone was using Cobol, and only a few systems were written in uncommon languages, the 500 language problem is of limited importance. Therefore, it is useful to know what the actual language distribution of installed software is. First, there are about 300 Cobol dialects: each compiler product has a few versions, with many patch levels, Cobol often contains embedded languages like DMS, DML, CICS and SQL. So there is no such thing as the Cobol language. Cobol is a polyglot: a confusing mixture of dialects and embedded languages--a 500 language problem of its own. Second, Yourdon and de Jager were right about the importance of the 500 language problem. Namely, 40% of all the software is written in less common languages. To be precise, the distribution of the world's installed software by language according to Capers Jones is as follows:

In contrast, for about 50 languages Y2K search engines existed, and for about 10 languages there were automated repair engines [2, p. 325]. So only for a very small part of the languages there is automated modification support. This lack caused people to worry about the 500 language problem.

What is the 500 Language Problem?

What is it, and is it still relevant? We entered the new millennium without too much trouble so you one could conclude that maybe there was this 500 language problem, but whatever it was, it is not relevant anymore. Of course the 500 language problem already existed before it was popularized by the Y2K gurus, and it did not go away when we entered the new millennium. Capers Jones identified and named the problem. But what is a good description of this problem? Here's a succinct formulation:

The 500 language problem is defined as the most prominent impediment for constructing tools to analyze and modify existing software written in those languages.

Removing this impediment solves the 500 language problem. We illustrate what this impediment comprises. If you want tools to accurately probe and manipulate source code, a prerequisite for such tools is that the code has to be converted from text format into a tree format. To make this conversion you need a so-called syntactic analyzer, or a parser. Constructing a parser for analysis or modification is a major effort, and in many cases the up-front investment is hampering initiatives for commercial tool builders, which clarifies the lack. Indeed, Tom McCabe told us that McCabe & Associates developed parsers for 23 languages, which was already a huge investment. But 500 would be insurmountable, therefore, he dubbed the 500 language problem ``the number one problem in software renovation''.

A sometimes heard solution for the 500 language problem is to just convert from uncommon languages to mainstream ones for which tool support is available. Eliminating all these languages will make the 500 language problem go away. This is not a solution. For, you need a full-blown tool-suite to make the conversion, including a serious parser. And obtaining a parser is part of the 500 language problem. So language conversion will not eliminate the 500 language problem, on the contrary, you need a solution for the 500 language problem to aid in solving conversion problems.

A second suggestion to solve the 500 language problem is reported on in Usenet discussions where a researcher proposed to generate grammars from the source code only, in the same way linguists try to generate a grammar from a piece of natural language. In search of solutions, we studied this idea and consulted the relevant literature. We did not find any successful effort where the linguistic approach helped to create a grammar for a parser in a cost-effective way. Our conclusion is that the linguistic approach does not lead to useful grammar inferences from which you can built parsers [5].

Another suggestion to solve the 500 language problem is to reuse the parser from compilers. This solution works already a bit better: you just tap the parser output from a compiler and feed it to a renovation tool. In fact, this is what Prem Devanbu is doing with his GENOA/GENII system [6]. He developed a programmable tool that can turn a certain idiosyncratic output format of a parser into another format that is more suitable for code analysis. There is however, one major drawback to this approach: as Devanbu points out in his paper [6] the GENOA system does not allow for modifying code. This is not a surprise, since a compiler removes comments, expands macros, includes files, minimizes syntax, and thus irreversibly deforms the original source code. The intermediate format is good enough for analysis in some cases, but the code can never be turned into acceptable text format again. Another very real limitation is that for many compilers you do not get access to its sources (for economic reasons). In renovation projects, such as Y2K, Euro, code restructuring, language conversion, and so on it is a requirement that you can automatically modify code. Namely, the code volume is prohibiting effective and efficient renovation by hand.

Summarizing, the availability of grammars is very sparse, and solutions to obtain them are far from optimal. The 500 language problem is a real problem, it was a real problem before the Y2K gurus baptized it, and it did not go away after the millennium passed. Its solution is a first step in the direction of enabling tool support for analyzing and modifying our 800-900 billion LOC of existing software assets written in numerous languages.

   
How We Are Cracking the 500 Language Problem

Ed Yourdon claimed that the large number of programming languages would virtually eliminate any chance of a single tool, method, or technique being universally applicable. It turns out that the 500 language problem does have a single solution. And it is not too hard either. This is what we mean with the word solution:

The 500 language problem is cracked when there is a cheap, rapid and reliable method to produce grammars for the myriads languages so that analysis modification of existing code is enabled.

We explain what this all means. Cheap is in the 25.000 $\pm$ 5.000 US dollar range, rapid is in the 2 weeks range (one person), and reliable is that the parser based on the produced grammar passes the test of parsing millions of lines of code provided by the customer in need for tool support. Why is this a solution? For, a grammar is hardly a Euro conversion tool, or a Y2K analyzer. Next we explain that the most dominating factor of constructing renovation tools is constructing the underlying parser.

From Grammar to Renovation Tool

Renovation tools routinely comprise the following main components: preprocessors, parsers, analyzers, transformers, visualizers, pretty printers, and postprocessors. In many cases, language-parametrized (or generic) tools are available to construct these components. Think of parser generators, pretty printer generators, graph visualization packages, rewrite engines, generic data flow analyzers, and the like. Workbenches providing this functionality are for instance, Elegant, Refine, ASF+SDF, but there are many more. This is the generic core of all renovation tools. In Figure 1 we depicted how you go from a grammar to actual renovation tools.


  
Figure 1: Effort shift for renovation tool development
\begin{figure}
\begin{center}
\leavevmode \psfig{file=pix/core.eps,width=9cm}
\end{center}\end{figure}

We expressed effort by the length of arrows (longer arrows imply more effort). As you can see, if you have a generic core, and a grammar, it does not take too much effort to construct parsers, tree walkers, pretty printers, and so on. Although these components depend on a particular language, their implementation uses generic language technology: a parser is generated using a parser generator, a pretty printer is generated using a formatter generator [7], and likewise, tree walkers for analysis or modification are generated similarly [8]. What all these generators share, is that they heavily rely on the grammar. Once you have the grammar and the relevant generators you can rapidly set up this core for developing software renovation tools. You could call this the grammar-centric approach. Leading Y2K companies indeed constructed generic Y2K analyzers so that dealing with a new language ideally reduced to constructing a parser. The bottleneck is obtaining complete and correct grammar specifications. The dashed part in Figure 1 expresses the current situation: it takes a lot of effort to create those grammars. In Table 1 we quantify the effort for a typical Cobol renovation project using the solution we propose. We discuss the project shortly, but first notice that the grammar part took two weeks. Implementing a quality Cobol parser can take 2 to 3 years, as Vadim Maslov of Siber Systems posted on Usenet (he constructed Cobol parsers for about 16 dialects). Also adapting an existing Cobol parser to cope with new dialects takes easily 3 to 5 months as we learned from several estimates done by others. Moreover, patching existing grammars using mainstream parser technology leads to unmaintainable grammars [9,10] significantly increasing the time it takes to adapt parsers. Using our approach this effort is reduced significantly (in this example to 2 weeks of effort), so that you can much more quickly start developing actual renovation tools.

To illustrate how to go from a grammar to an actual renovation task, we briefly describe this Cobol renovation project [11] where others applied our grammar-centric approach. This project concerned one of the largest financial enterprises in the world. They needed an automatic converter from Cobol 85 back to Cobol 74 (the 8574 project). The Cobol 85 code was machine generated from a 4GL tool (KEY) so the problem to convert back was fortunately limited, due to the limited vocabulary of the code generator. It took some time to find solutions for intricate problems such as, how to simulate Cobol 85 features like explicit scope terminators (END-IF, END-ADD), or to express the INITIALIZE statement in the less rich Cobol 74 dialect. The solutions were discussed with the customers and tested for equivalence. Once these problems were solved, it was not much work to implement the components due to the generic core assets generated from the recovered Cobol 85 grammar. The problem could be cut in 6 separate tools taking 5 days to implement. The programming by hand was limited (less than 500 LOC); but compiled into about 100.000 lines of C code and 5000 lines of makefile code (linking in all the generated generic renovation functionality). After compilation to 6 executables (2.6 Mb each), it took 25 lines of code to coordinate them into a distributed component-based software renovation factory that converts Cobol 85 code at a rate of 500.000 LOC/hour back to Cobol 74 using 11 Sun Workstations.


 
Table 1: Effort for the 8574 project.

8574 project
effort


grammar
2 weeks

generation
1 day

6 tools
5 days

assemblage
1 hour


total
3 weeks

 

Measuring this and other projects, it became clear to us that the total effort of writing a grammar by hand is orders of magnitude larger than constructing the renovation tools themselves. So the most dominating factor in producing renovation tools is constructing the parser. Building parsers using our approach reduces the effort to the same order of magnitude as constructing the renovation tools. Building parsers in turn is not hard: use a parser generator. But the input of a parser generator is a grammar description. So the most important artifacts that we need to enable tool-support for software renovation are complete and correct grammars. When we find an effective solution for producing grammars quickly for many languages we solve the largest impediment to construct tools for those languages, and we thus solve the 500 language problem.

But how to produce grammars quickly? Recapturing the syntax of an existing language is usually done by hand: take a huge amount of sources, manuals, books, and a parser generator and start working. We, and many others, have worked like this for years. But then we realized that this hand-work is not necessary. Since we are dealing with existing languages, the grammars are already constructed. This is what we discovered about grammars:

do not create them, steal them, and then massage them to your needs.

Grammar Stealing Covers Almost All Languages

We commence with an important argument showing that our approach covers virtually all languages: we found two actual problematic cases. We discuss the coverage diagram depicted in Figure 2.


  
Figure 2: Coverage diagram for grammar stealing
\begin{figure}
\begin{center}
\leavevmode \psfig{file=pix/coverage.eps,width=9cm}
\end{center}\end{figure}

Recall that we need to produce grammars for existing software, e.g., legacy systems. So the deployed software is compilable (or can be interpreted). After passing the start here box, we enter the compiler sources diamond. There are two possibilities: either the source code of the compiler is available to you or not. First we discuss the yes path. Then the only thing you have to do is find the part that turns the text into an intermediate form. That part now contains the grammar. This you can do to grep the compiler sources for keywords of the language.

There are three possibilities: either the grammar part is hard-coded, or a parser generator is used, or both (in a complex multi-language compiler for instance). We only need to cover the first two cases (they are present in the diagram). In the hard-coded case, you have to reverse engineer the actual grammar from the hand-written code. Fortunately, in the comments of such code often BNF rules are provided giving you an indication what the grammar comprises. Moreover, compiler construction is a well-understood subject: there is even a reference architecture known. Therefore, compilers are often implemented with well-known implementation algorithms. So usually the quality of a hard-coded parser is good, e.g., a recursive descent algorithm is used. In such cases you can easily recover the grammar either from the code, or the comments, or both. In one case we know that the grammar is not easily extractable: this is the case for the language perl [12]. In all the other cases we encountered, the quality of the code was always sufficient to recover the grammar.

If the parser is not hard-coded, it is generated (the BNF branch in Figure 2). But then there must be some BNF description of it in the compiler sources. With a simple tool that parses the BNF itself you can extract the BNF. So in all cases when we have the compiler sources we can recover the grammar, except for perl. This finishes the case when you have access to the source code of a compiler. Later on we discuss a published recovery case from compiler sources to give you an idea of grammar stealing in case the compiler sources are to your avail.

Now we look at the case were there is no access to the compiler sources (we enter the language reference manual diamond in Figure 2). In that case there are two possibilities: there is a language reference manual or not. Let us first discuss the case that a language reference manual is available. This can be a compiler vendor manual or an official language standard. There are three possibilities: either the language is explained by example, or using general rules, or both. We only need to treat the first two cases. Let us first assume that there are general rules. Then there is the quality issue. Reference manuals and language standards are known to be full of errors. To our surprise, we discovered that the myriads of errors are of a repairable category. We were surprised since we experienced a total failure to recover a grammar from a manual in 1998, a proprietary language for which--obviously--also the compiler sources were available (so this case is covered with the upper-half of Figure 2). As you can see in the coverage diagram we have not found low quality language reference manuals containing general rules for cases where we did not have access to compiler sources. This is clarified as follows: compiler vendors do not give away the source code of a compiler for economic reasons. But in order to be successful as a company accurate documentation explaining the entire language is necessary. We discovered that the quality level of those manuals is good enough to recover the grammar. Later on we discuss a published recovery case from a language reference manual to give you an idea of grammar stealing in case there are no compiler sources are to your avail.

For an uncommon language it is much more rare to have a high quality manual: either there is none (e.g., if the language is proprietary), or the company has only a few customers. In the proprietary case you have the compiler sources, so we have coverage by the upper half of the diagram. In the other case you can buy the sources since their business value is not too high. For instance, when Wang went bankrupt their important clients bought the sources of the Wang operating system and the Wang compilers to create their own platform and dialect migration tools. This explains why we do not know of low quality manuals containing general rules. We know of one case where the language is explained by code examples and where general rules are absent. This is RPG. We think that it is possible to extract the general rules of RPG from the code examples in a systematic manner. We plan to examine this case in more detail in a future renovation project involving a large amount of RPG code.

Finally, we have to deal with is the case without access to the compiler, and without access to a language reference manual. We have not (yet) seen those cases. Capers Jones mailed us that: ``For a significant number of applications with Y2K problems, the compilers may no longer be available either because the companies that wrote them have gone out of business or for other reasons.'' But he did not come up with actual examples. We just mentioned the Wang case, where you could buy the sources, and hence solve the problems using the upper half of Figure 2. But we do not exclude that occasionally such a thing can happen. In any case, a lesson that can be learned from this is: contracts between you and a business-critical language vendor should include a solution for source access in case of bankruptcy or terminated support (we have seen examples where the sealed sources were given to key customers).

Summarizing, our coverage diagram shows that virtually all grammars are in a class where you can recover the grammar as we will see later on.

But What About Semantics?

Some people think that you need up-front in-depth knowledge of the semantics of a programming language in order to change code. When the BNF is recovered, you can generate a syntax analyzer that produces trees, but the trees are not decorated with extra knowledge, such as control-flow, data-flow, type annotation, name resolution, and so on. Some people also think that you need a lot of semantical knowledge to analyze and modify existing software. We experienced that this is not true. There are three levels on which you can try to capture semantical knowledge of a language:

Recall that the 500 language problem is about removing the impediment to build tools that work on existing software. So we are not trying to build a compiler, in which all semantical knowledge has to be implemented. Consider the following Cobol excerpt:

PIC A X(5) RIGHT JUSTIFIED VALUE 'IEEE'.
DISPLAY A.

The OS/VS Cobol compiler prints the expected result, namely ' IEEE', which is right justified. The Cobol/370 compiler displays the output 'IEEE ' with a trailing space, which is left justified. There are many more such cases, so trying to deal with the semantics of all compilers up-front is not feasible. Even when you restrict yourself to one compiler, this problem does not go away. Consider the Cobol fragment below:

01 A PIC 9999999.
MOVE ALL '123' to A.
DISPLAY A.

Depending on the compiler flags, this code displays the number 3123123 or the entirely different number 1231231. There are hundreds of such problems so also for a single compiler it is infeasible to capture the semantics up-front. So there is no single semantics available, and gathering all variants is prohibitively expensive, and prone to error given the semantical differences between compilers, compiler versions, and even compiler flags used.

The good news is that you can deal with semantics on a per project basis. It is our experience that you only need specific ad hoc elements of the semantics. We call this demand driven semantics. Let us explain. For instance, the NEXT SENTENCE in Cobol directs control to the statement after the next separation period (denoted with a dot). So, depending on where people put an optional dot, the code jumps directly behind the dot. Omitting a dot can lead to different behavior. One of our customers wanted tools to get rid of this implicit jump instruction. Indeed you have to do an in depth analysis of what disasters can possibly happen. Luckily, it turned out that for this project, the implicit jump instruction could be replaced by the innocent no-op CONTINUE. So, after our semantical investigation, we knew we could safely ignore the hazardous jumping behavior, and use a relatively dumb tool making this change. In another project this tool might break down: depending on the source code or the semantics of the compiler. To give you an idea how far you can come with per project demand driven semantics: we developed for some Cobol systems relatively dumb tools able to wipe out very complex GO TO logic [13]. Overall it is our experience that with many and diverse tasks you do need to know the semantics, but it is not up-front necessary to encode this in a parse tree or otherwise. The tool developer uses the knowledge to construct the proper (mostly) syntactic tools.

How Grammar Stealing Works in Practice

We--and others from industry and academia--have applied grammar stealing successfully to a number of projects among which Java, PL/I, Ericsson PLEX, C++, Ada 95, VS Cobol II, Lucent SDL, AT&T SDL, SWIFT messages, and more. To show how the solution works out in practice we share our experience with these projects, in particular, we focus on the two main branches in diagram 2: one proprietary nontrivial real-time embedded system language (for which the compiler sources were accessible) and one well-known business language at the other end of the gamut (for which no compiler sources were available). Both languages, PLEX and VS Cobol II, are used for business-critical systems. PLEX (Programming Language for EXchanges) is used in the AXE 10 public branch exchange, and VS Cobol II systems run on IBM Mainframes (AXE 10 is the name of the switch, it is not an abbreviation).

Our approach uses a unique combination of powerful techniques. They are:

If one of the above ingredients is omitted the synergy is gone. Extraction by hand is error prone. Basic parsing technology limits you to work with grammars in severely limited formats. With powerful technology you can work with arbitrary context-free grammars, so you can directly test them irrespective of their format. Without automated testing you never find so many errors in a short time. Without tool support to transform grammar specifications, analyses are inaccurate, corrections are not done consistently, and without transformations you cannot repeat what you have done, or change initial decisions easily (for we record transformations in scripts). This also gives you an idea of what kind of people are capable of stealing grammars. They should know about grammars, powerful parsing techniques, how to set up testing, and know about automated transformations.

Grammar Stealing from Compiler Sources

We illustrate grammar stealing with a published industrial project [14] with access to the compiler sources. We applied our approach to the exceptionally complex proprietary language PLEX used by Ericsson to program public telephone switches. PLEX consists of about 20 (sub)languages, called sectors. In fact, we are dealing with a mixed language containing high level programming sectors, assembly sectors, finite state machine sectors, marshaling sectors, etc. What we did is straightforward, and can be summarized in a list:

1.
we reverse engineered the PLEX compiler on-site (63 Meg source code) to look for grammar related files;
2.
we found the majority of the grammars in some BNF form;
3.
we found a hand-written proprietary assembly parser with erroneous BNF in the comments;
4.
we wrote 6 BNF parsers (there were 6 different BNF dialects used);
5.
we extracted the plain BNF from the compiler sources, and converted it to another syntax definition formalism (SDF) for technical reasons;
6.
we found the lexer files and converted them to SDF;
7.
we combined all the converted grammars into one overall grammar;
8.
we generated an overall parser with a sophisticated parser generator;
9.
we successfully parsed 8 million lines of PLEX code as a test.

The total effort was 2 weeks for two persons, including constructing tools, testing time, etc. It was done for 25.000 US Dollar. We heard from Ericsson that some cutting edge reengineering company earlier estimated this task at a few million US Dollar. When we contacted this company they told us that 25.000 US Dollar was nothing for such a grammar.

To illustrate the limited complexity of the work, consider a fragment of raw compiler source:

<plex-program>
        = <program-header> <statement-row> 'END' 'PROGRAM' ';'
          %% xnsmtopg(1) ; %%
--      <= sect
--         Compound(
--             Reverse(STATEMENT-ROW.stat_list) =>
--                PROGRAM-HEADER.sect.as_prog_stat : ix_stat_list_p
--             PROGRAM-HEADER.sect : ix_sect_node_p)
        ;

This expresses that a PLEX program consists of a header, a list of statements, and the phrase END PROGRAM to end a PLEX program. The other code deals with semantic actions relevant for the compiler. Our tools converted this to some common BNF while removing the idiosyncratic semantic actions:

plex-program ::= program-header statement-row 'END' 'PROGRAM' ';'

Then our tools converted this into SDF, which was subsequently fed to a sophisticated parser generator accepting arbitrary context-free grammars:

Program-header Statement-row "END" "PROGRAM" ";" -> Plex-program

We only show one production rule to give you an idea of the low complexity. The tools automatically recovered the majority of the 3000+ production rules in an afternoon. Then we tested each sector grammar separately, we used a duplicate detector to weed out productions that were used in more than one sector grammar, so that we could construct an overall grammar able to parse complete PLEX programs. One assembly sector parser was hard-coded (viz. Figure 2), so we had to recover its grammar by reverse engineering. We had no problems with this task. The comments accompanying the code contained BNF, so this made the effort very limited. With all the sectors grammars combined, we generated a parser to test it with an 8 MLOC PLEX test-suite. Two files did not parse: they were compiler test-files that were not supposed to parse. The rest passed the test. In addition we generated a web-enabled version of the BNF description as a basis for a complete and correct manual.

Although this project was successful [14], an earlier (published) attempt to recover the PLEX grammar failed. In a first attack [15,16] we failed to recover the PLEX grammar from on-line PLEX manuals. Those manuals were not good enough to reconstruct the language from. Later, we could check that the manual lacked over 50% of the language definition, so that the recovery process had to be incomplete by definition. We recall that initially we concluded that you cannot recover grammars from manuals. But we concluded this too fast, as we will see next.

Grammar Stealing from Reference Manuals

We illustrate grammar stealing with a published industrial project [5] without access to the compiler sources. Some of our colleagues felt a little fooled by the PLEX result: ``they are not really constructing a parser, they only convert an existing one. Hey we can do that, too! Now try it without the compiler.'' Indeed, at first sight, not having this valuable knowledge source available is a different issue (after all, we failed doing this for PLEX). However, we discovered that this failure was not due to the tools that we developed, but due to the nature of proprietary manuals: its audience is so limited that major omissions can go unnoticed for a long time. When there is a large audience the language vendor has to deliver better quality.

In another two-week effort we recovered the VS Cobol II [17] grammar from a manual, by also stealing the grammar. For the fully recovered VS Cobol II grammar browse to: www.cs.vu.nl/grammars/vs-cobol-ii/. Again, what we did is straightforward, and can be summarized in a list:


  
Figure 3: The original syntax diagram for the SEARCH statement.
\begin{figure}
\begin{center}
\begin{footnotesize}
\begin{tex2html_preform}\be...
...nd{verbatim}\end{tex2html_preform} \end{footnotesize}
\end{center}\end{figure}

1.
we retrieved the on-line VS Cobol II manual from www.ibm.com;
2.
we extracted its syntax diagrams;
3.
we wrote a parser for the syntax diagrams;
4.
we extracted the BNF from the diagrams;
5.
we added 17 lexical rules by hand;
6.
we corrected the BNF using grammar transformations;
7.
we generated an error-detection parser;
8.
we incrementally parsed 2 million lines of VS Cobol II code
9.
we reiterated step 6-8 until all errors vanished;
10.
we converted the BNF to SDF for technical reasons;
11.
we generated a production parser;
12.
we incrementally parsed VS Cobol II code to detect ambiguities;
13.
we solved ambiguities using grammar transformations;
14.
we reiterated steps 11-13 until no more ambiguities were found.

So apart from some error correction cycles and ambiguity removal sessions, the process is the same as in the case when you have access to the compiler sources. An error-detection parser is a parser to detect errors in the grammar it is generated from. In this case, we used an inefficient top-down parser with infinite lookahead. It accepts practically all context-free grammars, and does not bother about ambiguities at all. We use this kind of parser to test the grammar, not to produce parse trees. Since the code is correct according to some compiler, all errors this parser detects raise potential grammar problems. In this way we were pointed to the majority of omissions, except ambiguities. When all our test code passed the top-down parser, we converted the grammar to SDF and generated a parser that detects ambiguities. In the same vein we corrected ambiguities. This project also took us two weeks of effort (two persons) in total, including the construction of tools, testing, and so on. It was done for zero US dollars. In that way we could freely publish the grammar on the Internet [18], as a gift for the 40th birthday party of Cobol.

To give you an idea of the limited complexity of this effort, we depicted an original syntax diagram [17] in Figure 3. After conversion to BNF, and correction it looks like this:

search-statement =
   "SEARCH" identifier ["VARYING" (identifier | index-name)]
   [["AT"] "END" statement-list]
   {"WHEN" condition (statement-list | "NEXT" "SENTENCE")}+
   ["END-SEARCH"]

A dash is removed between NEXT and SENTENCE. Furthermore, both occurrences of imperative-statement are replaced by statement-list. This is an example of a diagram that was too restrictive: only one statement was allowed but in the informal text we learned that: ``A series of imperative statements can be specified whenever an imperative statement is allowed.'' Both errors were found using our error-detection parser: we parsed code where NEXT SENTENCE was used but without a dash. Upon inspection of the manual and grammar we wrote a grammar transformation repairing the error. The other error was also detected with our error-detection parser: we parsed code where more statements were allowed by the compiler but not by the manual. We repaired the error with a grammar transformation.

After all these errors were corrected, we removed ambiguities in a separate phase. We illustrate this phase with an example. In the Cobol CALL statement the following fragment of a syntax diagram is present:

_____________identifier_____________________
          |__ADDRESS__OF__identifier__| 
          |__file-name________________|

The above stack of 3 alternatives can lead to an ambiguity. Namely, both identifier and file-name eventually reduce to the same lexical category. So when we parsed a CALL statement without an occurrence of ADDRESS OF the parser reported an ambiguity since the other alternatives were both valid. Without using type information we cannot separate identifier from a file-name. We first show you the ambiguous extracted BNF fragment.

 (identifier | "ADDRESS" "OF" identifier | file-name)

With a grammar transformation we eliminated the file-name alternative.

  (identifier | "ADDRESS" "OF" identifier)

With the adapted grammar the same language is recognized as before, only an ambiguity is gone. Note that this approach is much simpler than tweaking the parser and scanner to deal with types of names. In this way we recovered the entire VS Cobol II grammar and tested it with all our Cobol code from earlier software renovation projects and code from colleagues who were curious to the outcome of this project. For the final test we used about 2 million lines of pure VS Cobol II code. As in the PLEX case, we generated a fully web-enabled version of both the corrected BNF and the syntax diagrams that could serve as the core for a complete and correct language reference manual. It is freely available on the Internet [18].

   
Conclusion

We explained what the 500 language problem is and how it can be solved in general by showing that our method covers almost all languages. We illustrated the general solution for two very complex and representative cases. One proprietary language with access to the compiler sources (PLEX), and one well-known business language with only access to a reference manual (VS Cobol II). Both cases are backed up with publications in the scientific literature. We sketched a cost-effective and accurate method to quickly produce parsers for the myriad of languages so that existing code can be analyzed and modified using tools that work on the trees produced by the parsers. We illustrated our solution by providing details of the process for PLEX and Cobol. Apart from those two languages more grammars were recovered by us and others. From our effort in solving the 500 language problem we learned two interesting lessons:

0.0.0.1 Acknowledgements

Thanks to Terry Bollinger, Prem Devanbu, Capers Jones, Tom McCabe, Harry Sneed, Ed Yourdon, and the reviewers for their substantial contributions.

Bibliography

1
C. Jones.
Estimating Software Costs.
McGraw-Hill, 1998.

2
C. Jones.
The Year 2000 Software Problem - Quantifying the Costs and Assessing the Consequences.
Addison-Wesley, 1998.

3
G.M. Weinberg.
The Psychology of Computer Programming.
Van Nostrand Reinhold, 1971.

4
P. de Jager.
You've got to be kidding!, 1997.
Retrieved via: http://www.year2000.com/archive/NFkidding.html.

5
R. Lämmel and C. Verhoef.
Semi-automatic Grammar Recovery.
Software--Practice & Experience, 2001.
To Appear. Available via: http://www.cs.vu.nl/~x/ge/ge.pdf.

6
P.T. Devanbu.
GENOA - A Customizable, front-end Retargetable Source Code Analysis Framework.
ACM Transactions on Software Engineering and Methodology, 8(2):177-212, 1999.

7
M.G.J. van den Brand and E. Visser.
Generation of formatters for context-free languages.
ACM Transactions on Software Engineering and Methodology, 5:1-41, 1996.

8
M.G.J. van den Brand, M.P.A. Sellink, and C. Verhoef.
Generation of components for software renovation factories from context-free grammars.
Science of Computer Programming, 36(2-3):209-266, 2000.
Available at http://www.cs.vu.nl/~x/scp/scp.html.

9
M.G.J. van den Brand, M.P.A. Sellink, and C. Verhoef.
Current parsing techniques in software renovation considered harmful.
In S. Tilley and G. Visaggio, editors, Proceedings of the Sixth International Workshop on Program Comprehension, pages 108-117. IEEE Computer Society Press, 1998.
Available at http://www.cs.vu.nl/~x/ref/ref.html.

10
D. Blasband.
Parsing in a hostile world.
In P. Aiken, E. Burd, and R. Koschke, editors, Proceedings of the Working Conference on Reverse Engineering. IEEE Computer Society, 2001.
To appear.

11
J. Brunekreef and B. Diertens.
Towards a user-controlled software renovation factory.
In P. Nesi and C. Verhoef, editors, Proceedings of the Third European Conference on Maintenance and Reengineering, pages 83-90. IEEE Computer Society Press, 1999.

12
L. Wall, T. Christiansen, and R.L. Schwartz.
Programming Perl, 2nd Edition.
O'Reilly & Associates, Inc., 1996.

13
M.P.A. Sellink, H.M. Sneed, and C. Verhoef.
Restructuring of Cobol/CICS legacy systems.
In P. Nesi and C. Verhoef, editors, Proceedings of the Third European Conference on Maintenance and Reengineering, pages 72-82. IEEE Computer Society Press, 1999.
Available at http://www.cs.vu.nl/~x/cics/cics.html.

14
M.P.A. Sellink and C. Verhoef.
Generation of software renovation factories from compilers.
In H. Yang and L. White, editors, Proceedings of the International Conference on Software Maintenance, pages 245-255. IEEE Computer Society Press, 1999.
Available via http://www.cs.vu.nl/~x/com/com.html.

15
M.P.A. Sellink and C. Verhoef.
Development, assessment, and reengineering of language descriptions - extended abstract.
In B. Nuseibeh, D. Redmiles, and A. Quilici, editors, Proceedings of the 13th International Automated Software Engineering Conference, pages 314-317. IEEE Computer Society Press, 1998.
Available at: http://www.cs.vu.nl/~x/ase/ase.html.

16
M.P.A. Sellink and C. Verhoef.
Development, assessment, and reengineering of language descriptions.
In J. Ebert and C. Verhoef, editors, Proceedings of the Fourth European Conference on Software Maintenance and Reengineering, pages 151-160. IEEE Computer Society, March 2000.
Available at: http://www.cs.vu.nl/~x/cale/cale.html.

17
IBM Corporation.
VS COBOL II Reference Summary, 1.2. Publication number SX26-3721-05 edition, 1993.

18
R. Lämmel and C. Verhoef.
VS COBOL II grammar Version 1.0.3, 1999.
Available at: http://www.cs.vu.nl/~x/grammars/vs-cobol-ii/.

next up previous
X Verhoef
2001-09-04