The problem: You are using Antlr 1.33 to help you design and implement a new programming language; you have a base document describing your language and you have an Antlr parser that includes all the headers, token definitions, C++ actions, and so on. Now you make a change to the language definition in your base document; your compiler is now out of date. What do you do?
USQAGMS is designed to solve the problem of keeping the compiler's parser up to date after a change in the language definition.
The core of the system is
Components 4 through 6 depend on Word Perfect and the specific publication format that I chose for my language definition document. You will have to modify the components if you use a different publication format, and/or write another extraction macro if you use something other than Word Perfect for your language definition. (Components useful for other formats and/or word processors gratefully accepted for future releases of USQAGMS.)
|
How to use (see below) Setup instructions Back to Ron's home page |
Download USQAGMS |
This is the central program in the suite. It is designed to compare two files containing Antlr rules, one of which, the `old' version, has added actions. It writes a rule file containing the same rules as the `new' file, but with actions transferred from the `old' file wherever rules or alternatives are the same in both files. (Note that these are NOT complete Antlr files, just the rule sections - but the rest of the suite is designed to operate in a way that takes care of this.)
The command syntax is:
gramupd old.g new.g keepfile newer.g
where old.g and new.g are the old and new files respectively, and newer.g is new.g with actions from old.g inserted where appropriate. The intention is that new.g should be a `bare' rule file extracted automatically from your language definition document.
keepfile is a file containing a list of rules that are to be transferred unmodified from old.g into newer.g, irrespective of whether the rule is modified in new.g. (This handles cases where the publication rule is intended to differ from the Antlr rule; for example, you might describe a rule called `digit' as `any digit', whereas the Antlr file will, of course, want a specific character range.)
Example: Suppose we have three files:
---------------------- old.g -----------------------
// Leading stuff
// leading stuff #2
fred [ int i] :
/*()*/ ( ( "word" | "hello" ) )?( w:"word" | ( "hello" ) // test add
| ( { "that" | mary } | "sweetheart" ) << do some code >>
)+ << and some more >>
; << final stuff for fred >>
identifier :
special stuff
;
number :
A rule that must not be overridden // with comments
;
joe :
/*<>*/ ( "A" )? => << stuff >>? aa: "a" | bb: "c" << do something with #bb and #aa >>
;
<< final stuff for joe >>
---------------------- end of old.g -----------------------
---------------------- new.g -----------------------
fred :
( "word" | ( "hello" )
| ( { "that" | mary } | "sweetheart" )
)+
;
number :
A rule you SHOULD NOT SEE !
;
mary :
"this" "," "that"
;
joe :
"d" | "a"
;
---------------------- end of new.g -----------------------
---------------------- keeplist -----------------------
identifier character number string_literal operator_name exception_id
---------------------- end of keeplist -----------------------
Executing the command:
gramupd old.g new.g keeplist newer.g
Creates the following file:
---------------------- newer.g -----------------------
// Leading stuff
// leading stuff #2
fred [ int i] :
/*()*/ ( ( "word" | "hello" ) )?
(
w: "word"
|
(
"hello"
) // test add
|
(
{
"that"
|
mary
}
|
"sweetheart"
) << do some code >>
)+ << and some more >>
; << final stuff for fred >>
number :
A
rule
that
must
not
be
overridden // with comments
;
mary /********* THIS RULE IS NEW *********/ :
"this"
","
"that"
;
joe :
/* THIS ALTERNATIVE DELETED FROM THIS RULE:
bb: "c" << do something with #bb and #aa >>
*/
// THIS ALTERNATIVE IS NEW!
"d"
|
/*<>*/ ( "A" )? => << stuff >>?
aa: "a"
;
<< final stuff for joe >>
//**************************************************
//**************************************************
// THE FOLLOWING RULES DELETED IN THE NEW VERSION::
//**************************************************
//**************************************************
#ifdef DELETED_RULES
// The following rule was in the keep file:
identifier :
special
stuff
;
#endif
//****************************************************
//****************************************************
// THE FOLLOWING KEEPLIST RULES DON'T EXIST ANYWHERE:
//****************************************************
//****************************************************
#ifdef NON_EXISTENT_RULES
character :
string_literal :
operator_name :
exception_id :
#endif
---------------------- end of newer.g -----------------------
As the above shows, the output is fully commented with highly visible notes about deleted, modified, and added rules or alternatives. The program does not attempt to be `creative' by doing anything that is rightfully a human decision: all it does is transfer actions (including comments and predicates) where a rule or an alternative is exactly the same in both versions. After running gramupd, the compiler writer should take a look at the warning comments, deleting them as each change is verified for correctness. After the first run, the file will be formatted in a style that helps make the differences between rules and actions clear. Basically, each line holds one rule element, with any actions to its right.
Comments are treated like actions and transferred across accordingly. There is only one circumstance that causes the program some confusion, namely syntactic predicates, as these look superficially like rules, and yet they are a species of action. The program expects a convention to be used to warn it of an approaching syntactic predicate. The comment:
/*()*/
should be placed immediately before the syntactic predicate. This is illustrated in the example above.
The highly visible comments about changes are designed to make it easy for you to proceed quickly through the modified file checking that everything has worked okay, and deleting the `waffle' as you go. For example, if a rule flagged as deleted really was intended to be deleted, then the rule and the `deleted' message can themselves be deleted as soon as you have checked it out.
IMPORTANT WARNING: A subsequent modification run should not be made until the file has indeed been cleaned up as just described, or the program will get very confused indeed!
In essence, the idea is to split your Antlr grammar into the following pieces. (I shall use the name "YOURLANG" throughout as the name of the language under development and of its grammar file, but you must change this to the name of your language, including editing the shell scripts, as described shortly.) Here are the pieces:
COMPONENTS = cnv-data/g-header1 \
cnv-data/keywords \
cnv-data/g-header2 \
YOURLANG.g \
cnv-data/g-footer
g-header1, g-header2, and g-footer are parts of the entire Antlr file, as described below, keywords is automatically extracted by USQAGMS, and YOURLANG.g is the bare `rules' part of the Antlr grammar. You no longer edit the entire ".g" file as before: you edit instead the files g-header1, g-header2, and g-footer (but not keywords) and you may edit YOURLANG.g only for the purpose of adding or changing the actions (but not to change the Antlr rules themselves). The rules themselves are edited in the word processor document describing your language; this is the big benefit of USQAGMS, even if it requires discipline: your language definition and your compiler will always be in sync.
Let's start with just the final (Antlr) part of the process, the part you will most likely perform under Linux or some other Unix.
The supplied directory cnv-data contains a sample of the kinds of conversion files you will need to run your system. You must first set up a
cnv-data directory:
subsidiary to the directory where you are developing your compiler (I'll call this development directory COMPILER from now on; in other words, the cnv-data directory is COMPILER/cnv-data). In cnv-data, you need the following files:
g-header1: Your grammar file from the beginning up to the point where you want automatically-generated token definitions inserted.
g-header2: Your grammar file from after the automatically generated token definitions up to the start of the grammar proper.
keeplist: A list of rules that are to be retained unaltered from the previous version of the grammar, even if they have been modified in the publication document. Commonly, these are the rules that are written in English in the publication document, or rules for which you have worked out a better compilation rule than the one you wish to see in your language definition. (I have used Word Perfect's hidden text feature to reduce the frequency of this, as I can have two versions of a rule in the definition document, but only one version is seen in a printed document.)
Also set up two empty directories, cnv-data-bak1 and cnv-data-bak2 in the COMPILER directory. These are used to hold the previous two generations of your compiler in case something goes wrong. Yes, there's the version control system also, but it can be handy to have the previous version directly to hand. I've never had an error that caused me to use these backups, but then I'm paranoid. :-)
Next, make a copy of the
USQAGMS/conversion directory:
in the COMPILER directory. There are four shell scripts to perform various phases of the update process. You must modify these files to suit your compiler. Basically, each script must be edited to replace the word "YOURLANG" with the name of your language. The scripts assume that the COMPILER directory contains a file, YOURLANG.g, which is the target file for the conversion process. Each script should be executed from within the conversion directory.
script0: This script backs up ../cnv-data to ../cnv-data-bak1 and ../cnv-data-bak1 to ../cnv-data-bak2 and puts a copy of ../YOURLANG.g into ../cnv-data-bak1.
script1: This script converts the file gram.grm to ../cnv-data/gram01.g. gram.grm is the name of your publication-format grammar file extracted from your language definition (you must edit the directory where it resides). It also does a test run of Antlr; edit this command to suit or delete it if not appropriate.
script2: This script converts ../cnv-data/gram01.g to ../cnv-data/gram02.g extracting key words and writing to ../cnv-data/keywords.
script3: This script makes a copy of ../YOURLANG.g under the name ../cnv-data/old-rules.g, then converts ../cnv-data/gram02.g to ../YOURLANG.g merging actions from ../cnv-data/old-rules.g except for rules named in ../cnv-data/keeplist.
After you have done all these things, you should be able to test an upgrade to your compiler by running each script in order.
The file DOS/getgramr.wpm is a Word Perfect 6.0 for DOS macro that extracts any text enclosed in the style "GRules" and writes it to a file, GRAM.GRM. By using hidden text, you can include text that will not be seen in the printed language definition, or you can, with a hidden comment symbol, exclude text from your compiler that is visible in the document. Similar macros for other versions of Word Perfect, for other word processors, or for Latex will be gratefully accepted and included in a later version of USQAGMS.
DOS/check.c is a quick and dirty program (mentioned above) for doing some simple sanity checks on a grammar in my publication format. This might not be useful to you, or you might want to write one of your own in Antlr. Basically, my format uses a variant of BNF as follows: Nonterminals are represented by lower-case identifiers, including hyphens. Terminals are represented by quoted strings, for example, "function". Optional parts of definitions are represented inside square brackets: [ optional ]. Subsections of a definition to be treated as a unit are written inside parentheses: ( unit ). "&" connects two parts which must both appear, but in either order: one & another. "|" connects alternative derivations: this | that. "&|" connects two parts, either or both of which must appear, in either order: one &| another. Sections of a definition which may occur one or more times are written inside braces: { list }. A semicolon or comma immediately before the "}" indicates the separator between items of the list; the separator may not follow the final list item. If both a semicolon and a comma are shown, then the list may be punctuated by either semicolons or commas, but the same symbol must be used throughout any single list. Such a semicolon or comma is not written in quotes.
Example:
identifier = letter [ { letter | digit } ]
Checkers and/or converter for different publication formats will also be gratefully accepted for future versions.
This system takes a little setting up, but then compiler writers are pretty cluey people, so I don't expect there will be any big problems; the main update program itself has been successfully run on huge files with massive differences as well as files with just an odd change, in both cases successfully. Errors are more likely to happen from editing the wrong file, so please take care to follow the above instructions.
I suspect there is no reason a system like this couldn't be developed for YACC. If anyone modifies this system and gets it going for YACC, please let me know. An Antlr V2 variant would be nice as well.
And at last, bugs may not be wanted, but bug reports will be also be received with thanks.
Good Luck!
Mr Ron Househouse@usq.edu.au
Created: 28/1/98 Modified: 2/2/98 |
[ Back to home page ] |