Validated the practicality of ANTLR4 C++ parser

twoflat
8 min readMay 20, 2022

Overview

This article summarizes what I learned when I applied ANTLR to consider a C++ parser. I will publish the article in the hope that it will be useful to someone.
The results are near the end, so if you are in a hurry, look only at the end.

ANTLR is a parser generator

Do you know ANTLR (ANother Tool for Language Recognition), which is useful for processing structured text such as source code as data?
The process of making a computer understand structured text data as having a structure is called parsing.
Normally, the structure of text varies from programming language to programming language, so parsing should be tailored to the grammar of that language.
ANTLR lowers the bar for creating parsing a bit.
ANTLR defines the structure in its own programming language similar to EBNF notation (extension for ANTLR4 is g4), and provides source code for syntax analysis for the specified implementation Language. Sometimes it is generated and is called a parser generator.
The features of ANTLR can be found in the excellent articles found during the review process.

https://future-architect.github.io/articles/20200903/

The goal is to parse the C ++ source code

Since ANTLR can generate a parser with a high degree of freedom, it is relatively easy to define a parser for your own programming language, but here the goal is to parse the C++ source code. I will work on it.
The reason is that in software development for embedded devices, the programming language I use is mainly C/C++, and when these can be analyzed, I expect to be able to create unique check processes that cannot be covered by the compiler. Because there is.
However, defining the structure of C++ from scratch has been a daunting task. Therefore, it is maintained by volunteers in grammars-v4 .

https://github.com/antlr/grammars-v4/tree/master/cpp

I will use the grammar definition for C++.

Use Eclipse for development environment

This article uses Eclipse as the ANTLR development environment.
This is because I am accustomed to using Eclipse and I can find some information on the Internet.
For environment construction

URL1

URL2

I referred to such things.

Software version

  • Windows10 Pro
  • JDK 1.8
  • Eclipse 2019–12 (4.14.0)

Construction procedure

To install ANTLR in Eclipse:

  • Search Antlr in Help-> Eclipse Market place
  • Install ANTLR 4 IDE 0.3.6. Proceed with Install anyway for the options that appear on the way
  • After restarting Eclipse, go to Window-> Show view-> others-> ANTLR 4 Parse Treeand Syntax Diagramselect and

When you’ve done it to the end, the Eclipse display should look like the one shown in the red frame.

Parsing from the command line

Following the fact that most compilers can be run from the command line,

java -jar xxx.jar filepath

It would be convenient if the C++ parser can also be executed from the command line.
Also, if you display the presence or absence of an error so that you can see the result after parsing, it should be convenient as well.
As mentioned above, Java is the language used for parser development, so it is necessary to create a jar file to enable command line execution.
Therefore, from here, I will explain the procedure for creating a jar file that incorporates the Java source code generated from the two g4 files in cpp .

Java project creation and ANTLR runtime settings

Creating a Java project is not mandatory if you just want to store g4 files.
However, it is more convenient to have a Java project when creating a jar file, so create a Java project that also serves as a receiver for the g4 file, store the g4 file there, and also set the ANTLR runtime.
The procedure is as follows.

  • File-> New-> Project-> Java-> Java Project, enter any name in Project name and Finish
  • Make a note of the version number listed in Window -> Preferences -> ANTRL4 -> Tool -> ANTLR Tool -> Version
  • Once away from Eclipse, from here download the runtime.jar that matches the version noted above to any local path.
    Note that the runtime was version 4.4 compared to version 0.3.6 of the ANTLR 4 IDE.
  • Go back to Eclipse and in the project created above, right click -> Build Path -> Configure Build Path … -> Libraries -> Add External JARs … and specify the runtime.jar downloaded above

Copy g4 file

If the project was created in the previous step, the copy destination is arbitrary.
I created an antlr folder for g4 in the same hierarchy as the src folder and copied it there, so the result is like this.

Check source code generation from g4 file

Although the source code generation may be executed automatically after copying, to make it easier to refer to the generated source code from the class to be created, use @header for each of CPP14Lexer.g4 and CPP14Parser.g4 . Make a modification to add the package name.
Specifying the namespace following the package is optional.

CPP14Parser.g4

...
parser grammar CPP14Parser;
@header {
package antlr.cpp14.parser;
}
...

CPP14Lexer.g4

...
lexer grammar CPP14Lexer;
@header {
package antlr.cpp14.parser;
}
...

After entering all, select File -> Seve All.
After Save All, confirm that the source code has been generated automatically.
You should be in a situation where there are multiple files as shown by the red frame.

With the default settings, files will be generated deeper in the hierarchy, such as target \ generated-sources \ antlr4 * * \ CPP14Lexer.java. It is possible to make it shallow by specifying the generation path from preference

Make a package of these under the src folder, duplicate the files and you’re done.

Get parsing error in listener

As mentioned at the beginning, display the presence or absence of an error after parsing. The CPP14Parser class generated by ANTLR can register listeners to catch syntax errors. The listener only implements the ANTLRErrorListener interface , so we’ll prepare it here. Implement the ANTLRErrorListener interface, and if there is even one error, leave the error history in the private field, and after parsing, other classes can get the history.

public class Cpp14ErrorListener implements ANTLRErrorListener {

private boolean error = false;
@Override
public void reportAmbiguity(Parser arg0, DFA arg1, int arg2, int arg3, boolean arg4, BitSet arg5,
ATNConfigSet arg6) {
error = true;
}
@Override
public void reportAttemptingFullContext(Parser arg0, DFA arg1, int arg2, int arg3, BitSet arg4, ATNConfigSet arg5) {
}
@Override
public void reportContextSensitivity(Parser arg0, DFA arg1, int arg2, int arg3, int arg4, ATNConfigSet arg5) {
}
@Override
public void syntaxError(Recognizer<?, ?> arg0, Object arg1, int arg2, int arg3, String arg4,
RecognitionException arg5) {
}
public boolean hasError() {
return error;
}
}

Create a class with main method

To call the jar from the command line, you need the main method and the class that has it.
Right-click on the package -> New -> Class -> Enter the class name in the Name box as shown in the dialog that appears, check the public static void main checkbox, and Finish.

Call parser from main method

Now,

java -jar xxx.jar filepath

To be able to do this, it is necessary to add the process of passing the file to the parser, assuming that the argument received by the main method is the file path.

https://stackoverflow.com/questions/34409396/how-to-read-input-from-a-file-using-antlrinputstream

While referring to, the implemented main method is as follows.

public static void main(String[] args) {
if (args.length < 1) {
return;
}
File file = new File(args[0]);
FileInputStream fis = null;
try {
fis = new FileInputStream(file);
ANTLRInputStream stream = new ANTLRInputStream(fis);
CPP14Lexer lexer = new CPP14Lexer(stream);
CommonTokenStream tokens = new CommonTokenStream(lexer);
CPP14Parser parser = new CPP14Parser(tokens);
Cpp14ErrorListener listener = new Cpp14ErrorListener();
parser.addErrorListener(listener);
parser.translationUnit();
fis.close();
int errorCode = 0;
if (listener.hasError()) {
errorCode = 1;
}
System.out.println("Finished. Result: " + errorCode + " file:///" + args[0]);
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}

To have the parser perform parsing,

parser.translationUnit();

It is to call the rule name corresponding to the root of the defined rule as the method name.
This is an implicit understanding of ANTLR, so be careful when creating your own language.

Make a jar file

Now that the necessary processing is complete, the next step is to create a jar file. It is necessary to generate a jar including runtime.jar.
The procedure is

  • Right-click on the project -> Run as -> Java Application -> Console and OK The reason for this step is to make an entry for the Launch Configuration that you will specify in a later step.
  • Right click on the project name -> Export … -> Java -> Runnable JAR file -> select entry
  • Select Console — antlr.cpp14 from the Launch Configuration pull-down
  • Select any path for Export destination
  • Library handling confirms that Extract required ~ is selected and Finish

In order to run the jar independently, you can use Package ~ instead of Extract ~.
You can check the details here .

If you want to distribute it as an exe file, please refer to here .

Validate parser

The target source code is ESP32

Now that we are ready to parse, let’s verify the parser’s performance.
The code used for verification includes C++ source code that is regularly maintained, and I searched for the source code for embedded devices.

https://github.com/espressif/arduino-esp32

Seems to be good, so I will use it.
This is C/C++ source code developed and maintained for ESP32 Raspberry Pi-like microcontroller board.

About 70% passed

The metric is the percentage of successful parses, that is, (the number of files that did not fail in parse / the total number of C/C++ source code files parsed) * 100.
As a result, the success rate of parsing was 71% (102 files without error / 143 files in total).
The ESP32 source code will be 100% successfully compiled by the compiler, so it’s a fairly low number compared to that.
Since the C++ parser is used to analyze the C source code, it seems unavoidable that the numerical value is low, but it can be said that it is not practical as it is.
So, next, I will try various improvements to eliminate the cause of the parse error.

Other

Original article(in Japanese) is here.

--

--

twoflat
0 Followers

I am a developer of software for embedded devices. I would like to share what I have practiced and learned about programming languages, AI, etc.