”工欲善其事,必先利其器。“—孔子《论语.录灵公》
首页 > 编程 > 第一部分:实现用于构建 DSL 的表达式解释器 - 介绍 PEG 解析器

第一部分:实现用于构建 DSL 的表达式解释器 - 介绍 PEG 解析器

发布于2024-08-28
浏览:110

Part I: Implement an expression interpreter for building DSL - Introduce the PEG parser

In the last post I have introduced you the WHY and HOW I start the this project and WHAT the DSL looks like in the end. Starting from this post I will share the implementation of the whole project with you.

Generally, when we implement a language, the first thing comes in our mind is lexer and then parser. So in this post I will introduce to you how I implement my DSL with specified detail but less conception so that it won't be too confused I hope.

What's lexer?

In general, lexer is used for Lexical Analysis or tokenization if you will. Let's take the clause "We will rock you!" (the famous rock and roll music from Queen) as an example. After we tokenize it following the grammar rule of English, it shall output a list [Subject("We"), Auxiliary("will"), Verb("rock"), Object("you"), Punctuation("!")]. So a lexer is mostly used for classify texts into certain types by its lexical meaning. This is significant for us since a syntax is actually consist of lexical elements instead of characters or words.

Usually we implement a lexer with some code generator that can parse Regular Express like Ragel, nex and so on. But I think you will be surprised how easy it is to implement a lexer after checking Lexical Scanning in Go from Rob Pike. He had introduced an interesting pattern to implement a finite-state machine which I think is the core of an lexer.

How about the parser?

So how about a parser? What is it used for? Basically, a parser is used to identify a list of lexical elements with the specified pattern which we also called it grammar. Take the example "We will rock you!" we introduced before, it produces a sequence of [Subject("We"), Auxiliary("will"), Verb("rock"), Object("you"), Punctuation("!")] which matches the pattern of 'Future tense' grammar in English. So this is exactly what a parser do and is so called 'syntax analysis'.

Let's take another example in a more computer way, how about an expression like 1 2 * 3 ? It is obvious that they will be translated into [Number(1), Operator( ), Number(2), Operator(*), Number(3)] by the lexer, but what this sequence will be translated into by a parser with a common mathematical expression grammar? In general, a lexical elements sequence will got translated into an Abstract Syntax Tree(or AST in short) by parser like this:

      *
     / \
        3
   /  \
  1    2

With an abstract syntax tree we can analyze the syntax in the correct order according to the rule of our grammar.

Let's implement a parser

Now we are gonna implement an parser on our own. Well not entirely on our own, we still need some tool to help our to generate code for our parser since it is hard to implement it correctly and a hand-writing parser might be difficult to maintain, even if you do, the performance might be poor.

Luckily, there are many mature parser generative tools that helps us to generate golang code with a grammar definition file. The first choice came in my mind is go-yacc which is the official tool to generate go code for parser. I used to use it to write a SQL analyzer and the it wasn't a pleasure because it:

  • requires an external lexer.
  • lack of documentation.
  • union definition and value type declaration are both quite confusing.

"Why not try something new?" I think, so there we go, with the amazing tool peg I was able to implement both of the lexer and parser in one single grammar file, one single interface. Let's take a closer look at the peg.

A closer look at PEG

PEG stands for Parsing Expression Grammar introduced by Bryan Ford in 2004, which is an alternative to the traditional Context Free Grammar used for describing and expressing programming language syntax and protocol.

For the last decades, we have been using parser generative tool like yacc, bison that supports CFG to generate parser code, and if you have used them before you might find it difficult to avoid ambiguity and integrate them with the lexer or the regular expression. In fact, the syntax of a programming language is not just patterns of lexical elements but also the rules of those lexical elements which somehow the CFG is missing so when we use tools like yacc we will have to implement lexer by our self. Further more, to avoid ambiguity(like the precedence between plus and multiply, check this out) in CFG we have to define the precedence for each operator. All of these crucial fact makes it unnecessarily difficult to develop a parser.

But thanks to Bryan Ford, now we have another good choice, the PEG that allows us to define the lexical and syntax rule all in one single file with a tight DSL and resolve ambiguity in an elegant and simple way. Let me show you how easy it can be done with peg.

Example and code come first

I gonna take examples from my gendsl library which implements a lisp-like syntax(you can check it out here). Here is a simple snippet that can parse hex and decimal number literals in the golang style:

package playground

type parser Peg {
}

Script          



The first line package gendsl is package declaration which decides which package the generated golang file belongs to. The following type declaration type parser Peg {} used to define the parser type, which we will use it later for evaluation but you can ignore it for now.

After the parser type declaration we can start to define your syntax rule till the end. This is different from the workflow I used to work with with yacc when I have to define a union type and a lot of token types before I can actually define my grammar, which could be really confusing. Anyway, let's take a quick look at the grammar definition.

The very first rule

If you have worked with CFG before you might find the definition DSL syntax quite familiar. The right hand side of the '

Let's pay attention to the first pattern rule here since the first rule is always to entry point of the parser. The entry point Script here is consist of two parts, one is a rule refers to Value which is consist of a sequence of specified characters(we will get back to this later), the other one EOF is kind of interesting. Let's jump to the next rule to find the pattern of EOF. As you can see, EOF is consist of !.. What does !. mean? The !actually means NOT, and . means any character, so !. means NOTHING AT ALL or End Of File if you will. As a result whenever the parser find there is no character to read, it will stop here and treat it as an dummy rule call EOF which might produces the rule Script. This is quite a common pattern for PEG.

More about the rule syntax

So much like the regular expression(RE), the syntax of PEG is simple:

  • . stands for any character.
  • character set like [a-z] is also supported.
  • X matches one character if X is a character or a pattern when X is the name of an rule.
  • X? matches one character or pattern or none, while X* matches zero or more and 'X ' matches one or more.
  • X \ Y matches X or Y where X and Y could be characters, patterns or rule name.

Take the rule of DecimalNumeral as an example. The first part [1-9] means the start of an DecimalNumeral must be one of a digit ranging from 1 to 9, ([_]* [0-9])* means starting from the second position, all character, if there is any, must all be digit(0-9) that has might have no '_' or more than one '_' as its prefix so it could match string like "10_2_3". Otherwise, indicated by the operator \, it could also just be one single character '0' which means 0 obviously .

Resolving ambiguity

I'd like to spend more time to explain the or operator \, since it is quite important as the solution to the ambiguity. The PEG will always try to match the first pattern and then the second, the third and so on til it finds one matched, which is considered as earliest-match-first. For example, a string "ab" will never be able to match the grammar G

There is no much syntax left, you can explore them yourself in the pointlander/peg README or peg doc.

Let's have a try

Now that we already have a simple syntax rule prepared above, though it is not the whole grammar for the gendsl project but it can still parse some numbers. Anyway let's generate some code and see if it works as we expect.

Preparation

First we have to install the peg binary tool for code generate following this guide, then we gonna setup our workspace directory for playing:

> mkdir peg_playground && peg_playground 
> go mod init peg_playground 
> touch grammar.peg

Paste the grammar we have before into the peg_playground/grammar.peg, now we should be able to genreate the code using the generate tool but why not make a Makefile in peg_playground/makefile

GO := go

.SUFFIXES: .peg .go

grammar.go: grammar.peg
    peg -switch -inline -strict -output ./$@ $



Generate and test the parser

Now that we have everything ready, let's generate the code for parser:

make grammar.go

After running the command, you should see a generated grammar.go in the workspace directory. Let's write a function to parse a string and access our parser:

// peg_playground/parser.go
package playground

func PrintAST(script string) error {
    parser := &parser{
        Buffer: script,
        Pretty: true,
    }

    if err := parser.Init(); err != nil {
        return err
    }
    if err := parser.Parse(); err != nil {
        return err
    }

    parser.PrintSyntaxTree()
    return nil
}

The snippet here is pretty simple, it initializes the parser, parses the script we pass to it and print the syntax tree in final. Let's write an unit test to see if it works:

// peg_playground/parser_test.go
package playground

import (
    "testing"
)

func TestPrintTree(t *testing.T) {
    if err := PrintAST(`0x123`); err != nil {
        t.Fatal(err)
    }
    t.Log("-----------------------------------------------------")

    if err := PrintAST(`10_2_3`); err != nil {
        t.Fatal(err)
    }
    t.Log("-----------------------------------------------------")
}

The test function TestPrintTree calls the PrintAST and check the error. Let's run it now and see what it gonna print:

go test . -v

Now we should see the whole syntax tree in the output:

=== RUN   TestPrintTree
Script "0x123"
 Value "0x123"
  IntegerLiteral "0x123"
   HexNumeral "123"
    HexDigit "1"
    HexDigit "2"
    HexDigit "3"
    parser_test.go:11: -----------------------------------------------------
Script "10_2_3"
 Value "10_2_3"
  IntegerLiteral "10_2_3"
   DecimalNumeral "10_2_3"
    parser_test.go:16: -----------------------------------------------------
--- PASS: TestPrintTree (0.00s)
PASS
ok      playground      0.649s

It looks great, right? Everything works as we expected. No syntax error thrown and it prints every rule matched and the string it matches in a format of tree, which could be really useful when debugging.

Wrap it up

In this post, I have introduced you the two basic but significant parts of interpreter programming language:

  • Lexer, for lexical analysis that transforms a string into a sequence of lexical elements.
  • Parser, for syntax analysis that identify the the pattern (so called grammar) in the lexical elements and produces a syntax tree.

And then I introduce the PEG for parser code generating, and address its advantages comparing the traditional CFG:

  • Lexer rule integrated, no standalone lexer need to be implemented.
  • Simple, regular expression like syntax to start up fast.
  • No ambiguity, no reduce/shift conflict, always earlier-match-first.

Finally I have a tiny demonstration of how to generate parser with PEG, which is the basis of our interpreter.
In next post, I will walk you through the gendsl grammar in detail.
Thank you for checking this post, hope you enjoy it.

版本声明 本文转载于:https://dev.to/ccbhj/part-i-implement-an-expression-interpreter-for-building-dsl-introduce-the-peg-parser-omf?1如有侵犯,请联系[email protected]删除
最新教程 更多>
  • 为什么我会收到MySQL错误#1089:错误的前缀密钥?
    为什么我会收到MySQL错误#1089:错误的前缀密钥?
    mySQL错误#1089:错误的前缀键错误descript [#1089-不正确的前缀键在尝试在表中创建一个prefix键时会出现。前缀键旨在索引字符串列的特定前缀长度长度,可以更快地搜索这些前缀。了解prefix keys `这将在整个Movie_ID列上创建标准主键。主密钥对于唯一识别...
    编程 发布于2025-03-14
  • 如何使用不同数量列的联合数据库表?
    如何使用不同数量列的联合数据库表?
    合并列数不同的表 当尝试合并列数不同的数据库表时,可能会遇到挑战。一种直接的方法是在列数较少的表中,为缺失的列追加空值。 例如,考虑两个表,表 A 和表 B,其中表 A 的列数多于表 B。为了合并这些表,同时处理表 B 中缺失的列,请按照以下步骤操作: 确定表 B 中缺失的列,并将它们添加到表的末...
    编程 发布于2025-03-14
  • 如何使用PHP从XML文件中有效地检索属性值?
    如何使用PHP从XML文件中有效地检索属性值?
    从php $xml = simplexml_load_file($file); foreach ($xml->Var[0]->attributes() as $attributeName => $attributeValue) { echo $attributeName,...
    编程 发布于2025-03-14
  • Android如何向PHP服务器发送POST数据?
    Android如何向PHP服务器发送POST数据?
    在android apache httpclient(已弃用) httpclient httpclient = new defaulthttpclient(); httppost httppost = new httppost(“ http://www.yoursite.com/script.p...
    编程 发布于2025-03-14
  • 大批
    大批
    [2 数组是对象,因此它们在JS中也具有方法。 切片(开始):在新数组中提取部分数组,而无需突变原始数组。 令ARR = ['a','b','c','d','e']; // USECASE:提取直到索引作...
    编程 发布于2025-03-14
  • 我可以将加密从McRypt迁移到OpenSSL,并使用OpenSSL迁移MCRYPT加密数据?
    我可以将加密从McRypt迁移到OpenSSL,并使用OpenSSL迁移MCRYPT加密数据?
    将我的加密库从mcrypt升级到openssl 问题:是否可以将我的加密库从McRypt升级到OpenSSL?如果是这样,如何?答案:是的,可以将您的Encryption库从McRypt升级到OpenSSL。可以使用openssl。附加说明: [openssl_decrypt()函数要求iv参...
    编程 发布于2025-03-14
  • Java是否允许多种返回类型:仔细研究通用方法?
    Java是否允许多种返回类型:仔细研究通用方法?
    在Java中的多个返回类型:一种误解类型:在Java编程中揭示,在Java编程中,Peculiar方法签名可能会出现,可能会出现,使开发人员陷入困境,使开发人员陷入困境。 getResult(string s); ,其中foo是自定义类。该方法声明似乎拥有两种返回类型:列表和E。但这确实是如此吗...
    编程 发布于2025-03-14
  • 为什么PYTZ最初显示出意外的时区偏移?
    为什么PYTZ最初显示出意外的时区偏移?
    与pytz 最初从pytz获得特定的偏移。例如,亚洲/hong_kong最初显示一个七个小时37分钟的偏移: 差异源利用本地化将时区分配给日期,使用了适当的时区名称和偏移量。但是,直接使用DateTime构造器分配时区不允许进行正确的调整。 example pytz.timezone(...
    编程 发布于2025-03-14
  • 如何检查对象是否具有Python中的特定属性?
    如何检查对象是否具有Python中的特定属性?
    方法来确定对象属性存在寻求一种方法来验证对象中特定属性的存在。考虑以下示例,其中尝试访问不确定属性会引起错误: >>> a = someClass() >>> A.property Trackback(最近的最新电话): 文件“ ”,第1行, AttributeError: SomeClass...
    编程 发布于2025-03-14
  • 如何在Java字符串中有效替换多个子字符串?
    如何在Java字符串中有效替换多个子字符串?
    在java 中有效地替换多个substring,需要在需要替换一个字符串中的多个substring的情况下,很容易求助于重复应用字符串的刺激力量。 However, this can be inefficient for large strings or when working with nu...
    编程 发布于2025-03-14
  • 如何使用Regex在PHP中有效地提取括号内的文本
    如何使用Regex在PHP中有效地提取括号内的文本
    php:在括号内提取文本在处理括号内的文本时,找到最有效的解决方案是必不可少的。一种方法是利用PHP的字符串操作函数,如下所示: 作为替代 $ text ='忽略除此之外的一切(text)'; preg_match('#((。 &&& [Regex使用模式来搜索特...
    编程 发布于2025-03-14
  • 如何使用替换指令在GO MOD中解析模块路径差异?
    如何使用替换指令在GO MOD中解析模块路径差异?
    在使用GO MOD时,在GO MOD 中克服模块路径差异时,可能会遇到冲突,其中可能会遇到一个冲突,其中3派对软件包将另一个带有导入套件的path package the Imptioned package the Imptioned package the Imported tocted pac...
    编程 发布于2025-03-14
  • 如何使用PHP将斑点(图像)正确插入MySQL?
    如何使用PHP将斑点(图像)正确插入MySQL?
    essue VALUES('$this->image_id','file_get_contents($tmp_image)')";This code builds a string in PHP, but the function call ...
    编程 发布于2025-03-14
  • 对象拟合:IE和Edge中的封面失败,如何修复?
    对象拟合:IE和Edge中的封面失败,如何修复?
    To resolve this issue, we employ a clever CSS solution that solves the problem:position: absolute;top: 50%;left: 50%;transform: translate(-50%, -50%)...
    编程 发布于2025-03-14

免责声明: 提供的所有资源部分来自互联网,如果有侵犯您的版权或其他权益,请说明详细缘由并提供版权或权益证明然后发到邮箱:[email protected] 我们会第一时间内为您处理。

Copyright© 2022 湘ICP备2022001581号-3