”工欲善其事,必先利其器。“—孔子《论语.录灵公》
首页 > 编程 > 第一部分:实现用于构建 DSL 的表达式解释器 - 介绍 PEG 解析器

第一部分:实现用于构建 DSL 的表达式解释器 - 介绍 PEG 解析器

发布于2024-08-28
浏览:937

Part I: Implement an expression interpreter for building DSL - Introduce the PEG parser

In the last post I have introduced you the WHY and HOW I start the this project and WHAT the DSL looks like in the end. Starting from this post I will share the implementation of the whole project with you.

Generally, when we implement a language, the first thing comes in our mind is lexer and then parser. So in this post I will introduce to you how I implement my DSL with specified detail but less conception so that it won't be too confused I hope.

What's lexer?

In general, lexer is used for Lexical Analysis or tokenization if you will. Let's take the clause "We will rock you!" (the famous rock and roll music from Queen) as an example. After we tokenize it following the grammar rule of English, it shall output a list [Subject("We"), Auxiliary("will"), Verb("rock"), Object("you"), Punctuation("!")]. So a lexer is mostly used for classify texts into certain types by its lexical meaning. This is significant for us since a syntax is actually consist of lexical elements instead of characters or words.

Usually we implement a lexer with some code generator that can parse Regular Express like Ragel, nex and so on. But I think you will be surprised how easy it is to implement a lexer after checking Lexical Scanning in Go from Rob Pike. He had introduced an interesting pattern to implement a finite-state machine which I think is the core of an lexer.

How about the parser?

So how about a parser? What is it used for? Basically, a parser is used to identify a list of lexical elements with the specified pattern which we also called it grammar. Take the example "We will rock you!" we introduced before, it produces a sequence of [Subject("We"), Auxiliary("will"), Verb("rock"), Object("you"), Punctuation("!")] which matches the pattern of 'Future tense' grammar in English. So this is exactly what a parser do and is so called 'syntax analysis'.

Let's take another example in a more computer way, how about an expression like 1 2 * 3 ? It is obvious that they will be translated into [Number(1), Operator( ), Number(2), Operator(*), Number(3)] by the lexer, but what this sequence will be translated into by a parser with a common mathematical expression grammar? In general, a lexical elements sequence will got translated into an Abstract Syntax Tree(or AST in short) by parser like this:

      *
     / \
        3
   /  \
  1    2

With an abstract syntax tree we can analyze the syntax in the correct order according to the rule of our grammar.

Let's implement a parser

Now we are gonna implement an parser on our own. Well not entirely on our own, we still need some tool to help our to generate code for our parser since it is hard to implement it correctly and a hand-writing parser might be difficult to maintain, even if you do, the performance might be poor.

Luckily, there are many mature parser generative tools that helps us to generate golang code with a grammar definition file. The first choice came in my mind is go-yacc which is the official tool to generate go code for parser. I used to use it to write a SQL analyzer and the it wasn't a pleasure because it:

  • requires an external lexer.
  • lack of documentation.
  • union definition and value type declaration are both quite confusing.

"Why not try something new?" I think, so there we go, with the amazing tool peg I was able to implement both of the lexer and parser in one single grammar file, one single interface. Let's take a closer look at the peg.

A closer look at PEG

PEG stands for Parsing Expression Grammar introduced by Bryan Ford in 2004, which is an alternative to the traditional Context Free Grammar used for describing and expressing programming language syntax and protocol.

For the last decades, we have been using parser generative tool like yacc, bison that supports CFG to generate parser code, and if you have used them before you might find it difficult to avoid ambiguity and integrate them with the lexer or the regular expression. In fact, the syntax of a programming language is not just patterns of lexical elements but also the rules of those lexical elements which somehow the CFG is missing so when we use tools like yacc we will have to implement lexer by our self. Further more, to avoid ambiguity(like the precedence between plus and multiply, check this out) in CFG we have to define the precedence for each operator. All of these crucial fact makes it unnecessarily difficult to develop a parser.

But thanks to Bryan Ford, now we have another good choice, the PEG that allows us to define the lexical and syntax rule all in one single file with a tight DSL and resolve ambiguity in an elegant and simple way. Let me show you how easy it can be done with peg.

Example and code come first

I gonna take examples from my gendsl library which implements a lisp-like syntax(you can check it out here). Here is a simple snippet that can parse hex and decimal number literals in the golang style:

package playground

type parser Peg {
}

Script          



The first line package gendsl is package declaration which decides which package the generated golang file belongs to. The following type declaration type parser Peg {} used to define the parser type, which we will use it later for evaluation but you can ignore it for now.

After the parser type declaration we can start to define your syntax rule till the end. This is different from the workflow I used to work with with yacc when I have to define a union type and a lot of token types before I can actually define my grammar, which could be really confusing. Anyway, let's take a quick look at the grammar definition.

The very first rule

If you have worked with CFG before you might find the definition DSL syntax quite familiar. The right hand side of the '

Let's pay attention to the first pattern rule here since the first rule is always to entry point of the parser. The entry point Script here is consist of two parts, one is a rule refers to Value which is consist of a sequence of specified characters(we will get back to this later), the other one EOF is kind of interesting. Let's jump to the next rule to find the pattern of EOF. As you can see, EOF is consist of !.. What does !. mean? The !actually means NOT, and . means any character, so !. means NOTHING AT ALL or End Of File if you will. As a result whenever the parser find there is no character to read, it will stop here and treat it as an dummy rule call EOF which might produces the rule Script. This is quite a common pattern for PEG.

More about the rule syntax

So much like the regular expression(RE), the syntax of PEG is simple:

  • . stands for any character.
  • character set like [a-z] is also supported.
  • X matches one character if X is a character or a pattern when X is the name of an rule.
  • X? matches one character or pattern or none, while X* matches zero or more and 'X ' matches one or more.
  • X \ Y matches X or Y where X and Y could be characters, patterns or rule name.

Take the rule of DecimalNumeral as an example. The first part [1-9] means the start of an DecimalNumeral must be one of a digit ranging from 1 to 9, ([_]* [0-9])* means starting from the second position, all character, if there is any, must all be digit(0-9) that has might have no '_' or more than one '_' as its prefix so it could match string like "10_2_3". Otherwise, indicated by the operator \, it could also just be one single character '0' which means 0 obviously .

Resolving ambiguity

I'd like to spend more time to explain the or operator \, since it is quite important as the solution to the ambiguity. The PEG will always try to match the first pattern and then the second, the third and so on til it finds one matched, which is considered as earliest-match-first. For example, a string "ab" will never be able to match the grammar G

There is no much syntax left, you can explore them yourself in the pointlander/peg README or peg doc.

Let's have a try

Now that we already have a simple syntax rule prepared above, though it is not the whole grammar for the gendsl project but it can still parse some numbers. Anyway let's generate some code and see if it works as we expect.

Preparation

First we have to install the peg binary tool for code generate following this guide, then we gonna setup our workspace directory for playing:

> mkdir peg_playground && peg_playground 
> go mod init peg_playground 
> touch grammar.peg

Paste the grammar we have before into the peg_playground/grammar.peg, now we should be able to genreate the code using the generate tool but why not make a Makefile in peg_playground/makefile

GO := go

.SUFFIXES: .peg .go

grammar.go: grammar.peg
    peg -switch -inline -strict -output ./$@ $



Generate and test the parser

Now that we have everything ready, let's generate the code for parser:

make grammar.go

After running the command, you should see a generated grammar.go in the workspace directory. Let's write a function to parse a string and access our parser:

// peg_playground/parser.go
package playground

func PrintAST(script string) error {
    parser := &parser{
        Buffer: script,
        Pretty: true,
    }

    if err := parser.Init(); err != nil {
        return err
    }
    if err := parser.Parse(); err != nil {
        return err
    }

    parser.PrintSyntaxTree()
    return nil
}

The snippet here is pretty simple, it initializes the parser, parses the script we pass to it and print the syntax tree in final. Let's write an unit test to see if it works:

// peg_playground/parser_test.go
package playground

import (
    "testing"
)

func TestPrintTree(t *testing.T) {
    if err := PrintAST(`0x123`); err != nil {
        t.Fatal(err)
    }
    t.Log("-----------------------------------------------------")

    if err := PrintAST(`10_2_3`); err != nil {
        t.Fatal(err)
    }
    t.Log("-----------------------------------------------------")
}

The test function TestPrintTree calls the PrintAST and check the error. Let's run it now and see what it gonna print:

go test . -v

Now we should see the whole syntax tree in the output:

=== RUN   TestPrintTree
Script "0x123"
 Value "0x123"
  IntegerLiteral "0x123"
   HexNumeral "123"
    HexDigit "1"
    HexDigit "2"
    HexDigit "3"
    parser_test.go:11: -----------------------------------------------------
Script "10_2_3"
 Value "10_2_3"
  IntegerLiteral "10_2_3"
   DecimalNumeral "10_2_3"
    parser_test.go:16: -----------------------------------------------------
--- PASS: TestPrintTree (0.00s)
PASS
ok      playground      0.649s

It looks great, right? Everything works as we expected. No syntax error thrown and it prints every rule matched and the string it matches in a format of tree, which could be really useful when debugging.

Wrap it up

In this post, I have introduced you the two basic but significant parts of interpreter programming language:

  • Lexer, for lexical analysis that transforms a string into a sequence of lexical elements.
  • Parser, for syntax analysis that identify the the pattern (so called grammar) in the lexical elements and produces a syntax tree.

And then I introduce the PEG for parser code generating, and address its advantages comparing the traditional CFG:

  • Lexer rule integrated, no standalone lexer need to be implemented.
  • Simple, regular expression like syntax to start up fast.
  • No ambiguity, no reduce/shift conflict, always earlier-match-first.

Finally I have a tiny demonstration of how to generate parser with PEG, which is the basis of our interpreter.
In next post, I will walk you through the gendsl grammar in detail.
Thank you for checking this post, hope you enjoy it.

版本声明 本文转载于:https://dev.to/ccbhj/part-i-implement-an-expression-interpreter-for-building-dsl-introduce-the-peg-parser-omf?1如有侵犯,请联系[email protected]删除
最新教程 更多>
  • 如何修复 Matplotlib 中的“无显示名称且无 $DISPLAY 环境变量”错误?
    如何修复 Matplotlib 中的“无显示名称且无 $DISPLAY 环境变量”错误?
    "_tkinter.TclError: no display name and no $DISPLAY 环境变量"使用 Matplotlib 运行 Python 脚本时通常会发生此错误在没有图形显示的服务器上。 Matplotlib 依赖后端来渲染绘图,默认情况下,它选择 Xwi...
    编程 发布于2024-11-05
  • 您的第一个使用 Node.js 的后端应用程序
    您的第一个使用 Node.js 的后端应用程序
    您是否正在学习 Web 开发并对如何启动 Node.js 项目感到困惑?别担心,我有你!我将指导您只需 5 个步骤即可使用 Node.js 和 Express.js 创建您的第一个后端。 ️5个关键步骤: 第 1 步:设置项目 第 2 步:整理文件夹 第3步:创建server.js文...
    编程 发布于2024-11-05
  • 跨域场景下CORS何时使用预检请求?
    跨域场景下CORS何时使用预检请求?
    CORS:了解跨域请求的“预检”请求跨域资源共享 (CORS) 在制作 HTTP 时提出了挑战跨域请求。为了解决这些限制,引入了预检请求作为解决方法。预检请求说明预检请求是先于实际请求(例如 GET 或 POST)的 OPTIONS 请求)并用于与服务器协商请求的权限。这些请求包括两个附加标头:Ac...
    编程 发布于2024-11-05
  • 如何使用 PHP 的 glob() 函数按扩展名过滤文件?
    如何使用 PHP 的 glob() 函数按扩展名过滤文件?
    在 PHP 中按扩展名过滤文件使用目录时,通常需要根据扩展名检索特定文件。 PHP 提供了一种使用 glob() 函数来完成此任务的有效方法。要按扩展名过滤文件,请使用语法:$files = glob('/path/to/directory/*.extension');例如,要检索目录 /path/...
    编程 发布于2024-11-05
  • 理解 JavaScript 中的 Promise 和 Promise Chaining
    理解 JavaScript 中的 Promise 和 Promise Chaining
    什么是承诺? JavaScript 中的 Promise 就像你对未来做某事的“承诺”。它是一个对象,表示异步任务的最终完成(或失败)及其结果值。简而言之,Promise 充当尚不可用但将来可用的值的占位符。 承诺国家 Promise 可以存在于以下三种状态之一: ...
    编程 发布于2024-11-05
  • 安全分配
    安全分配
    今天,关于 JavaScript 中安全赋值运算符 (?=) 的新提案引起了热烈讨论。我喜欢 JavaScript 随着时间的推移而不断改进,但这也是我最近在一些情况下遇到的问题。我应该将快速示例实现作为函数,对吧? 如果您还没有阅读该提案,以下是其建议: const [error, value] ...
    编程 发布于2024-11-05
  • 创建队列接口
    创建队列接口
    创建字符队列的接口。 需要开发的三个实现: 固定大小的线性队列。 循环队列(复用数组空间)。 动态队列(根据需要增长)。 1 创建一个名为 ICharQ.java 的文件 // 字符队列接口。 公共接口 ICharQ { // 向队列中插入一个字符。 void put(char ch); ...
    编程 发布于2024-11-05
  • Pip 的可编辑模式何时对本地 Python 包开发有用?
    Pip 的可编辑模式何时对本地 Python 包开发有用?
    使用 Pip 在 Python 中利用可编辑模式进行本地包开发在 Python 的包管理生态系统中,Pip 拥有“-e”(或'--editable') 特定场景的选项。什么时候使用这个选项比较有利?答案在于可编辑模式的实现,官方文档中有详细说明:“从本地以可编辑模式安装项目(即 se...
    编程 发布于2024-11-05
  • 当您在浏览器中输入 URL 时会发生什么?
    当您在浏览器中输入 URL 时会发生什么?
    您是否想知道当您在浏览器中输入 URL 并按 Enter 键时幕后会发生什么?该过程比您想象的更加复杂,涉及多个步骤,这些步骤无缝地协同工作以提供您请求的网页。在本文中,我们将探讨从输入 URL 到查看完全加载的网页的整个过程,阐明使这一切成为可能的技术和协议。 第 1 步:输入 U...
    编程 发布于2024-11-05
  • 如何有效管理大量小HashMap对象的“OutOfMemoryError:超出GC开销限制”?
    如何有效管理大量小HashMap对象的“OutOfMemoryError:超出GC开销限制”?
    OutOfMemoryError: Handling Garbage Collection Overhead在Java中,当过多时会出现“java.lang.OutOfMemoryError: GC Overhead limit allowed”错误根据 Sun 的文档,时间花费在垃圾收集上。要解决...
    编程 发布于2024-11-05
  • 为什么在 Python 列表初始化中使用 [[]] * n 时列表会链接在一起?
    为什么在 Python 列表初始化中使用 [[]] * n 时列表会链接在一起?
    使用 [[]] * n 进行列表初始化时的列表链接问题使用 [[]] 初始化列表列表时 n,程序员经常会遇到一个意想不到的问题,即列表似乎链接在一起。出现这种情况是因为 [x]n 语法创建对同一基础列表对象的多个引用,而不是创建不同的列表实例。为了说明该问题,请考虑以下代码:x = [[]] * ...
    编程 发布于2024-11-05
  • Python 变得简单:从初学者到高级 |博客
    Python 变得简单:从初学者到高级 |博客
    Python Course Code Examples This is a Documentation of the python code i used and created , for learning python. Its easy to understand and L...
    编程 发布于2024-11-05
  • 简化 TypeScript 中的类型缩小和防护
    简化 TypeScript 中的类型缩小和防护
    Introduction to Narrowing Concept Typescript documentation explains this topic really well. I am not going to copy and paste the same descrip...
    编程 发布于2024-11-05
  • 何时应该使用 session_unset() 而不是 session_destroy() ,反之亦然?
    何时应该使用 session_unset() 而不是 session_destroy() ,反之亦然?
    理解 PHP 中 session_unset() 和 session_destroy() 的区别PHP 函数 session_unset() 和 session_destroy() 有不同的用途管理会话数据。尽管它们在清除会话变量方面有明显相似之处,但它们具有不同的效果。session_unset(...
    编程 发布于2024-11-05
  • 如何选择在 C++ 中解析 INI 文件的最佳方法?
    如何选择在 C++ 中解析 INI 文件的最佳方法?
    在 C 中解析 INI 文件:各种方法指南在 C 中处理初始化 (INI) 文件时,开发人员经常遇到有效解析这些文件以提取所需信息的挑战。本文探讨了用 C 解析 INI 文件的不同方法,讨论了它们的优点和注意事项。本机 Windows API 函数一种方法是利用 Windows API 函数INI ...
    编程 发布于2024-11-05

免责声明: 提供的所有资源部分来自互联网,如果有侵犯您的版权或其他权益,请说明详细缘由并提供版权或权益证明然后发到邮箱:[email protected] 我们会第一时间内为您处理。

Copyright© 2022 湘ICP备2022001581号-3