」工欲善其事,必先利其器。「—孔子《論語.錄靈公》
首頁 > 程式設計 > 第一部分:實作用於建立 DSL 的表達式解釋器 - 介紹 PEG 解析器

第一部分:實作用於建立 DSL 的表達式解釋器 - 介紹 PEG 解析器

發佈於2024-08-28
瀏覽:245

Part I: Implement an expression interpreter for building DSL - Introduce the PEG parser

In the last post I have introduced you the WHY and HOW I start the this project and WHAT the DSL looks like in the end. Starting from this post I will share the implementation of the whole project with you.

Generally, when we implement a language, the first thing comes in our mind is lexer and then parser. So in this post I will introduce to you how I implement my DSL with specified detail but less conception so that it won't be too confused I hope.

What's lexer?

In general, lexer is used for Lexical Analysis or tokenization if you will. Let's take the clause "We will rock you!" (the famous rock and roll music from Queen) as an example. After we tokenize it following the grammar rule of English, it shall output a list [Subject("We"), Auxiliary("will"), Verb("rock"), Object("you"), Punctuation("!")]. So a lexer is mostly used for classify texts into certain types by its lexical meaning. This is significant for us since a syntax is actually consist of lexical elements instead of characters or words.

Usually we implement a lexer with some code generator that can parse Regular Express like Ragel, nex and so on. But I think you will be surprised how easy it is to implement a lexer after checking Lexical Scanning in Go from Rob Pike. He had introduced an interesting pattern to implement a finite-state machine which I think is the core of an lexer.

How about the parser?

So how about a parser? What is it used for? Basically, a parser is used to identify a list of lexical elements with the specified pattern which we also called it grammar. Take the example "We will rock you!" we introduced before, it produces a sequence of [Subject("We"), Auxiliary("will"), Verb("rock"), Object("you"), Punctuation("!")] which matches the pattern of 'Future tense' grammar in English. So this is exactly what a parser do and is so called 'syntax analysis'.

Let's take another example in a more computer way, how about an expression like 1 2 * 3 ? It is obvious that they will be translated into [Number(1), Operator( ), Number(2), Operator(*), Number(3)] by the lexer, but what this sequence will be translated into by a parser with a common mathematical expression grammar? In general, a lexical elements sequence will got translated into an Abstract Syntax Tree(or AST in short) by parser like this:

      *
     / \
        3
   /  \
  1    2

With an abstract syntax tree we can analyze the syntax in the correct order according to the rule of our grammar.

Let's implement a parser

Now we are gonna implement an parser on our own. Well not entirely on our own, we still need some tool to help our to generate code for our parser since it is hard to implement it correctly and a hand-writing parser might be difficult to maintain, even if you do, the performance might be poor.

Luckily, there are many mature parser generative tools that helps us to generate golang code with a grammar definition file. The first choice came in my mind is go-yacc which is the official tool to generate go code for parser. I used to use it to write a SQL analyzer and the it wasn't a pleasure because it:

  • requires an external lexer.
  • lack of documentation.
  • union definition and value type declaration are both quite confusing.

"Why not try something new?" I think, so there we go, with the amazing tool peg I was able to implement both of the lexer and parser in one single grammar file, one single interface. Let's take a closer look at the peg.

A closer look at PEG

PEG stands for Parsing Expression Grammar introduced by Bryan Ford in 2004, which is an alternative to the traditional Context Free Grammar used for describing and expressing programming language syntax and protocol.

For the last decades, we have been using parser generative tool like yacc, bison that supports CFG to generate parser code, and if you have used them before you might find it difficult to avoid ambiguity and integrate them with the lexer or the regular expression. In fact, the syntax of a programming language is not just patterns of lexical elements but also the rules of those lexical elements which somehow the CFG is missing so when we use tools like yacc we will have to implement lexer by our self. Further more, to avoid ambiguity(like the precedence between plus and multiply, check this out) in CFG we have to define the precedence for each operator. All of these crucial fact makes it unnecessarily difficult to develop a parser.

But thanks to Bryan Ford, now we have another good choice, the PEG that allows us to define the lexical and syntax rule all in one single file with a tight DSL and resolve ambiguity in an elegant and simple way. Let me show you how easy it can be done with peg.

Example and code come first

I gonna take examples from my gendsl library which implements a lisp-like syntax(you can check it out here). Here is a simple snippet that can parse hex and decimal number literals in the golang style:

package playground

type parser Peg {
}

Script          



The first line package gendsl is package declaration which decides which package the generated golang file belongs to. The following type declaration type parser Peg {} used to define the parser type, which we will use it later for evaluation but you can ignore it for now.

After the parser type declaration we can start to define your syntax rule till the end. This is different from the workflow I used to work with with yacc when I have to define a union type and a lot of token types before I can actually define my grammar, which could be really confusing. Anyway, let's take a quick look at the grammar definition.

The very first rule

If you have worked with CFG before you might find the definition DSL syntax quite familiar. The right hand side of the '

Let's pay attention to the first pattern rule here since the first rule is always to entry point of the parser. The entry point Script here is consist of two parts, one is a rule refers to Value which is consist of a sequence of specified characters(we will get back to this later), the other one EOF is kind of interesting. Let's jump to the next rule to find the pattern of EOF. As you can see, EOF is consist of !.. What does !. mean? The !actually means NOT, and . means any character, so !. means NOTHING AT ALL or End Of File if you will. As a result whenever the parser find there is no character to read, it will stop here and treat it as an dummy rule call EOF which might produces the rule Script. This is quite a common pattern for PEG.

More about the rule syntax

So much like the regular expression(RE), the syntax of PEG is simple:

  • . stands for any character.
  • character set like [a-z] is also supported.
  • X matches one character if X is a character or a pattern when X is the name of an rule.
  • X? matches one character or pattern or none, while X* matches zero or more and 'X ' matches one or more.
  • X \ Y matches X or Y where X and Y could be characters, patterns or rule name.

Take the rule of DecimalNumeral as an example. The first part [1-9] means the start of an DecimalNumeral must be one of a digit ranging from 1 to 9, ([_]* [0-9])* means starting from the second position, all character, if there is any, must all be digit(0-9) that has might have no '_' or more than one '_' as its prefix so it could match string like "10_2_3". Otherwise, indicated by the operator \, it could also just be one single character '0' which means 0 obviously .

Resolving ambiguity

I'd like to spend more time to explain the or operator \, since it is quite important as the solution to the ambiguity. The PEG will always try to match the first pattern and then the second, the third and so on til it finds one matched, which is considered as earliest-match-first. For example, a string "ab" will never be able to match the grammar G

There is no much syntax left, you can explore them yourself in the pointlander/peg README or peg doc.

Let's have a try

Now that we already have a simple syntax rule prepared above, though it is not the whole grammar for the gendsl project but it can still parse some numbers. Anyway let's generate some code and see if it works as we expect.

Preparation

First we have to install the peg binary tool for code generate following this guide, then we gonna setup our workspace directory for playing:

> mkdir peg_playground && peg_playground 
> go mod init peg_playground 
> touch grammar.peg

Paste the grammar we have before into the peg_playground/grammar.peg, now we should be able to genreate the code using the generate tool but why not make a Makefile in peg_playground/makefile

GO := go

.SUFFIXES: .peg .go

grammar.go: grammar.peg
    peg -switch -inline -strict -output ./$@ $



Generate and test the parser

Now that we have everything ready, let's generate the code for parser:

make grammar.go

After running the command, you should see a generated grammar.go in the workspace directory. Let's write a function to parse a string and access our parser:

// peg_playground/parser.go
package playground

func PrintAST(script string) error {
    parser := &parser{
        Buffer: script,
        Pretty: true,
    }

    if err := parser.Init(); err != nil {
        return err
    }
    if err := parser.Parse(); err != nil {
        return err
    }

    parser.PrintSyntaxTree()
    return nil
}

The snippet here is pretty simple, it initializes the parser, parses the script we pass to it and print the syntax tree in final. Let's write an unit test to see if it works:

// peg_playground/parser_test.go
package playground

import (
    "testing"
)

func TestPrintTree(t *testing.T) {
    if err := PrintAST(`0x123`); err != nil {
        t.Fatal(err)
    }
    t.Log("-----------------------------------------------------")

    if err := PrintAST(`10_2_3`); err != nil {
        t.Fatal(err)
    }
    t.Log("-----------------------------------------------------")
}

The test function TestPrintTree calls the PrintAST and check the error. Let's run it now and see what it gonna print:

go test . -v

Now we should see the whole syntax tree in the output:

=== RUN   TestPrintTree
Script "0x123"
 Value "0x123"
  IntegerLiteral "0x123"
   HexNumeral "123"
    HexDigit "1"
    HexDigit "2"
    HexDigit "3"
    parser_test.go:11: -----------------------------------------------------
Script "10_2_3"
 Value "10_2_3"
  IntegerLiteral "10_2_3"
   DecimalNumeral "10_2_3"
    parser_test.go:16: -----------------------------------------------------
--- PASS: TestPrintTree (0.00s)
PASS
ok      playground      0.649s

It looks great, right? Everything works as we expected. No syntax error thrown and it prints every rule matched and the string it matches in a format of tree, which could be really useful when debugging.

Wrap it up

In this post, I have introduced you the two basic but significant parts of interpreter programming language:

  • Lexer, for lexical analysis that transforms a string into a sequence of lexical elements.
  • Parser, for syntax analysis that identify the the pattern (so called grammar) in the lexical elements and produces a syntax tree.

And then I introduce the PEG for parser code generating, and address its advantages comparing the traditional CFG:

  • Lexer rule integrated, no standalone lexer need to be implemented.
  • Simple, regular expression like syntax to start up fast.
  • No ambiguity, no reduce/shift conflict, always earlier-match-first.

Finally I have a tiny demonstration of how to generate parser with PEG, which is the basis of our interpreter.
In next post, I will walk you through the gendsl grammar in detail.
Thank you for checking this post, hope you enjoy it.

版本聲明 本文轉載於:https://dev.to/ccbhj/part-i-implement-an-expression-interpreter-for-building-dsl-introduce-the-peg-parser-omf?1如有侵犯,請洽study_golang @163.com刪除
最新教學 更多>
  • React:了解 React 的事件系統
    React:了解 React 的事件系統
    Overview of React's Event System What is a Synthetic Event? Synthetic events are an event-handling mechanism designed by React to ach...
    程式設計 發佈於2024-11-05
  • 為什麼在使用 Multipart/Form-Data POST 請求時會收到 301 Moved Permanently 錯誤?
    為什麼在使用 Multipart/Form-Data POST 請求時會收到 301 Moved Permanently 錯誤?
    Multipart/Form-Data POSTsMultipart/Form-Data POSTs嘗試使用multipart/form-data POST 資料時,可能會出現類似所提供的錯誤訊息遭遇。理解問題需要檢視問題的構成。遇到的錯誤是 301 Moved Permanently 回應,表示資...
    程式設計 發佈於2024-11-05
  • 如何使用日期和時間物件來確定 PHP 中的時間邊界?
    如何使用日期和時間物件來確定 PHP 中的時間邊界?
    確定PHP 中的時間邊界在此編程場景中,我們的任務是確定給定時間是否在預先定義的範圍內。具體來說,我們得到三個時間字串:當前時間、日出和日落。我們的目標是確定當前時間是否位於日出和日落的邊界時間之間。 為了應對這個挑戰,我們將使用 DateTime 類別。這個類別使我們能夠表示和操作日期和時間。我們...
    程式設計 發佈於2024-11-05
  • 如何使用 CSS 變換比例修復 jQuery 拖曳/調整大小問題?
    如何使用 CSS 變換比例修復 jQuery 拖曳/調整大小問題?
    jQuery 使用CSS 轉換縮放拖曳/調整大小問題: 當應用CSS 轉換時,特別是變換:矩陣(0.5, 0, 0, 0.5, 0, 0);,對於一個div 並在子元素上使用jQuery 的draggable() 和resizing() 插件,jQuery 所做的更改變得與滑鼠位置「不同步”。 解決...
    程式設計 發佈於2024-11-05
  • 如何修復 TensorFlow 中的「ValueError:無法將 NumPy 陣列轉換為張量(不支援的物件類型浮點)」錯誤?
    如何修復 TensorFlow 中的「ValueError:無法將 NumPy 陣列轉換為張量(不支援的物件類型浮點)」錯誤?
    TensorFlow:解決「ValueError: Failed to Convert NumPy Array to Tensor (Unsupported Object Type Float)」工作時遇到的常見錯誤TensorFlow 的錯誤是「ValueError:無法將NumPy 陣列轉換為T...
    程式設計 發佈於2024-11-05
  • 如何有效率判斷本機儲存項目是否存在?
    如何有效率判斷本機儲存項目是否存在?
    確定本地儲存專案是否存在使用 Web 儲存時,在存取或修改特定專案之前驗證它們是否存在至關重要。在本例中,我們想要確定 localStorage 中是否設定了特定項目。 當前方法檢查項目是否存在的當前方法似乎是:if (!(localStorage.getItem("infiniteScr...
    程式設計 發佈於2024-11-05
  • Java 中的原子是什麼?了解 Java 中的原子性和線程安全
    Java 中的原子是什麼?了解 Java 中的原子性和線程安全
    1. Java 原子簡介 1.1 Java 中什麼是原子? 在Java中,java.util.concurrent.atomic套件提供了一組支援對單一變數進行無鎖定線程安全程式設計的類別。這些類別統稱為原子變數。最常使用的原子類別包括 AtomicInteger ...
    程式設計 發佈於2024-11-05
  • 前端/後端主要設定檔
    前端/後端主要設定檔
    從 DevOps 的角度來看,了解 Java 和 Node.js(後端和前端)程式碼庫中的設定檔對於管理建置流程、部署和環境設定至關重要。以下是在 Java 和 Node.js 應用程式中需要注意的設定檔的完整清單: Java 應用程式 後端 pom.xml (Maven): 管理依...
    程式設計 發佈於2024-11-05
  • Python 中出現「意外縮排」錯誤的原因以及如何解決?
    Python 中出現「意外縮排」錯誤的原因以及如何解決?
    Python 中意外縮排的意義是什麼? 在 Python 程式設計領域,精心製作的縮排起著至關重要的作用定義程式碼的結構和流程。當這個縮排不經意地被打亂時,就會出現「unexpected indent」錯誤,提示需要立即修正。 錯誤訊息背後:Unexpected Indent本質Python 的語法...
    程式設計 發佈於2024-11-05
  • 在 Node.js 中什麼時候應該使用 `setImmediate` 和 `process.nextTick`?
    在 Node.js 中什麼時候應該使用 `setImmediate` 和 `process.nextTick`?
    了解setImmediate 和nextTick 之間的差異了解setImmediate 和nextTick 之間的差異Node.js 版本0.10 引入了setImmediate,這是一個旨在補充process.nextjs 版本的新API。這兩個函數都提供了非同步執行回呼的方法,但它們具有控制其...
    程式設計 發佈於2024-11-05
  • jQuery中如何有效率地取得隱藏元素的高度?
    jQuery中如何有效率地取得隱藏元素的高度?
    在 jQuery 中獲取隱藏元素的高度處理隱藏元素時,檢索其高度可能具有挑戰性。暫時顯示元素以測量其高度然後再次隱藏它的傳統方法似乎效率低下。有沒有更優化的解決方案? jQuery 1.4.2 方法這是一個使用 jQuery 1.4.2 的範例:$select.show(); optionHeigh...
    程式設計 發佈於2024-11-05
  • 為什麼我不能在 Go Struct 標籤中使用變數?
    為什麼我不能在 Go Struct 標籤中使用變數?
    在Go 結構體標籤中使用變數在Go 中,結構體標籤用於指定有關結構體中字段的元數據。雖然可以使用字串文字定義標籤,但嘗試在其位置使用變數會導致錯誤。 無效用法:const ( TYPE = "type" ) type Shape struct { Type str...
    程式設計 發佈於2024-11-05
  • Qopy:身為開發人員我最喜歡的剪貼簿管理器
    Qopy:身為開發人員我最喜歡的剪貼簿管理器
    身為開發人員,我一直在尋找可以讓我的工作流程更順暢、更有效率的工具。最近,我偶然發現了 Qopy,一個可以在 Linux 和 Windows 上運行的開源剪貼簿管理器。 什麼是Qopy? Qopy 是一個簡單的剪貼簿管理器,旨在改善標準剪貼簿體驗。它的設計宗旨是用戶友好、可靠且快速...
    程式設計 發佈於2024-11-05
  • 為什麼我的按鈕上的懸停效果不起作用?
    為什麼我的按鈕上的懸停效果不起作用?
    更改懸停時的按鈕顏色:替代解決方案嘗試更改懸停時按鈕的顏色時,如果出現以下情況,可能會令人沮喪該解決方案未能產生預期的效果。考慮提供的範例程式碼:a.button { ... } a.button a:hover{ background: #383; }此解決方案嘗試在連結懸停在「按...
    程式設計 發佈於2024-11-05
  • 僅使用 Python 建構前端
    僅使用 Python 建構前端
    對於專注於後端的開發人員來說,前端開發可能是一項艱鉅的、甚至是噩夢般的任務。在我職業生涯的早期,前端和後端之間的界線是模糊的,每個人都被期望能夠處理這兩者。 CSS,尤其是,是一場持續不斷的鬥爭;這感覺像是一個不可能的任務。 雖然我喜歡前端工作,但 CSS 對我來說仍然是一個複雜的挑戰,特別是因為...
    程式設計 發佈於2024-11-05

免責聲明: 提供的所有資源部分來自互聯網,如果有侵犯您的版權或其他權益,請說明詳細緣由並提供版權或權益證明然後發到郵箱:[email protected] 我們會在第一時間內為您處理。

Copyright© 2022 湘ICP备2022001581号-3