\b Boundaries with Latin Characters in Go Regex
In the world of Go regular expressions, the \b boundary option has a slight quirk when dealing with Latin characters. The issue arises when trying to define words containing Latin characters, such as accented vowels and special characters.
Consider the following example, where we want to match the word "vis" using the \b boundary option:
import (
"fmt"
"regexp"
)
func main() {
r, _ := regexp.Compile(`\b(vis)\b`)
fmt.Println(r.MatchString("re vis e"))
fmt.Println(r.MatchString("revise"))
fmt.Println(r.MatchString("révisé"))
}
Surprisingly, the expected result of matching "révisé" as false doesn't occur. Instead, it matches as true. This is because \b operates only on ASCII word boundaries.
To resolve this issue and accurately match Latin characters, we can replace the \b boundary with a more inclusive alternative. Here's an example:
import (
"fmt"
"regexp"
)
func main() {
r, _ := regexp.Compile(`(?:\A|\s)(vis)(?:\s|\z)`)
fmt.Println(r.MatchString("vis"))
fmt.Println(r.MatchString("re vis e"))
fmt.Println(r.MatchString("revise"))
fmt.Println(r.MatchString("révisé"))
}
With this modification, the regex now recognizes the start and end of words using a combination of start of string (\A), end of string (\z), and whitespace (\s). The result accurately matches "vis" as true and "révisé" as false:
true true false false
This technique ensures accurate word matching, regardless of the presence of Latin characters.
Disclaimer: All resources provided are partly from the Internet. If there is any infringement of your copyright or other rights and interests, please explain the detailed reasons and provide proof of copyright or rights and interests and then send it to the email: [email protected] We will handle it for you as soon as possible.
Copyright© 2022 湘ICP备2022001581号-3