Converting Surrogate Pairs to Normal String in Python
This question seeks a method to transform a Python Unicode string containing surrogate pairs into a standard string representation. The goal is to obtain an intelligible Unicode character or a standardized hexadecimal format.
The provided code snippet presents a Python string that includes a surrogate pair representing an emoji:
emoji = "This is \ud83d\ude4f, an emoji."
To resolve the issue, it is crucial to distinguish between literal surrogate pair strings in a JSON file on disk (six characters) and single-character surrogate pair strings in memory (one character).
If the string is a single-character surrogate pair found in Python source code (such as the example provided), it indicates a potential bug upstream. If this is encountered and cannot be resolved, the surrogatepass error handler can be employed:
"\ud83d\ude4f".encode('utf-16', 'surrogatepass').decode('utf-16')
This will output the corresponding Unicode character, represented as a question mark (?):
'?'
In the case of literal surrogate pair strings in a JSON file on disk, the surrogate pair should not be present after loading the JSON data:
ascii(json.loads(r'"\ud83d\ude4f"'))
This will output the standardized hexadecimal format for the Unicode character:
'\U0001f64f'
Understanding this distinction is essential for handling surrogate pairs in Python and converting them to a usable format.
Disclaimer: All resources provided are partly from the Internet. If there is any infringement of your copyright or other rights and interests, please explain the detailed reasons and provide proof of copyright or rights and interests and then send it to the email: [email protected] We will handle it for you as soon as possible.
Copyright© 2022 湘ICP备2022001581号-3