基于python，中文汉字如何转unicode编码？编码转汉字

我们相信：世界是美好的，你是我也是。来玩一下解压小游戏吧！

上一篇文章里面，苏南大叔猜测LabelEncoder对汉字进行编码后，是按照unicode编码表进行排序的。那么，是不是如苏南大叔所猜测的那样呢？这就引出了本文的内容：汉字如何转unicode编码？unicode编码又如何再次转为汉字呢？

苏南大叔：基于python，中文汉字如何转unicode编码？编码转汉字 - 汉字转编码 — 基于python，中文汉字如何转unicode编码？编码转汉字（图3-1）

苏南大叔的“程序如此灵动”博客，记录苏南大叔的编程经验文章。测试环境：win10，python@3.12.0。本文的核心函数就是ord()和chr()，这和其它的高级编程语言里面是一样一样的。

标准写法里面，前面一个斜线加一个字母u。参考链接：https://tool.chinaz.com/tools/unicode.aspx

汉字转编码

测试代码：

a = ord("苏")       
b = ord("南")
c = ord("大")
d = ord("叔")
print(a,b,c,d)   #  33487 21335 22823 21460

如果换成更常见的unicode写法的话：

su = ord("苏")                                # 33487
su_hex = hex(ord("苏"))                       # 0x82cf
su_unicode = "\\u" + hex(ord("苏"))[2:]       # \u82cf

苏南大叔：基于python，中文汉字如何转unicode编码？编码转汉字 - cn-unicode — 基于python，中文汉字如何转unicode编码？编码转汉字（图3-2）

汉字要一个一个的ord()，整体ord()会报错。

print(ord("苏南大叔"))
# TypeError: ord() expected a character, but string of length 4 found

想测试获得一系列汉字的unicode编码的话，可以使用这种很好用的表达式：

aaa = list(ord(n) for n in list("苏南大叔"))                           
#  [33487, 21335, 22823, 21460]

aaa_hex = list(hex(ord(n)) for n in list("苏南大叔"))                  
#  ['0x82cf', '0x5357', '0x5927', '0x53d4']

aaa_unicode = list("\\u" + hex(ord(n))[2:] for n in list("苏南大叔"))  
#  ['\\u82cf', '\\u5357', '\\u5927', '\\u53d4']

编码转汉字

根据输入项的不同，可以有不同的还原成汉字的方式：

十进制数字

如果输入项是十进制数字，例如：33487，那么就使用chr()来进行还原。

a = chr(33487)                                             # 苏
b = chr(21335)                                             # 南
aaa = list(chr(n) for n in [33487, 21335, 22823, 21460])   # ['苏', '南', '大', '叔']

十六进制数字

如果输入项是十六进制数字，例如：0x82cf，那么可以有两种方法进行还原：

a = chr(int('0x82cf',16))          # 苏
b = chr(int('0x5357',16))          # 南
aaa = list(chr(int(n,16)) for n in ['0x82cf', '0x5357', '0x5927', '0x53d4'])
# ['苏', '南', '大', '叔']

或者：

a = ('\\u'+'0x82cf'[2:]).encode('utf-8').decode('unicode_escape')          # 苏
b = ('\\u'+'0x5357'[2:]).encode('utf-8').decode('unicode_escape')          # 南
aaa = list(('\\u'+n[2:]).encode('utf-8').decode('unicode_escape') for n in ['0x82cf', '0x5357', '0x5927', '0x53d4'])
# ['苏', '南', '大', '叔']

苏南大叔：基于python，中文汉字如何转unicode编码？编码转汉字 - code2cn — 基于python，中文汉字如何转unicode编码？编码转汉字（图3-3）

unicode 格式编码

如果输入项目是unicode格式的字符串，例如\\u82cf，那么可以下面这样还原：

a = chr(int('0x' + '\\u82cf'[2:],16))                      # 苏
b = chr(int('0x' + '\\u5357'[2:],16))                      # 南
aaa = list(chr(int('0x'+n[2:],16)) for n in ['\\u82cf', '\\u5357', '\\u5927', '\\u53d4'])   
# ['苏', '南', '大', '叔']

还可以这样还原：

a = '\\u82cf'.encode('utf-8').decode('unicode_escape')                      # 苏
b = '\\u5357'.encode('utf-8').decode('unicode_escape')                      # 南
aaa = list(n.encode('utf-8').decode('unicode_escape') for n in ['\\u82cf', '\\u5357', '\\u5927', '\\u53d4'])   # ['苏', '南', '大', '叔']

题外话：16进制和10进制

十六进制（简写为hex或下标16）是一种基数为16的计数系统，是一种逢16进1的进位制。通常用数字0、1、2、3、4、5、6、7、8、9和字母A、B、C、D、E、F（a、b、c、d、e、f）表示，其中:A~F表示10~15，这些称作十六进制数字。

a16 = hex(10)                 # 0xa
a10 = int(0xa)                # 10
print(a16,a10)

# a10_2 = int(0xa,16)         # TypeError: int() can't convert non-string with explicit base
a10_2 = int("0xa",16)         # 10
a10_3 = int("0xa",base=16)    # 10
print(a10_2,a10_3)

这个报错有点措手不及...

上一个问题

回到下面这篇文章，LabelEncoder的.classes_的顺序，极有可能就是unicode编码排序（不限于数字字母汉字）。

https://newsn.net/say/sklearn-label-encoder.html

aaa = list(ord(n) for n in list("苏南大叔"))    # [33487, 21335, 22823, 21460]
print(aaa)

aaa.sort()                                      # [21335, 21460, 22823, 33487]      
print(aaa)

bbb = list(chr(n) for n in aaa)                 # ['南', '叔', '大', '苏']
print(bbb)

https://newsn.net/say/python2-encoding.html

结语

更多由苏南大叔带来的python经验文字，请点击下面的链接：

https://newsn.net/tag/python/

如果本文对您有帮助，或者节约了您的时间，欢迎打赏瓶饮料，建立下友谊关系。

本博客不欢迎：各种镜像采集行为。请尊重原创文章内容，转载请保留作者链接。

【福利】腾讯云最新爆款活动！1核2G云服务器首年50元！

【源码】本文代码片段及相关软件，请点此获取更多信息

【绝密】秘籍文章入口，仅传授于有缘之人 python

	原创不易，转载请保留链接，谢绝镜像采集
	如果能解决您的困扰，那么想必定是极好的
	快来这里！大家都在这儿等你讨论这个问题