dataframe如何改变dtype为Categorical？自定义排序筛选

我们相信：世界是美好的，你是我也是。来玩一下解压小游戏吧！

涉及到一个新的dtype类型：Categorical。如何理解这个Categorical数据类型呢？什么时候使用这个Categorical数据类型呢？如何使用.astype()来改变dataframe的dtype呢？这些都是本文要考虑的问题。

苏南大叔：dataframe如何改变dtype为Categorical？自定义排序筛选 - CategoricalDtype自定义排序 — dataframe如何改变dtype为Categorical？自定义排序筛选（图2-1）

苏南大叔的“程序如此灵动”博客，记录苏南大叔看到的想到的编程故事。本文测试环境：win10，python@3.12.0，scikit-learn@1.3.2，pandas@2.1.3。本文中的pandas版本很重要，因为网络上流行的解决方案都是基于pandas的早期版本的，现在都失效了。

前文回顾

第一段内容还是先回顾一下以前写过的相关文章。如果需要，可以先看看基础知识点。

《如何定义一个dataframe？》 https://newsn.net/say/pandas-dataframe.html

pandas的dataframe和numpy的二维ndarray很相似，dataframe可以理解为一个增强版的二维ndarray。下面的文章是基于ndarray来理解dtype的，而不是dataframe。仅供参考：

《如何理解ndarray的dtype？》 https://newsn.net/say/ndarray-dtype.html
《利用 np.astype()改变ndarray的dtype?》https://newsn.net/say/numpy-astype.html

龙套演员

依然有请苏南大叔的会后空翻的宠物列表，它们上过专门的宠物学校，作为dataframe的龙套演员：

from pandas import DataFrame
df = DataFrame([
        ('虎子',"小学"),
        ('老许',"高中"),
        ('二赖子',"初中"),
        ('老白',"文盲"),
        ('小黑',"幼儿园"),
    ],
    columns=('name','education')
)
print(df)
print(df.dtypes)
print(df["name"].dtype)
print(df["education"].dtype)

输出：

  name education
0   虎子        小学
1   老许        高中
2  二赖子        初中
3   老白        文盲
4   小黑       幼儿园

name         object
education    object
dtype: object

object
object

所以，关于dtype，初步的结论是：

定义在dataframe上叫.dtypes，定义在某一列上叫.dtype。【注意拼写差一个字母s】。
字符串类型的dtype被归为object。

Categorical有序分类数据类型

苏南大叔理解着Categorical看起来和枚举类型非常类似，但是，两者是不同的。

Categorical英文单词解释是："明确的；绝对的"。

定义一个Categorical代码如下：

from pandas.api.types import CategoricalDtype
_dtype_order = CategoricalDtype(categories=["文盲","幼儿园","小学","初中","高中"], ordered=True)
print(_dtype_order,type(_dtype_order))

输出：

category <class 'pandas.core.dtypes.dtypes.CategoricalDtype'>

值得注意的是：

这个类型的变量print的结果，恒定是category，是不是不明觉厉？处理的略草率。
ordered这个参数，暂时不做探讨，恒定为true即可。

苏南大叔：dataframe如何改变dtype为Categorical？自定义排序筛选 - 自定义排序代码 — dataframe如何改变dtype为Categorical？自定义排序筛选（图2-2）

设置`dtype`

df["education"] = df["education"].astype(_dtype_order)
print(df.dtypes)
print(df["category"].dtype)

输出：

name           object
education    category
dtype: object

category

注意，这里.astype()后，有个再次赋值给原列数据的操作。对应列的dtype由object变成了category。

自定义排序

表明上定义的是个dtype，实际上定义的是这列数据的排序标准。.astype()之后，这列数据获得了自定义排序功能。对应如下：

from pandas import DataFrame
df = DataFrame([
        ('虎子',"小学"),
        ('老许',"高中"),
        ('二赖子',"初中"),
        ('老白',"文盲"),
        ('小黑',"幼儿园"),
    ],
    columns=('name','education')
)

sorted_df = df.sort_values('education', ascending=True)
print(sorted_df)

from pandas.api.types import CategoricalDtype
_dtype_order = CategoricalDtype(categories=["文盲","幼儿园","小学","初中","高中"], ordered=True)
df["education"] = df["education"].astype(_dtype_order)

sorted_df = df.sort_values('education', ascending=True)
print(sorted_df)

输出：

  name education
2  二赖子        初中
0   虎子        小学
4   小黑       幼儿园
3   老白        文盲
1   老许        高中

  name education
3   老白        文盲
4   小黑       幼儿园
0   虎子        小学
2  二赖子        初中
1   老许        高中

很明显，这个新的排序更加合理。

自定义筛选

定义了dtype为Categorical之后，还获得非常合理的数据筛选能力。例如：

print(df["education"]>="初中")
print(df[df["education"]>="初中"])

输出：

0    False
1     True
2     True
3    False
4    False
Name: education, dtype: bool

  name education
1   老许        高中
2  二赖子        初中

可能遇到的问题

如果使用以前的写法的话，旧写法：

df['education'].astype('category',categories=[],ordered=True)

可能会遇到下面的报错信息：

TypeError: NDFrame.astype() got an unexpected keyword argument 'categories'

参考文章：

https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.25.0.html#removal-of-prior-version-deprecations-changes

Removed the previously deprecated ordered and categories keyword arguments in astype (GH 17742)

影响独热码的.categories_属性

再来一个例子：

from pandas import DataFrame
df = DataFrame([
        ("小学"),
        ("文盲"),
        ("幼儿园"),
    ]
)
from sklearn import preprocessing  
one = preprocessing.OneHotEncoder()
two = preprocessing.OneHotEncoder()
a = one.fit_transform(df).toarray()
print(a)
'''
[[1. 0. 0.]
 [0. 0. 1.]
 [0. 1. 0.]]
'''
print(one.categories_)
# [array(['小学', '幼儿园', '文盲'], dtype=object)]


from pandas.api.types import CategoricalDtype
_dtype_order = CategoricalDtype(categories=["文盲","幼儿园","高中","硕士"], ordered=True)
df[0] = df[0].astype(_dtype_order)

from sklearn import preprocessing
two = preprocessing.OneHotEncoder()
a = two.fit_transform(df).toarray()
print(a)
'''
[[0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]]
'''
print(two.categories_)
# [array(['幼儿园', '文盲', nan], dtype=object)]

结语

更多python相关经验文字，请点击：

https://newsn.net/tag/python/

如果本文对您有帮助，或者节约了您的时间，欢迎打赏瓶饮料，建立下友谊关系。

本博客不欢迎：各种镜像采集行为。请尊重原创文章内容，转载请保留作者链接。

【福利】腾讯云最新爆款活动！1核2G云服务器首年50元！

【源码】本文代码片段及相关软件，请点此获取更多信息

【绝密】秘籍文章入口，仅传授于有缘之人 python

	原创不易，转载请保留链接，谢绝镜像采集
	如果能解决您的困扰，那么想必定是极好的
	快来这里！大家都在这儿等你讨论这个问题