机器学习,如何使用fetch_openml加载泰坦尼克数据集?
发布于 作者:苏南大叔 来源:程序如此灵动~在sklearn
包里面,并不存在titanic.csv
数据集文件本身。但是,可以通过fetch_openml()
函数扩展到很多数据集,其中就包括titanic
数据集。并且,在openml
网站上,这个泰坦尼克数据集还存在多个版本。那么,如何区分加载openml
数据集的多个版本呢?这就是本文要讨论的问题。
大家好,这里是苏南大叔的“程序如此灵动”博客,这里讲述苏南大叔和计算机代码之间的故事。本文主要分析的对象是“泰坦尼克数据集”。测试环境:python@3.6.8
,pandas@1.1.5
,scikit-learn@1.3.2
。
泰坦尼克数据集
首先,和鸢尾花数据集一样,泰坦尼克数据集也存在着多个版本,不同的网站上提供着不同的数据源。本文聚焦于openml
网站上,提供的两个版本的泰坦尼克数据源。这就是本文的主要实验对象。值得注意的是:本文的数据源格式并不是csv
,也不是txt
,而是一种新的.arff
格式数据。所以普通的csv
数据集的加载方式,并不能解析出来数据。
泰坦尼克数据集的字段含义,可以参考文章:
本文的函数fetch_openml()
的参数及返回值,可以类比:
泰塔尼克数据集版本一,14字段
说明文档:
下载地址:
数据字段:
@attribute 'pclass' numeric
@attribute 'survived' {0,1}
@attribute 'name' string
@attribute 'sex' {'female','male'}
@attribute 'age' numeric
@attribute 'sibsp' numeric
@attribute 'parch' numeric
@attribute 'ticket' string
@attribute 'fare' numeric
@attribute 'cabin' string
@attribute 'embarked' {'C','Q','S'}
@attribute 'boat' string
@attribute 'body' numeric
@attribute 'home.dest' string
加载代码:
from sklearn.datasets import fetch_openml
t = fetch_openml("titanic", parser="auto", version=1)
泰塔尼克数据集版本二,4字段
说明文档:
下载地址:
数据字段:
@ATTRIBUTE Class NUMERIC
@ATTRIBUTE Age NUMERIC
@ATTRIBUTE Sex NUMERIC
@ATTRIBUTE class {-1,1}
这里的第一个字段Class
是target
,那么就代表了survived
字段。
加载代码:
from sklearn.datasets import fetch_openml
t = fetch_openml("titanic", parser="auto", version=2)
可能遇到的问题(参数)
执行代码:
from sklearn.datasets import fetch_openml
t1 = fetch_openml("titanic")
警告信息:
FutureWarning: The default value of `parser` will change from `'liac-arff'` to `'auto'` in 1.4. You can set `parser='auto'` to silence this warning. Therefore, an `ImportError` will be raised from 1.4 if the dataset is dense and pandas is not installed. Note that the pandas parser may return different data types. See the Notes Section in fetch_openml's API doc for details.
解决方案,加个额外的参数parser='auto'
即可。
from sklearn.datasets import fetch_openml
t = fetch_openml("titanic", parser='auto')
警告信息:
UserWarning: Multiple active versions of the dataset matching the name titanic exist. Versions may be fundamentally different, returning version 2.
解决方案,加个额外的参数version=2
即可。
from sklearn.datasets import fetch_openml
t = fetch_openml("titanic", parser="auto", version=2)
返回值又是个"bunch"类型。
<class 'sklearn.utils._bunch.Bunch'>
{'data': Class Age Sex
0 -1.8700 -0.228 0.521
1 -0.9230 -0.228 -1.920
2 -0.9230 -0.228 -1.920
3 0.9650 -0.228 0.521
4 0.0214 -0.228 0.521
... ... ... ...
2196 0.9650 -0.228 0.521
2197 -0.9230 -0.228 0.521
2198 -1.8700 -0.228 0.521
2199 0.9650 -0.228 0.521
2200 -0.9230 -0.228 -1.920
[2201 rows x 3 columns], 'target': 0 -1
1 1
2 1
3 1
4 -1
..
2196 -1
2197 -1
2198 -1
2199 -1
2200 1
Name: class, Length: 2201, dtype: category
Categories (2, object): ['-1', '1'], 'frame': Class Age Sex class
0 -1.8700 -0.228 0.521 -1
1 -0.9230 -0.228 -1.920 1
2 -0.9230 -0.228 -1.920 1
3 0.9650 -0.228 0.521 1
4 0.0214 -0.228 0.521 -1
... ... ... ... ...
2196 0.9650 -0.228 0.521 -1
2197 -0.9230 -0.228 0.521 -1
2198 -1.8700 -0.228 0.521 -1
2199 0.9650 -0.228 0.521 -1
2200 -0.9230 -0.228 -1.920 1
[2201 rows x 4 columns], 'categories': None, 'feature_names': ['Class', 'Age', 'Sex'], 'target_names': ['class'], 'DESCR': 'PMLB version of the Titanic dataset, which only uses 3 features. See version 1 for the complete version: https://www.openml.org/d/40945\n\nDownloaded from openml.org.', 'details': {'id': '40704', 'name': 'Titanic', 'version': '2', 'description_version': '1', 'format': 'ARFF', 'upload_date': '2017-04-06T12:38:28', 'licence': 'public', 'url': 'https://api.openml.org/data/v1/download/4965305/Titanic.arff', 'parquet_url': 'https://openml1.win.tue.nl/datasets/0004/40704/dataset_40704.pq', 'file_id': '4965305', 'default_target_attribute': 'class', 'tag': ['Computer Systems', 'derived', 'Machine Learning'], 'visibility': 'public', 'minio_url': 'https://openml1.win.tue.nl/datasets/0004/40704/dataset_40704.pq', 'status': 'active', 'processing_date': '2018-10-04 07:15:38', 'md5_checksum': '08416114dd85d0ebd932fcb1d87650c1'}, 'url': 'https://www.openml.org/d/40704'}
返回数据分解
按照鸢尾花数据集load_irir()
的套路,这个fetch_openml()
也可以返回下面的值:
data【用于计算的特征】
print(type(t["data"]),t["data"])
输出:
<class 'pandas.core.frame.DataFrame'>
Class Age Sex
0 -1.8700 -0.228 0.521
1 -0.9230 -0.228 -1.920
2 -0.9230 -0.228 -1.920
3 0.9650 -0.228 0.521
4 0.0214 -0.228 0.521
... ... ... ...
2196 0.9650 -0.228 0.521
2197 -0.9230 -0.228 0.521
2198 -1.8700 -0.228 0.521
2199 0.9650 -0.228 0.521
2200 -0.9230 -0.228 -1.920
[2201 rows x 3 columns]
target【计算得出的目标】
print(type(t["target"]),t["target"])
输出:
<class 'pandas.core.series.Series'>
0 -1
1 1
2 1
3 1
4 -1
..
2196 -1
2197 -1
2198 -1
2199 -1
2200 1
Name: class, Length: 2201, dtype: category
Categories (2, object): ['-1', '1']
target_names
print(type(t["target_names"]),t["target_names"])
输出:
<class 'list'> ['class']
feature_names
print(type(t.feature_names),t.feature_names)
输出:
<class 'list'> ['Class', 'Age', 'Sex']
frame【数据加目标】
print(type(t.frame),t.frame)
输出:
<class 'pandas.core.frame.DataFrame'>
Class Age Sex class
0 -1.8700 -0.228 0.521 -1
1 -0.9230 -0.228 -1.920 1
2 -0.9230 -0.228 -1.920 1
3 0.9650 -0.228 0.521 1
4 0.0214 -0.228 0.521 -1
... ... ... ... ...
2196 0.9650 -0.228 0.521 -1
2197 -0.9230 -0.228 0.521 -1
2198 -1.8700 -0.228 0.521 -1
2199 0.9650 -0.228 0.521 -1
2200 -0.9230 -0.228 -1.920 1
descr
print(type(t.DESCR),t.DESCR)
输出:
<class 'str'>
PMLB version of the Titanic dataset, which only uses 3 features. See version 1 for the complete version: https://www.openml.org/d/40945
Downloaded from openml.org.
参数之as_frame,默认true
as_frame
用于控制返回值类型,不过整体依然是bunch
类型。
from sklearn.datasets import fetch_openml
t = fetch_openml("titanic", parser="auto", version=2)
t = fetch_openml("titanic", parser="auto", version=2, as_frame=False)
# print(type(t), t)
print(type(t["data"]))
print(type(t["target"]))
print(type(t["target_names"]))
print(type(t["feature_names"]))
print(type(t["frame"]))
as_frame | data | target | target_names | feature_names | frame |
---|---|---|---|---|---|
默认/true | DataFrame | Series | list | list | DataFrame |
false | ndarray | ndarray | list | list | NoneType |
参数之return_X_y,默认false
一个return_X_y=True
,毁灭上面所有结论。这里的X
就是原来的data
,y
就是target
,X
+y
就是原来的frame
。
from sklearn.datasets import fetch_openml
t = fetch_openml("titanic", parser="auto", version=2)
print(type(t))
X,y = fetch_openml("titanic", parser="auto", version=2, return_X_y=True)
print(type(X), X)
print(type(y), y)
输出:
<class 'sklearn.utils._bunch.Bunch'>
<class 'pandas.core.frame.DataFrame'>
Class Age Sex
0 -1.8700 -0.228 0.521
1 -0.9230 -0.228 -1.920
2 -0.9230 -0.228 -1.920
3 0.9650 -0.228 0.521
4 0.0214 -0.228 0.521
... ... ... ...
2196 0.9650 -0.228 0.521
2197 -0.9230 -0.228 0.521
2198 -1.8700 -0.228 0.521
2199 0.9650 -0.228 0.521
2200 -0.9230 -0.228 -1.920
[2201 rows x 3 columns]
<class 'pandas.core.frame.DataFrame'>
Class Age Sex
0 -1.8700 -0.228 0.521
1 -0.9230 -0.228 -1.920
2 -0.9230 -0.228 -1.920
3 0.9650 -0.228 0.521
4 0.0214 -0.228 0.521
... ... ... ...
2196 0.9650 -0.228 0.521
2197 -0.9230 -0.228 0.521
2198 -1.8700 -0.228 0.521
2199 0.9650 -0.228 0.521
2200 -0.9230 -0.228 -1.920
[2201 rows x 3 columns]
<class 'pandas.core.series.Series'>
0 -1
1 1
2 1
3 1
4 -1
..
2196 -1
2197 -1
2198 -1
2199 -1
2200 1
Name: class, Length: 2201, dtype: category
Categories (2, object): ['-1', '1']
结语
更多机器学习的相关经验文字,欢迎参考苏南大叔的博客文章:
本博客不欢迎:各种镜像采集行为。请尊重原创文章内容,转载请保留作者链接。