lightgbm的LGBMClassifier,如何对鸢尾花数据集进行预测?
发布于 作者:苏南大叔 来源:程序如此灵动~ 我们相信:世界是美好的,你是我也是。平行空间的世界里面,不同版本的生活也在继续...
继上一个XGBoost
之后,这里再次迎来一个没有收录在sklearn
里面的预测模型:来自microsoft
的LightGBM
。很厉害是吧?其实使用方式上,还是一样的非常简单,和其它的模型的使用方法上基本相同。但是,需要设置默认参数才能避免输出很多警告信息,这可能是其特殊的地方。
苏南大叔的“程序如此灵动”博客,记录苏南大叔的代码感想感悟。本文测试环境:win10
,python@3.12.0
,pandas@2.1.3
,scikit-learn@1.3.2
,LightGBM@4.1.0
。
LightGBM
LightGBM
目前由微软维护,根正苗红。参考链接:
官方页面的介绍文字:
LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed and efficient with the following advantages:
- Faster training speed and higher efficiency.
- Lower memory usage.
- Better accuracy.
- Support of parallel, distributed, and GPU learning.
- Capable of handling large-scale data.
LightGBM
依然是没有集成在sklearn
里面的。所以,依然需要单独安装。
pip install lightgbm
本文还是使用这个LightGBM
来处理一下鸢尾花数据集。看看效果如何。
加载鸢尾花数据集
老生常谈部分,代码如下:
from sklearn.model_selection import train_test_split
import pandas as pd
data_url = "http://download.tensorflow.org/data/iris_training.csv"
column_names = ["萼长", "萼宽", "瓣长", "瓣宽", "种类"]
data = pd.read_csv(data_url, header=0, names=column_names)
X = data.iloc[:, :-1].values
y = data.iloc[:, -1:].values.flatten()
X_train, X_true, y_train, y_true = train_test_split(X, y, test_size=0.2, random_state=8)
不明白的读者,可以参考下面的文章:
- https://newsn.net/say/sklearn-load_iris.html
- https://newsn.net/say/sklearn-csv.html
- https://newsn.net/say/sklearn-train_test_split.html
LGBMClassifier模型预测
import lightgbm as lgb
model = lgb.LGBMClassifier(verbose=-1, num_threads=2)
model.fit(X_train, y_train)
y_pred = model.predict(X_true)
print(y_pred)
print("LGBM算法预测准确率:", model.score(X_true, y_true))
输出:
[1 2 2 2 1 1 0 0 1 1 0 2 2 0 0 2 2 2 2 0 0 2 0 1]
LGBM算法预测准确率: 0.9166666666666666
可能存在的问题
在这个部分,LGBMClassifier
是存在着一些特殊情况的。如果不设置参数的话,它处理鸢尾花数据集的时候,会有一些警告信息输出的。比如:
D:\Program Files\Python312\Lib\site-packages\joblib\externals\loky\backend\context.py:136: UserWarning: Could not find the number of physical cores for the following reason:
found 0 physical cores < 1
Returning the number of logical cores instead. You can silence this warning by setting LOKY_MAX_CPU_COUNT to the number of cores you want to use.
warnings.warn(
File "D:\Program Files\Python312\Lib\site-packages\joblib\externals\loky\backend\context.py", line 282, in _count_physical_cores
raise ValueError(f"found {cpu_count_physical} physical cores < 1")
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000049 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 80
[LightGBM] [Info] Number of data points in the train set: 96, number of used features: 4
[LightGBM] [Info] Start training from score -1.037988
[LightGBM] [Info] Start training from score -1.232144
[LightGBM] [Info] Start training from score -1.037988
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
它的解决方案是,设置参数:
verbose=-1
,抑止错误输出。num_threads=2
,设置线程数量,这个和测试机的cpu
是几核的有关。大家就根据实际情况设置吧。
模型评估
模型得分这部分也是固定套路,套公式感十足。本文就用最新研究的.classification_report()
来做评测吧。
from sklearn.metrics import classification_report
report = classification_report(y_true, y_pred)
print(report)
输出:
precision recall f1-score support
0 1.00 1.00 1.00 8
1 1.00 0.75 0.86 8
2 0.80 1.00 0.89 8
accuracy 0.92 24
macro avg 0.93 0.92 0.92 24
weighted avg 0.93 0.92 0.92 24
对于这个结果,解释说明,可以参考文章:
结语
机器学习,全称machine learning
,简称ml
。链接:
如果本文对您有帮助,或者节约了您的时间,欢迎打赏瓶饮料,建立下友谊关系。
本博客不欢迎:各种镜像采集行为。请尊重原创文章内容,转载请保留作者链接。
本博客不欢迎:各种镜像采集行为。请尊重原创文章内容,转载请保留作者链接。