当前位置：首页 > news >正文

网站后台地址修改/网站建设是什么

news 2025/7/11 10:28:05

网站后台地址修改,网站建设是什么,企业推广费计入什么科目,建筑网络图片数据预处理工具 1. 载入库这里，载入的是numpy，matplotlib中的pyplot，还有pandas这三个库。 import numpy as np import matplotlib.pyplot as plt import pandas as pd2. 载入数据这里的数据为Data.csv，格式是csv格式&#…

数据预处理工具

1. 载入库

这里，载入的是numpy，matplotlib中的pyplot，还有pandas这三个库。

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

2. 载入数据

这里的数据为Data.csv，格式是csv格式，使用的是pandas的函数读取。这里读取成功后，打印前4行预览一下。

dataset = pd.read_csv('Data.csv')
dataset.head()

	Country	Age	Salary	Purchased
0	France	44.0	72000.0	No
1	Spain	27.0	48000.0	Yes
2	Germany	30.0	54000.0	No
3	Spain	38.0	61000.0	No
4	Germany	40.0	NaN	Yes

将前三列作为X，将最后一列作为y。这里用了一个非常巧的方法，用了-1去提取和保留。

X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

查看X的结果，nan的为缺失值。

print(X)

[['France' 44.0 72000.0]['Spain' 27.0 48000.0]['Germany' 30.0 54000.0]['Spain' 38.0 61000.0]['Germany' 40.0 nan]['France' 35.0 58000.0]['Spain' nan 52000.0]['France' 48.0 79000.0]['Germany' 50.0 83000.0]['France' 37.0 67000.0]]

查看y的结果

print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']

3. 处理缺失值

这里，使用的是sklearn中的函数，进行缺失值的处理。

from sklearn.impute import SimpleImputer

这里，使用平均值作为缺失值的填充值

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:, 1:3])

SimpleImputer(add_indicator=False, copy=True, fill_value=None,missing_values=nan, strategy='mean', verbose=0)

X[:, 1:3] = imputer.transform(X[:, 1:3])

查看缺失值，可以看到缺失值已经使用平均值，填充完成

print(X)

[['France' 44.0 72000.0]['Spain' 27.0 48000.0]['Germany' 30.0 54000.0]['Spain' 38.0 61000.0]['Germany' 40.0 63777.77777777778]['France' 35.0 58000.0]['Spain' 38.77777777777778 52000.0]['France' 48.0 79000.0]['Germany' 50.0 83000.0]['France' 37.0 67000.0]]

4. 对分类变量进行重新编码

4.1 对X变量重编码

载入sklearn.compose中的ColumnTransformer对数据进行重新编码

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

查看重新编码后的X

print(X)

[[0.0 1.0 0.0 0.0 44.0 72000.0][1.0 0.0 0.0 1.0 27.0 48000.0][1.0 0.0 1.0 0.0 30.0 54000.0][1.0 0.0 0.0 1.0 38.0 61000.0][1.0 0.0 1.0 0.0 40.0 63777.77777777778][0.0 1.0 0.0 0.0 35.0 58000.0][1.0 0.0 0.0 1.0 38.77777777777778 52000.0][0.0 1.0 0.0 0.0 48.0 79000.0][1.0 0.0 1.0 0.0 50.0 83000.0][0.0 1.0 0.0 0.0 37.0 67000.0]]

4.2 重新编码y变量

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

查看编码后的结果

print(y)

[0 1 0 0 1 1 0 1 0 1]

5. 将数据分为训练数据和测试数据

将训练群体为80%的数据，将测试数据为20%的数据

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

查看X的训练数据集

print(X_train)

[[1.0 0.0 0.0 1.0 38.77777777777778 52000.0][1.0 0.0 1.0 0.0 40.0 63777.77777777778][0.0 1.0 0.0 0.0 44.0 72000.0][1.0 0.0 0.0 1.0 38.0 61000.0][1.0 0.0 0.0 1.0 27.0 48000.0][0.0 1.0 0.0 0.0 48.0 79000.0][1.0 0.0 1.0 0.0 50.0 83000.0][0.0 1.0 0.0 0.0 35.0 58000.0]]

查看X的测试数据集

print(X_test)

[[1.0 0.0 1.0 0.0 30.0 54000.0][0.0 1.0 0.0 0.0 37.0 67000.0]]

查看y的训练数据集

print(y_train)

[0 1 0 0 1 1 0 1]

查看y的测试数据集

print(y_test)

[0 1]

6. 数据标准化

因为X不同列的量纲单位不一样，需要将其进行标准化，然后才可以相互比较及建模

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
X_test[:, 3:] = sc.transform(X_test[:, 3:])

查看X的训练数据集，可以看到已经对数据进行了标准化

print(X_train)

[[1.0 0.0 0.0 1.2909944487358056 -0.19159184384578545 -1.0781259408412425][1.0 0.0 1.0 -0.7745966692414834 -0.014117293757057777-0.07013167641635372][0.0 1.0 0.0 -0.7745966692414834 0.566708506533324 0.633562432710455][1.0 0.0 0.0 1.2909944487358056 -0.30453019390224867-0.30786617274297867][1.0 0.0 0.0 1.2909944487358056 -1.9018011447007988 -1.420463615551582][0.0 1.0 0.0 -0.7745966692414834 1.1475343068237058 1.232653363453549][1.0 0.0 1.0 -0.7745966692414834 1.4379472069688968 1.5749910381638885][0.0 1.0 0.0 -0.7745966692414834 -0.7401495441200351 -0.5646194287757332]]

对X的测试数据，同样进行标准化处理

print(X_test)

[[1.0 0.0 1.0 -0.7745966692414834 -1.4661817944830124 -0.9069571034860727][0.0 1.0 0.0 -0.7745966692414834 -0.44973664397484414 0.2056403393225306]]

这样，对数据的预处理就完成了。

查看全文

http://www.jmfq.cn/news/4745773.html

网站建设一级页面二级页面/seo排名优化推广教程

盱眙网站制作/企业网站建设的重要性

网站建设人力成本费用/网络营销做得好的公司

四川建筑信息数据共享平台/常州百度搜索优化

小工厂怎么做网站/想要导航页面推广app

做快递网站难吗/郴州网站建设推广公司

个人如何建立公司网站/信阳网络推广公司

个人可以做自媒体网站吗/网站推广seo优化

404错误直接转向到网站首页/百度总部客服电话

广州翼讯资讯科技有限公司网站/深圳搜索引擎

一个云主机可以做多少网站/百度一下官网首页百度

惠州市网站设计公司/抚顺优化seo

建设网站和公告号的意义/汕头seo收费

建网站开发费用/seo搜索排名