Chapter 7 无监督方法

7.1 聚类

7.1.1 K均值聚类:

K均值聚类是一种常见的聚类方法,它将数据分为预定义数量的簇,其中每个数据点属于与其最近的簇。

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

# 生成模拟数据
X, _ = make_blobs(n_samples=300, centers=4, random_state=0, cluster_std=1.0)

# 创建K均值聚类模型
kmeans = KMeans(n_clusters=4)

# 训练模型
kmeans.fit(X)

# 获取簇的标签
## KMeans(n_clusters=4)
labels = kmeans.labels_

print("簇的标签:")
## 簇的标签:
print(labels)
## [3 2 0 2 0 1 3 0 2 2 3 2 0 2 1 0 0 1 3 3 1 1 0 3 3 3 1 0 3 0 2 2 0 2 2 2 2
##  2 3 1 0 3 2 0 3 3 2 3 2 1 3 1 2 1 0 3 2 3 2 1 2 0 2 3 3 3 2 1 2 3 0 0 2 3
##  3 2 3 0 1 2 1 0 1 1 2 0 1 0 2 2 0 1 2 3 3 0 1 1 2 3 2 1 2 1 0 1 1 0 2 0 3
##  3 1 2 1 1 2 1 1 0 0 1 0 1 1 1 1 3 1 3 2 3 3 1 2 3 3 2 3 2 2 3 0 3 0 3 2 0
##  2 2 2 0 0 0 1 3 2 3 1 0 2 0 0 1 0 3 1 0 1 0 0 2 1 0 0 2 1 1 0 3 1 0 3 3 0
##  0 0 0 1 2 3 3 0 0 3 3 3 0 3 2 0 3 1 3 0 3 3 2 0 2 0 3 0 0 2 3 3 1 1 0 2 1
##  1 3 1 3 3 2 2 0 0 2 0 1 3 0 1 3 2 3 1 0 1 2 2 2 2 3 3 0 0 3 1 0 3 3 0 1 1
##  2 0 0 3 1 2 3 0 2 0 1 1 3 3 1 1 1 1 0 2 2 1 1 0 1 1 1 2 0 2 0 1 1 2 2 2 1
##  1 0 2 3]

7.1.2 层次聚类:

层次聚类是一种聚类方法,它通过构建树状结构将数据分层次地分成簇,可以使用不同的链接准则(如单链接、完全链接、平均链接等)。

from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import make_blobs

# 生成模拟数据
X, _ = make_blobs(n_samples=300, centers=4, random_state=0, cluster_std=1.0)

# 创建层次聚类模型
agg_clustering = AgglomerativeClustering(n_clusters=4)

# 训练模型
agg_clustering.fit(X)

# 获取簇的标签
## AgglomerativeClustering(n_clusters=4)
labels = agg_clustering.labels_

print("簇的标签:")
## 簇的标签:
print(labels)
## [0 1 0 1 0 2 3 0 1 1 3 1 0 1 2 0 0 2 0 3 2 2 0 3 3 0 2 0 3 0 1 1 0 1 1 1 1
##  1 3 2 0 3 1 0 3 0 1 3 1 2 3 2 1 2 0 0 1 3 1 2 1 0 1 0 3 0 1 2 1 3 0 0 1 3
##  3 1 3 0 2 1 2 0 2 2 1 0 2 0 1 1 0 2 1 3 0 0 2 2 0 3 1 2 1 2 0 2 2 0 1 0 3
##  3 2 1 2 2 1 2 0 0 0 2 0 2 0 2 2 3 2 3 1 3 3 2 1 3 3 1 0 1 1 0 0 0 0 3 1 0
##  1 1 1 0 0 0 2 0 1 3 2 0 1 0 0 0 0 3 3 0 2 0 0 1 2 0 0 1 2 2 0 0 2 0 3 3 0
##  0 0 0 2 1 0 0 0 0 3 3 3 0 3 1 0 0 2 3 0 0 3 1 0 1 0 3 0 0 1 0 0 2 2 0 1 2
##  2 3 2 3 0 1 1 0 0 1 0 2 0 0 2 0 1 3 2 0 2 1 1 1 1 3 0 0 0 3 2 0 3 0 0 2 2
##  1 0 0 3 2 1 0 0 1 0 2 2 3 3 2 2 2 2 0 1 1 2 2 0 2 2 2 1 0 1 0 2 2 1 1 1 2
##  2 0 1 3]

7.1.3 DBSCAN(基于密度的聚类):

DBSCAN是一种基于密度的聚类方法,它能够自动发现不规则形状的簇,并适用于噪声数据。

from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons

# 生成半月形数据
X, _ = make_moons(n_samples=200, noise=0.05)

# 创建DBSCAN模型
dbscan = DBSCAN(eps=0.3, min_samples=5)

# 训练模型
dbscan.fit(X)

# 获取簇的标签
## DBSCAN(eps=0.3)
labels = dbscan.labels_

print("簇的标签:")
## 簇的标签:
print(labels)
## [0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1
##  1 1 1 1 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0
##  1 1 1 0 0 0 0 1 1 0 0 1 1 1 1 1 1 1 0 1 1 0 1 0 1 1 1 1 1 0 0 0 1 0 1 0 1
##  0 1 1 0 0 1 1 1 1 0 0 1 0 1 0 0 1 0 1 0 1 0 0 0 0 0 1 1 1 1 1 0 1 1 0 1 0
##  0 1 0 0 1 1 0 0 1 1 1 1 1 1 1 1 1 0 1 1 0 0 1 0 0 0 0 0 1 1 1 1 1 1 0 1 1
##  1 0 0 1 1 0 0 1 0 0 0 1 1 1 1]

7.1.4 谱聚类:

谱聚类是一种基于图论的聚类方法,它通过将数据点表示为图中的节点,并通过图的拉普拉斯特征来划分簇。

from sklearn.cluster import SpectralClustering
from sklearn.datasets import make_blobs

# 生成模拟数据
X, _ = make_blobs(n_samples=300, centers=4, random_state=0, cluster_std=1.0)

# 创建谱聚类模型
spectral = SpectralClustering(n_clusters=4, eigen_solver='arpack', affinity="nearest_neighbors")

# 训练模型
spectral.fit(X)

# 获取簇的标签
## SpectralClustering(affinity='nearest_neighbors', eigen_solver='arpack',
##                    n_clusters=4)
labels = spectral.labels_

print("簇的标签:")
## 簇的标签:
print(labels)
## [2 3 2 3 2 1 0 2 3 3 0 3 2 3 1 2 2 1 0 0 1 1 2 0 0 0 1 2 0 2 3 3 2 3 3 3 3
##  3 0 1 2 0 3 2 0 0 3 0 3 1 0 1 3 1 2 0 3 0 3 1 3 2 3 0 0 0 3 1 3 0 2 2 3 0
##  0 3 0 2 1 3 1 2 1 1 3 2 1 2 3 3 2 1 3 0 0 2 1 1 3 0 3 1 3 1 2 1 1 2 3 2 0
##  0 1 3 1 1 3 1 2 2 2 1 2 1 1 1 1 0 1 0 3 0 0 1 3 0 0 3 2 3 3 0 2 0 2 0 3 2
##  3 3 3 2 2 2 1 0 3 0 1 2 3 2 2 1 2 0 1 2 1 2 2 3 1 2 2 3 1 1 2 0 1 2 0 0 2
##  2 2 2 1 3 2 0 2 2 0 0 0 2 0 3 2 2 1 0 2 0 0 3 2 3 2 0 2 2 3 0 0 1 1 2 3 1
##  1 0 1 0 2 3 3 2 2 3 2 1 0 2 1 0 3 0 1 2 1 3 3 3 3 0 0 2 2 0 1 2 0 0 2 1 1
##  3 2 2 0 1 3 0 2 3 2 1 1 0 0 1 1 1 1 2 3 3 1 1 2 1 1 1 3 2 3 2 1 1 3 3 3 1
##  1 2 3 0]

7.2 降维

Scikit-Learn(Sklearn)提供了多种降维方法,包括主成分分析(PCA)、线性判别分析(LDA)、t-分布随机邻域嵌入(t-SNE)和多维尺度分析(MDS)等。

7.2.1 主成分分析(PCA):

PCA是一种常见的线性降维方法,通过线性变换将高维数据映射到低维空间,保留尽可能多的方差。

from sklearn.decomposition import PCA
from sklearn.datasets import load_iris

# 加载鸢尾花数据集
iris = load_iris()
X = iris.data

# 创建PCA模型,将数据降维到2维
pca = PCA(n_components=2)

# 训练模型
X_reduced = pca.fit_transform(X)

print("降维后的数据:")
## 降维后的数据:
print(X_reduced)
## [[-2.68412563  0.31939725]
##  [-2.71414169 -0.17700123]
##  [-2.88899057 -0.14494943]
##  [-2.74534286 -0.31829898]
##  [-2.72871654  0.32675451]
##  [-2.28085963  0.74133045]
##  [-2.82053775 -0.08946138]
##  [-2.62614497  0.16338496]
##  [-2.88638273 -0.57831175]
##  [-2.6727558  -0.11377425]
##  [-2.50694709  0.6450689 ]
##  [-2.61275523  0.01472994]
##  [-2.78610927 -0.235112  ]
##  [-3.22380374 -0.51139459]
##  [-2.64475039  1.17876464]
##  [-2.38603903  1.33806233]
##  [-2.62352788  0.81067951]
##  [-2.64829671  0.31184914]
##  [-2.19982032  0.87283904]
##  [-2.5879864   0.51356031]
##  [-2.31025622  0.39134594]
##  [-2.54370523  0.43299606]
##  [-3.21593942  0.13346807]
##  [-2.30273318  0.09870885]
##  [-2.35575405 -0.03728186]
##  [-2.50666891 -0.14601688]
##  [-2.46882007  0.13095149]
##  [-2.56231991  0.36771886]
##  [-2.63953472  0.31203998]
##  [-2.63198939 -0.19696122]
##  [-2.58739848 -0.20431849]
##  [-2.4099325   0.41092426]
##  [-2.64886233  0.81336382]
##  [-2.59873675  1.09314576]
##  [-2.63692688 -0.12132235]
##  [-2.86624165  0.06936447]
##  [-2.62523805  0.59937002]
##  [-2.80068412  0.26864374]
##  [-2.98050204 -0.48795834]
##  [-2.59000631  0.22904384]
##  [-2.77010243  0.26352753]
##  [-2.84936871 -0.94096057]
##  [-2.99740655 -0.34192606]
##  [-2.40561449  0.18887143]
##  [-2.20948924  0.43666314]
##  [-2.71445143 -0.2502082 ]
##  [-2.53814826  0.50377114]
##  [-2.83946217 -0.22794557]
##  [-2.54308575  0.57941002]
##  [-2.70335978  0.10770608]
##  [ 1.28482569  0.68516047]
##  [ 0.93248853  0.31833364]
##  [ 1.46430232  0.50426282]
##  [ 0.18331772 -0.82795901]
##  [ 1.08810326  0.07459068]
##  [ 0.64166908 -0.41824687]
##  [ 1.09506066  0.28346827]
##  [-0.74912267 -1.00489096]
##  [ 1.04413183  0.2283619 ]
##  [-0.0087454  -0.72308191]
##  [-0.50784088 -1.26597119]
##  [ 0.51169856 -0.10398124]
##  [ 0.26497651 -0.55003646]
##  [ 0.98493451 -0.12481785]
##  [-0.17392537 -0.25485421]
##  [ 0.92786078  0.46717949]
##  [ 0.66028376 -0.35296967]
##  [ 0.23610499 -0.33361077]
##  [ 0.94473373 -0.54314555]
##  [ 0.04522698 -0.58383438]
##  [ 1.11628318 -0.08461685]
##  [ 0.35788842 -0.06892503]
##  [ 1.29818388 -0.32778731]
##  [ 0.92172892 -0.18273779]
##  [ 0.71485333  0.14905594]
##  [ 0.90017437  0.32850447]
##  [ 1.33202444  0.24444088]
##  [ 1.55780216  0.26749545]
##  [ 0.81329065 -0.1633503 ]
##  [-0.30558378 -0.36826219]
##  [-0.06812649 -0.70517213]
##  [-0.18962247 -0.68028676]
##  [ 0.13642871 -0.31403244]
##  [ 1.38002644 -0.42095429]
##  [ 0.58800644 -0.48428742]
##  [ 0.80685831  0.19418231]
##  [ 1.22069088  0.40761959]
##  [ 0.81509524 -0.37203706]
##  [ 0.24595768 -0.2685244 ]
##  [ 0.16641322 -0.68192672]
##  [ 0.46480029 -0.67071154]
##  [ 0.8908152  -0.03446444]
##  [ 0.23054802 -0.40438585]
##  [-0.70453176 -1.01224823]
##  [ 0.35698149 -0.50491009]
##  [ 0.33193448 -0.21265468]
##  [ 0.37621565 -0.29321893]
##  [ 0.64257601  0.01773819]
##  [-0.90646986 -0.75609337]
##  [ 0.29900084 -0.34889781]
##  [ 2.53119273 -0.00984911]
##  [ 1.41523588 -0.57491635]
##  [ 2.61667602  0.34390315]
##  [ 1.97153105 -0.1797279 ]
##  [ 2.35000592 -0.04026095]
##  [ 3.39703874  0.55083667]
##  [ 0.52123224 -1.19275873]
##  [ 2.93258707  0.3555    ]
##  [ 2.32122882 -0.2438315 ]
##  [ 2.91675097  0.78279195]
##  [ 1.66177415  0.24222841]
##  [ 1.80340195 -0.21563762]
##  [ 2.1655918   0.21627559]
##  [ 1.34616358 -0.77681835]
##  [ 1.58592822 -0.53964071]
##  [ 1.90445637  0.11925069]
##  [ 1.94968906  0.04194326]
##  [ 3.48705536  1.17573933]
##  [ 3.79564542  0.25732297]
##  [ 1.30079171 -0.76114964]
##  [ 2.42781791  0.37819601]
##  [ 1.19900111 -0.60609153]
##  [ 3.49992004  0.4606741 ]
##  [ 1.38876613 -0.20439933]
##  [ 2.2754305   0.33499061]
##  [ 2.61409047  0.56090136]
##  [ 1.25850816 -0.17970479]
##  [ 1.29113206 -0.11666865]
##  [ 2.12360872 -0.20972948]
##  [ 2.38800302  0.4646398 ]
##  [ 2.84167278  0.37526917]
##  [ 3.23067366  1.37416509]
##  [ 2.15943764 -0.21727758]
##  [ 1.44416124 -0.14341341]
##  [ 1.78129481 -0.49990168]
##  [ 3.07649993  0.68808568]
##  [ 2.14424331  0.1400642 ]
##  [ 1.90509815  0.04930053]
##  [ 1.16932634 -0.16499026]
##  [ 2.10761114  0.37228787]
##  [ 2.31415471  0.18365128]
##  [ 1.9222678   0.40920347]
##  [ 1.41523588 -0.57491635]
##  [ 2.56301338  0.2778626 ]
##  [ 2.41874618  0.3047982 ]
##  [ 1.94410979  0.1875323 ]
##  [ 1.52716661 -0.37531698]
##  [ 1.76434572  0.07885885]
##  [ 1.90094161  0.11662796]
##  [ 1.39018886 -0.28266094]]

7.2.2 线性判别分析(LDA):

LDA是一种监督学习的降维方法,它在降维的同时最大化类间差异,通常用于分类问题。

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.datasets import load_iris

# 加载鸢尾花数据集
iris = load_iris()
X = iris.data
y = iris.target

# 创建LDA模型,将数据降维到2维
lda = LinearDiscriminantAnalysis(n_components=2)

# 训练模型
X_reduced = lda.fit_transform(X, y)

print("降维后的数据:")
## 降维后的数据:
print(X_reduced)
## [[ 8.06179978e+00  3.00420621e-01]
##  [ 7.12868772e+00 -7.86660426e-01]
##  [ 7.48982797e+00 -2.65384488e-01]
##  [ 6.81320057e+00 -6.70631068e-01]
##  [ 8.13230933e+00  5.14462530e-01]
##  [ 7.70194674e+00  1.46172097e+00]
##  [ 7.21261762e+00  3.55836209e-01]
##  [ 7.60529355e+00 -1.16338380e-02]
##  [ 6.56055159e+00 -1.01516362e+00]
##  [ 7.34305989e+00 -9.47319209e-01]
##  [ 8.39738652e+00  6.47363392e-01]
##  [ 7.21929685e+00 -1.09646389e-01]
##  [ 7.32679599e+00 -1.07298943e+00]
##  [ 7.57247066e+00 -8.05464137e-01]
##  [ 9.84984300e+00  1.58593698e+00]
##  [ 9.15823890e+00  2.73759647e+00]
##  [ 8.58243141e+00  1.83448945e+00]
##  [ 7.78075375e+00  5.84339407e-01]
##  [ 8.07835876e+00  9.68580703e-01]
##  [ 8.02097451e+00  1.14050366e+00]
##  [ 7.49680227e+00 -1.88377220e-01]
##  [ 7.58648117e+00  1.20797032e+00]
##  [ 8.68104293e+00  8.77590154e-01]
##  [ 6.25140358e+00  4.39696367e-01]
##  [ 6.55893336e+00 -3.89222752e-01]
##  [ 6.77138315e+00 -9.70634453e-01]
##  [ 6.82308032e+00  4.63011612e-01]
##  [ 7.92461638e+00  2.09638715e-01]
##  [ 7.99129024e+00  8.63787128e-02]
##  [ 6.82946447e+00 -5.44960851e-01]
##  [ 6.75895493e+00 -7.59002759e-01]
##  [ 7.37495254e+00  5.65844592e-01]
##  [ 9.12634625e+00  1.22443267e+00]
##  [ 9.46768199e+00  1.82522635e+00]
##  [ 7.06201386e+00 -6.63400423e-01]
##  [ 7.95876243e+00 -1.64961722e-01]
##  [ 8.61367201e+00  4.03253602e-01]
##  [ 8.33041759e+00  2.28133530e-01]
##  [ 6.93412007e+00 -7.05519379e-01]
##  [ 7.68823131e+00 -9.22362309e-03]
##  [ 7.91793715e+00  6.75121313e-01]
##  [ 5.66188065e+00 -1.93435524e+00]
##  [ 7.24101468e+00 -2.72615132e-01]
##  [ 6.41443556e+00  1.24730131e+00]
##  [ 6.85944381e+00  1.05165396e+00]
##  [ 6.76470393e+00 -5.05151855e-01]
##  [ 8.08189937e+00  7.63392750e-01]
##  [ 7.18676904e+00 -3.60986823e-01]
##  [ 8.31444876e+00  6.44953177e-01]
##  [ 7.67196741e+00 -1.34893840e-01]
##  [-1.45927545e+00  2.85437643e-02]
##  [-1.79770574e+00  4.84385502e-01]
##  [-2.41694888e+00 -9.27840307e-02]
##  [-2.26247349e+00 -1.58725251e+00]
##  [-2.54867836e+00 -4.72204898e-01]
##  [-2.42996725e+00 -9.66132066e-01]
##  [-2.44848456e+00  7.95961954e-01]
##  [-2.22666513e-01 -1.58467318e+00]
##  [-1.75020123e+00 -8.21180130e-01]
##  [-1.95842242e+00 -3.51563753e-01]
##  [-1.19376031e+00 -2.63445570e+00]
##  [-1.85892567e+00  3.19006544e-01]
##  [-1.15809388e+00 -2.64340991e+00]
##  [-2.66605725e+00 -6.42504540e-01]
##  [-3.78367218e-01  8.66389312e-02]
##  [-1.20117255e+00  8.44373592e-02]
##  [-2.76810246e+00  3.21995363e-02]
##  [-7.76854039e-01 -1.65916185e+00]
##  [-3.49805433e+00 -1.68495616e+00]
##  [-1.09042788e+00 -1.62658350e+00]
##  [-3.71589615e+00  1.04451442e+00]
##  [-9.97610366e-01 -4.90530602e-01]
##  [-3.83525931e+00 -1.40595806e+00]
##  [-2.25741249e+00 -1.42679423e+00]
##  [-1.25571326e+00 -5.46424197e-01]
##  [-1.43755762e+00 -1.34424979e-01]
##  [-2.45906137e+00 -9.35277280e-01]
##  [-3.51848495e+00  1.60588866e-01]
##  [-2.58979871e+00 -1.74611728e-01]
##  [ 3.07487884e-01 -1.31887146e+00]
##  [-1.10669179e+00 -1.75225371e+00]
##  [-6.05524589e-01 -1.94298038e+00]
##  [-8.98703769e-01 -9.04940034e-01]
##  [-4.49846635e+00 -8.82749915e-01]
##  [-2.93397799e+00  2.73791065e-02]
##  [-2.10360821e+00  1.19156767e+00]
##  [-2.14258208e+00  8.87797815e-02]
##  [-2.47945603e+00 -1.94073927e+00]
##  [-1.32552574e+00 -1.62869550e-01]
##  [-1.95557887e+00 -1.15434826e+00]
##  [-2.40157020e+00 -1.59458341e+00]
##  [-2.29248878e+00 -3.32860296e-01]
##  [-1.27227224e+00 -1.21458428e+00]
##  [-2.93176055e-01 -1.79871509e+00]
##  [-2.00598883e+00 -9.05418042e-01]
##  [-1.18166311e+00 -5.37570242e-01]
##  [-1.61615645e+00 -4.70103580e-01]
##  [-1.42158879e+00 -5.51244626e-01]
##  [ 4.75973788e-01 -7.99905482e-01]
##  [-1.54948259e+00 -5.93363582e-01]
##  [-7.83947399e+00  2.13973345e+00]
##  [-5.50747997e+00 -3.58139892e-02]
##  [-6.29200850e+00  4.67175777e-01]
##  [-5.60545633e+00 -3.40738058e-01]
##  [-6.85055995e+00  8.29825394e-01]
##  [-7.41816784e+00 -1.73117995e-01]
##  [-4.67799541e+00 -4.99095015e-01]
##  [-6.31692685e+00 -9.68980756e-01]
##  [-6.32773684e+00 -1.38328993e+00]
##  [-6.85281335e+00  2.71758963e+00]
##  [-4.44072512e+00  1.34723692e+00]
##  [-5.45009572e+00 -2.07736942e-01]
##  [-5.66033713e+00  8.32713617e-01]
##  [-5.95823722e+00 -9.40175447e-02]
##  [-6.75926282e+00  1.60023206e+00]
##  [-5.80704331e+00  2.01019882e+00]
##  [-5.06601233e+00 -2.62733839e-02]
##  [-6.60881882e+00  1.75163587e+00]
##  [-9.17147486e+00 -7.48255067e-01]
##  [-4.76453569e+00 -2.15573720e+00]
##  [-6.27283915e+00  1.64948141e+00]
##  [-5.36071189e+00  6.46120732e-01]
##  [-7.58119982e+00 -9.80722934e-01]
##  [-4.37150279e+00 -1.21297458e-01]
##  [-5.72317531e+00  1.29327553e+00]
##  [-5.27915920e+00 -4.24582377e-02]
##  [-4.08087208e+00  1.85936572e-01]
##  [-4.07703640e+00  5.23238483e-01]
##  [-6.51910397e+00  2.96976389e-01]
##  [-4.58371942e+00 -8.56815813e-01]
##  [-6.22824009e+00 -7.12719638e-01]
##  [-5.22048773e+00  1.46819509e+00]
##  [-6.80015000e+00  5.80895175e-01]
##  [-3.81515972e+00 -9.42985932e-01]
##  [-5.10748966e+00 -2.13059000e+00]
##  [-6.79671631e+00  8.63090395e-01]
##  [-6.52449599e+00  2.44503527e+00]
##  [-4.99550279e+00  1.87768525e-01]
##  [-3.93985300e+00  6.14020389e-01]
##  [-5.20383090e+00  1.14476808e+00]
##  [-6.65308685e+00  1.80531976e+00]
##  [-5.10555946e+00  1.99218201e+00]
##  [-5.50747997e+00 -3.58139892e-02]
##  [-6.79601924e+00  1.46068695e+00]
##  [-6.84735943e+00  2.42895067e+00]
##  [-5.64500346e+00  1.67771734e+00]
##  [-5.17956460e+00 -3.63475041e-01]
##  [-4.96774090e+00  8.21140550e-01]
##  [-5.88614539e+00  2.34509051e+00]
##  [-4.68315426e+00  3.32033811e-01]]

7.2.3 多维尺度分析(MDS):

MDS是一种距离度量的降维方法,它试图在低维空间中保持数据点之间的距离关系。

from sklearn.manifold import MDS
from sklearn.datasets import load_iris

# 加载鸢尾花数据集
iris = load_iris()
X = iris.data

# 创建MDS模型,将数据降维到2维
mds = MDS(n_components=2)

# 训练模型
X_reduced = mds.fit_transform(X)

print("降维后的数据:")
## 降维后的数据:
print(X_reduced)
## [[ 2.21493806  1.56217939]
##  [ 2.49838014  1.10287551]
##  [ 2.61874174  1.25449374]
##  [ 2.58278832  1.00804328]
##  [ 2.24596805  1.60849812]
##  [ 1.63420577  1.76735088]
##  [ 2.54589824  1.29123899]
##  [ 2.2380282   1.39501926]
##  [ 2.82250689  0.86264598]
##  [ 2.42293458  1.15205048]
##  [ 1.88361653  1.78104739]
##  [ 2.29587818  1.26172236]
##  [ 2.587974    1.08660399]
##  [ 3.08692761  1.09068784]
##  [ 1.74415403  2.33552187]
##  [ 1.43597459  2.35197647]
##  [ 1.89619249  2.00952465]
##  [ 2.18325475  1.54578571]
##  [ 1.49952171  1.83692885]
##  [ 2.0236492   1.72292355]
##  [ 1.83008783  1.45398861]
##  [ 2.0193518   1.62705907]
##  [ 2.7776783   1.69103698]
##  [ 1.97229094  1.18967335]
##  [ 2.1096466   1.04198191]
##  [ 2.30308522  1.01131513]
##  [ 2.1067539   1.29884887]
##  [ 2.08496426  1.54241903]
##  [ 2.18042156  1.5306755 ]
##  [ 2.41872618  1.06684872]
##  [ 2.38153582  1.03956875]
##  [ 1.90041588  1.53691655]
##  [ 1.98089043  2.00851255]
##  [ 1.75264661  2.22452415]
##  [ 2.38751445  1.13720282]
##  [ 2.50861915  1.43506322]
##  [ 1.98960887  1.85738546]
##  [ 2.34409928  1.5993994 ]
##  [ 2.86521364  0.98757605]
##  [ 2.17670458  1.4323303 ]
##  [ 2.31650207  1.56458354]
##  [ 3.01272842  0.44001755]
##  [ 2.82491888  1.12296046]
##  [ 2.00620076  1.33295392]
##  [ 1.67607979  1.51049146]
##  [ 2.52777042  1.04146628]
##  [ 1.97332751  1.69789467]
##  [ 2.62059539  1.15687624]
##  [ 1.95152672  1.73824026]
##  [ 2.33489203  1.38165898]
##  [-1.49589323  0.10832104]
##  [-0.97464306 -0.15167731]
##  [-1.56492402 -0.13793114]
##  [ 0.25895092 -0.79924461]
##  [-1.03957724 -0.32218333]
##  [-0.3520002  -0.66821403]
##  [-1.10081448 -0.29937134]
##  [ 1.14685902 -0.51259291]
##  [-1.09280496 -0.1160219 ]
##  [ 0.40038967 -0.69288939]
##  [ 1.07718996 -0.87745191]
##  [-0.39474224 -0.33152812]
##  [ 0.20241771 -0.86082334]
##  [-0.80732185 -0.54887631]
##  [ 0.26198891 -0.08207754]
##  [-1.07492667  0.07261712]
##  [-0.36477337 -0.73650175]
##  [-0.04447532 -0.33278004]
##  [-0.47000889 -1.07201094]
##  [ 0.24993886 -0.49607067]
##  [-0.85889808 -0.83301636]
##  [-0.31074303 -0.12452513]
##  [-1.05471024 -0.71820737]
##  [-0.72599856 -0.5367    ]
##  [-0.73569458 -0.07883377]
##  [-0.9909382  -0.01707652]
##  [-1.38748727 -0.15627628]
##  [-1.51856419 -0.43577495]
##  [-0.62710623 -0.53515573]
##  [ 0.4588692  -0.14784975]
##  [ 0.41249826 -0.55485446]
##  [ 0.51630368 -0.49170472]
##  [ 0.03056565 -0.28743678]
##  [-1.00017658 -1.02676465]
##  [-0.2049495  -0.90202194]
##  [-0.81699949 -0.16675016]
##  [-1.29181916 -0.15010509]
##  [-0.65600032 -0.32646214]
##  [-0.0949578  -0.3225071 ]
##  [ 0.18225328 -0.64020644]
##  [-0.06769147 -0.81639118]
##  [-0.76203694 -0.44987582]
##  [ 0.00497144 -0.43077573]
##  [ 1.11221625 -0.54373983]
##  [-0.07336693 -0.58762878]
##  [-0.18989463 -0.32190738]
##  [-0.18882237 -0.42887293]
##  [-0.58754584 -0.21528147]
##  [ 1.17631482 -0.23131518]
##  [-0.0972932  -0.42960566]
##  [-2.1237449  -1.64665366]
##  [-0.94880662 -1.24080969]
##  [-2.49811599 -0.86813504]
##  [-1.62726082 -1.14769922]
##  [-2.04220904 -1.21722256]
##  [-3.31159784 -1.05393405]
##  [ 0.17755187 -1.44020491]
##  [-2.85742976 -0.9597933 ]
##  [-1.94938728 -1.37533653]
##  [-2.96943328 -0.72624427]
##  [-1.58602412 -0.6277047 ]
##  [-1.49097129 -1.00566805]
##  [-2.02388492 -0.81227744]
##  [-0.78931756 -1.3940889 ]
##  [-1.03964147 -1.55094484]
##  [-1.72856464 -0.99398756]
##  [-1.73968111 -0.879427  ]
##  [-3.6653959  -0.52750496]
##  [-3.55551349 -1.50100433]
##  [-0.70361443 -1.36419053]
##  [-2.30496422 -0.91677447]
##  [-0.70598124 -1.27253819]
##  [-3.40985364 -1.16554119]
##  [-1.15409045 -0.78085312]
##  [-2.15055346 -0.89441774]
##  [-2.61972942 -0.60680272]
##  [-1.03020632 -0.74173423]
##  [-1.0568331  -0.79844499]
##  [-1.77504469 -1.19399926]
##  [-2.43433081 -0.43176833]
##  [-2.80866661 -0.74995722]
##  [-3.51754372 -0.24123269]
##  [-1.80294283 -1.23672919]
##  [-1.23485318 -0.7182038 ]
##  [-1.25273004 -1.48643705]
##  [-3.11705297 -0.67394425]
##  [-1.82136834 -1.45391363]
##  [-1.69033354 -0.91073282]
##  [-0.91865621 -0.79544208]
##  [-2.04607638 -0.66006853]
##  [-2.12554761 -1.04965446]
##  [-1.96560986 -0.44065068]
##  [-0.94880662 -1.24080969]
##  [-2.36411804 -1.10360043]
##  [-2.24445629 -1.16883995]
##  [-1.86718566 -0.68961676]
##  [-1.18745167 -1.02729761]
##  [-1.59770886 -0.77371442]
##  [-1.59959451 -1.35354741]
##  [-1.04412492 -1.05155794]]

这些示例提供了Sklearn中不同降维方法的详细示例,包括参数设置。不同的降维方法适用于不同的数据和问题,你可以根据你的需求选择合适的降维方法。

7.3 离群点检测

7.3.1 1. Isolation Forest:

Isolation Forest 是一种基于树的方法,它通过构建随机树来检测离群点。该方法在一系列随机分割中测量观测的离群程度。

from sklearn.ensemble import IsolationForest
from sklearn.datasets import make_blobs

# 生成模拟数据
X, _ = make_blobs(n_samples=300, centers=1, cluster_std=0.3, random_state=0)

# 创建Isolation Forest模型
clf = IsolationForest(contamination=0.05, random_state=0)

# 训练模型并预测离群点
y_pred = clf.fit_predict(X)

print("离群点预测结果:")
## 离群点预测结果:
print(y_pred)
## [ 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1
##   1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   1  1 -1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1
##   1  1  1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##  -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1 -1  1  1  1  1  1  1
##   1  1  1  1  1  1  1  1  1  1  1  1 -1  1 -1  1  1  1  1  1  1  1  1  1
##   1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1
##   1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1 -1  1  1  1  1  1  1
##   1  1  1  1  1  1  1  1  1  1  1  1]

7.3.2 One-Class SVM:

One-Class SVM 是一种支持向量机方法,用于将数据点分为正常点和离群点。它通过寻找包围大多数数据点的边界来检测离群点。

from sklearn.svm import OneClassSVM
from sklearn.datasets import make_blobs

# 生成模拟数据
X, _ = make_blobs(n_samples=300, centers=1, cluster_std=0.3, random_state=0)

# 创建One-Class SVM模型
clf = OneClassSVM(nu=0.05)

# 训练模型并预测离群点
y_pred = clf.fit_predict(X)

print("离群点预测结果:")
## 离群点预测结果:
print(y_pred)
## [ 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1
##   1  1 -1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1
##   1  1  1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##  -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1 -1  1  1  1  1  1  1
##   1  1  1  1  1  1 -1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1  1
##   1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1 -1  1  1  1  1  1  1  1
##   1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   1  1  1  1  1  1  1  1  1  1  1  1]

7.3.3 Local Outlier Factor (LOF):

LOF 是一种基于局部密度的方法,它使用每个数据点的邻近数据点来计算该点的离群分数。

from sklearn.neighbors import LocalOutlierFactor
from sklearn.datasets import make_blobs

# 生成模拟数据
X, _ = make_blobs(n_samples=300, centers=1, cluster_std=0.3, random_state=0)

# 创建LOF模型
clf = LocalOutlierFactor(n_neighbors=20, contamination=0.05)

# 训练模型并预测离群点
y_pred = clf.fit_predict(X)

print("离群点预测结果:")
## 离群点预测结果:
print(y_pred)
## [ 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1
##   1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1
##   1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1
##  -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##  -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1 -1  1  1  1  1  1  1
##   1  1  1  1  1  1  1  1 -1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1  1
##   1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1
##   1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1 -1  1  1  1  1  1  1
##   1  1  1  1  1 -1  1  1  1  1  1  1]