データマイニング（DM）- Python のバックアップ(No.11)

「.NET 開発基盤部会 Wiki」は、「Open棟梁Project」,「OSSコンソーシアム .NET開発基盤部会」によって運営されています。散布図

戻る
- CRISP-DM
- Excel
- KNIME
- Python
- DataSet

目次 †

目次
概要
詳細
参考
- scikit-learn
- TensorFlow・Keras

↑

概要 †

ドイツのコンスタンツ大学で作られたOSS
分析、レポート、および統合プラットフォーム。
GUI、データ分析の一連の流れを実行できる。
日本語版は無いため表記は英語。

↑

詳細 †

↑

インストレーション †

機械学習・深層学習の特性上、インタラクティブな環境の方が習得が捗る。

↑

基本操作 †

↑

起動 †

↑

初期化 †

インストール

データ分析

scikit-learn
```
!pip install scikit-learn
```

Excel読込用
```
!pip install openpyxl
```

散布図表示用
```
!pip install seaborn
```

分類可視化用
```
!pip install mlxtend
```

教材的ライブラリ
```
!pip install mglearn
```

自然言語処理

自然言語ツールキット
```
!pip install nltk -q
```

日本語形態素解析
```
!pip install janome
```

警告の非表示

import warnings
warnings.filterwarnings('ignore')

インポート

基本的なライブラリ
NumPy, Pandas, Matplotlib, mglearn

import pandas as pd
import numpy as np
import mglearn
import matplotlib.pyplot as plt
%matplotlib inline

scikit-learn

前処理

from sklearn.preprocessing import StandardScaler            # 標準化
from sklearn.model_selection import train_test_split        # データ分割

モデル

from sklearn.linear_model import LinearRegression           # 線形回帰
from sklearn.preprocessing import PolynomialFeatures        # 多項式回帰の変数変換
from sklearn.linear_model import Ridge                      # 多項式回帰のRidge回帰
from sklearn.linear_model import Lasso                      # 多項式回帰のLASSO回帰
from sklearn.linear_model import ElasticNet                 # 多項式回帰のLASSO回帰
from sklearn.linear_model import Perceptron                 # 単純パーセプトロン線形分類器
from sklearn.linear_model import LogisticRegression         # ロジスティク回帰
from sklearn.svm import SVC                                 # サポートベクターマシン（SVM）分類器
from sklearn.tree import DecisionTreeClassifier             # 決定木（分類木）
from sklearn.ensemble import RandomForestClassifier         # ランダムフォレスト（分類木）
from sklearn.ensemble import GradientBoostingClassifier     # 勾配ブースティング木（分類木）
from sklearn.decomposition import PCA                       # 主成分分析
from sklearn.cluster import KMeans                          # k-means法 クラスタ分析
from sklearn.feature_extraction.text import CountVectorizer # 自然言語処理ベクトル化
from sklearn.feature_extraction.text import TfidfTransformer # 自然言語ベクトルのTF-IDF計算
from sklearn.decomposition import LatentDirichletAllocation # 自然言語ベクトルからLDAトピック抽出

精度評価

from sklearn import metrics                                 # モデル評価
from sklearn.metrics import mean_squared_error as mse       # 精度評価（mse）
from sklearn.metrics import silhouette_samples              # シルエット係数
from sklearn.model_selection import cross_val_score         # 交差検証法
from sklearn.model_selection import KFold                   # k分割交差検証法
from sklearn.model_selection import StratifiedKFold         # 層化交差検証法
from sklearn.model_selection import GridSearchCV            # グリッドサーチ

データセット生成

from sklearn.datasets import make_regression                # 回帰データセット
from sklearn.datasets import make_blobs                     # 分類データセット

その他

可視化

import seaborn as sns                                       # matplotlibラッパ
from mlxtend.plotting import plot_decision_regions          # 決定領域表示関数
from matplotlib import cm                                   # カラーマップ処理

その他

from numpy import linalg as LA # 線形代数ライブラリ

↑

（データの取得・加工） †

（ココはCRISP-DM上に定義なし）

生成
データ読込

加工
- 切出して
- 結合して、

↑

データの理解 †

説明変数と目的変数の
表（PandasのDataFrame）を用意し、

基本統計量の計算と確認
```
>>>df.describe()
```

相関
目的変数と相関の強い説明変数を調査。

相関係数行列

２列のみの場合
```
df.corr()
```

不要な列を削除して

df.drop('行番号列名', axis=1).corr()

ヒートマップに表示

散布図行列
列数が多いとハングするので注意。

通常
```
>>>sns.pairplot(df)
>>>plt.show()
```

カテゴリ分類

>>>sns.pairplot(df, hue='カテゴリ列')
>>>plt.show()

欠損率の計算と確認

欠損値をカウント
各列あたりの欠損値を出力
```
df.isnull().sum()
```

可視化 -> Matplotlib、seaborn

↑

データの準備 †

データのクリーニング

リストワイズ法

欠損値を含む行を削除
```
df.dropna()
```
指定の列に欠損値を含む行を削除
```
df.dropna(subset=['列名の指定'])
```
欠損していない列数が指定した値未満の行を削除
```
df.dropna(thresh=列数)
```

補完値で置換

平均値補完
```
df.fillna(df.mean())
```
線形補完
```
df.interpolate(method='linear')
```

任意の値で補完
・全列

df.fillna(定数)

・列指定

df.fillna({'列名1':定数1, '列名2':定数2, ..., '列名n':定数n})

高度な補完

データの構築

標準化・正規化

ss = StandardScaler()
# 引数は numpy.ndarray
x = ss.fit_transform(x)
y = ss.fit_transform(y)

単一属性変換

カテゴリ → 数値

df['列名'] = df['列名'].map({'カテゴリ1':数値1, 'カテゴリ2':数値2, ..., 'カテゴリn':数値n})

数値 → カテゴリ

df['列名'] = df['列名'].map({数値1:'カテゴリ1', 数値2:'カテゴリ2', ..., 数値n:'カテゴリn'})

データの統合

One-Hotエンコーディング
- Pandasの場合、get_dummiesを使う。
- NumPyの場合、to_categoricalを使う。

↑

↑

評価 †

各モデルのコードを参照。

モデルの評価手法は、モデルのアルゴリズムに依存するので、モデルによって異なる。
精度の評価手法は、実行結果から評価するのでモデルに非依存（目的変数の種類によって異なる）。

↑

展開 †

↑

手順（モデリング） †

↑

単回帰分析 †

（データの取得・加工）
線形単回帰し易そうなデータセットを持ってくる。

データの理解

データの準備

データの分割
説明変数と目的変数の表（PandasのDataFrame）で
特徴量の選択とエンジニアリング（データの理解、データの準備）した後、
再び、説明変数の表と目的変数の表に分割してnumpy.ndarrayに変換する。
```
# データを切出して、型を変換（DF → NP）
x = np.array(df.loc[:, ['列名１']])
y = np.array(df.loc[:, ['列名２']])
# データをスライシングして確認。
x[:5]
y[:5]
```

実行

学習
```
lr = LinearRegression()
lr.fit(x, y)
```
※ fitは多変量に対応しているので二次元配列であること。

推論

new_val = np.array([[説明変数の値]])
pred_val = lr.predict(new_val)
print(pred_val)

※ predictは多変量に対応しているので二次元配列であること。

評価

回帰直線

# y = lr.coef_[0] * x + lr.intercept_
print('coefficient = ', lr.coef_[0]) # 係数
print('intercept = ', lr.intercept_) # 切片

決定係数（R2乗値）

print('R^2')
print('train: %.3f' % lr.score(x_train, y_train))
print('test : %.3f' % lr.score(x_test, y_test))

※ 過学習などが起きていないか確認する。

散布図に回帰直線を追加

プロットのための x を用意
学習データを用いてもイイが、別途、用意しても良い。
```
x = np.arange(上限, 下限, 間隔)[:, np.newaxis]
```
※ predictは多変量に対応しているので二次元配列であること。

推論の結果を散布

plt.scatter(x, y, color = 'blue')
plt.plot(x, lr.predict(x), color = 'red')
plt.grid()
plt.show()

相関係数は算出済みで、相関係数の2乗が決定係数（R2乗値）。

↑

重回帰分析 †

（データの取得・加工）
ボストン住宅価格データセット等

データの理解

データの準備

データの分割
説明変数と目的変数の表（PandasのDataFrame）で
特徴量の選択とエンジニアリング（データの理解、データの準備）した後、
再び、説明変数の表と目的変数の表に分割してnumpy.ndarrayに変換する。
```
# データを切出して、型を変換（DF → NP）
x = np.array(df.loc[:, ['列名１', '列名２', ...]])
y = np.array(df.loc[:, ['列名３']])
# データをスライシングして確認。
x[:5]
y[:5]
```

データの標準化
ココでは偏回帰係数による追加の特徴量エンジニアリング専用。
標準化しないと、偏回帰係数で、どの説明変数が重要か判断できない。

標準化
x, y 共に標準化。

ss = StandardScaler()
# 引数は numpy.ndarray
x = ss.fit_transform(x)
y = ss.fit_transform(y)

確認
・平均が≒0
```
x.mean()
y.mean()
```
・標準偏差が≒1
```
x.std()
y.std()
```

学習・テストデータの分割

ホールド・アウト法

# test_size = 0.3  train:test = 7:3 で分割。
# random_state = 0 毎回同じサンプルに分割
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)

データ分割の結果確認
(numpy.ndarray).shapeメソッドで確認できる。
```
x_train.shape
y_train.shape
x_test.shape
y_test.shape
```

実行

学習
学習データで学習を実行する。

lr = LinearRegression()
lr.fit(x_train, y_train)

推論
i, jは説明変数の値（スカラ）
```
y_pred = lr.predict(np.array([[i, j]]))
```

評価

回帰超平面

# y = lr.coef_[0] * x1 + lr.coef_[1] * x2 * ... + lr.intercept_
print('coefficient = ', lr.coef_) # 偏回帰係数
print('intercept = ', lr.intercept_) # 定数項

偏回帰係数は、データの標準化を施してあれば、
　追加の特徴量エンジニアリングで参考にできる。
　（標準化する前は説明変数の数値1あたりの重みに差がある）

抑制変数（相関係数が低くても予測における重要度が大きい説明変数）
があるので、相関係数が低い説明変数を利用した学習もさせる、
自由度調整済み決定係数がより大きいモデルを採用する。

推論値から見た実測値の比率を算出するなど。
```
ratio = y_prop / y_pred 
```

trainとtestの決定係数（R2乗値）を算出する。
重回帰の場合の決定係数（R2乗値）≠ 相関係数の2乗

決定係数（R2乗値）

print('R^2')
print('train: %.3f' % lr.score(x_train, y_train))
print('test : %.3f' % lr.score(x_test, y_test))

※ 過学習などが起きていないか確認する。

自由度調整済み決定係数（R2乗値）
関数の定義

def adjusted(score, n_sample, n_explanatory_variables):
    adjusted_score = 1 - (1 - score) * ((n_sample - 1) / (n_sample - n_explanatory_variables - 1))
    return adjusted_score

計算の実行

print('train: %.3f' % adjusted(lr.score(x_train, y_train), len(y_train), x_train.shape[1]))
print('test : %.3f' % adjusted(lr.score(x_test, y_test), len(y_test), x_test.shape[1]))

RMSEなどを計算する。

print('train: %.3f' % (mse(y_train, lr.predict(x_train)) ** (1/2)))
print('test : %.3f' % (mse(y_test, lr.predict(x_test)) ** (1/2)))

回帰超平面の可視化
- y, x1, x2 の３変数なので可能。４変数以上では不可能。
- w0 - w2は、以下のように取得して計算する。
```
w0 = lr.intercept_
w1 = lr.coef_[0, 0]
w2 = lr.coef_[0, 1]
y = w0 + w1*x1 + w2*x2
```

残差プロットを行う。

↑

多項式回帰分析 †

＜元データ＞
重回帰分析の改善として、
ボストン住宅価格データセット等を利用すると良い。

目的変数：y_org
説明変数：x_org
上記から１説明変数だけ抜いてきたもの。
- x1_org
- x2_org

＜単一変数での実装と比較＞
x1_orgのみ使う（y=x^n的なｎ次関数）。

準備

ホールド・アウト法による学習・テストデータの分割

x_train_lin, x_test_lin, y_train, y_test = train_test_split(x1_org, y_org, test_size = 0.3, random_state = 0)

多項式基底関数で変数変換を実行
（与えられた変数の累乗を計算して返しているだけ）
・2次関数

quad = PolynomialFeatures(degree=2)
x1_quad = quad.fit_transform(x1_org)
x_train_quad, x_test_quad, y_train, y_test = train_test_split(x1_quad, y_org, test_size = 0.3, random_state = 0)

・3次関数

cubic = PolynomialFeatures(degree=3)
x1_cubic = cubic.fit_transform(x1_org)
x_train_cubic, x_test_cubic, y_train, y_test = train_test_split(x1_cubic, y_org, test_size = 0.3, random_state = 0)

実行
単回帰分析の学習を実行

x_train_linで、model_linを学習

model_lin = LinearRegression()
model_lin.fit(x_train_lin, y_train)

x_train_quadで、model_quadを学習

model_quad = LinearRegression()
model_quad.fit(x_train_quad, y_train)

x_train_cubicで、model_cubicを学習

model_cubic = LinearRegression()
model_cubic.fit(x_train_quad, y_train)

推論
散布図に回帰式で追加。

model_linで回帰直線
model_quadで回帰曲線（2次関数

model_cubicで回帰曲線（3次関数

# 散布図
plt.scatter(x_org, y_org, color='lightgray', label='data')
# 回帰式でプロット
x = np.arange(下限, 上限, 間隔)[:, np.newaxis] # 二次元化
x_quad = quad.fit_transform(x) # 注意：2次関数の変数変換の実行を忘れずに。
x_cubic = cubic.fit_transform(x) # 注意：3次関数の変数変換の実行を忘れずに。
plt.plot(x, model_lin.predict(x), color='red', label='linear') # 回帰直線
plt.plot(x, model_quad.predict(x_quad), color='green', label='quad') # 回帰曲線（2次関数
plt.plot(x, model_cubic.predict(x_cubic), color='blue', label='cubic') # 回帰曲線（3次関数
# グラフ表示
plt.xlabel('説明変数')
plt.ylabel('目的変数')
plt.legend(loc = 'upper right')
plt.show()

評価
model_lin、model_quad、model_cubicの自由度調整済み決定係数の比較

model_lin

print('train: %.3f' % adjusted(model_lin.score(x_train_lin, y_train), len(y_train), 1)
print('test : %.3f' % adjusted(model_lin.score(x_test_lin, y_test), len(y_test), 1)

model_quad

print('train: %.3f' % adjusted(model_quad.score(x_train_quad, y_train), len(y_train), 2)
print('test : %.3f' % adjusted(model_quad.score(x_test_quad, y_test), len(y_test), 2)

model_cubic

print('train: %.3f' % adjusted(model_cubic.score(x_train_cubic, y_train), len(y_train), 3)
print('test : %.3f' % adjusted(model_cubic.score(x_test_cubic, y_test), len(y_test), 3)

＜複数変数での実装と比較＞
x1のみ多項式で、x2_orgを追加。

準備

説明変数の結合

x_org = np.hstack((x1_org, x2_org))
x_quad = np.hstack((x1_quad, x2_org))
x_cubic = np.hstack((x_cubic, x2_org))

ホールド・アウト法による学習・テストデータの分割

x_train, x_test, y_train, y_test = train_test_split(x_org, y_org, test_size = 0.3, random_state = 0)
x_train_quad, x_test_quad, y_train, y_test = train_test_split(x_quad, y_org, test_size = 0.3, random_state = 0)
x_train_cubic, x_test_cubic, y_train, y_test = train_test_split(x_cubic, y_org, test_size = 0.3, random_state = 0)

実行
単回帰分析の学習を実行（同上）

評価
model_lin、model_quad、model_cubicの
自由度調整済み決定係数の比較

model_lin

print('train: %.3f' % adjusted(model_lin.score(x_train_lin, y_train), len(y_train), 2)
print('test : %.3f' % adjusted(model_lin.score(x_test_lin, y_test), len(y_test), 2)

model_quad

print('train: %.3f' % adjusted(model_quad.score(x_train_quad, y_train), len(y_train), 3)
print('test : %.3f' % adjusted(model_quad.score(x_test_quad, y_test), len(y_test), 3)

model_cubic

print('train: %.3f' % adjusted(model_cubic.score(x_train_cubic, y_train), len(y_train), 4)
print('test : %.3f' % adjusted(model_cubic.score(x_test_cubic, y_test), len(y_test), 4)

＜全体を通した評価＞

残差プロットの比較
- 線形回帰と回帰曲線（3次関数）での残差プロットを比較。
- 曲線的相関を非線形性を導入して捉えると残差からも曲線の傾向が消える。

過学習
- 次数を大きくすると、その柔軟さが過学習のし易さにも繋がる｡
- テストデータを用いた自由度調整済み決定係数を基準に、ある程度の次数にとどめる。

↑

正則化回帰分析 †

各種法の学習・推論・評価

事前にテストデータを作成しておく

def function(x):
    y = 0.0001 * (x**3 + x**2 + x + 1)
    return y

通常の多項式回帰（7次関数）

準備
```
pol = PolynomialFeatures(degree=7)
```

実行

学習

x_pol = pol.fit_transform(x) # 変数変換
lr.fit(x_pol, y) # 学習

推論

x_plot_pol = pol.fit_transform(x_plot) # 変数変換
y_plot_pol = lr.predict(x_plot_pol) # 推論

評価
結果をテストデータのグラフに追加してプロット（過学習の確認

# テストデータ生成に使用した関数
plt.plot(x_plot, y_plot, color='gray')
# 多項式回帰の回帰式（7次関数）
plt.plot(x_plot, y_plot_pol, color='green')

多項式回帰（7次関数）でRidge回帰

準備
alpha（正則化項の係数λ）の値を変えながら繰り返し実行
```
model_ridge = Ridge(alpha=1000)
```

実行

学習
```
model_ridge.fit(x_pol, y)
```

推論

y_plot_pol_ridge = model_ridge.predict(x_plot_pol)

評価

スコアを確認

print('R^2: %.3f' % model_ridge.score(x_pol, y))
print('adjusted R^2: %.3f' % adjusted(model_ridge.score(x_pol, y), len(y), 7))

結果をプロット（正則化の確認
（テストデータのグラフに追加）

# テストデータ生成に使用した関数
plt.plot(x_plot, y_plot, color='gray')
# 多項式回帰の回帰式（7次関数）
plt.plot(x_plot, y_plot_pol, color='green')
# Ridge回帰の回帰式（7次関数）
plt.plot(x_plot, y_plot_pol_ridge, color='red')

確認

# 重み（正則化なし）
lr.coef_
# 重み（Ridge回帰）
model_ridge.coef_
# L2ノルム（正則化なし）
LA.norm(lr.coef_)
# L2ノルム（Ridge回帰）
LA.norm(model_ridge.coef_) # L2ノルムの縮小の確認

多項式回帰（7次関数）でLASSO回帰
特徴選択による次元削減を自動的に行う。
サンプル数に対して特徴量が多すぎるスパースなデータを分析するとき最も活かされる。

準備
alpha（正則化項の係数λ）の値を変えながら繰り返し実行
```
model_lasso = Lasso(alpha=1000)
```

実行

学習
```
model_lasso.fit(x_pol, y)
```

推論

y_plot_pol_lasso = model_lasso.predict(x_plot_pol)

評価

スコアを確認

print('R^2: %.3f' % model_lasso.score(x_pol, y))
print('adjusted R^2: %.3f' % adjusted(model_lasso.score(x_pol, y), len(y), 7))

結果をプロット（正則化の確認
（テストデータのグラフに追加）

# テストデータ生成に使用した関数
plt.plot(x_plot, y_plot, color='gray')
# 多項式回帰の回帰式（7次関数）
plt.plot(x_plot, y_plot_pol, color='green')
# LASSO回帰の回帰式（7次関数）
plt.plot(x_plot, y_plot_pol_lasso, color='red')

確認
特徴選択による次元削減、L1ノルムの縮小

# 重み（正則化なし）
lr.coef_
# 重み（LASSO回帰）
model_lasso.coef_ # 次元削減の確認
# L1ノルム（正則化なし）
LA.norm(lr.coef_, ord=1)
# L1ノルム（LASSO回帰）
LA.norm(model_lasso.coef_, ord=1) # L1ノルムの縮小の確認

多項式回帰（7次関数）でElasticNet?
L2ノルムとL1ノルムの中間のノルムに罰則を課すと
Ridge回帰より強いがLASSOより弱い次元削減が行われる。

準備
alpha（正則化項の係数λ）とl1_ratioの値を変えながら繰り返し実行
```
model_en= ElasticNet(alpha=1000, l1_ratio=0.9)
```

実行

学習
```
model_en.fit(x_pol, y)
```

推論

y_plot_pol_en = model_en.predict(x_plot_pol)

評価

スコアを確認

print('R^2: %.3f' % model_en.score(x_pol, y))
print('adjusted R^2: %.3f' % adjusted(model_en.score(x_pol, y), len(y), 7))

結果をプロット（正則化の確認
（テストデータのグラフに追加）

# テストデータ生成に使用した関数
plt.plot(x_plot, y_plot, color='gray')
# 多項式回帰の回帰式（7次関数）
plt.plot(x_plot, y_plot_pol, color='green')
# ElasticNetの回帰式（7次関数）
plt.plot(x_plot, y_plot_pol_en, color='red')

確認
次元削減、L1とL2ノルムの縮小

# 重み（正則化なし）
lr.coef_
# 重み（ElasticNet）
model_en.coef_ # 次元削減の確認

# L2ノルム（正則化なし）
LA.norm(lr.coef_)
# L2ノルム（ElasticNet）
LA.norm(model_en.coef_) # L2ノルムの縮小の確認

# L1ノルム（正則化なし）
LA.norm(lr.coef_, ord=1)
# L1ノルム（ElasticNet）
LA.norm(model_en.coef_, ord=1) # L1ノルムの縮小の確認

↑

単純パーセプトロン線形分類器 †

概要
- 単純パーセプトロンを使った線形の分類器
- 単層パーセプトロンだから線形問題しか解けない模様。

準備

データの入力
アイリス・データセット
```
df = pd.read_csv('work/iris.csv')
```

特徴量の選択とエンジニアリング

チェック

df.describe()
sns.pairplot(df, hue='Species')
plt.show()

選択

np_arr=np.array(df)
# PetalLengthCm, PetalWidthCm列の選択
x=np_arr[:100, 3:5] # 100→150で３値分類
# Species列の選択
y=np_arr[:100, 5:6] # 100→150で３値分類

変換

y[y=='Iris-setosa']=0
y[y=='Iris-versicolor']=1
# y[y=='Iris-virginica']=2
y=np.array(y,dtype='int64')

データの標準化
カテゴリ・データは対象外
```
ss = StandardScaler()
ss.fit(x)
x_std = ss.transform(x)
```

散布図に表示
線形分類器で分類可能であることを確認

plt.scatter(x_std[:50, 0], x_std[:50,1], color="red", marker="s", label="setosa")
plt.scatter(x_std[50:100, 0], x_std[50:100,1], color="blue", marker="x", label="versicolor")
#plt.scatter(x_std[100:150, 0], x_std[100:150,1], color="yellow", marker="o", label="virginica")
plt.xlabel("PetalLengthCm")
plt.ylabel("PetalWidthCm")
plt.legend(loc="upper left")
plt.show()

ホールド・アウト法による学習・テストデータの分割

x_train, x_test, y_train, y_test = train_test_split(x_std, y, test_size=0.3, random_state=0)

実行

学習

ppn = Perceptron(eta0=0.1) # 学習率 0.1
ppn.fit(x_train, y_train)

推論
ココでは分類

index = 10
print('answer : %d' % y_test[index][0])
print('predict: %d' % ppn.predict([x_test[index]])[0])

評価

データセットの正答率

print('train acc: %.3f' % ppn.score(x_train, y_train))
print('test acc: %.3f' % ppn.score(x_test, y_test))

学習した決定境界を可視化

plot_decision_regions(x_std, y.flatten(), ppn)
plt.xlabel("PetalLengthCm")
plt.ylabel("PetalWidthCm")
plt.legend(loc="upper left")
plt.title('Iris')
plt.show()

↑

サポートベクターマシン分類器 †

概要
こちらは非線形にも対応した分類器。

手順
以下、単純パーセプトロン線形分類器と異なる部分だけ手順を記載。

学習

通常

svc = SVC(kernel='linear')
svc.fit(x_train, y_train)

スラック変数の導入（誤分類を許容）
C: 誤分類を調整する正則化パラメタ
（小は誤分類に寛大、大は誤分類に厳格）
```
svc = SVC(kernel='linear', C=1.0)
svc.fit(x_train, y_train) 
```
カーネル法の導入（非線形対応）
C: 誤分類への厳しさ
gamma: 決定曲線の複雑さ
```
svc = SVC(kernel='rbf', gamma=0.1, C=10)
svc.fit(x_train, y_train) 
```

推論
```
ppn → svc に変更して実行
```

評価
```
ppn → svc に変更して実行
```

※ カーネル法で以下のテストデータを使うと、
　決定境界の非線形性が顕著に可視化される。

# XORのデータの作成
np.random.seed(0)
X_xor = np.random.randn(200, 2)
y_xor = np.logical_xor(X_xor[:, 0] > 0, X_xor[:, 1] > 0)
y_xor = np.where(y_xor, 1, -1)
# データの散布
plt.scatter(X_xor[y_xor == 1, 0], X_xor[y_xor == 1, 1], c='b', marker='x', label='1')
plt.scatter(X_xor[y_xor == -1, 0], X_xor[y_xor == -1, 1], c='r', marker='s', label='-1')
plt.xlim([-3, 3])
plt.ylim([-3, 3])
plt.legend(loc='best') # 右上に凡例を出力
plt.show()

↑

ロジスティック回帰分析 †

概要
- こちらも、単純パーセプトロン線形分類器同様の線形分類器。

ただし、こちらは、
- ２値分類の確率を予測する。
- 乳がんデータセットを使用

手順
以下、単純パーセプトロン線形分類器と異なる部分だけ手順を記載。

準備

特徴量の選択とエンジニアリング
・チェック

df_pickup = df.loc[:, ['perimeter_worst', 'concave points_worst', 'radius_worst', 'concave points_mean', 'diagnosis']]
sns.pairplot(df_pickup, hue='diagnosis')
plt.show()

・選択

x = df.loc[:, ['perimeter_worst', 'concave points_mean']].values
y = df.loc[:, ['diagnosis']].values

・変換

y[y=='M']=0
y[y=='B']=1
y=np.array(y,dtype='int64')

データの標準化
カテゴリ・データは対象外
```
ss = StandardScaler()
ss.fit(x)
x_std = ss.transform(x)
```

ホールド・アウト法による学習・テストデータの分割

x_train, x_test, y_train, y_test = train_test_split(x_std, y, test_size=0.3, random_state=0)

学習

lr = LogisticRegression(C=1.0)
lr.fit(x_train, y_train)

推論
ココでは２値分類

実行

index = 10
print('answer : %d' % y_test[index][0])
print('predict: %d' % lr.predict([x_test[index]])[0])

確率
```
lr.predict_proba([x_test[index]])[0]
```

評価

データセットの正答率

print('train acc: %.3f' % lr.score(x_train, y_train))
print('test acc: %.3f' % lr.score(x_test, y_test))

学習した決定境界を可視化

plot_decision_regions(x_std, y.flatten(), lr)
plt.xlabel("perimeter_worst")
plt.ylabel("concave points_mean")
plt.legend(loc="upper left")
plt.title('cancer')
plt.show()

メトリック表示
・関数定義

def print_metrics(model, data, label):
    from sklearn import metrics
    pred = model.predict(data)
    print('accuracy: %.3f' % metrics.accuracy_score(label, pred)) # 正答率
    print('recall: %.3f' % metrics.recall_score(label, pred, average='macro')) # 再現率（マクロ平均）
    print('precision: %.3f' % metrics.precision_score(label, pred, average='macro')) # 適合率（マクロ平均）
    print('f1_score: %.3f' % metrics.f1_score(label, pred, average='macro')) # f値（マクロ平均）

・指標確認

print_metrics(lr, x_train, y_train)
print_metrics(lr, x_test, y_test)

↑

クラスタ分析 †

準備
irisを使用して主成分分析で特徴抽出したデータを使って

学習
k-means法でクラスタ分析する。

km = KMeans(n_clusters=3,   # クラスタの個数を指定
            init='random',  # 重心の初期値の決め方を決定
            n_init=10,      # 異なる重心の初期値を用いての実行回数
            max_iter=300,   # ひとつの重心を用いたときの最大イテレーション回数
            tol=1e-04,      # 収束と判定するための相対的な許容誤差
            random_state=0, # 重心の初期化に用いる乱数生成器の状態
           )

y_km = km.fit_predict(x_pca[:, 0:2]) # PC1, 2のみ使用

結果

結果を表示する。

関数定義

def kmeans_plot(n_clusters, km, x):
    # クラスタの予測値を算出
    y_km = km.fit_predict(x)
    
    # クラスタ毎に散布（ZIP的に5クラスまで、要素を増やせば対応可能
    for i, color, marker in zip(range(n_clusters), 'rgbcm', '>o+xv'):
        plt.scatter(x[y_km==i, 0],            # 横軸の値
                    x[y_km==i, 1],            # 縦軸の値
                    color=color,              # プロットの色
                    marker=marker,            # プロットの形
                    label='cluster ' + str(i) # ラベル
                   )
    
    # クラスタの中心を散布
    plt.scatter(km.cluster_centers_[:, 0],    # 横軸の値
                km.cluster_centers_[:, 1],    # 縦軸の値
                color='y',                    # プロットの色
                marker='*',                   # プロットの形
                label='centroids',            # ラベル
                s=300,                        # プロットのサイズを大き目に
               )
   
    plt.legend()
    plt.grid()
    plt.show()

結果をプロット

kmeans_plot(3, km, x_pca[:, 0:2]) # PC1, 2のみ使用

正解率を計算する。

関数定義

def kmeans_score(y_km, y):
    y=y.flatten()
    correct_ans = 0
    for i in range(len(y)):
        if y_km[i] == y[i]:
            correct_ans += 1
    return correct_ans / len(y)

結果表示

# ひっくり返ったラベルを戻す
y_km[y_km==2]=3
y_km[y_km==1]=2
y_km[y_km==3]=1
# スコアの表示
kmeans_score(y_km, y)

弱点と対策

重心の初期値次第で､最終的なクラスタが大きく変わる

k-means法をn_initの値を変え繰り返し実行し
クラスタ内誤差平方和（SSE）が一番小さくなった学習器を採用する。

k-means++法をinit='k-means++'と設定し実行する。
initの既定値は'k-means++'になっている。

クラスタの数を自分で決めなければならない

エルボー法

# クラスタ数とSSE
distortions = []
for k  in range(1,11):              # 1~10クラスタ
    km = KMeans(n_clusters=k,       # クラスタ数
                init='random',      # 重心の初期値の決め方を決定
                n_init=10,          # 重心の初期値を変えての繰り返し回数
                max_iter=300,       # 一回の最適化の繰り返し回数を指定
                random_state=0)     # 乱数の生成状態を指定
    km.fit(x_pca[:, 0:2])           # クラスタリングを実行
    distortions.append(km.inertia_) # SSEをリストに格納
    # 良く解らんが、distortion = cluster inertia = SSEらしい。

# 結果をグラフに出力
plt.plot(range(1,11), distortions,marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Distortion')
plt.show()

※ 肘の部分は 2 or 3 となり、結構曖昧ではある。

シルエット分析

# シルエット係数の計算
silhouettes = silhouette_samples(x_pca[:, 0:2], y_km, metric='euclidean')

# シルエット係数の表示
cluster_labels = np.unique(y_km)
n_clusters = cluster_labels.shape[0]

yticks = []
y_ax_lower, y_ax_upper = 0, 0

for i, cluster_label in enumerate(cluster_labels):
   # 当該クラスタの係数を取り出す
   cluster_silhouettes = silhouettes[y_km==cluster_label]
   # 描画の上端の値を設定
   y_ax_upper += len(cluster_silhouettes)
   # 描画時の色の値をセット
   color = cm.jet(float(i) / n_clusters)
   # クラスタを横棒グラフで描画
   cluster_silhouettes.sort()
   plt.barh(range(y_ax_lower, y_ax_upper), 
            cluster_silhouettes, 
            height=1.0, 
            edgecolor='none', 
            color=color)
   # クラスタのラベル位置を指定
   yticks.append((y_ax_lower + y_ax_upper) / 2.)
   # 次の描画のスタート位置
   y_ax_lower += len(cluster_silhouettes)

plt.axvline(np.mean(silhouettes), color='red', linestyle='--')
plt.yticks(yticks, cluster_labels + 1)
plt.ylabel('Cluster')
plt.xlabel('Silhoutte coefficient')
plt.tight_layout()
plt.show()

↑

決定木分析 †

概要
意味解釈性が高い「ツリーベース」のアルゴリズム
- 決定木
- ランダムフォレスト
- 勾配ブースティング木

手順
以下、単純パーセプトロン線形分類器と異なる部分だけ手順を記載。

学習

通常

tree = DecisionTreeClassifier(random_state=0)
tree.fit(x_train, y_train)

剪定
max_depthを指定

tree = DecisionTreeClassifier(random_state=0, max_depth=3)
tree.fit(x_train, y_train)

ランダムフォレスト
n_estimatorsを指定

tree = RandomForestClassifier(random_state=0, n_estimators=10)
tree.fit(x_train, y_train)

勾配ブースティング木
learning_rateを指定。

tree = GradientBoostingClassifier(random_state=0, max_depth=3, learning_rate=0.1)
tree.fit(x_train, y_train)

推論
ppn → tree に変更して実行

評価

ppn → tree に変更して実行

予測確率を出力可能
・ランダムフォレスト、勾配ブースティング木のみの機能
```
tree.predict_proba(x_test[11].reshape(1, -1))
```
・以下の出力は、0は0%, 1は100%, 2は0%の意。
```
array([[0., 1., 0.]])
```

※ 各特徴量の重要度を出力
（ランダムフォレストの方が信頼性の高い）

データの準備
ロジスティック回帰の
乳がんデータセットの
全列を使用して評価してみる。

学習
n_estimators=100に指定。

tree = RandomForestClassifier(random_state=0, n_estimators=100)

出力

配列で表示
```
print(tree.feature_importances_)
```

可視化（棒グラフ）

x_columns = len(df_x.columns)
plt.figure(figsize=(12, 8))
plt.barh(range(x_columns), tree.feature_importances_ , align='center')
plt.yticks(np.arange(x_columns), df_x.columns)
plt.show()

↑

主成分分析 †

準備
単純パーセプトロン線形分類器と同じirisを使用。

学習
n_componentsは説明変数の数以下。
4を指定するなら、yに4列を残しておく。
```
pca = PCA(n_components=4) # 主成分を4つまで取得
x_pca = pca.fit_transform(x_std)
```

結果

可視化

上記を第1、2主成分で散布すると
上手く、３値分類されていることが可視化される。

plt.figure()
for target, marker, color in zip(range(3), '>ox', 'rgb'): # ３値分類
    # y==targetで、boolのnumpy.ndarrayベクトルが返るのでコレで行を指定している。
    plt.scatter(x_pca[y==target, 0], x_pca[y==target, 1], marker=marker, color=color)
plt.xlabel('第1主成分')
plt.ylabel('第2主成分')
plt.show()

上記を第3、4主成分で散布図にプロットすると
上手く、分類されていないことが可視化される。

寄与率

通常の寄与率
```
pca.explained_variance_ratio_
```

累積寄与率（合計100%）

 np.cumsum(pca.explained_variance_ratio_)

因子負荷量
目的変数 = √ljh1*説明変数1 + √ljh2*説明変数2 + ... + √ljhn*説明変数n
```
pca.components_ * np.sqrt(pca.explained_variance_)[:, np.newaxis]
```
※ 出力の行列はn行が第n+1主成分の因子負荷量を表し、
　 n列が当該成分の因子負荷量に対する説明変数の相関を表す。

↑

自然言語処理 †

＜ロジスティック回帰分類機を使った感情分析＞
レビューを入力として肯定的・否定的を判別する教師あり学習を行う。

準備

インポート

正規表現モジュール
```
import re
```

自然言語ツールキット

import nltk
from nltk.corpus import stopwords
nltk.download("stopwords")

クレンジング
・HTMLタグを取り除く
・", :, - などを取り除く（顔文字は残す）
・すべて大文字で記述された部分を小文字に統一する
・ステミング（接頭辞や接尾辞を取り除く）
・"the" や "to" などのstopwordsの除去

関数定義

def remove_html_tag(text):
    pattern = re.compile(r"<[^>]*>")
    removed = re.sub(pattern, " ", text)
    return removed

def remove_punct(text):
    pattern = re.compile(r"(?::|;|=)(?:-)?(?:\)|\(|D|P)")
    emoticons = pattern.findall(text)
    lower = text.lower()
    removed = re.sub(r"[\W]+", " ", lower)
    emoticons = " ".join(emoticons)
    emoticons = emoticons.replace("-","")
    connected = removed + ' ' + emoticons
    return connected

def porter_stem(text):
    stemmer = nltk.PorterStemmer()
    words = []
    for word in text.split(' '):
        try:
            words.append(stemmer.stem(word))
        except:
            stem_ls.append(word)
    return " ".join(words)

def strip_stop(text):
    words = []
    stop = stopwords.words("english")
    for word in text.split(' '):
        if word not in stop:
            words.append(word)
    return " ".join(words)

処理実行

df["review"] = df["review"].apply(remove_html_tag)
df["review"] = df["review"].apply(remove_punct)
df["review"] = df["review"].apply(porter_stem)
df["review"] = df["review"].apply(strip_stop)

ベクトルに変換

cv = CountVectorizer(max_df=0.3, min_df=5, stop_words='english')
cv.fit(df['review'])
feature_names = cv.get_feature_names()
bow = cv.transform(df['review']).toarray()

ホールド・アウト法による学習・テストデータの分割

x_train, x_test, y_train, y_test = train_test_split(bow, df["sentiment"], test_size=0.3, random_state=0)

実行
ロジスティック回帰

学習

lr = LogisticRegression()
lr.fit(x_train, y_train)

推論
```
y_pred_train = lr.predict(x_train)
```

評価
若干過学習気味。

トレーニング・データ
```
print_metrics(lr, x_train, y_train)
```
テスト・データ
```
print_metrics(lr, x_test, y_test)
```

TF-IDF

概要
単語のレア度に依って重み付けする。

TF（Term Frequency: 単語の出現頻度）
ある1つの文書の中でその単語が出現した回数
DF（Document Frequency: 文書頻度）
すべての文書の中でその単語が出現した回数

手順

変換

tfidf = TfidfTransformer(sublinear_tf=True)
x_train_tfidf = tfidf.fit_transform(x_train.astype('f')).toarray()
x_test_tfidf = tfidf.fit_transform(x_test.astype('f')).toarray()

学習

lr_tfidf = LogisticRegression()
lr_tfidf.fit(x_train_tfidf, y_train)

評価
・過学習が若干、解消される。

print_metrics(lr_tfidf, x_train_tfidf, y_train)
print_metrics(lr_tfidf, x_test_tfidf, y_test)

・重要度の高い単語を確認する。

mglearn.tools.visualize_coefficients(lr.coef_, feature_names, n_top_features=30)
mglearn.tools.visualize_coefficients(lr_tfidf.coef_, feature_names, n_top_features=30)

＜LDAトピックモデル（文書のクラスタリング）＞
話題分析、リコメンド、類似文章検索、機械翻訳などで利用できる。

準備
x_train_tfidfを使用する。

実行

特徴的な５つのトピックを抽出

lda = LatentDirichletAllocation(n_components=5, learning_method='batch', random_state=0)
document_topics = lda.fit_transform(x_train_tfidf)

評価

レビュー毎の５トピック
```
document_topics
```
５トピック中の単語と重要度
```
lda.components_
```

各トピックに関連性する単語を強い順に５つ出力

mglearn.tools.print_topics(
    topics=range(5),
    feature_names=np.array(feature_names),
    sorting=np.argsort(lda.components_, axis=1)[:, ::-1],
    topics_per_chunk=5, n_words=5)

LDAを次元削減で利用
LDAにより次元削減を行った後に分類（教師あり学習）を試す。

LDAによる次元削減
5000単語から25トピックに次元削減

lda2 = LatentDirichletAllocation(n_components=25, learning_method='batch', random_state=0)
x_train_lda = lda2.fit_transform(x_train_tfidf)
x_test_lda = lda2.fit_transform(x_test_tfidf)

分類機にかけてみる。
精度は下がるが、使用するコンピュータ・リソースを削減可能

lr_lda = LogisticRegression()
lr_lda.fit(x_train_lda, y_train)
print_metrics(lr_lda, x_train_lda, y_train)
print_metrics(lr_lda, x_test_lda, y_test)

↑

深層学習 †

TensorFlow・Keras

↑

手順（性能） †

↑

交差検証法 †

k分割交差検証法（k-fold Cross-Validation）
重回帰分析の例でテストすると良い。

全てのデータがテストデータとして利用されるよう、
学習データとテストデータをk個に分割して性能評価する方法

ホールド・アウト法よりも高い信頼性評価ができる。

kf = KFold(n_splits=5, shuffle=True, random_state=0)
scores = cross_val_score(lr, x, y, cv=kf)
scores
scores.mean() # 交差検証精度の平均
scores.std() # 交差検証精度の標準偏差

層化交差検証法
サポートベクターマシン分類器の例でテストすると良い。

分類問題に適した交差検証の方法

クラスラベルを等分する条件を追加

kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
scores = cross_val_score(svc, x, y, cv=kf)
scores
scores.mean() # 交差検証精度の平均
scores.std() # 交差検証精度の標準偏差

↑

グリッドサーチ †

サポートベクターマシン分類器の例でテストすると良い。

概要
- ハイパー・パラメタの丁度、良い値を探索的に定める手法。
- ハイパー・パラメタの組み合わせを格子点（Grid）の上をしらみ潰しに探索（Search）

手順

準備

グリッドサーチのアルゴリズムを生成
```
gs_svc = GridSearchCV(SVC(), param_grid, cv=kf)
```
・SVC()
　・任意のアルゴリズムを指定可能。
　・ココでは、SVMのインスタンスを指定。
・param_grid
　ハイパー・パラメタのグリッド
　ココでは、SVMのハイパー・パラメタを指定。
```
param_grid = {'C': [0.1, 1.0, 10, 100, 1000, 10000],
              'gamma': [0.001, 0.01, 0.1, 1, 10]}
```
・cv=kf
　データの分割方法を指定
```
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
```

ホールド・アウト法による学習・テストデータの分割

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)

学習
グリッドサーチで学習
```
gs_svc.fit(x_train, y_train)
```

評価

結果の出力

# 精度が最も高かった組み合わせ
gs_svc.best_params_
# その際のスコア
gs_svc.best_score_
# データセットの正答率
gs_svc.score(x_test, y_test)

決定境界の出力

plot_decision_regions(x_std, y.flatten(), gs_svc)
...

↑

特徴選択と特徴抽出 †

irisを使用して、

特徴選択した場合と、
主成分分析で特徴抽出した場合の、

特徴量を使用して精度を比較する。

アルゴリズムはサポートベクターマシン分類器を使用する。

x_stdを使用して学習

scores_std = cross_val_score(SVC(), x_std[:, [0, 2]], y, cv=5)
print('特徴選択: {}'.format(scores_std.mean()))

x_pcaを使用して学習

scores_pca = cross_val_score(SVC(), x_pca[:, 0:2], y, cv=5)
print('特徴抽出: {}'.format(scores_pca.mean()))

※ このケースは低次元なので、特徴選択した x_std の方が精度が良い。

↑

テストデータの作成 †

↑

回帰データセット †

生成

x, y, coef = make_regression(random_state=12, 
                       n_samples=100,   # サンプル数 100
                       n_features=4,    # 特徴量の数 4
                       n_informative=2, # 目的変数に相関の強い特徴量の数 2
                       noise=10.0,      # ノイズ 10.0
                       bias=-0.0,
                       coef=True)

表示

df_x=pd.DataFrame(X,columns=['a','b','c','d'])
df_y=pd.DataFrame(y,columns=['y'])
df=pd.concat([df_x, df_y],axis=1)

sns.pairplot(df)
plt.show()

↑

分類データセット生成 †

生成

x, y = make_blobs(random_state=8,
                  n_samples=100,   # サンプル数 100
                  n_features=2,    # 特徴量の数を 2
                  cluster_std=1.5, # 標準偏差
                  centers=3)       # 塊数を3

表示

関数

def cluster_plot(n_clusters, x, y):
    plt.figure()
    for target, marker, color in zip(range(3), '>ox', 'rgb'): # ３値分類
        # y==targetで、boolのnumpy.ndarrayベクトルが返るのでコレで行を指定している。
        plt.scatter(x[y==target, 0], x[y==target, 1], marker=marker, color=color)
    plt.xlabel('x1')
    plt.ylabel('x2')
    plt.show()

散布
```
cluster_plot(3, x, y)
```

↑

関数で近似できるデータセット †

適当な関数を定義

def function(x):
    y = なんとか x
    return y

関数上のプロット

# xの範囲を指定
x_plot = np.arange(-25, 25, 0.1)
# yを関数で指定
y_plot = function(x_plot)
# 機械学習用に変換
x_plot = x_plot.reshape(-1, 1)

サンプルの生成

# 乱数生成器の状態を指定
np.random.seed(3)
# 正規分布に従ってX個のデータ点を生成
x = np.random.normal(0, 10, X)
# 対応するyを関数で生成
y = function(x)
# 正規分布に従うノイズを加える
y += np.random.normal(0, 0.25, len(y))
# 機械学習用に変換
x = x.reshape(-1, 1)

関数の描画と散布図で可視化

# 関数を描画
plt.plot(x_plot, y_plot, color='gray')

# サンプルを散布
plt.scatter(x, y)

# グラフを表示
plt.show()

↑

参考 †

↑

scikit-learn †

↑

データマイニング（DM）- Python のバックアップ(No.11)

目次 †

概要 †

詳細 †

インストレーション †

基本操作 †

起動 †

初期化 †

データ読込 †

CRISP-DM上で利用 †

（データの取得・加工） †

データの理解 †

データの準備 †

モデリング †

評価 †

展開 †

手順（モデリング） †

単純パーセプトロン線形分類器 †

サポートベクターマシン分類器 †

深層学習 †

手順（性能） †

交差検証法 †

グリッドサーチ †

特徴選択と特徴抽出 †

テストデータの作成 †

回帰データセット †

分類データセット生成 †

関数で近似できるデータセット †

参考 †