python決策樹iris_基於python的決策樹能進行多分類嗎

Ⅰ python中的sklearn中決策樹使用的是哪一種演算法

sklearn中決策樹分為DecisionTreeClassifier和DecisionTreeRegressor，所以用的演算法是CART演算法，也就是分類與回歸樹演算法(classification and regression tree,CART)，劃分標准默認使用的也是Gini，ID3和C4.5用的是信息熵，為何要設置成ID3或者C4.5呢

Ⅱ 基於python的決策樹能進行多分類嗎

決策樹主文件 tree.py

[python] view plain

#coding:utf-8
frommathimportlog
importjson
fromplotimportcreatePlot
classDecisionTree():
def__init__(self,criterion="entropy"):
self.tree=None
self.criterion=criterion
def_is_continuous_value(self,a):
#判斷一個值是否是連續型變數
iftype(a).__name__.lower().find('float')>-1or
type(a).__name__.lower().find('int')>-1:
returnTrue
else:
returnFalse
def_calc_entropy(self,dataset):
#計算數據集的香農熵
classes=dataset.ix[:,-1]
total=len(classes)
cls_count={}
forclsinclasses:
ifclsnotincls_count.keys():
cls_count[cls]=0
cls_count[cls]+=1
entropy=1.0
forkeyincls_count:
prob=float(cls_count[key])/total
entropy-=prob*log(prob,2)
returnentropy
def_calc_gini(self,dataset):
#計算數據集的Gini指數
classes=dataset.ix[:,-1]
total=len(classes)
cls_count={}
forclsinclasses:
ifclsnotincls_count.keys():
cls_count[cls]=0
cls_count[cls]+=1
gini=1.0
forkeyincls_count:
prob=float(cls_count[key])/total
gini-=prob**2
returngini
def_split_data_category(self,dataset,feature,value):
#對分類變數進行拆分
#將feature列的值為value的記錄抽取出來，同時刪除feature列

Ⅲ python sklearn 如何用測試集數據畫出決策樹（非開發樣本）

#coding=utf-8

from sklearn.datasets import load_iris
from sklearn import tree

iris = load_iris()
clf = tree.DecisionTreeClassifier()
clf = clf.fit(iris.data, iris.target)

from sklearn.externals.six import StringIO
import pydot

dot_data = StringIO()
tree.export_graphviz(clf, out_file=dot_data)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
graph[0].write_dot('iris_simple.dot')
graph[0].write_png('iris_simple.png')

Ⅳ python實現的決策樹怎麼可視化

常用的幾種決策樹演算法有ID3、C4.5、CART：
ID3：選擇信息熵增益最大的feature作為node，實現對數據的歸納分類。
C4.5：是ID3的一個改進，比ID3准確率高且快，可以處理連續值和有缺失值的feature。
CART：使用基尼指數的劃分准則，通過在每個步驟最大限度降低不純潔度，CART能夠處理孤立點以及能夠對空缺值進行處理。

Ⅳ python sklearn決策樹的圖怎麼畫

Ⅵ python中使用scikit-learn的決策樹演算法運行錯誤

你好，你有讀寫的許可權嗎？嘗試普通的文件讀寫操作：
f=open('a.txt', 'w')
如果不能正常運行，那麼嘗試用管理員許可權運行你的程序。或者，修改保存的文件名，'iris.doct'修改為其它的名字，如'abcd.dot'.

Ⅶ 如何用python實現隨機森林分類

大家如何使用scikit-learn包中的類方法來進行隨機森林演算法的預測。其中講的比較好的是各個參數的具體用途。
這里我給出我的理解和部分翻譯：
參數說明：
最主要的兩個參數是n_estimators和max_features。
n_estimators：表示森林裡樹的個數。理論上是越大越好。但是伴隨著就是計算時間的增長。但是並不是取得越大就會越好，預測效果最好的將會出現在合理的樹個數。
max_features：隨機選擇特徵集合的子集合，並用來分割節點。子集合的個數越少，方差就會減少的越快，但同時偏差就會增加的越快。根據較好的實踐經驗。如果是回歸問題則：
max_features＝n_features，如果是分類問題則max_features＝sqrt(n_features)。

如果想獲取較好的結果，必須將max_depth＝None,同時min_sample_split=1。
同時還要記得進行cross_validated（交叉驗證），除此之外記得在random forest中，bootstrap=True。但在extra-trees中，bootstrap=False。

這里也給出一篇老外寫的文章：調整你的隨機森林模型參數http://www.analyticsvidhya.com/blog/2015/06/tuning-random-forest-model/

這里我使用了scikit-learn自帶的iris數據來進行隨機森林的預測：

[python]view plain

fromsklearn.
fromsklearn.
importnumpyasnp
fromsklearn.datasetsimportload_iris
iris=load_iris()
#printiris#iris的4個屬性是：萼片寬度萼片長度花瓣寬度花瓣長度標簽是花的種類：setosaversicolourvirginica
printiris['target'].shape
rf=RandomForestRegressor()#這里使用了默認的參數設置
rf.fit(iris.data[:150],iris.target[:150])#進行模型的訓練
#
#隨機挑選兩個預測不相同的樣本
instance=iris.data[[100,109]]
printinstance
print'instance0prediction；',rf.predict(instance[0])
print'instance1prediction；',rf.predict(instance[1])
printiris.target[100],iris.target[109]

返回的結果如下：

(150,)

[[ 6.3 3.3 6. 2.5]

[ 7.2 3.6 6.1 2.5]]

instance 0 prediction； [ 2.]

instance 1 prediction； [ 2.]

在這里我有點困惑，就是在scikit-learn演算法包中隨機森林實際上就是一顆顆決策樹組成的。但是之前我寫的決策樹博客中是可以將決策樹給顯示出來。但是隨機森林卻做了黑盒處理。我們不知道內部的決策樹結構，甚至連父節點的選擇特徵都不知道是誰。所以我給出下面的代碼（這代碼不是我的原創），可以顯示的顯示出所有的特徵的貢獻。所以對於貢獻不大的，甚至是負貢獻的我們可以考慮刪除這一列的特徵值，避免做無用的分類。

[python]view plain

fromsklearn.cross_validationimportcross_val_score,ShuffleSplit
X=iris["data"]
Y=iris["target"]
names=iris["feature_names"]
rf=RandomForestRegressor()
scores=[]
foriinrange(X.shape[1]):
score=cross_val_score(rf,X[:,i:i+1],Y,scoring="r2",
cv=ShuffleSplit(len(X),3,.3))
scores.append((round(np.mean(score),3),names[i]))
printsorted(scores,reverse=True)

顯示的結果如下：

[(0.934, 'petal width (cm)'), (0.929, 'petal length (cm)'), (0.597, 'sepal length (cm)'), (0.276, 'sepal width (cm)')]

這里我們會發現petal width、petal length這兩個特徵將起到絕對的貢獻，之後是sepal length，影響最小的是sepal width。這段代碼將會提示我們各個特徵的貢獻，可以讓我們知道部分內部的結構。

Ⅷ 使用python+sklearn的決策樹方法預測是否有信用風險

import numpy as np11
import pandas as pd11
names=("Balance,Duration,History,Purpose,Credit amount,Savings,Employment,instPercent,sexMarried,Guarantors,Residence ration,Assets,Age,concCredit,Apartment,Credits,Occupation,Dependents,hasPhone,Foreign,lable").split(',')11
data=pd.read_csv("Desktop/sunshengyun/data/german/german.data",sep='\s+',names=names)11
data.head()11

Balance
Duration
History
Purpose
Credit amount
Savings
Employment
instPercent
sexMarried
Guarantors
…
Assets
Age
concCredit
Apartment
Credits
Occupation
Dependents
hasPhone
Foreign
lable

0
A11 6 A34 A43 1169 A65 A75 4 A93 A101 … A121 67 A143 A152 2 A173 1 A192 A201 1

1
A12 48 A32 A43 5951 A61 A73 2 A92 A101 … A121 22 A143 A152 1 A173 1 A191 A201 2

2
A14 12 A34 A46 2096 A61 A74 2 A93 A101 … A121 49 A143 A152 1 A172 2 A191 A201 1

3
A11 42 A32 A42 7882 A61 A74 2 A93 A103 … A122 45 A143 A153 1 A173 2 A191 A201 1

4
A11 24 A33 A40 4870 A61 A73 3 A93 A101 … A124 53 A143 A153 2 A173 2 A191 A201 2

5 rows × 21 columns
data.Balance.unique()11
array([『A11』, 『A12』, 『A14』, 『A13』], dtype=object)data.count()11
Balance 1000 Duration 1000 History 1000 Purpose 1000 Credit amount 1000 Savings 1000 Employment 1000 instPercent 1000 sexMarried 1000 Guarantors 1000 Residence ration 1000 Assets 1000 Age 1000 concCredit 1000 Apartment 1000 Credits 1000 Occupation 1000 Dependents 1000 hasPhone 1000 Foreign 1000 lable 1000 dtype: int64#部分變數描述性統計分析
data.describe()1212

Duration
Credit amount
instPercent
Residence ration
Age
Credits
Dependents
lable

count
1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000

mean
20.903000 3271.258000 2.973000 2.845000 35.546000 1.407000 1.155000 1.300000

std
12.058814 2822.736876 1.118715 1.103718 11.375469 0.577654 0.362086 0.458487

min
4.000000 250.000000 1.000000 1.000000 19.000000 1.000000 1.000000 1.000000

25%
12.000000 1365.500000 2.000000 2.000000 27.000000 1.000000 1.000000 1.000000

50%
18.000000 2319.500000 3.000000 3.000000 33.000000 1.000000 1.000000 1.000000

75%
24.000000 3972.250000 4.000000 4.000000 42.000000 2.000000 1.000000 2.000000

max
72.000000 18424.000000 4.000000 4.000000 75.000000 4.000000 2.000000 2.000000

data.Duration.unique()11
array([ 6, 48, 12, 42, 24, 36, 30, 15, 9, 10, 7, 60, 18, 45, 11, 27, 8, 54, 20, 14, 33, 21, 16, 4, 47, 13, 22, 39, 28, 5, 26, 72, 40], dtype=int64)data.History.unique()11
array([『A34』, 『A32』, 『A33』, 『A30』, 『A31』], dtype=object)data.groupby('Balance').size().order(ascending=False)11
c:\python27\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: order is deprecated, use sort_values(…) if __name__ == 『__main__』: Balance A14 394 A11 274 A12 269 A13 63 dtype: int64data.groupby('Purpose').size().order(ascending=False)11
c:\python27\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: order is deprecated, use sort_values(…) if __name__ == 『__main__』: Purpose A43 280 A40 234 A42 181 A41 103 A49 97 A46 50 A45 22 A44 12 A410 12 A48 9 dtype: int64data.groupby('Apartment').size().order(ascending=False)11
c:\python27\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: order is deprecated, use sort_values(…) if __name__ == 『__main__』: Apartment A152 713 A151 179 A153 108 dtype: int64import matplotlib.pyplot as plt
%matplotlib inline
data.plot(x='lable', y='Age', kind='scatter',
alpha=0.02, s=50);12341234
![png](output_13_0.png)data.hist('Age', bins=15);11
![png](output_14_0.png)target=data.lable11
features_data=data.drop('lable',axis=1)11
numeric_features = [c for c in features_data if features_data[c].dtype.kind in ('i', 'f')] # 提取數值類型為整數或浮點數的變數11
numeric_features11
[『Duration』, 『Credit amount』, 『instPercent』, 『Residence ration』, 『Age』, 『Credits』, 『Dependents』]numeric_data = features_data[numeric_features]11
numeric_data.head()11

Duration
Credit amount
instPercent
Residence ration
Age
Credits
Dependents

0
6 1169 4 4 67 2 1

1
48 5951 2 2 22 1 1

2
12 2096 2 3 49 1 2

3
42 7882 2 4 45 1 2

4
24 4870 3 4 53 2 2

categorical_data = features_data.drop(numeric_features, axis=1)11
categorical_data.head()11

Balance
History
Purpose
Savings
Employment
sexMarried
Guarantors
Assets
concCredit
Apartment
Occupation
hasPhone
Foreign

0
A11 A34 A43 A65 A75 A93 A101 A121 A143 A152 A173 A192 A201

1
A12 A32 A43 A61 A73 A92 A101 A121 A143 A152 A173 A191 A201

2
A14 A34 A46 A61 A74 A93 A101 A121 A143 A152 A172 A191 A201

3
A11 A32 A42 A61 A74 A93 A103 A122 A143 A153 A173 A191 A201

4
A11 A33 A40 A61 A73 A93 A101 A124 A143 A153 A173 A191 A201

categorical_data_encoded = categorical_data.apply(lambda x: pd.factorize(x)[0]) # pd.factorize即可將分類變數轉換為數值表示
# apply運算將轉換函數應用到每一個變數維度
categorical_data_encoded.head(5)123123

Balance
History
Purpose
Savings
Employment
sexMarried
Guarantors
Assets
concCredit
Apartment
Occupation
hasPhone
Foreign

0
0 0 0 0 0 0 0 0 0 0 0 0 0

1
1 1 0 1 1 1 0 0 0 0 0 1 0

2
2 0 1 1 2 0 0 0 0 0 1 1 0

3
0 1 2 1 2 0 1 1 0 1 0 1 0

4
0 2 3 1 1 0 0 2 0 1 0 1 0

features = pd.concat([numeric_data, categorical_data_encoded], axis=1)#進行數據的合並
features.head()
# 此處也可以選用one-hot編碼來表示分類變數，相應的程序如下：
# features = pd.get_mmies(features_data)
# features.head()1234512345

Duration
Credit amount
instPercent
Residence ration
Age
Credits
Dependents
Balance
History
Purpose
Savings
Employment
sexMarried
Guarantors
Assets
concCredit
Apartment
Occupation
hasPhone
Foreign

0
6 1169 4 4 67 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0

1
48 5951 2 2 22 1 1 1 1 0 1 1 1 0 0 0 0 0 1 0

2
12 2096 2 3 49 1 2 2 0 1 1 2 0 0 0 0 0 1 1 0

3
42 7882 2 4 45 1 2 0 1 2 1 2 0 1 1 0 1 0 1 0

4
24 4870 3 4 53 2 2 0 2 3 1 1 0 0 2 0 1 0 1 0

X = features.values.astype(np.float32) # 轉換數據類型
y = (target.values == 1).astype(np.int32) # 1:good,2:bad1212
from sklearn.cross_validation import train_test_split # sklearn庫中train_test_split函數可實現該劃分

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=0) # 參數test_size設置訓練集佔比
1234512345
from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import cross_val_score

clf = DecisionTreeClassifier(max_depth=8) # 參數max_depth設置樹最大深度

# 交叉驗證，評價分類器性能，此處選擇的評分標準是ROC曲線下的AUC值，對應AUC更大的分類器效果更好
scores = cross_val_score(clf, X_train, y_train, cv=3, scoring='roc_auc')
print("ROC AUC Decision Tree: {:.4f} +/-{:.4f}".format(
np.mean(scores), np.std(scores)))123456789123456789
ROC AUC Decision Tree: 0.6866 +/-0.0105

#利用learning curve，以樣本數為橫坐標，訓練和交叉驗證集上的評分為縱坐標，對不同深度的決策樹進行對比（判斷是否存在過擬合或欠擬合）
from sklearn.learning_curve import learning_curve

def plot_learning_curve(estimator, X, y, ylim=(0, 1.1), cv=3,
n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5),
scoring=None):
plt.title("Learning curves for %s" % type(estimator).__name__)
plt.ylim(*ylim); plt.grid()
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes, train_scores, validation_scores = learning_curve(
estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes,
scoring=scoring)
train_scores_mean = np.mean(train_scores, axis=1)
validation_scores_mean = np.mean(validation_scores, axis=1)

plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, validation_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.legend(loc="best")
print("Best validation score: {:.4f}".format(validation_scores_mean[-1]))2223
clf = DecisionTreeClassifier(max_depth=None)
plot_learning_curve(clf, X_train, y_train, scoring='roc_auc')
# 可以注意到訓練數據和交叉驗證數據的得分有很大的差距，意味著可能過度擬合訓練數據了123123
Best validation score: 0.6310

clf = DecisionTreeClassifier(max_depth=10)
plot_learning_curve(clf, X_train, y_train, scoring='roc_auc')1212
Best validation score: 0.6565

clf = DecisionTreeClassifier(max_depth=8)
plot_learning_curve(clf, X_train, y_train, scoring='roc_auc')1212
Best validation score: 0.6762

clf = DecisionTreeClassifier(max_depth=5)
plot_learning_curve(clf, X_train, y_train, scoring='roc_auc')1212
Best validation score: 0.7219

clf = DecisionTreeClassifier(max_depth=4)
plot_learning_curve(clf, X_train, y_train, scoring='roc_auc')1212
Best validation score: 0.7226

Ⅸ python 決策樹可以調什麼參數

調用這個包：
sklearn.treesklearn(scikit-learn)可以去下載，解壓後放入C:Python27Libsite-packages直接使用。需要用同樣的方法額外下載numpy和scipy包，不然會報錯。

例子：

fromsklearn.datasetsimportload_iris
fromsklearn.model_selectionimportcross_val_score
fromsklearn.
clf=DecisionTreeClassifier(random_state=0)
iris=load_iris()
cross_val_score(clf,iris.data,iris.target,cv=10)

Ⅹ 如何將python生成的決策樹利用graphviz畫出來

#這里有一個示例，你可以看一下。
#http://scikit-learn.org/stable/moles/tree.html
>>>fromIPython.displayimportImage
>>>dot_data=tree.export_graphviz(clf,out_file=None,
feature_names=iris.feature_names,
class_names=iris.target_names,
filled=True,rounded=True,
special_characters=True)
>>>graph=pydotplus.graph_from_dot_data(dot_data)
>>>Image(graph.create_png())

導航:首頁 > 編程語言 > python決策樹iris

python決策樹iris

與python決策樹iris相關的資料