博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
学习笔记之scikit-learn
阅读量:6678 次
发布时间:2019-06-25

本文共 17683 字,大约阅读时间需要 58 分钟。

scikit-learn: machine learning in Python — scikit-learn 0.20.0 documentation

  • Simple and efficient tools for data mining and data analysis
  • Accessible to everybody, and reusable in various contexts
  • Built on NumPy, SciPy, and matplotlib
  • Open source, commercially usable - BSD license

scikit-learn - Wikipedia

  • Scikit-learn (formerly scikits.learn) is a  for the  programming language. It features various ,  and algorithms including , , ,  and , and is designed to interoperate with the Python numerical and scientific libraries  and .

Installing scikit-learn — scikit-learn 0.20.2 documentation

Check scikit-learn version

  • import sklearn
  • print(sklearn.__version__)

全面了解机器学习包scikit-learn - AI遇见机器学习

15分钟带你入门sklearn与机器学习——分类算法篇

盘一盘 Python 系列 8 - Sklearn


  • Identifying to which category an object belongs to.

1. Supervised learning — scikit-learn 0.20.2 documentation

  • https://scikit-learn.org/stable/supervised_learning.html#supervised-learning

1.6. Nearest Neighbors — scikit-learn 0.20.2 documentation

  • https://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbors-classification
  • Neighbors-based classification is a type of instance-based learning or non-generalizing learning: it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the nearest neighbors of each point: a query point is assigned the data class which has the most representatives within the nearest neighbors of the point.
  • scikit-learn implements two different nearest neighbors classifiers: 
    •  implements learning based on the k nearest neighbors of each query point, where k is an integer value specified by the user. 
    •  implements learning based on the number of neighbors within a fixed radius r of each training point, where r is a floating-point value specified by the user.
  • Nearest Neighbors Classification — scikit-learn 0.20.2 documentation
    • https://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html#sphx-glr-auto-examples-neighbors-plot-classification-py
    • Sample usage of Nearest Neighbors classification. It will plot the decision boundaries for each class.
  • sklearn.neighbors.KNeighborsClassifier — scikit-learn 0.20.2 documentation
    • https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier
    • class sklearn.neighbors.KNeighborsClassifier(n_neighbors=5weights=’uniform’algorithm=’auto’leaf_size=30p=2metric=’minkowski’metric_params=Nonen_jobs=None**kwargs)
    • Classifier implementing the k-nearest neighbors vote.
    • fit(Xy)
      • Fit the model using X as training data and y as target values
    • predict(X)
      • Predict the class labels for the provided data
    • score(Xysample_weight=None)
      • Returns the mean accuracy on the given test data and labels.
      • In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Classifier comparison — scikit-learn 0.20.2 documentation

  • https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html
  • A comparison of a several classifiers in scikit-learn on synthetic datasets. The point of this example is to illustrate the nature of decision boundaries of different classifiers. This should be taken with a grain of salt, as the intuition conveyed by these examples does not necessarily carry over to real datasets.
  • sklearn.datasets.make_classification — scikit-learn 0.20.2 documentation
    • https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html#sklearn-datasets-make-classification
    • sklearn.datasets.make_classification(n_samples=100, n_features=20, n_informative=2, n_redundant=2, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None)
    • Generate a random n-class classification problem.
  • Decision boundary - Wikipedia
    • https://en.wikipedia.org/wiki/Decision_boundary
    • In a  problem with two classes, a decision boundary or decision surface is a  that partitions the underlying  into two sets, one for each class. The classifier will classify all the points on one side of the decision boundary as belonging to one class and all those on the other side as belonging to the other class.
    • A decision boundary is the region of a problem space in which the output label of a  is ambiguous.
    • If the decision surface is a , then the classification problem is linear, and the classes are .
    • Decision boundaries are not always clear cut. That is, the transition from one class in the feature space to another is not discontinuous, but gradual. This effect is common in fuzzy logic based classification algorithms, where membership in one class or another is ambiguous.
  • What is the difference between decision_function, predict_proba, and predict function for logistic regression problem? - Cross Validated
    • https://stats.stackexchange.com/questions/329857/what-is-the-difference-between-decision-function-predict-proba-and-predict-fun
  • What's the difference between predict_proba and decision_function in scikit-learn? - Stack Overflow
    • https://stackoverflow.com/questions/36543137/whats-the-difference-between-predict-proba-and-decision-function-in-scikit-lear
  • python - Scikit Learn SVC decision_function and predict - Stack Overflow
    • https://stackoverflow.com/questions/20113206/scikit-learn-svc-decision-function-and-predict
  • machine learning - Negative decision_function values - Stack Overflow
    • https://stackoverflow.com/questions/46820154/negative-decision-function-values

  • Comparing, validating and choosing parameters and models.

3. Model selection and evaluation — scikit-learn 0.20.3 documentation

  • https://scikit-learn.org/stable/model_selection.html#model-selection

3.1. Cross-validation: evaluating estimator performance — scikit-learn 0.20.3 documentation

  • https://scikit-learn.org/stable/modules/cross_validation.html
  • sklearn.model_selection.train_test_split — scikit-learn 0.20.2 documentation
    • https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
    • sklearn.model_selection.train_test_split(*arrays, **options)
    • Split arrays or matrices into random train and test subsets
    • Quick utility that wraps input validation and next(ShuffleSplit().split(X, y)) and application to input data into a single call for splitting (and optionally subsampling) data in a oneliner.

3.3. Model evaluation: quantifying the quality of predictions — scikit-learn 0.20.3 documentation

  • https://scikit-learn.org/stable/modules/model_evaluation.html
  • There are 3 different APIs for evaluating the quality of a model’s predictions:
    • Estimator score method: Estimators have a score method providing a default evaluation criterion for the problem they are designed to solve. This is not discussed on this page, but in each estimator’s documentation.
    • Scoring parameter: Model-evaluation tools using  (such as  and) rely on an internal scoring strategy. This is discussed in the section .
    • Metric functions: The metrics module implements functions assessing prediction error for specific purposes. These metrics are detailed in sections on , ,  and .
  • Finally,  are useful to get a baseline value of those metrics for random predictions.
  • 3.3.2. Classification metrics
    • 3.3.2.2. Accuracy score
      • sklearn.metrics.accuracy_score — scikit-learn 0.20.2 documentation
        • https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html  
        • sklearn.metrics.accuracy_score(y_true, y_pred, normalize=True, sample_weight=None)
        • Accuracy classification score.
        • In multilabel classification, this function computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true.
    • 3.3.2.5. Confusion matrix 
      • sklearn.metrics.confusion_matrix — scikit-learn 0.20.2 documentation
        • https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html  
        • sklearn.metrics.confusion_matrix(y_true, y_pred, labels=None, sample_weight=None)
        • Compute confusion matrix to evaluate the accuracy of a classification
    • 3.3.2.6. Classification report
      • The  function builds a text report showing the main classification metrics.
    • 3.3.2.9. Precision, recall and F-measures
      • https://scikit-learn.org/stable/modules/model_evaluation.html#precision-recall-and-f-measures
      • Intuitively,  is the ability of the classifier not to label as positive a sample that is negative, and  is the ability of the classifier to find all the positive samples.
        • sklearn.metrics.precision_score — scikit-learn 0.20.3 documentation
          • https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn-metrics-precision-score
          • sklearn.metrics.precision_score(y_truey_predlabels=Nonepos_label=1average=’binary’sample_weight=None)
          • Compute the precision
          • The precision is the ratio tp (tp fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.
          • The best value is 1 and the worst value is 0.
        • sklearn.metrics.recall_score — scikit-learn 0.20.3 documentation
          • https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html#sklearn-metrics-recall-score
          • sklearn.metrics.recall_score(y_truey_predlabels=Nonepos_label=1average=’binary’sample_weight=None)
          • Compute the recall
          • The recall is the ratio tp (tp fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.
          • The best value is 1 and the worst value is 0.  
      • The  (Fβ and F1 measures) can be interpreted as a weighted harmonic mean of the precision and recall. A Fβ measure reaches its best value at 1 and its worst score at 0. With β=1, Fβ and F1 are equivalent, and the recall and the precision are equally important.
        • sklearn.metrics.f1_score — scikit-learn 0.20.3 documentation
          • https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn-metrics-f1-score
          • sklearn.metrics.f1_score(y_truey_predlabels=Nonepos_label=1average=’binary’sample_weight=None)
          • Compute the F1 score, also known as balanced F-score or F-measure
          • The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is:
            • F1 = 2 * (precision * recall) / (precision + recall)
          • In the multi-class and multi-label case, this is the average of the F1 score of each class with weighting depending on the average parameter.
      • The  computes a precision-recall curve from the ground truth label and a score given by the classifier by varying a decision threshold.
        • sklearn.metrics.precision_recall_curve — scikit-learn 0.20.3 documentation
          • https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_curve.html#sklearn.metrics.precision_recall_curve
          • sklearn.metrics.precision_recall_curve(y_true, probas_pred, pos_label=None, sample_weight=None)
      • The  function computes the  (AP) from prediction scores. The value is between 0 and 1 and higher is better. AP is defined as
        • AP=∑n(Rn−Rn−1)Pn  
        • where Pn and Rn are the precision and recall at the nth threshold. With random predictions, the AP is the fraction of positive samples. 
        • sklearn.metrics.average_precision_score — scikit-learn 0.20.3 documentation
          • https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html#sklearn.metrics.average_precision_score
          • sklearn.metrics.average_precision_score(y_true, y_score, average=’macro’, pos_label=1, sample_weight=None)
          • Area under the precision-recall curve
      • Precision-Recall — scikit-learn 0.20.3 documentation
        • https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html  
        • Example of Precision-Recall metric to evaluate classifier output quality.
        • Precision-Recall is a useful measure of success of prediction when the classes are very imbalanced. In information retrieval, precision is a measure of result relevancy, while recall is a measure of how many truly relevant results are returned.
        • The precision-recall curve shows the tradeoff between precision and recall for different threshold. A high area under the curve represents both high recall and high precision, where high precision relates to a low false positive rate, and high recall relates to a low false negative rate. High scores for both show that the classifier is returning accurate results (high precision), as well as returning a majority of all positive results (high recall).
    • 3.3.2.13. Receiver operating characteristic (ROC)
      • https://scikit-learn.org/stable/modules/model_evaluation.html#receiver-operating-characteristic-roc
      • The function  computes the . Quoting Wikipedia :
        • “A receiver operating characteristic (ROC), or simply ROC curve, is a graphical plot which illustrates the performance of a binary classifier system as its discrimination threshold is varied. It is created by plotting the fraction of true positives out of the positives (TPR = true positive rate) vs. the fraction of false positives out of the negatives (FPR = false positive rate), at various threshold settings. TPR is also known as sensitivity, and FPR is one minus the specificity or true negative rate.”
      • This function requires the true binary value and the target scores, which can either be probability estimates of the positive class, confidence values, or binary decisions.
        • sklearn.metrics.roc_curve — scikit-learn 0.20.3 documentation
          • https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve
          • sklearn.metrics.roc_curve(y_true, y_score, pos_label=None, sample_weight=None, drop_intermediate=True)
      • The  function computes the area under the receiver operating characteristic (ROC) curve, which is also denoted by AUC or AUROC. By computing the area under the roc curve, the curve information is summarized in one number. For more information see the .
        • sklearn.metrics.roc_auc_score — scikit-learn 0.20.3 documentation
          • https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html#sklearn.metrics.roc_auc_score
          • sklearn.metrics.roc_auc_score(y_true, y_score, average=’macro’, sample_weight=None, max_fpr=None)  
      • Receiver Operating Characteristic (ROC) — scikit-learn 0.20.3 documentation
        • https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#sphx-glr-auto-examples-model-selection-plot-roc-py
        • Example of Receiver Operating Characteristic (ROC) metric to evaluate classifier output quality.
        • ROC curves typically feature true positive rate on the Y axis, and false positive rate on the X axis. This means that the top left corner of the plot is the “ideal” point - a false positive rate of zero, and a true positive rate of one. This is not very realistic, but it does mean that a larger area under the curve (AUC) is usually better.
        • The “steepness” of ROC curves is also important, since it is ideal to maximize the true positive rate while minimizing the false positive rate.  

  • Feature extraction and normalization.

4.3. Preprocessing data — scikit-learn 0.20.2 documentation

  • https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing
  • 4.3.1.1. Scaling features to a range
    • https://scikit-learn.org/stable/modules/preprocessing.html#scaling-features-to-a-range
    • An alternative standardization is scaling features to lie between a given minimum and maximum value, often between zero and one, or so that the maximum absolute value of each feature is scaled to unit size. This can be achieved using  or , respectively.
    • sklearn.preprocessing.MinMaxScaler — scikit-learn 0.20.3 documentation
      • https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn-preprocessing-minmaxscaler
      • class sklearn.preprocessing.MinMaxScaler(feature_range=(0, 1), copy=True)
      • Transforms features by scaling each feature to a given range.
      • This estimator scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one.

4.4. Imputation of missing values — scikit-learn 0.20.2 documentation

  • https://scikit-learn.org/stable/modules/impute.html
  • sklearn.impute.SimpleImputer — scikit-learn 0.20.2 documentation
    • https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn-impute-simpleimputer
    • class sklearn.impute.SimpleImputer(missing_values=nanstrategy=’mean’fill_value=Noneverbose=0copy=True)
    • Imputation transformer for completing missing values. 

How to fix ValueError: need more than 1 value to unpack when call tn, fp, fn, tp = confusion_matrix(y_actual, y_predict).ravel() ?

  • To force it to output both classes even when one of them is not predicted, use the label attribute.
    • tn, fp, fn, tp = confusion_matrix(y_actual, y_predict, labels=[0,1]).ravel()
  • python - How to make sklearn.metrics.confusion_matrix() to always return TP, TN, FP, FN? - Stack Overflow
    • https://stackoverflow.com/questions/46229965/how-to-make-sklearn-metrics-confusion-matrix-to-always-return-tp-tn-fp-fn
    • y_actual, y_predict = [0,0,0,0],[0,0,0,0]
    • tn, fp, fn, tp = confusion_matrix(y_actual, y_predict, labels=[0,1]).ravel()
    • >> array([[4, 0], [0, 0]])

转载于:https://www.cnblogs.com/pegasus923/p/9997485.html

你可能感兴趣的文章
Linux基金会宣布微内核项目Zephyr
查看>>
企业级市场移动办公率先热战
查看>>
打开电邮附件要小心:新JavaScript勒索工具加密文件无解
查看>>
当经济预测遇到大数据,会产生什么样的火花?
查看>>
行业渠道再洗牌,运营商或重掌行业话语权
查看>>
英特尔驱动边缘计算产业联盟国际化 拓展计算边界
查看>>
网络电话为什么一直是“邻家的电话”?
查看>>
评估公共云存储提供商的四个标准
查看>>
由世纪互联运营的 Power BI 受用户喜爱的六大理由
查看>>
迪斯尼正研究使用RFID技术,增强人与物之间的互动
查看>>
热带地区数据中心需要太阳能发电,而不是自然冷却
查看>>
炙手可热的威胁情报!飞塔已应用了15年
查看>>
Ruckus提高了Brocade 2016年Q4盈收
查看>>
2015年度互联网安全报告发布 移动支付成重灾区
查看>>
数百亿美元半导体设备投资 如何避免被海外大厂瓜分?
查看>>
黑客测试有望提高智能家居安全性?
查看>>
思科推NCS4200家族 与Ciena竞争Verizon城域订单
查看>>
易成新能28亿驰援赛维破产重整 试水另类“债转股”
查看>>
医疗信息安全马虎不得
查看>>
智慧城市需依托社会共建
查看>>