Python Machine Learning Programming> Keywords
Python Machine Learning Programming by Sebastian Raschka et al.
I started reading (2016/10/29)
Pick up only keywords.
Chapter 1 Gives Computers the Ability to Learn from Data
--Supervised learning (p2)
--Unsupervised laerning (p2)
--Reinforcement learning (p2)
--Supervised learning, regression, output: continuous value (p3)
--Negative class (p4)
--Positive class (p4)
--Decision boundary (p4)
--Predictor variable (p5)
--Response variable (p5)
--Explanatory variable (p5)
--Result variable (outcome) (p5)
--Reinforcement learning, Goal (p6)
--Environment (p6)
--Agent (p6)
--Reward (p6)
--Clustering (p7)
--Dimensionality reduction (p8)
- dimensionality reduction
- dimension reduction
--Unsupervised dimensionality reduction, features, preprocessing (p8)
--Iris dataset (p9)
-$ X \ in R ^ {150 * 4} : Set of real numbers, 150x4 matrix (p9)
- x ^ i : i-th training sample (p10)
- x_j : The jth dimension of the training dataset (p10)
- x : Vector (lowercase bold) (p10)
- X : Matrix (uppercase in bold) (p10)
- \ it {x} $: One element of vector or matrix (italic) (p10)
--Predictive modeling (p10)
--Preprocessing is the most (p11)
--Depending on the extracted features, certain overlap may be observed due to the high correlation. In such a case (p11)
--David Wolpert, "No Free Lunch Theorem" (p12)
--To address this issue (p12)
--Model generalization performance (p12)
--Hyperparameter optimization (p12)
--Generalization error (p13)
--NumPy, SciPy, Fortran, C, Implementation (p13)
--Differences between Python 3.4 and Python 2.7, Summary (p13)
- pandas (p14)
- matplotlib (p14)
Chapter 2 Classification Problems-Training Machine Learning Algorithms
- ADALINE (Adaptive Linear Neuron) (p17)
- scikit-learn (p17)
--MCP Neuron (McCulloch-Pitts Neuron) (p17)
- Warren McCulloch
- Walter Pitts
--Frank Rosenblatt, Perceptron, Learning Rules, Algorithms (p18)
--Two classes (p18)
--1 (Positive class)
---1 (negative class)
--Total input (net input) (p18)
-$ \ theta : Threshold (p19)
--Unit step function (p19)
--Heaviside step function
- \ hat {y} : Output value (p21)
- \ eta $: Learning rate (constants greater than 0.0 and less than 1.0) (p21)
--fit method (p24)
--predict method (p24)
--Underscore (eg
self.w_
): By convention, the xxx attribute has xxx (p24)
--Review (p24)
- http://wiki.scipy.org/Tentative_NumPy_Tutorial
- http://pandas.pydata.org/pandas-docs/stable/tutorials.html
- http://matplotlib.org/users/beginner.html
for _ in range(self.n_iter):
(p24)
-(Supplement) It seems to be written when loop variables are not used.
--One-vs-All: OvA method (p27)
--UCI Machine Learning Repository, Iris Dataset (p27)
-- plt, scatter (omitted because it is long)
(p28)
markers = ('s', 'x' , 'o', '^', 'V')
(p30)
--Numpy's meshgrid function (p31)
--The learning rules of Perceptron converge (p32)
--Linear hyperplane (p32)
- ADALINE (ADAptive LInear NEuron) (p32)
- Bernard Widrow, Tedd Hoff
--Can be regarded as xxx
--The main difference is (p32)
--ADALINE learning rules
--Rosenblatt's Perceptron
--Identity function (p32)
--Widrow-Hoff rule (p32)
--Quantizer (p32)
--Objective function (p33)
--Cost function (p33)
--ADALINE, cost function J (p33)
--Sum of Squared Error (SSE) (p33)
--The main advantage of the continuous value linear activation function is (p33)
--Another feature of this cost function is (p33)
--Unit step function, definition formula (p33)
--Partial derivative of the cost function of the error sum of squares for the jth weight, equation transformation (p34)
--"Batch" gradient descent (p35)
self.cost_ = []
(p35)
-(Supplement) ~~ This meaning has not been learned ~~
-(Supplement) I understand that I will make an empty list
--self.w_ [1:] + = (omitted)
(p36)
-(Supplement) Processing after index 1
--The value of the hyperparameter that optimizes the performance of the classification model is (p37).
--Two types of problems (p37)
--Scaling method, standardization (p38)
――What are its characteristics?
--Equation (p39)
--Numpy mean method, std method (p39)
--Stochastic gradient descent (p40)
--Iterative gradient descent (p40)
--On-line gradient descent (p40)
--The stochastic gradient descent method can be regarded as xxx (p40)
--Shuffle training data to avoid circulation (p41)
--Adaptive learning rate (p41)
--Another advantage of stochastic gradient descent (p41)
--Online learning (p41)
--This is especially useful (p41)
--Also, you will be able to xxx (p41)
--Since it can be xxx, the calculation efficiency of the learning algorithm can be further improved (p41).
--Option to shuffule training data before each epoch (p41)
--The _shuffle method used in the Adaline SGD classifier produces xxx (p42)
Chapter 3 Classification Problems-Using the Machine Learning Library scikit-learn
--Training of machine learning algorithms, 5 major steps (p48)
-(Supplement) Is it a mistranslation of "algorithm training?"
--The scikit-learn library has xxx as well as xxx (p48)
np.unique(y)
(p49)
--Using the StandardScaler class of scikit-learn's preprocessing module (p50)
--Call the transform method (p50)
--Let's be careful. This is (p50)
--This can be xxx because it uses the one-to-many (OvR) method (p50)
--Using the random_state parameter to allow xxx (p51)
--Model accuracy rate, error classification rate (p51)
--Overfitting means xxx (p51)
--Example of using numpy mesh grid (p52)
--For datasets that do not allow perfect linear separation (p53)
--Even if the classes cannot be completely linearly separated (p53)
--Perceptron learning rules, the biggest problem (p54)
--Logistic regression (p54)
--Logistic regression, high performance is demonstrated (p54)
--Odds ratio (odds ratio), odds ratio is (p54)
-$ p $: Represents the probability of a positive event (p54)
-(Supplement) TODO: Definition of "positive event". Examine other than the following explanation
--The positive event is (p54)
--Logit function, expression (p54)
--Logarithmic odds, equation (p54)
--Logistic function, expression (p55)
--Sigmoid function (sigmoid) function (p55)
--Implementation of ADALINE, identity function (p56)
--Output and interpretation of sigmoid function (p57)
--Likelihood L (Yudo), Definition, Equation $ L (w) $ (p58)
--In fact, xxx is easy (p58)
--Log likelihood function, equation $ l (w) $ (p58)
--Applying a logarithmic function reduces the likelihood of xxx (p58)
--Cost function J (p58)
--If you want to implement logistic regression yourself (p59)
--sklearn.linear_model.LogisticRegression class (p59)
--"What is this mysterious parameter C?" (P60)
--predict_proba method (p61)
--Partial derivative of the log-likelihood function with respect to the j-th weight, equation (p61)
--Partial derivative of sigmoid function (p61)
--"High variance" (p62)
--Underfitting (p63)
-(Supplement) Is it 7of9?
--"High bias" (p63)
--What is Variance (p63)
--If the variance is large (p63)
--In contrast, what is bias (p63)
--As one of the ways to find the trade-off between bias and variance (p64)
--Regularization (p64)
--Collinearity (p64)
--What is collinearity (p64)
--The idea behind regularization is (p64)
--Most general regularization (p64)
--L2 regularization (p64)
--L2 Shrinkage
--Weight decay
-(Supplement) @ Deep Learning by Takayuki Okatani
-$ \ lambda : regularization parameter (p64)
--The generalization error of the model is decomposed as follows (p64)
--Bias quantifies xxx (p64)
--Only xxx is required to apply regularization (p65)
- C $: Implemented in scikit-learn's LogisticRegression class (p65)
--Directly related to the regularization parameter $ \ lambda $, expression
--Inverse regularization parameter $ C $, to reduce (p65)
--To visualize the strength of regularization (p65)
--Support Vector Machine (SVM) (p66)
-(Supplement) Not Separation of Variable Method
--Can be regarded as SVM, (p66)
--SVM, purpose of optimization (p66)
--SVM, Margin, Definition (p66)
--Hyperplane (decision boundary) (p66)
--Support vector, illustrated (p67)
--Models with small margins, tend to fall into xxx (p67)
--Positive and negative hyperplanes (p67)
--Hyperplane, equation (p67)
--Vector length, equation (p67)
--Left side of expression, interpretation (p68)
--Two equations (3.4.6), are shown (p68)
--Simply put, equation (3.4.7) (p68)
--xxx is easy (p68)
--Quadratic programming (p68)
--By Vladimir Vapnik
--Christopher J.C. Burges's paper
--Slack variable $ \ xi $ (p68)
--1995, Vladimir Vapnik
--Soft margin classification
--Slack variable, because it was needed (p68)
-If the value of $ C $ is large, it means xxx, which means xxx (p69)
-$ C $, can be adjusted (p69)
-When the value of $ \ lambda $ becomes large (p69)
--Logistic regression tries to maximize xxx (p70)
--Therefore, it becomes more susceptible to xxx
--scikit-learn, LogsiticRegression class, LIBLINEAR library (p71)
--scikit-learn, SVM training, SVC class, LIBSVM library (p71)
--Computer memory (p71)
--SGDClassifier class, alternative implementation (p71)
--SVM, popularity, reason (p71)
--Kernel SVM (kernel SVM) (p71)
X_xor = np.random.randn(200, 2)
(p72)
--Projection function $ φ (・) , high dimension, linear separation (p73)
--Separate classes (p73)
--Projection method, problem (p74)
--Kernel trick (p74)
-(Supplement) [Related article](http://qiita.com/kilometer/items/66e6116cc661019ead59)
--Radial Basis Function kernel (p74)
-(Supplement) Is it related to Vector Spherical Harmonics?
-(Supplement) Expanded in Vector Spherical Harmonics
- \ gamma : Expression, hyperparameters to be optimized (p74)
--Kernel, Interpretation (p74)
--Kernel, minus sign (p74)
--1 (exactly the same sample) (p74)
--0 (a completely different sample) (p74)
- \ gamma $: Kernel function, cutoff (p75)
-(Supplement) Remember the cutoff frequency that appears in the electric circuit
-(Supplement) Article, 3dB
--Decision tree classifier (p77)
--Interpretability (p77)
--Decision tree, can be thought of as xxx (p77)
--Decision tree, information gain (p78)
--Information gain (decrease in xxx) (p78)
--Root (root) (p78)
--Leaf (p78)
--To divide the node by the feature with the highest information gain (p78)
--Information gain, equation $ IG (D_p, f) $ (p78)
-$ f : Features to divide (p78)
- D_p : Parent dataset (p78)
- D_j : Dataset of jth child node (p78)
- I : Impure (p79)
- N_p : Total number of sample parent nodes (p79)
- N_j $: Number of samples of jth child node (p79)
--Thus, the information gain is only xxx (p79)
- D_{left}, D_{right} (p79)
--Binary tree, impure index or division condition (p79)
--Gini impurity (p79)
--Entropy (p79)
-(Supplement) Rich Flow by Grisha
--Classification error
-$ I_E : Classification error (p79)
- I_H $: (Supplement) I don't know what the formula is (p79)
- p(i=1|t), p(i=0|t) (p79)
――The entropy is 1 in binary classification (p79)
--The maximum purity of Gini is (p79)
--Another indicator of impureness, classification error (p80)
-Equation using $ I_E : p (p80)
- D_p $: Let's look at the parent node's dataset (p80)
--Information gain (difference between "purity of parent node" and "sum of purity of child node") (p80)
-(Supplement) After that, about 15 related expressions continue
--To be able to visually compare the above three types of impure conditions (p82)
--Add xxx to confirm that gini impureness lies somewhere between entropy and classification error (p82).
--
# Entropy (2 types), Gini purity, and classification error are each looped
(p83)
--Decision tree, overfitting (p84)
--Feature scaling, decision tree (p84)
--As a decision tree peculiar (p84)
--scikit-learn, post-training decision tree, export (p85)
- GraphViz (p85)
--Random forest, features (p86)
--Random Forest, intuitively (p86)
--The idea behind ensemble learning (p86)
--Weak learning algorithm, strong learning algorithm (p86)
--Generalization error, overfitting (p86)
--Random Forest Algorithm, 4 Steps (p86)
--Non-restoring extraction (p87)
--Majority vote, assign class label (p87)
--Random forest, advantages (p87)
--No need to xxx (p87)
――Can be optimized (p87)
--Bootstrap sample size (p87)
--Scikit-learn, RandomForestClassifier implementation (p87)
-$ d $: Number of features for each division (p87)
--Total number of features in the training dataset (p87)
- d\sqrt{m} (p87)
-$ m $: Number of features in the training dataset (p87)
--This allows you to xxx (p88)
--k-nearest neighbor classifier (p89)
- KNN
--KNN, lazy learner (p89)
--What is called "laziness" (p89)
--Parametric model, nonparametric model (p89)
--Perceptron, Logistic Regression, Linear SVM (p89)
--Decision Tree / Random Forest, Kernel SVM (p89)
--Instance-base learning (p89)
--Remember the training dataset (p89)
--The main advantages of the memory-based approach (p90)
--When the majority vote is the same (p91)
--In the implementation of scikit-learn's KNN algorithm
--Euclidean distance (p91)
--minkowski distance (p91)
--Manhattan Distance (p91)
--minkowski distance, equation (p91)
--Curse of dimensionality (p92)
--Curse of dimensionality, representing the xxx phenomenon (p92)
--By using xxx, you can escape from the curse of dimensionality (p92)
Chapter 4 Data Preprocessing-Building a Better Training Set
--Missing value (p93)
--Blank in data table (p93)
- NaN (Not a Number) (p93)
--Placeholder (provisional) string (p93)
--Ignoring missing values (p93)
--
# If you are using Python 2.7, you need to convert the string to unicode
(p94)
--StringIO function, when used (p94)
--Using the isnull method (p94)
--Data preprocessing, pandas DataFrame class (p95)
--DataFrame object, values attribute (p95)
df.dropna()
(p95)
--If you set the axis argument to 1 (p95)
df.dropna(how='all')
(p95)
df.dropna(thresh=4)
(p95)
df.dropna(subset=['C'])
(p95)
--Deletion of missing data, problem (p96)
--Interpolation technique (p96)
--Mean imputation (p96)
-(Supplement) Isn't "complement" a mistake of "interpolation"?
--scikit-learn Imputer class (p96)
--strategy argument
- median
- most_frequent
--Useful for most_frequent, xxx (p96)
--So-called transformer class (p96)
--Transformer, fit, transform (p96)
--The converter and fit method is (p96)
--Transformer, transform method is (p96)
--Estimator, predict method (p97)
--Category data, nominal features (p98)
--Category data, ordinal features (p98)
--Order features, example (p98)
--Numerical features (p98)
--Class label (p98)
--Category string, convert to integer, required (p99)
--Dictionary for reverse mapping inv_size_mapping (p99)
--Many machine learning libraries, requesting xxx (p99)
--To revert the converted class label to its original string representation (p100)
--A convenient class called LabelEncoder (p100) implemented directly in scikit-learn.
--One of the most common mistakes in processing categorical data (p101)
--Avoid xxx problems, one-hot encoding (p101)
--Dummy feature (p101)
--scikit-learn, OneHotEncoder class (p101)
--OneHotEncoder class returns a sparse matrix when xxx (p102)
--get_dummies function implemented in pandas (p102)
--Wine dataset (p102)
- UCI Machine Learning Repository (p102)
-(Supplement) http://archive.ics.uci.edu/ml/
--Randomly split into test and training datasets (p104)
--train_test_split function (p104)
--scikit-learn, cross_validation module (p104)
--Dataset, Split, Attention (p104)
--Accuracy of generalization error estimation, trade-off (p104)
--xxx would be good (p104)
--Feature scaling (p105)
--Decision tree and random forest, without xxx (p105)
--Most of xxx, works much better with xxx (p105)
--Feature Scaling, Importance (p105)
--Scale (p105)
--Normalization
--Standardization
--Normalization, meaning xxx (p105)
--xxx special case (p105)
-$ x_ {norm} ^ {(i)} $: New value for sample $ x ^ {(i)} $, equation (p105)
--min-max scaling, scikit-learn (p105)
--Bounded section (within a certain range) (p106)
--Useful for normalization by min-max scaling (p106)
--xxx may be more practical, reason (p106)
--Many linear models, including xxx, are xxx (p106)
--When using standardization (p106)
--Standardization procedure, equation (p106)
--Overfitting (p107)
--Overfitting, cause (p107)
--General methods for reducing generalization error (p107)
--L2 regularization, equation (p107)
--L1 regularization, equation (p107)
--Returned by L1 regularization, (p107)
--L1 regularization, how to promote sparseness (p108)
--Regularization, Geometric Interpretation (p108)
--Regularization, think as follows (p108)
--Regularization parameter $ \ lambda $, by strengthening (p108)
--L2 Penalty Term Concept, Illustrated (p108)
--Here xxx cannot exceed xxx (p109)
--On the other hand, I want to minimize xxx (p109)
--The goal here is (p109)
--If there is no xxx, it can be understood as xxx (p109)
--L1 regularization, sparseness (p109)
--Similar to xxx. However, xxx (p109)
--The term L2 is xxx (p109)
--Rhombus (p109)
--L1 diamond (p110)
--The optimization condition is likely to be in xxx (p110)
--Why L1 regularization leads to sparse solutions (p110)
--Trevor Hastie et al. "The Elements of Statistical Learning" Section 3.4
--scikit-learn, L1 regularization (p110)
--penalty argument
--The regularization path is (p112)
--Dimension reduction by feature selection (p113)
--Dimensionality reduction (p113)
--Feature selection
--Feature extraction
--For feature selection (p113)
--In feature extraction (p113)
--Typical feature selection algorithm (p113)
--Sequential selection algorithm (p113)
--Greedy search (p113)
--d dimension, k dimension (k <d) (p113)
--Feature selection algorithm, two purposes (p113)
--The latter is useful for xxx (p114)
--Sequential Backward Selection (SBS) (p114)
--SBS, Purpose (p114)
--Exhaustive search algorithm (p114)
--Not xxx in terms of xxx (p114)
--SBS, Algorithm, 4 Simple Steps (p114)
--Let's implement it with SBS, Python (p115)
--Features, subsets, classification problems, estimators (p116)
--In the while loop of the fit method, it is reduced to xxx (p116)
--Test dataset, training dataset, split (p117)
--To prevent the original test dataset from becoming part of the training dataset (p117)
--Because the number of features has been reduced (p117)
--KNN algorithm, curse of dimensionality (p117)
--Various feature selection methods, comprehensive explanation (p119)
- http://scikit-learn.org/stable/modules/feature_selection.html
--L1 Logistic regression with regularization, irrelevant features, SBS algorithm, feature selection (p119)
--Feature selection, random forest (p119)
--Random Forest, Ensemble Method (p119)
--xxx Even without making assumptions (p119)
indices = np.argsort(importance)[::-1]
(p120)
--n_jobs = -1
, all cores (p120)
--Random Forest, Note xxx, Important (p120)
--L1 regularization, useful for xxx (p122)
--Sequential feature selection algorithm, SBS (p122)
Chapter 10 Regression Analysis; Prediction of Objective Variables with Continuous Values
--Regression analysis (p265)
--Explanatory variable, objective variable, figure (p266)
--Regression line (p266)
--Offset, residual (p266)
--Simple linear regression (p266)
--Multiple linear regression (p266)
--Housing dataset (p267)
- UCI Machine Learning Repository
--MEDV: Median Home Prices (p267)
--pandas DataFrame object (p267)
--TODO: Learning pandas
--Exploratory Data Analysis (EDA) (p268)
--Recommended as EDA, xxx (p268)
--Relationship between outliers, data distribution, and features (p268)
--Scatter plot matrix, xxx can be visualized (p268)
--Scatterplot matrix, pairplot function of seaborn library (p268)
pip install seaborn
(p268)
--xxx changes when importing seaborn library (p269)
--RM (Average number of rooms per unit) (p270)
--In contrast to popular belief, xxx is not necessary (p270)
--Correlation matrix (p270)
--Correlation matrix, covariance matrix, intuitively (p270)
--Pearson product-moment correlation coefficient, square matrix (p270)
--Pearson's r (p270)
--Correlation coefficient, range (p270)
--Positive correlation, negative correlation (p270)
- r = 0 (p270)
--Pearson's product moment correlation coefficient, equation (p270)
-$ \ mu : Sample mean of corresponding features (p270)
- \ sigma_ {xy} : Covariance between features x and y
- \ sigma_x $ and $ \ sigma_y $: Standard deviation of each feature
--Pearson's product-moment correlation coefficient, covariance, standard deviation product (p270)
--NumPy corrcoef function (p271)
--seaborn heatmap function (p271)
--Fit a linear regression model, focus (p272)
--Least Squares (OLS) (p272)
-(Supplement) Is there an Extraordinary ...?
--OLS, Interpretation (p273)
--Regression analysis, more efficient implementation (p277)
--Least Squares, Closed Form Solution (p278)
--Introduction to Statistics Textbook
--Linear regression, greatly influenced by xxx (p278)
--Alternative method for removing outliers (p278)
--RANSAC (RANdom SAmple Consensus) algorithm (p278)
--Normal value (inlier: not outlier) (p279)
--lambda function, callable (p279)
--Calculate lambda function, xxx (p279)
--MAD, median absolute deviation of objective value y (p279)
--Linear regression line (to be exact, hyperplane) (p281)
--In the case of xxx, the residual is 0, in a real application (p282)
--For a good regression model (p282)
--Model performance, quantification (p283)
--Mean Squared Error (MSE) (p283)
--Useful for MSE, (p283)
--Coefficient of determination $ R ^ 2 $ (p283)
--The coefficient of determination can be thought of as xxx (p283)
--SSE, Residual Sum of Squares (p283)
--SST (Sum of Squared Total), Equation (p283)
--That is (p283)
-It's just $ R ^ 2 $ :, expression transformation (p284)
--Model extreme parameter weights, penalties (p284)
--Regularized Linear Regression, 3 (p284)
--Ridge regression (p284)
- LASSO (Least Absolute Shrinkage and Selection Operator) (p284)
--Elastic Net method (p284)
--Model with L2 penalty (p284)
- J(w)_{Ridge}
- L2
--Increase, increase, decrease (p285)
--LASSO, constraint, when m> n (p285)
--Ridge Regression, LASSO, Elastic Net (p285)
--Elastic Net, L1 Penalty, L2 Penalty (p285)
--Sparseness, number of variables selected xxx partially overcome (p285)
--k split-validation, parameter $ \ lambda $, regularization strength (p285)
--Regularization strength, $ \ lambda $ parameter, $ \ alpha $ parameter (p285)
--LASSO Regressor in linear_model submodule (p285)
--ElasticNet, l1_ratio argument (p285)
--Polynomial regression, finding curves (p286)
--Linear regression coefficient w, multiple regression model (p286)
--scikit-learn, PolynomialFeatures converter class (p286)
--How to compare polynomial regression and linear regression (p286)
--linear fit, quadratic fit, training points, figure (p287)
--Coefficient of determination ($ R ^ 2 ), linear model, quadratic polynomial model, fit (p288)
--Added polynomial features, model complexity, overfitting (p289)
--Polynomial features, not always the best choice (p289)
--Convert explanatory variables to logarithms and be able to xxx (p290)
--Random Forest Regression (p290)
--Random Forest, Decision Tree, Ensemble (p290)
--Random forest, sum of piecewise linear functions, i.e. (p290)
--Advantages of decision tree algorithm (p290)
--Decision tree, to stretch (p290)
--Decision tree, entropy (p290)
--Entropy, xxx (p290)
--To use a decision tree for regression (p291)
- I (t) , Entropy, which is a negative purity index of the expression node t ... (p291)
- N_t : Number of training samples for node t (p291)
- D_t : Training subset of node t (p291)
- y ^ {(i)} : True destination (p291)
- \ hat {y_t} $: Predicted target value (sample average) (p291)
--MSE, Node distribution after split (p291)
--Variance reduction (p291)
--scikit-learn, DecisionTreeRegressor class (p291)
--Decision tree, model, constraint (p292)
--Decision tree depth, overfitting, lack of learning (p292)
--Random forest, decision tree, generalization (p292)
- Reason
--Random forest, advantages (p292)
--Random forest, parameters, experiments required (p292)
--Random forest, algorithm, algorithm for classification (p292)
--The only difference
--Random forest, predicted objective variable, calculated by xxx (p292)
--SVM, Nonlinear Regression (p294)
--SVM, Regression, S.R.Gunn (p294)
--SVM Regressor, scikit-learn (p294)