Python Machine Learning Programming> Keywords

Python Machine Learning Programming by Sebastian Raschka et al.

I started reading (2016/10/29)

Pick up only keywords.

Chapter 1 Gives Computers the Ability to Learn from Data

--Supervised learning (p2) --Unsupervised laerning (p2) --Reinforcement learning (p2) --Supervised learning, regression, output: continuous value (p3) --Negative class (p4) --Positive class (p4) --Decision boundary (p4) --Predictor variable (p5) --Response variable (p5) --Explanatory variable (p5) --Result variable (outcome) (p5) --Reinforcement learning, Goal (p6) --Environment (p6) --Agent (p6) --Reward (p6) --Clustering (p7) --Dimensionality reduction (p8)

dimensionality reduction
dimension reduction --Unsupervised dimensionality reduction, features, preprocessing (p8) --Iris dataset (p9) -$ X \ in R ^ {150 * 4} : Set of real numbers, 150x4 matrix (p9) - x ^ i : i-th training sample (p10) - x_j : The jth dimension of the training dataset (p10) - x : Vector (lowercase bold) (p10) - X : Matrix (uppercase in bold) (p10) - \ it {x} $: One element of vector or matrix (italic) (p10) --Predictive modeling (p10) --Preprocessing is the most (p11) --Depending on the extracted features, certain overlap may be observed due to the high correlation. In such a case (p11) --David Wolpert, "No Free Lunch Theorem" (p12) --To address this issue (p12) --Model generalization performance (p12) --Hyperparameter optimization (p12) --Generalization error (p13) --NumPy, SciPy, Fortran, C, Implementation (p13) --Differences between Python 3.4 and Python 2.7, Summary (p13)
pandas (p14)
matplotlib (p14)

Chapter 2 Classification Problems-Training Machine Learning Algorithms

ADALINE (Adaptive Linear Neuron) (p17)
scikit-learn (p17) --MCP Neuron (McCulloch-Pitts Neuron) (p17)
- Warren McCulloch
- Walter Pitts --Frank Rosenblatt, Perceptron, Learning Rules, Algorithms (p18) --Two classes (p18) --1 (Positive class) ---1 (negative class) --Total input (net input) (p18) -$ \ theta : Threshold (p19) --Unit step function (p19) --Heaviside step function - \ hat {y} : Output value (p21) - \ eta $: Learning rate (constants greater than 0.0 and less than 1.0) (p21) --fit method (p24) --predict method (p24) --Underscore (eg self.w_): By convention, the xxx attribute has xxx (p24) --Review (p24)
- http://wiki.scipy.org/Tentative_NumPy_Tutorial
- http://pandas.pydata.org/pandas-docs/stable/tutorials.html
- http://matplotlib.org/users/beginner.html
for _ in range(self.n_iter): (p24) -(Supplement) It seems to be written when loop variables are not used. --One-vs-All: OvA method (p27) --UCI Machine Learning Repository, Iris Dataset (p27) -- plt, scatter (omitted because it is long) (p28)
markers = ('s', 'x' , 'o', '^', 'V') (p30) --Numpy's meshgrid function (p31) --The learning rules of Perceptron converge (p32) --Linear hyperplane (p32)
- Frank Rosenblatt
ADALINE (ADAptive LInear NEuron) (p32)
- Bernard Widrow, Tedd Hoff --Can be regarded as xxx --The main difference is (p32) --ADALINE learning rules --Rosenblatt's Perceptron --Identity function (p32) --Widrow-Hoff rule (p32) --Quantizer (p32) --Objective function (p33) --Cost function (p33) --ADALINE, cost function J (p33) --Sum of Squared Error (SSE) (p33) --The main advantage of the continuous value linear activation function is (p33) --Another feature of this cost function is (p33) --Unit step function, definition formula (p33) --Partial derivative of the cost function of the error sum of squares for the jth weight, equation transformation (p34) --"Batch" gradient descent (p35)
self.cost_ = [] (p35) -(Supplement) ~~ This meaning has not been learned ~~ -(Supplement) I understand that I will make an empty list --self.w_ [1:] + = (omitted) (p36) -(Supplement) Processing after index 1 --The value of the hyperparameter that optimizes the performance of the classification model is (p37). --Two types of problems (p37) --Scaling method, standardization (p38) ――What are its characteristics? --Equation (p39) --Numpy mean method, std method (p39) --Stochastic gradient descent (p40) --Iterative gradient descent (p40) --On-line gradient descent (p40) --The stochastic gradient descent method can be regarded as xxx (p40) --Shuffle training data to avoid circulation (p41) --Adaptive learning rate (p41) --Another advantage of stochastic gradient descent (p41) --Online learning (p41) --This is especially useful (p41) --Also, you will be able to xxx (p41) --Since it can be xxx, the calculation efficiency of the learning algorithm can be further improved (p41). --Option to shuffule training data before each epoch (p41) --The _shuffle method used in the Adaline SGD classifier produces xxx (p42)

Chapter 3 Classification Problems-Using the Machine Learning Library scikit-learn

--Training of machine learning algorithms, 5 major steps (p48) -(Supplement) Is it a mistranslation of "algorithm training?" --The scikit-learn library has xxx as well as xxx (p48)

np.unique(y) (p49) --Using the StandardScaler class of scikit-learn's preprocessing module (p50) --Call the transform method (p50) --Let's be careful. This is (p50) --This can be xxx because it uses the one-to-many (OvR) method (p50) --Using the random_state parameter to allow xxx (p51) --Model accuracy rate, error classification rate (p51) --Overfitting means xxx (p51) --Example of using numpy mesh grid (p52) --For datasets that do not allow perfect linear separation (p53) --Even if the classes cannot be completely linearly separated (p53) --Perceptron learning rules, the biggest problem (p54) --Logistic regression (p54) --Logistic regression, high performance is demonstrated (p54) --Odds ratio (odds ratio), odds ratio is (p54) -$ p $: Represents the probability of a positive event (p54) -(Supplement) TODO: Definition of "positive event". Examine other than the following explanation --The positive event is (p54) --Logit function, expression (p54) --Logarithmic odds, equation (p54) --Logistic function, expression (p55) --Sigmoid function (sigmoid) function (p55) --Implementation of ADALINE, identity function (p56) --Output and interpretation of sigmoid function (p57) --Likelihood L (Yudo), Definition, Equation $ L (w) $ (p58) --In fact, xxx is easy (p58) --Log likelihood function, equation $ l (w) $ (p58) --Applying a logarithmic function reduces the likelihood of xxx (p58) --Cost function J (p58) --If you want to implement logistic regression yourself (p59) --sklearn.linear_model.LogisticRegression class (p59) --"What is this mysterious parameter C?" (P60) --predict_proba method (p61) --Partial derivative of the log-likelihood function with respect to the j-th weight, equation (p61) --Partial derivative of sigmoid function (p61) --"High variance" (p62) --Underfitting (p63) -(Supplement) Is it 7of9? --"High bias" (p63) --What is Variance (p63) --If the variance is large (p63) --In contrast, what is bias (p63) --As one of the ways to find the trade-off between bias and variance (p64) --Regularization (p64) --Collinearity (p64) --What is collinearity (p64) --The idea behind regularization is (p64) --Most general regularization (p64) --L2 regularization (p64) --L2 Shrinkage --Weight decay -(Supplement) @ Deep Learning by Takayuki Okatani -$ \ lambda : regularization parameter (p64) --The generalization error of the model is decomposed as follows (p64) --Bias quantifies xxx (p64) --Only xxx is required to apply regularization (p65) - C $: Implemented in scikit-learn's LogisticRegression class (p65) --Directly related to the regularization parameter $ \ lambda $, expression --Inverse regularization parameter $ C $, to reduce (p65) --To visualize the strength of regularization (p65) --Support Vector Machine (SVM) (p66) -(Supplement) Not Separation of Variable Method --Can be regarded as SVM, (p66) --SVM, purpose of optimization (p66) --SVM, Margin, Definition (p66) --Hyperplane (decision boundary) (p66) --Support vector, illustrated (p67) --Models with small margins, tend to fall into xxx (p67) --Positive and negative hyperplanes (p67) --Hyperplane, equation (p67) --Vector length, equation (p67) --Left side of expression, interpretation (p68) --Two equations (3.4.6), are shown (p68) --Simply put, equation (3.4.7) (p68) --xxx is easy (p68) --Quadratic programming (p68) --By Vladimir Vapnik --Christopher J.C. Burges's paper --Slack variable $ \ xi $ (p68) --1995, Vladimir Vapnik --Soft margin classification --Slack variable, because it was needed (p68) -If the value of $ C $ is large, it means xxx, which means xxx (p69) -$ C $, can be adjusted (p69) -When the value of $ \ lambda $ becomes large (p69) --Logistic regression tries to maximize xxx (p70) --Therefore, it becomes more susceptible to xxx --scikit-learn, LogsiticRegression class, LIBLINEAR library (p71) --scikit-learn, SVM training, SVC class, LIBSVM library (p71) --Computer memory (p71) --SGDClassifier class, alternative implementation (p71) --SVM, popularity, reason (p71) --Kernel SVM (kernel SVM) (p71)
X_xor = np.random.randn(200, 2) (p72) --Projection function $ φ (・) , high dimension, linear separation (p73) --Separate classes (p73) --Projection method, problem (p74) --Kernel trick (p74) -(Supplement) [Related article](http://qiita.com/kilometer/items/66e6116cc661019ead59) --Radial Basis Function kernel (p74) -(Supplement) Is it related to Vector Spherical Harmonics? -(Supplement) Expanded in Vector Spherical Harmonics - \ gamma : Expression, hyperparameters to be optimized (p74) --Kernel, Interpretation (p74) --Kernel, minus sign (p74) --1 (exactly the same sample) (p74) --0 (a completely different sample) (p74) - \ gamma $: Kernel function, cutoff (p75) -(Supplement) Remember the cutoff frequency that appears in the electric circuit -(Supplement) Article, 3dB --Decision tree classifier (p77) --Interpretability (p77) --Decision tree, can be thought of as xxx (p77) --Decision tree, information gain (p78) --Information gain (decrease in xxx) (p78) --Root (root) (p78) --Leaf (p78) --To divide the node by the feature with the highest information gain (p78) --Information gain, equation $ IG (D_p, f) $ (p78) -$ f : Features to divide (p78) - D_p : Parent dataset (p78) - D_j : Dataset of jth child node (p78) - I : Impure (p79) - N_p : Total number of sample parent nodes (p79) - N_j $: Number of samples of jth child node (p79) --Thus, the information gain is only xxx (p79)
D_{left}, D_{right} (p79) --Binary tree, impure index or division condition (p79) --Gini impurity (p79) --Entropy (p79) -(Supplement) Rich Flow by Grisha --Classification error -$ I_E : Classification error (p79) - I_H $: (Supplement) I don't know what the formula is (p79)
p(i=1|t), p(i=0|t) (p79) ――The entropy is 1 in binary classification (p79) --The maximum purity of Gini is (p79) --Another indicator of impureness, classification error (p80) -Equation using $ I_E : p (p80) - D_p $: Let's look at the parent node's dataset (p80) --Information gain (difference between "purity of parent node" and "sum of purity of child node") (p80) -(Supplement) After that, about 15 related expressions continue --To be able to visually compare the above three types of impure conditions (p82) --Add xxx to confirm that gini impureness lies somewhere between entropy and classification error (p82). --# Entropy (2 types), Gini purity, and classification error are each looped (p83) --Decision tree, overfitting (p84) --Feature scaling, decision tree (p84) --As a decision tree peculiar (p84) --scikit-learn, post-training decision tree, export (p85)
GraphViz (p85) --Random forest, features (p86) --Random Forest, intuitively (p86) --The idea behind ensemble learning (p86) --Weak learning algorithm, strong learning algorithm (p86) --Generalization error, overfitting (p86) --Random Forest Algorithm, 4 Steps (p86) --Non-restoring extraction (p87) --Majority vote, assign class label (p87) --Random forest, advantages (p87) --No need to xxx (p87) ――Can be optimized (p87) --Bootstrap sample size (p87) --Scikit-learn, RandomForestClassifier implementation (p87) -$ d $: Number of features for each division (p87) --Total number of features in the training dataset (p87)
d\sqrt{m} (p87) -$ m $: Number of features in the training dataset (p87) --This allows you to xxx (p88) --k-nearest neighbor classifier (p89)
- KNN --KNN, lazy learner (p89) --What is called "laziness" (p89) --Parametric model, nonparametric model (p89) --Perceptron, Logistic Regression, Linear SVM (p89) --Decision Tree / Random Forest, Kernel SVM (p89) --Instance-base learning (p89) --Remember the training dataset (p89) --The main advantages of the memory-based approach (p90) --When the majority vote is the same (p91) --In the implementation of scikit-learn's KNN algorithm --Euclidean distance (p91) --minkowski distance (p91) --Manhattan Distance (p91) --minkowski distance, equation (p91) --Curse of dimensionality (p92) --Curse of dimensionality, representing the xxx phenomenon (p92) --By using xxx, you can escape from the curse of dimensionality (p92)

Chapter 4 Data Preprocessing-Building a Better Training Set

--Missing value (p93) --Blank in data table (p93)

NaN (Not a Number) (p93) --Placeholder (provisional) string (p93) --Ignoring missing values (p93) -- # If you are using Python 2.7, you need to convert the string to unicode (p94) --StringIO function, when used (p94) --Using the isnull method (p94) --Data preprocessing, pandas DataFrame class (p95) --DataFrame object, values attribute (p95)
df.dropna() (p95) --If you set the axis argument to 1 (p95)
df.dropna(how='all') (p95)
df.dropna(thresh=4) (p95)
df.dropna(subset=['C']) (p95) --Deletion of missing data, problem (p96) --Interpolation technique (p96) --Mean imputation (p96) -(Supplement) Isn't "complement" a mistake of "interpolation"? --scikit-learn Imputer class (p96) --strategy argument - median - most_frequent --Useful for most_frequent, xxx (p96) --So-called transformer class (p96) --Transformer, fit, transform (p96) --The converter and fit method is (p96) --Transformer, transform method is (p96) --Estimator, predict method (p97) --Category data, nominal features (p98) --Category data, ordinal features (p98) --Order features, example (p98) --Numerical features (p98) --Class label (p98) --Category string, convert to integer, required (p99) --Dictionary for reverse mapping inv_size_mapping (p99) --Many machine learning libraries, requesting xxx (p99) --To revert the converted class label to its original string representation (p100) --A convenient class called LabelEncoder (p100) implemented directly in scikit-learn. --One of the most common mistakes in processing categorical data (p101) --Avoid xxx problems, one-hot encoding (p101) --Dummy feature (p101) --scikit-learn, OneHotEncoder class (p101) --OneHotEncoder class returns a sparse matrix when xxx (p102) --get_dummies function implemented in pandas (p102) --Wine dataset (p102)
- UCI Machine Learning Repository (p102) -(Supplement) http://archive.ics.uci.edu/ml/ --Randomly split into test and training datasets (p104) --train_test_split function (p104) --scikit-learn, cross_validation module (p104) --Dataset, Split, Attention (p104) --Accuracy of generalization error estimation, trade-off (p104) --xxx would be good (p104) --Feature scaling (p105) --Decision tree and random forest, without xxx (p105) --Most of xxx, works much better with xxx (p105) --Feature Scaling, Importance (p105) --Scale (p105) --Normalization --Standardization --Normalization, meaning xxx (p105) --xxx special case (p105) -$ x_ {norm} ^ {(i)} $: New value for sample $ x ^ {(i)} $, equation (p105) --min-max scaling, scikit-learn (p105) --Bounded section (within a certain range) (p106) --Useful for normalization by min-max scaling (p106) --xxx may be more practical, reason (p106) --Many linear models, including xxx, are xxx (p106) --When using standardization (p106) --Standardization procedure, equation (p106) --Overfitting (p107) --Overfitting, cause (p107) --General methods for reducing generalization error (p107) --L2 regularization, equation (p107) --L1 regularization, equation (p107) --Returned by L1 regularization, (p107) --L1 regularization, how to promote sparseness (p108) --Regularization, Geometric Interpretation (p108) --Regularization, think as follows (p108) --Regularization parameter $ \ lambda $, by strengthening (p108) --L2 Penalty Term Concept, Illustrated (p108) --Here xxx cannot exceed xxx (p109) --On the other hand, I want to minimize xxx (p109) --The goal here is (p109) --If there is no xxx, it can be understood as xxx (p109) --L1 regularization, sparseness (p109) --Similar to xxx. However, xxx (p109) --The term L2 is xxx (p109) --Rhombus (p109) --L1 diamond (p110) --The optimization condition is likely to be in xxx (p110) --Why L1 regularization leads to sparse solutions (p110) --Trevor Hastie et al. "The Elements of Statistical Learning" Section 3.4 --scikit-learn, L1 regularization (p110) --penalty argument --The regularization path is (p112) --Dimension reduction by feature selection (p113) --Dimensionality reduction (p113) --Feature selection --Feature extraction --For feature selection (p113) --In feature extraction (p113) --Typical feature selection algorithm (p113) --Sequential selection algorithm (p113) --Greedy search (p113) --d dimension, k dimension (k <d) (p113) --Feature selection algorithm, two purposes (p113) --The latter is useful for xxx (p114) --Sequential Backward Selection (SBS) (p114) --SBS, Purpose (p114) --Exhaustive search algorithm (p114) --Not xxx in terms of xxx (p114) --SBS, Algorithm, 4 Simple Steps (p114) --Let's implement it with SBS, Python (p115) --Features, subsets, classification problems, estimators (p116) --In the while loop of the fit method, it is reduced to xxx (p116) --Test dataset, training dataset, split (p117) --To prevent the original test dataset from becoming part of the training dataset (p117) --Because the number of features has been reduced (p117) --KNN algorithm, curse of dimensionality (p117) --Various feature selection methods, comprehensive explanation (p119)
- http://scikit-learn.org/stable/modules/feature_selection.html --L1 Logistic regression with regularization, irrelevant features, SBS algorithm, feature selection (p119) --Feature selection, random forest (p119) --Random Forest, Ensemble Method (p119) --xxx Even without making assumptions (p119)
indices = np.argsort(importance)[::-1] (p120) --n_jobs = -1, all cores (p120) --Random Forest, Note xxx, Important (p120) --L1 regularization, useful for xxx (p122) --Sequential feature selection algorithm, SBS (p122)

Chapter 10 Regression Analysis; Prediction of Objective Variables with Continuous Values

--Regression analysis (p265) --Explanatory variable, objective variable, figure (p266) --Regression line (p266) --Offset, residual (p266) --Simple linear regression (p266) --Multiple linear regression (p266) --Housing dataset (p267)

UCI Machine Learning Repository --MEDV: Median Home Prices (p267) --pandas DataFrame object (p267) --TODO: Learning pandas --Exploratory Data Analysis (EDA) (p268) --Recommended as EDA, xxx (p268) --Relationship between outliers, data distribution, and features (p268) --Scatter plot matrix, xxx can be visualized (p268) --Scatterplot matrix, pairplot function of seaborn library (p268)
pip install seaborn (p268) --xxx changes when importing seaborn library (p269) --RM (Average number of rooms per unit) (p270) --In contrast to popular belief, xxx is not necessary (p270) --Correlation matrix (p270) --Correlation matrix, covariance matrix, intuitively (p270) --Pearson product-moment correlation coefficient, square matrix (p270) --Pearson's r (p270) --Correlation coefficient, range (p270) --Positive correlation, negative correlation (p270)
r = 0 (p270) --Pearson's product moment correlation coefficient, equation (p270) -$ \ mu : Sample mean of corresponding features (p270) - \ sigma_ {xy} : Covariance between features x and y - \ sigma_x $ and $ \ sigma_y $: Standard deviation of each feature --Pearson's product-moment correlation coefficient, covariance, standard deviation product (p270) --NumPy corrcoef function (p271) --seaborn heatmap function (p271) --Fit a linear regression model, focus (p272) --Least Squares (OLS) (p272) -(Supplement) Is there an Extraordinary ...? --OLS, Interpretation (p273) --Regression analysis, more efficient implementation (p277) --Least Squares, Closed Form Solution (p278) --Introduction to Statistics Textbook --Linear regression, greatly influenced by xxx (p278) --Alternative method for removing outliers (p278) --RANSAC (RANdom SAmple Consensus) algorithm (p278) --Normal value (inlier: not outlier) (p279) --lambda function, callable (p279) --Calculate lambda function, xxx (p279) --MAD, median absolute deviation of objective value y (p279) --Linear regression line (to be exact, hyperplane) (p281) --In the case of xxx, the residual is 0, in a real application (p282) --For a good regression model (p282) --Model performance, quantification (p283) --Mean Squared Error (MSE) (p283) --Useful for MSE, (p283) --Coefficient of determination $ R ^ 2 $ (p283) --The coefficient of determination can be thought of as xxx (p283) --SSE, Residual Sum of Squares (p283) --SST (Sum of Squared Total), Equation (p283) --That is (p283) -It's just $ R ^ 2 $ :, expression transformation (p284) --Model extreme parameter weights, penalties (p284) --Regularized Linear Regression, 3 (p284) --Ridge regression (p284)
LASSO (Least Absolute Shrinkage and Selection Operator) (p284) --Elastic Net method (p284) --Model with L2 penalty (p284) -　J(w)_{Ridge} -　L2 --Increase, increase, decrease (p285) --LASSO, constraint, when m> n (p285) --Ridge Regression, LASSO, Elastic Net (p285) --Elastic Net, L1 Penalty, L2 Penalty (p285) --Sparseness, number of variables selected xxx partially overcome (p285) --k split-validation, parameter $ \ lambda $, regularization strength (p285) --Regularization strength, $ \ lambda $ parameter, $ \ alpha $ parameter (p285) --LASSO Regressor in linear_model submodule (p285) --ElasticNet, l1_ratio argument (p285) --Polynomial regression, finding curves (p286) --Linear regression coefficient w, multiple regression model (p286) --scikit-learn, PolynomialFeatures converter class (p286) --How to compare polynomial regression and linear regression (p286) --linear fit, quadratic fit, training points, figure (p287) --Coefficient of determination ($ R ^ 2 ), linear model, quadratic polynomial model, fit (p288) --Added polynomial features, model complexity, overfitting (p289) --Polynomial features, not always the best choice (p289) --Convert explanatory variables to logarithms and be able to xxx (p290) --Random Forest Regression (p290) --Random Forest, Decision Tree, Ensemble (p290) --Random forest, sum of piecewise linear functions, i.e. (p290) --Advantages of decision tree algorithm (p290) --Decision tree, to stretch (p290) --Decision tree, entropy (p290) --Entropy, xxx (p290) --To use a decision tree for regression (p291) - I (t) , Entropy, which is a negative purity index of the expression node t ... (p291) - N_t : Number of training samples for node t (p291) - D_t : Training subset of node t (p291) - y ^ {(i)} : True destination (p291) - \ hat {y_t} $: Predicted target value (sample average) (p291) --MSE, Node distribution after split (p291) --Variance reduction (p291) --scikit-learn, DecisionTreeRegressor class (p291) --Decision tree, model, constraint (p292) --Decision tree depth, overfitting, lack of learning (p292) --Random forest, decision tree, generalization (p292)
Reason --Random forest, advantages (p292) --Random forest, parameters, experiments required (p292) --Random forest, algorithm, algorithm for classification (p292) --The only difference --Random forest, predicted objective variable, calculated by xxx (p292) --SVM, Nonlinear Regression (p294) --SVM, Regression, S.R.Gunn (p294) --SVM Regressor, scikit-learn (p294)