(Maybe) This is all you need to pass the Python 3 Engineer Certification Data Analysis Exam

Edit history

2020.8.11 I will share the spreadsheet link because the table is hard to see.

Introduction

This article summarizes your knowledge of the Python 3 Engineer Certification Data Analysis Exam, which began on June 8, 2020. We are organizing information from Prime Strategy's practice exams and various web pages. The term "textbook" in the article refers to the following books, which are the main teaching materials.

Main teaching material: Released on September 19, 2018 (2,678 yen including tax) "A new textbook for data analysis using Python" (Shoeisha) Authors: Manabu Terata, Shingo Tsuji, Takanori Suzuki, Shintaro Fukushima (honorific title omitted)

Question range and question distribution

Question range Number of questions Question distribution
1 Role of data engineer 2 5.00%
2 Python and environment
1 Execution environment construction 1 2.50%
2 Python basics 3 7.50%
3 Jupyter Notebook 1 2.50%
3 Foundations of mathematics
1 Basic knowledge for reading mathematical formulas 1 2.50%
2 linear algebra 2 5.00%
3 Basic analysis 1 2.50%
4 Probability and statistics 2 5.00%
4 Analysis practice by library
1 NumPy 6 15.00%
2 pandas 7 17.50%
3 Matplotlib 6 15.00%
4 scikit-learn 8 20%

Learning method

  1. Read the main teaching materials and copy sutras
  2. Solve the Mock test of Prime Strategy, and sort out the points you do not understand.
  3. Solve PyQ's "Python 3 Engineer Certification Data Analysis Test PyQ Quest Support"

Organize knowledge according to the range of questions

Major items Sub-item Overview Details reference
2 Python and environment pip The pip command is a utility that installs Python packages published in The Python Package Index. Use the pip install command to install the package.
About pip's U option

Ex.)
pip install -U numpy pandas
The pip command is-By adding the U option, the installed library will be updated to the latest version.

To install the latest version explicitly, it looks like this.
PEP8 PEP8 is a standard coding standard. Multiple imports are allowed for the same module, but line breaks are allowed for different modules. [Python coding conventions]Read PEP8- Qiita https://qiita.com/simonritchie/items/bb06a7521ae6560738a7
Log level There are five levels of logging in python.



1. CRITICAL

2. ERROR

3. WARNING

4. INFO

5. DEBUG
Convenient module The pickle module can serialize Python objects so that they can be read and written in files. Boolean values, numbers, character strings, etc. can be pickled.

The pathlib module is useful for working with file paths. Wildcard filename in glob method(*)It can also be specified with.
ravel and flatten are functions that make an array one-dimensional. ravel()Returns views as much as possible, but flatten()Always returns a copy. reshape()Also reval()Returns views as much as possible. If you assign an array to another variable, the assigned variable refers to the original array. If you want to create it as a separate object, copy()Or deep copy()use.

* Ravel and flatten of numpy are functions that make an array one-dimensional. ravel()Returns views as much as possible, but flatten()Always returns a copy. reshape()Also reval()Returns views as much as possible.
Reading and writing data Reading data from a binary file returns a file descriptor with the b option of the open method and reads()Read and write with()Write with
strip method

Ex.)
bird = ' Condor Penguin Duck '

print("befor strip: {}".format(bird))

print("after strip: {}".format(bird.strip()))
Whitespace characters at both ends are removed.
Regular expressions

.Any one letter a.c abc, acc, aac

^The beginning of the line^abc abcdef

End of line abc defabc

Repeat 0 or more times ab a, ab, abb, abbb

+Repeat one or more times ab+ ab, abb, abbb

?0 times or 1 time ab? a, ab

{m}Repeat m times a{3} aaa

{m,n}Repeat m ~ n times a{2, 4} aa, aaa, aaaa

[★]★ Any one character[a-c] a, b, c

★★ Any a b a, b
Regular expression special sequence

\d arbitrary number[0-9]

\D Other than any number[^0-9]

\s Any whitespace character[\t\n\r\f\v]

\S Any non-whitespace character[^\t\n\r\f\v]

\w Any alphanumeric characters[a-xA-Z0-9_]

\W Any non-alphanumeric character[\a-xA-Z0-9_]

\A beginning of string^

\End of Z string$
Regular expressions

find() / findall()→ Returns a list of one or all matching substrings each
match()→ Check if the beginning of the character string matches
fullmatch()→ Check if the entire string matches
search()→ Check if it matches, not just at the beginning. Used when you want to extract a part of a character string
replace()→ Replace character string
sub()→ Replace character string。置換された文字列が返される。
subn()→ Replaced character string (sub)()Returns a tuple of the number of replaced parts (the number that matches the pattern) (same as the return value of).


match/search returns a match object. The following methods can be used for match objects.

Get the matched position: start(), end(), span()
Get the matched string: group()
Get the string for each group: groups()

* Parentheses the part of the regular expression pattern in the character string()If you enclose it in, that part is treated as a group. At this time, groups()You can get the character string of the part that matches each group as a tuple.

sub is parentheses()When grouping with, the matching character string can be used in the replaced character string.
By default\1, \2, \3...But each is the first(), The second(), Third()...Corresponds to the part that matches. If it is a normal string that is not a raw string'\1'like\Note that you need to escape. Regular expression pattern()At the beginning of?PIf you write and name the group,\Not a number like 1\glike名前を使って指定できる。
re.search("category/(.+?)/", "https://foo.com/category/books/murakami").group(1)
#Obtained character string:'books'

>>> text = "123456abcedf789ghi"
>>> matchobj = re.search(r'[a-z]+', text)
>>> if matchobj:
... print(matchobj.group())
... print(matchobj.start())
... print(matchobj.end())
... print(matchobj.span())
※re.Note that search can only retrieve information for the first matched string.

replace is the target string.replace(String to be replaced,String to replace[,Number of replacements])Grammar.
>>> raw_abc = r"aaaaabbbbbccccc"
>>> rep_raw_abc = raw_abc.replace("c", "C")
>>> print("Change before:",raw_abc, "After change:",rep_raw_abc)
Change before: aaaaabbbbbccccc After change: aaaaabbbbbCCCCC

re.sub(Regular expressions,String to replace, String to be replaced [,Number of replacements])Note the difference between and replace.
【Python】とっても便利なRegular expressions! - Qiita https://qiita.com/hiroyuki_mrp/items/29e87bf5fe46de62983c
Regular expression flag

Limited to ASCII characters: re.ASCII
Case insensitive: re.IGNORECASE
Match the beginning and end of each line: re.MULTILINE
Specify multiple flags
Compiling the pattern

p = re.compile(r'([a-z]+)@([a-z]+).com')
m = p.match(s)
result = p.sub('new-address', s)
Virtual environment venv can isolate the modules to be installed for each virtual environment. Use pyenv or Anaconda to switch the Python interpreter. https://tinyurl.com/y4ypsz9r
%, %%Is a magic command.
!Execute the OS shell command with.
Shit +Display docstring with Tab.
How to use the magic command (magic function) of Jupyter Notebook https://miyukimedaka.com/2019/07/28/blog-0083-jupyter-notebook-magic-command-explanation/
Frequently used magic commands

%time: Measures the execution time of the code that follows and displays the result.
%timeit: Measures the execution time of the following code several times and displays the fastest result and average.
%env: You can get and set environment variables.
%who: Shows the currently declared variables.
%whos: Shows the currently declared variables, their types, and their contents.
%pwd: Shows the current directory.
%history: Displays a list of code cell execution histories.
%ls: Shows a list of files in the current directory.
%matplotlib inline: If you draw a graph with pyplot etc., the result will open in another window and will be displayed there, but if you use this magic command, the graph will be displayed in the notebook.

%%timeit:%Apply the timeit function to all the code in the cell.

%%html, %%HTML: Allows you to write and execute html code.
Jupyter notebook storage format notebook format(.ipynb)Is a JSON file
3 Foundations of mathematics queue "Commutative law: x", "Associative law: ○", "Distributive law: ○"

The commutative law does not always hold (note that some do).

1 row / 1 column is a vector.

If the number of columns in the matrix and the size of the vector are the same, then these multiplications can be defined and the result is a vector of the same size as the number of rows in the original matrix.
Common logarithm and natural logarithm The common logarithm is the base 10 logarithm. The natural logarithm is based on e.
Euclidean distance direct distance
Manhattan distance Zigzag distance (derived from Manhattan's grid)
Function F(x)Differentiate f(x)When, F is called the primitive function of f and f is called the derivative of F.
Integral An integral whose range of integration is not defined is called an indefinite integral. Since an arbitrary constant is differentiated to 0, the indefinite integral usually includes the constant of integration "C".
Differentiation and integration The derivative can be regarded as the slope, and the integral as the area. In data analysis and machine learning, the point that the slope of the function is 0 is used as useful information.
Partial differential The derivative of a multivariable function with two or more variables is called the partial derivative. In partial differentiation, it is necessary to show which variable was differentiated.
Established Expected value of dodecahedron dice is 6.Five. For random variables, discrete → probability mass function, continuous → probability density function
Factorial 0! 0!=Note that it is 1. Also remember that the logarithm of 1 is 0.
sin and cos sin/cos are called sine and cosine, respectively. tan is tangent.
4 Analysis practice by library Numpy dtype attribute You can check the data type of the element of ndarray.
Convenient way to generate ndarray

# -0 from 5 to 5.Define an array of 1-step numbers
x = np.arange(-5, 5, 0.1)

#Generate arithmetic progressions from 1 to 10 for the number of elements specified by num
np.linspace(1, 10)
np.linspace(start, stop, num=50, endpoint=True)Generated with the grammar of. num specifies the number of elements. num is 50 by default.
np.random module Note that np does not include the value specified for stop compared to the standard module.


random.random() / np.random.rand(size)Generates a random number from 0 to 1.

import numpy as np
import random

print(random.random())
# 0.9733563537374995

print(np.random.rand(3))
# [ 0.69962497 0.61950126 0.7944733 ]

print(np.random.rand(2, 3))
# [[ 0.29315148 0.06560949 0.56110618]
# [ 0.62784039 0.19218867 0.07087768]]

np.random.randn(size)Is a random number generator that follows a standard normal distribution.
print(np.random.randn(3, 3)) #3x3 array with standard normal distribution
# [[-0.52434526 0.16597271 -2.22295048]
# [ 0.46995083 -0.64576356 -2.73155503]
# [ 1.04575168 0.05712791 -0.46522963]]

If you want to generate random numbers that follow a normal distribution, do as follows.
np.random.normal(mu, sd, 10000)

When generating an integer random number
random.randint(low, high, size)

np.random.randint(1, 10, 2) #Generates two ndarrays with integers between 1 and less than 10.
np.random.randint(1, 10, (2, 3)  #Generate a 2-by-3 ndarray.
np.random.randint(2, size=8) #If high is omitted, the value of low is treated as high.
# array([1, 0, 0, 0, 1, 1, 1, 0])
np.random.randint(1, size=8) #Only integers less than 1, that is, 0.
# array([0, 0, 0, 0, 0, 0, 0, 0])

choich has the following differences from the standard module.
random.choice(seq)Select one from seq
np.random.choice(a)Select multiple from a

seq1=[0、1、2、3]

random.choice(seq1) #1 time choice

random.choice("hello") #1 letter choice from 5 letters

np.random.choice(seq1, 4) #Arrangement chosen 4 times with duplication

np.random.choice([2, 4, 6],2)  #Arrangement chosen twice with duplication

np.random.choice([0, 1], (3, 3)) #0 in a size3x3 array,Fill in 1

np.random.choice(5, 2) #np.randint(0, 5, 2)Synonymous with
How to use NumPy (12) Random numbers, random-Remrin's python capture diary http://python-remrin.hatenadiary.jp/entry/2017/04/26/233717
Conversion to a one-dimensional array You can use the raise or flatten methods to convert a two-dimensional NumPy array to one-dimensional. The ravel method returns a reference and the flatten method returns a copy.
Copy and reference

a = np.array([1, 2, 3])
b = a ①
b = a.copy() ②
① is a reference and ② is a copy. Note that slicing a Python standard list will pass a copy, but Numpy slices will pass a reference.
Matrix division

a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
first1, second1 = np.vsplit(a, [2])
first2, second2 = np.hsplit(second1, [2])
print(second2)
The vpslit function decomposes the matrix in the row direction, and the hsplit function decomposes the matrix in the column direction.
About display of print statement

import numpy as np
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([7,8,9])
print(a[-1:, [1,2]], b.shape)
[5 6]

a is a[-1:, [1,2]]And the last line ([4,5,6)[1,2]So 5,Extract 6 Note that b is one-dimensional because it has one parenthesis.
np.Number of elements generated by arange x = np.arange(0.0, 1.5, 0.1)Then array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. , 1.1, 1.2, 1.3, 1.4])15 pieces. If the center is 15, it is 10 times that, so 150. np.sin(x)Is processed in radians of the arc degree method.
pandas Date array

dates = pd.date_range(start="2020-01-01", end="2020-12-31")

print(dates)
date_range()Generate a date array with. You can specify the start and end date and time with start and end.
DataFrame Join

Linking:Connect the contents of the data in a certain direction as it is. pd.concat, DataFrame.append

Join:Connect the contents of the data by associating them with the value of some key. pd.merge, DataFrame.join
pd.concat([df1, df2])は縦方向のJoin、横方向にJoinしたい場合はaxis=1をつける。何も指定しないと完全外部Joinになるため、内部Joinにしたいならjoin=Attach the inner. join_axes=[df1.index]のようにJoin行/列を指定することも可能。

Simple df1 in the vertical direction.append(df2)としてLinkingすることもできる。df2の箇所をSeriesにすると行追加。ignore_index=Note that if True is not specified, index will be linked as it is.

Joinはmergeによっておこなう。文法はpd.merge(left, right, on='key', how='inner').. how is inner/left/right/outerを指定可能。複数のkeyでJoinする際はonにリストを渡す。indexをキーとしてJoinしたい場合はDataFrame.joinが便利。規定は左外部Joinとなるがhowで変更可能(left.join(right, how='inner'))。
Python pandas 図でみる データLinking / Join処理 - StatsFragments http://sinhrks.hatenablog.com/entry/2015/01/28/073327
read_html() If there are multiple tables, get them as a list of DataFrames
Missing value processing fillna()Arguments method= 'ffill', method = 'bfill'You can store different values in the same column for the missing element. method= 'ffill'If, the value stored in the element with the smaller subscript, method= 'bfill'If, the missing value is filled with the value stored in the element with a large subscript.

data['Age'].fillna(20) #Fill in the missing values in column Age with 20

data['Age'].fillna(data['Age'].mean()) #Fill in the missing values in column Age with the average value of Age

data['Age'].fillna(data['Age'].median()) #Fill in the missing values in column Age with the median of Age

data['Age'].fillna(data['Age'].mode()) #Fill in the missing values in column Age with the mode of Age
Missing value handling with Pandas- Qiita https://qiita.com/0NE_shoT_/items/8db6d909e8b48adcb203
Mutual conversion between Numpy and Pandas The pandas → numpy conversion is the values attribute of DataFrame, and the reverse is pd to ndarray..DataFrame()It can be converted by using it as an argument of.
Index name and column name are not retained when converting to numpy.
pd.describe() describe is the mean, standard deviation, maximum for each column/You can get the minimum and mode values. std is the standard deviation. top is the mode. https://tinyurl.com/y3gn3dz4
How to use groupby and Grouper

import numpy as np
import pandas as pd
np.random.seed(123)
dates = pd.date_range(start="2017-04-01", periods=365)
df = pd.DataFrame(np.random.randint(1, 31, 365), index=dates, columns=["rand"])
df_year = pd.DataFrame(df.groupby(pd.Grouper(freq='W-SAT')).sum(), columns=["rand"])
Grouper can be grouped flexibly by specifying the frequency with freq.

* The 5th line creates a DataFrame that uses the date as an index. Each value in the rand column is a random integer from 1 to 30.
Matplotlib MATLAB Style and OOP (Object Oriented) Style The former has a shorter code, but you cannot specify it in detail. Basically, the latter should be used.

Users do not need to prepare Figures or Axes to create a single graph. These objects are automatically generated.
Generation of drawing objects and subplot objects

fig, axes = plt.subplots(2)
As shown on the left, figs and axes can be generated at once. fig.add_subplot()It is also possible to generate subplots individually for figs with.

■fig,When making ax individually
#Create an area to place Axes

fig = plt.figure(facecolor = "lightgray")

#Add Axes to Figure

ax = fig.add_subplot(111)

subplots(2)Then the subplot is 2 lines, ncol=If you do like 2, you will have two rows.
How to arrange multiple subplots

ax_1 = fig.add_subplot(221)
ax_2 = fig.add_subplot(222)
ax_3 = fig.add_subplot(223)

#Plot the data in Axes in the 3rd row and 2nd column

ax[2, 1].plot(x, y)
pyplot.subplots()You can use to create multiple Axes objects at once. For the first argument nrows and the second argument ncols, pass the number of Axes in the row direction and the number in the column direction, respectively. [Matplotlib]OOP and MATLAB style https://python.atelierkobato.com/matplotlib/
Axis settings

#Axes settings

ax.grid() #Show grid

ax.set_title("Axes sample", fontsize=14) #Show title

ax.set_xlim([-5, 5]) #x-axis range

ax.set_ylim([-5, 5]) #y-axis range
Formatting a Figure object

#Creating and formatting Figure objects

fig = plt.figure(



#size

figsize = (5, 5),



#Fill color

facecolor = "lightgray",



#Border display

frameon = True,



#Border color

edgecolor = "black",



#Border thickness

linewidth = 4)
#Axes on the figure(Subplot)Add

ax = fig.add_subplot(



#Number of rows and columns, Axes number

111,



#Fill color

facecolor = "lightgreen",



#x-axis and y-axis range

xlim = [-4,4], ylim = [0,40])
Graph display

plt.show()
Display the graph with the show method.
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
x = [1, 2, 3]
y1 = [10, 2, 3]
y2 = [5, 3, 6]
labels = ['Setosa', 'Versicolor', 'Virginica']
ax.bar(x, y_total, tick_label=labels, label='y1')
ax.bar(x, y2, label='y2')
ax.legend()
plt.show()
Note that y1 is not used as a variable
import numpy as np

import matplotlib.pyplot as plt

np.random.seed(123)

mu = 100

sigma = 15

x = np.random.normal(mu, sigma, 1000)

fig, ax = plt.subplots()

n, bins, patches = ax.hist(x, bins=25, orientation='horizontal')

for i, num in enumerate(n):

print('{:.2f} - {:.2f} {}'.format(bins[i], bins[i + 1], num))



plt.show()
The default value for bins is 10. See textbook P192. Bins as a return value is a boundary value, and the number of bins + 1.

The variable mu means the mean value and the variable sigma means the standard deviation.
The histogram is drawn horizontally.
"N" where the return value of hist method is stored, bins,Of the "patches", "bins" contains the values of the bin boundaries, and the number is 26.
When this script is executed, the frequency distribution table is output in addition to the histogram.

The part of the print statement on the left is the display of the frequency distribution table.
51.53 - 55.62 2.0

55.62 - 59.70 3.0

59.70 - 63.78 6.0

63.78 - 67.86 7.0

67.86 - 71.94 16.0

71.94 - 76.02 29.0

76.02 - 80.11 37.0
Pie chart display See textbook P198. To maintain the ass ratio, ax.axis('equal')And. autopct can display each value in%. Highlight is explode.

Example: plt.pie(x, labels=label, counterclock=False, startangle=90)Draw clockwise from directly above
https://tinyurl.com/yyl8yml6
Scikit-learn DBSCAN The DBSCAN method, which is one of unsupervised learning, is a density-based clustering algorithm that focuses on the distance between feature vectors.
Evaluation scale of classification

Precision(Compliance rate)
Recall(Recall)
F1 Score
Accuracy(Correct answer rate)
Precision and Recall are in a trade-off relationship. Therefore, you should also look at the F1 Score index.

An example of a common cancer diagnosis is
Precision → Emphasis when you want to reduce misdiagnosis
Recall → Emphasis when you want to avoid overlooking the correct example
Accuracy → General index for checking the accuracy of classification
Machine learning practice (supervised learning: classification)- KIKAGAKU https://www.kikagaku.ai/tutorial/basic_of_machine_learning/learn/machine_learning_classification
Evaluation scale of regression model MSE (Mean Squared Error), RMSE (Root Mean Sqaured Error), MAE (Mean Absolute Error) are famous. https://tinyurl.com/y2xc9c58
https://tinyurl.com/y5k8gc9a
Meaning of various errors (RMSE, MAE, etc.)-Mathematics learned with concrete examples https://mathwords.net/rmsemae#:~:text=MAE%EF%BC%88Mean%20Absolute%20Error%EF%BC%89,-%E3%83%BB%E5%AE%9A%E7%BE%A9%E5%BC%8F%E3%81%AF&text=%E3%83%BB%E5%B9%B3%E5%9D%87%E7%B5%B6%E5%AF%BE%E8%AA%A4%E5%B7%AE%E3%81%A8%E3%82%82%E8%A8%80%E3%81%84,%E3%81%A8%E3%81%97%E3%81%A6%E6%89%B1%E3%81%86%E5%82%BE%E5%90%91%E3%81%8C%E3%81%82%E3%82%8A%E3%81%BE%E3%81%99%E3%80%82
Scikit-Dataset that comes with learn

load_iris
load_boston
The iris records the length and width of 150 iris "gaku" and "petals", as well as the type of flower. Explanatory variable 4, objective variable 1. boston is a dataset that records 14 features and housing prices, including the number of crimes per capita and the average number of rooms in a residence, by region on the outskirts of Boston, USA.
Decision tree

Algorithm for regression and classification. It has the advantage of being easy to interpret and requiring less pretreatment.
Textbook P235. Information gain=Impureness of parent node-It is represented by the sum of the impurities of the child nodes. If it is positive, it should be divided into child nodes, and if it is negative, it should not be divided. Tree structure(data structure) - Wikipedia https://ja.wikipedia.org/wiki/%E6%9C%A8%E6%A7%8B%E9%80%A0_(%E3%83%87%E3%83%BC%E3%82%BF%E6%A7%8B%E9%80%A0)#%E7%94%A8%E8%AA%9E
SVM

Draw a decision boundary so that the margin is maximized. The method of making linearly separable data linearly separable is called a kernel trick.
from sklearn.svm import SVC

svc = SVC()

C is a cost parameter and means a penalty for false predictions. If it is too large, it causes overfitting. gamma determines the complexity of the model. The larger the value, the more complicated it becomes and overfitting occurs.
Sigmoid function

y = 1 / 1 + exp(x)Takes the form of.(0, 0.5), 0< y <It becomes 1.
Sigmoid is a model that performs binary classification. In the case of three-class classification, it can be dealt with by performing binary classification for the number of classes.
Normalization Normalization is standardization with an average of 0 variances of 1.[StandardScaler]Normalization to maximum 1 and minimum 0[MinMaxScaler]Is famous.
Separation of training data and test data from sklearn.model_selection import train_test_split
Linear model The linear model (LinearRegression) is divided into simple regression with one explanatory variable and multiple multiple regression with multiple explanatory variables.
Principal component analysis This is a method of compressing data to the same or lower dimension as the original dimension by looking for the direction in which the variance increases.

Principal component analysis is scikit-It can be executed using the PCA class of learn's depositon module.
Grid search

from sklearn.datasets import load_iris

from sklearn.model_selection import GridSearchCV

from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import train_test_split

iris = load_iris()

X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)

clf = DecisionTreeClassifier()

param_grid = {'max_depth': [3, 4, 5]}

cv = GridSearchCV(clf, param_grid=param_grid, cv=10)

cv.fit(X_train, y_train)

y_pred = cv.predict(X_test)
In the code on the left, the optimum value of the depth of the decision tree may change each time it is executed. If you want to have reproducibility, do as follows.

clf = DecisionTreeClassifier(random_state=0)
Parameter explanation of decision tree analysis – S-Analysis http://data-analysis-stats.jp/2019/01/14/%E6%B1%BA%E5%AE%9A%E6%9C%A8%E5%88%86%E6%9E%90%E3%81%AE%E3%83%91%E3%83%A9%E3%83%A1%E3%83%BC%E3%82%BF%E8%A7%A3%E8%AA%AC/
Clustering k-means is a method of first randomly allocating cluster centers, modifying the cluster centers while calculating the distance to each data, and recalculating and clustering until the final cluster centers converge.

Clustering can be broadly divided into split-optimal clustering and hierarchical clustering. Divided optimal clustering is a method of preparing a function that measures the goodness of a cluster in advance and seeking clustering that minimizes the value of that function. Hierarchical clustering, on the other hand, is a method of building clusters hierarchically by dividing or merging clusters.

Hierarchical clustering is further divided into aggregate type and split type. The agglomeration type is a method in which each data point is considered as a cluster, and similar clusters are sequentially agglomerated. The split type is a method that starts from the state where the entire data point is considered as one cluster, and sequentially divides a group of dissimilar data points.

The split type tends to require more calculations than the aggregate type.
https://tinyurl.com/y6cgp24f
https://tinyurl.com/y2df2w4c

Recommended Posts

(Maybe) This is all you need to pass the Python 3 Engineer Certification Data Analysis Exam
Have passed the Python Engineer Certification Data Analysis Exam
How to pass and study the Python 3 Engineer Certification Basic Exam
How to use the asterisk (*) in Python. Maybe this is all? ..
Python3 Engineer Certification Data Analysis Exam Self-made Questions
Python 3 Engineer Certification Data Analysis Exam Pre-Exam Learning
Is the Python 3 Engineer Certification Basic Exam Really Easy?
Python3 Engineer Certification Basic Exam-I tried to solve the mock exam-
Take the Python3 Engineer Certification Basic Exam
Python 3 Engineer Certified Data Analysis Exam Preparation
[Examination Report] Python 3 Engineer Certified Data Analysis Exam
Implement "All You Need Is Kill" in Python
python engineer certification exam
[For beginners] How to study Python3 data analysis exam
How amateurs passed the Python 3 Engineer Certification Basic Exam
[Python3 engineer certification data analysis test] Examination / passing experience
Impressions of taking the Python 3 Engineer Certification Basic Exam
How to study Python 3 engineer certification data analysis test by Python beginner (passed in September 2020)
[Qualification] I studied Python from the basics to take the python3 engineer certification basic exam (examination record)
[Qualification] I studied Python from the basics to take the python3 engineer certification basic exam (study edition)
Pass OpenCV data from the original C ++ library to Python
I passed the python engineer certification exam, so I released the study method
[pepper] Pass the JSON data obtained by python request to the tablet.
A memorandum regarding the acquisition of the Python3 engineer certification basic exam
Data analysis, what do you do after all?
I'm new to Python. Could you please tell me where the error is?
How to study Python 3 engineer certification basic exam by Python beginner (passed in August 2020)
Python C / C ++ Extensions: Pass some of the data as np.array to Python (set stride)
The math of some entrance exam question is awkward to think about, so I left it to python after all
Just add the python array to the json data
This is the only basic review of Python ~ 1 ~
This is the only basic review of Python ~ 2 ~
Programming beginner Python3 engineer certification basic exam record
[Python] Flow from web scraping to data analysis
This is the only basic review of Python ~ 3 ~
virtualenv For the time being, this is all!
How an "amateur banker" passed the Python 3 Engineer Certification Basic Exam in a week