SIGNATE Practice, predicting wine varieties.
RandomForest
As a result of trying three algorithms, the random forest was the most accurate, so it was adopted as the final classifier.
wine-learning.py
wine_data = pd.read_csv('train.tsv',sep='\t')
wine_test = pd.read_csv('test.tsv',sep='\t')
I used read_table
last time, but I also tried using read_csv
because it's a big deal. I feel that table is easier.
By the way, both methods are the same, so there is no right one.
wine-learning.py
X = wine_data.loc[:,['Alcohol','Malic acid','Ash','Alcalinity of ash','Magnesium','Total phenols','Flavanoids','Nonflavanoid ohenols','Proanthocyanins','Color intensity','Hue','OD280/OD315 of diluted wines','Proline']].values
y = wine_data.loc[:,'Y'].values
I want to do something about this because it tends to be long when there are many variables. Let's consider if there is a way to improve it when doing the next task. By the way, the test data is finally here
wine-learning.py
Xt = wine_test.loc[:,['Alcohol','Malic acid','Ash','Alcalinity of ash','Magnesium','Total phenols','Flavanoids','Nonflavanoid ohenols','Proanthocyanins','Color intensity','Hue','OD280/OD315 of diluted wines','Proline']].values
wine-learning.py
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)
This time as well, the data was divided at a ratio of 8: 2.
wine-learning.py
X_train = X_train[:, ~np.isnan(X_train).any(axis=0)]
X_test = X_test[:, ~np.isnan(X_test).any(axis=0)]
Xt = Xt[:, ~np.isnan(Xt).any(axis=0)]
A missing value that did not exist before the division suddenly appeared. I did not understand the cause, so I will try to verify it soon. This time, we decided to delete the missing values.
SVC
wine-learning.py
clf = svm.SVC()
clf.fit(X_train, y_train)
wine-learning.py
clf = LogisticRegression()
clf.fit(X_train, y_train)
wine-learning.py
clf = RandomForestClassifier(n_estimators=500, random_state=0)
clf.fit(X_train, y_train)
Random_state was set to 0, and n_estimators (number of decision trees) was set to 500.
wine-learning.py
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('Correct answer rate= ' , accuracy)
As in the example, the accuracy rate was calculated using the ʻaccuracy` function.
Correct answer rate= 0.6111111111111112
Correct answer rate= 0.8888888888888888
Correct answer rate= 1.0
wine-learning.py
X_pred = np.array(Xt)
y_pred = clf.predict(X_pred)
print(y_pred)
I did it (Pachi Pachi)
--Change the learning algorithm depending on the amount of variables. --Reduce the number of variables (remove dimensions) and then apply the learning algorithm. --Understand the characteristics of the algorithm in the first place. --Investigating the cause of missing values that occurred at the time of data division --Adoption of mixed matrix
Recommended Posts