For studying machine learning and Bayesian optimization, I tried to score k, kaggle's tutorial-like competition "Titanic: Machine Learning from Disaster" with a neural network. For high-para optimization, we use Preferred Networks' optuna library (official site) (https://preferred.jp/ja/projects/optuna/'optuna library').
Public Score : 0.7655
I will put the link of the kaggle note. kaggle notebook
The first is pre-processing for deletion of features that are unlikely to be related to defective land processing. I did it by intuition.
train = train.fillna({'Age':train['Age'].mean()})
X_df = train.drop(columns=['PassengerId','Survived', 'Name', 'Ticket', 'Cabin', 'Embarked'])
y_df = train['Survived']
Next is the acquisition of dummy variables.
X_df = X_df.replace('male', 0)
X_df = X_df.replace('female', 1)
Divide the data into training and evaluation.
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X_df.values, y_df.values, test_size=0.25, shuffle=True, random_state=0)
Let's take a look at the contents of X_train. The column names are Pclass, Sex, Age, SibSp, Parch, Fare.
[[ 3. 0. 28. 0. 0. 7.8958 ]
[ 3. 1. 17. 4. 2. 7.925 ]
[ 3. 0. 30. 1. 0. 16.1 ]
...
[ 3. 0. 29.69911765 0. 0. 7.7333 ]
[ 3. 1. 36. 1. 0. 17.4 ]
[ 2. 0. 60. 1. 1. 39. ]]
We will build a neural network model. Only fully connected layer. Optuna also optimizes the number of hidden layers and the number of units.
def create_model(activation, num_hidden_layer, num_hidden_unit):
inputs = Input(shape=(X_train.shape[1],))
model = inputs
for i in range(1,num_hidden_layer):
model = Dense(num_hidden_unit, activation=activation,)(model)
model = Dense(1, activation='sigmoid')(model)
model = Model(inputs, model)
return model
Determine the range of parameters to optimize with optuna. It minimizes or maximizes the return value of the function. The default is minimized. If you want to maximize it, you can do it with create_study ('direction = maximize)
which will appear later.
def objective(trial):
K.clear_session()
activation = trial.suggest_categorical('activation',['relu','tanh','linear'])
optimizer = trial.suggest_categorical('optimizer',['adam','rmsprop','adagrad', 'sgd'])
num_hidden_layer = trial.suggest_int('num_hidden_layer',1,5,1)
num_hidden_unit = trial.suggest_int('num_hidden_unit',10,100,10)
learning_rate = trial.suggest_loguniform('learning_rate', 0.00001,0.1)
if optimizer == 'adam':
optimizer = Adam(learning_rate=learning_rate)
elif optimizer == 'adagrad':
optimizer = Adagrad(learning_rate=learning_rate)
elif optimizer =='rmsprop':
optimizer = RMSprop(learning_rate=learning_rate)
elif optimizer =='sgd':
optimizer = SGD(learning_rate=learning_rate)
model = create_model(activation, num_hidden_layer, num_hidden_unit)
model_list.append(model)
model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['acc', 'mape'],)
es = EarlyStopping(monitor='val_acc', patience=50)
history = model.fit(X_train, y_train, validation_data=(X_val, y_val), verbose=0, epochs=200, batch_size=20, callbacks=[es])
history_list.append(history)
val_acc = np.array(history.history['val_acc'])
return 1-val_acc[-1]
Learn and optimize. After the optimization, I put each model in a list for easy re-learning. It took about 6 minutes and 12 seconds.
model_list=[]
history_list=[]
study_name = 'titanic_study'
study = optuna.create_study(study_name=study_name,storage='sqlite:///../titanic_study.db', load_if_exists=True)
study.optimize(objective, n_trials=50, )
See the result of the optimization.
print(study.best_params)
print('')
print(study.best_value)
The result of optimization. I'm sorry for the miscellaneous. The top is each high para, and the bottom is the correct answer rate.
{'activation': 'relu', 'learning_rate': 0.004568302718922509, 'num_hidden_layer': 5, 'num_hidden_unit': 50, 'optimizer': 'rmsprop'}
0.17937219142913818
Predict using test data. Before that, do a sufficient amount of learning with the best parameters. The preprocessing of the test data is almost the same as the training data, but the PassengerId is saved in a separate data frame for score submission.
model_list[study.best_trial._number-1].compile(optimizer=study.best_trial.params['optimizer'], loss='binary_crossentropy', metrics=['acc', 'mape'],)
es = EarlyStopping(monitor='val_acc', patience=100)
history = model_list[study.best_trial._number-1].fit(X_train, y_train, validation_data=(X_val, y_val), verbose=1, epochs=400, batch_size=20, callbacks=[es])
predicted = model_list[study.best_trial._number-1].predict(X_test.values)
predicted_survived = np.round(predicted).astype(int)
The passenger and the survival prediction result are linked and output to csv to complete.
df = pd.concat([test_df_index,pd.DataFrame(predicted_survived, columns=['Survived'])], axis=1)
df.to_csv('gender_submission.csv', index=False)
df
PassengerId | Survived | |
---|---|---|
0 | 892 | 0 |
1 | 893 | 0 |
2 | 894 | 0 |
3 | 895 | 0 |
4 | 896 | 0 |
... | ... | ... |
413 | 1305 | 0 |
414 | 1306 | 1 |
415 | 1307 | 0 |
416 | 1308 | 0 |
417 | 1309 | 0 |
418 rows × 2 columns
Public Score : 0.7655
It was a subtle result. But it was very easy. I'm worried that I'll be relying on optuna for the rest of my life, and I'm worried that my tuning skills won't improve. Is it okay to automatically optimize everything?
It was very easy to understand and helpful. [Introduction to Optuna](https://qiita.com/studio_haneya/items/2dc3ba9d7cafa36ddffa'Introduction to Optuna')
Recommended Posts