[PYTHON] Challenge Kaggle Titanic

This article

A memo that challenged kaggle to study data analysis. I tried the tutorial "titanic", but there are too many things I don't understand, such as pandas and scikit. I thought I could use the code below, but the score is not good.

problem

The question of thinking about who the Titanic crew members survived (roughly explained without fear of misunderstanding). Train data and test data are given, and both data contain data such as gender and age. However, the train data has survival data (0/1), but the test data does not. *** In other words, the problem of creating a survival model from train data and predicting the survival of test data. *** (The correctness can be confirmed by submitting the forecast data on the website.)

code

#Preparation
def df_cleaner(df):
    #Make up for the missing parts
    #age
    median_age = np.median(df[(df['Age'].notnull())]['Age'])
    for passenger in df[(df['Age'].isnull())].index: #.index =Null location in the array
    	df.loc[passenger, 'Age'] = median_age
    # fare
    median_fare = np.median(df[(df['Fare'].notnull())]['Fare'])
    for passenger in df[(df['Fare'].isnull())].index:
        df.loc[passenger, 'Fare'] = median_fare

    #Convert character string data to numerical data
    df.loc[(df['Sex'] == 'male'),'Sex'] = 0
    df.loc[(df['Sex'] == 'female'),'Sex'] = 1
    df.loc[(df['Sex'].isnull()),'Sex'] = 2
    df.loc[(df['Embarked'] == 'S'),'Embarked'] = 0
    df.loc[(df['Embarked'] == 'C'),'Embarked'] = 1
    df.loc[(df['Embarked'] == 'Q'),'Embarked'] = 2
    df.loc[(df['Embarked'].isnull()),'Embarked'] = 3

    return df

#Let's make a csv for submission
def make_csv(file_path, passengerId, predicts):
    f = open(file_path, "wb")
    writer = csv.writer(f)
    writer.writerow(["PassengerId", "Survived"])
    for row, survived in zip(passengerId, predicts):
        writer.writerow([row, survived])

#Let's see the performance of the model we made
def getScore(answer, predicts):
    sum_p = 0.0
    total = 0.0
    for (row, predict) in zip(answer,predicts):
        if row == predict:
            sum_p += 1.0
        total += 1.0
    return sum_p/total

def main():
    # Read in the training data.
    train = pd.read_csv('./data/train.csv')
    test = pd.read_csv("./data/test.csv")
    #Unnecessary data(Expected)Let's erase
    train.drop(['Name', 'PassengerId', 'Ticket', 'Cabin'], axis=1, inplace=True)
    test.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True)

    #Prepare
    train = df_cleaner(train)
    test = df_cleaner(test)
    x_train = train[:][['Pclass', 'Sex','Age', 'SibSp', 'Parch', 'Fare', 'Embarked']]
    y_train = train[:][['Survived']]
    #Let's make a model in Random Forest
    scores =[]
    for trees in range(1,100):
        model = RandomForestClassifier(n_estimators=trees)
        model.fit(x_train, np.ravel(y_train))
        #Let's see the match rate
        pre = model.predict(x_train)
        scores.append(getScore(y_train['Survived'],pre))
    plt.plot(scores,'-r')
    plt.show()
    
    #Remake the actual test data
    x_test = test[:][['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']]
    label = test[:][['PassengerId']]
    #Let's predict using model
    output = model.predict(x_test)
    #Let's make a csv for submission
    make_csv("./output/random_forest.csv", label['PassengerId'], output.astype(int))

if __name__ == '__main__':
    main()

github Source code

Score

Code above: 0.75120 Tutorial copy and paste: 0.76555

Impressions

The original is worse. .. .. I don't think it's wrong as an algorithm, so it seems necessary to investigate the random forest part of scikit a little more.

Another case

I created a site to visualize and understand the algorithm. There is no Titanic issue, but the San Francisco issue is relevant so I'll list it here. Library of Algorithms: A site that visually understands algorithms

Recommended Posts

Challenge Kaggle Titanic
RECRUIT Challenge @ kaggle
Challenge Kaggle [House Prices]
Kaggle Tutorial Titanic Accuracy 80.9% (Top 7% 0.80861)
[For Kaggle beginners] Titanic (LightGBM)
Select models with Kaggle's Titanic (kaggle ④)
Predict Kaggle's Titanic with keras (kaggle ⑦)
[Kaggle for super beginners] Titanic (Logistic regression)
Check raw data with Kaggle's Titanic (kaggle ⑥)
Challenges for the Titanic Competition for Kaggle Beginners
I tried learning with Kaggle's Titanic (kaggle②)
Check the correlation with Kaggle's Titanic (kaggle③)
Challenge AtCoder
Try all scikit-learn models on Kaggle's Titanic (kaggle ⑤)
Can you challenge Kaggle with just your iPad?
Local public information university students challenge Kaggle (memorial)
Day 66 [Introduction to Kaggle] The easiest Titanic forecast
Basic visualization techniques learned from Kaggle Titanic data