[PYTHON] I tried to predict the sales of game software with VARISTA by referring to the article of Codexa

This time, I would like to use VARISTA to generate a model that predicts the sales of game software. I tried to build a prediction model with VARISTA by referring to the following Codexa article. The article uses AWS SageMaker, but I'll try it with VARISTA.

I tried to predict the sales of game software with XGBoost [Amazon SageMaker notebook + model training + model hosting]

However, I had to write some code by formatting the data, so I'm doing it with Google Colabratory. (For the time being, Python is a knowledge level that I have bitten.)

Time required

The actual operation is about 10 minutes Learning is about 1 or 2 minutes at level 1 and about 1.5 hours at level 3

Incurred costs

Free with VARISTA's Free account

Download data

Download from the following page of Kaggle. Video Game Sales with Ratings

A description of the data contained in Kaggle

This dataset was a Metacritic scraping. Unfortunately, Metacritic only covers a subset of the platform, so there is a lack of aggregated data. Also, in some games, the variables described below are missing.

Critic_score --A critic score compiled by Metacritic staff. Critic_count --The number of critics used to calculate the Critic_score. User_score --Score by Metacritic subscribers Usercount-Number of users who voted for a user score Developer-Game development company Rating --ESRB Rating

So just keep in mind that it's quite ** missing or not sales data for all games. ** **

Data processing

This time, I will process the data a little as the article I referred to. Since we have defined more than 1 million sales as hits, we will add a new column with more than 1 million as Yes and others as No.

I don't really like to create an environment locally, so I write this code in Google Colaboratory to process the data.

Colaboratory - Google Colab

import pandas as pd
filename = './sample_data/kaggle/Video_Games_Sales_as_at_22_Dec_2016.csv'
data = pd.read_csv(filename)                               
#Set target
# Global_Create y based on sales of 1 (1 million) or more in Sales
data['y'] = 'no'
data.iloc[data['Global_Sales'] > 1, 'y'] = 'yes
pd.set_option('display.max_rows', 20)
#View data
data
#Save the processed data as a new CSV
data.to_csv('sample_data/kaggle/Add_y_Column_Video_Games_Sales.csv')

You can see that ** y ** has been added to the rightmost column. image.png

When you execute the above code, a file called "Add_y_Column_Video_Games_Sales.csv" will be generated, so download it.

Upload data to VARISTA

Click here for VARISTA image.png

Create a new project in VARISTA and upload the ** Add_y_Column_Video_Games_Sales.csv ** you created. This time, select ** y ** for the column to predict.

Data confirmation

The outline of the data is as follows. image.png

The number of releases seems to peak in 2008-2010. image.png

Most of the platforms are PS2 and DS, followed by PS3. It seems that the smartphone is not included. image.png

The distribution of genres is like this. image.png

EA seems to be the top in the number of published books. I'm glad that there are many Japanese game companies. image.png

As for whether or not it was a hit, it is quite a narrow gate with 2,057 / 16,719 books. I used to develop smartphone games, but I had the impression that million hits had capital or luck. Moreover, this data is a consumer machine, so it's difficult. .. image.png

See the correlation

yes (yellow): More than a million hit no (light blue): Million hit not reached

Platforms are NES and GB % E3% 83% A0% E3% 83% 9C% E3% 83% BC% E3% 82% A4) has a high million hit rate. Is it because there weren't many other options when these game consoles were popular? .. ?? image.png

Publisher / Developer Since I'm Japanese, I'm always interested in Nintendo and Square Enix, but it's amazing that all the titles I developed in this data are million hits. As you can imagine from this graph, Nintendo is good at planning and development, and may not be very good at selling games developed by other companies. image.png image.png

The difference between Publisher and Developer is that Publisher is the company that sells and provides games, and Developer is the company that develops games. In some cases, Developer is also Publisher.

Critic_score & Critic_count image.png

image.png

User_score image.png

image.png

Learning

Learning was done at ** level 3 **. Detailed parameter settings have been done like this since Titanic. Level 1 learning is completed in a few minutes, but with this setting, it took an hour to find a large number of parameters. image.png

Also, I turned off the columns (Unnamed0, NA_Sales, EU_Sales, JP_Sales, Other_Sales, Global_Sales) that are directly related to the predicted columns from the dataset. image.png

Check the result

When I check the score, it looks like this.

image.png

The confusion matrix is displayed like this. Using 103 cases as test data for verification, it seems that 82 cases were hit and 21 cases were not hit. image.png

Also, this time I was angry that the learning data was biased. For this, there is no choice but to adjust the amount of data by undersampling etc., but I would like to try what actually happens. I will make time to try it again.

image.png

It seems that the value judged as Yes / No is also automatically adjusted. In this case, it seems to judge YES if it exceeds 0.222.

image.png

Since there is no actual test data, it should be created by picking up from the training data. This time I tried it briefly, so I tried to verify it using the data automatically divided by VARISTA.

If you read this article and decided to use VARISTA, please use the link below! Earn 7 $ credits for me and you! m (_ _) m

https://console.varista.ai/welcome/jamaica-draft-coach-cup-blend


Reference article I tried to predict the sales of game software with XGBoost [Amazon SageMaker notebook + model training + model hosting]

Recommended Posts

I tried to predict the sales of game software with VARISTA by referring to the article of Codexa
I tried to predict the price of ETF
I tried to automatically extract the movements of PES players with software
I tried to predict the behavior of the new coronavirus with the SEIR model.
I tried to predict the presence or absence of snow by machine learning.
I tried to automate the article update of Livedoor blog with Python and selenium.
I tried to visualize the running data of the racing game (Assetto Corsa) with Plotly
I tried to find the average of the sequence with TensorFlow
I tried to make the weather forecast on the official line by referring to the weather forecast bot of "Dialogue system made with python".
I wrote a doctest in "I tried to simulate the probability of a bingo game with Python"
I tried to automate the watering of the planter with Raspberry Pi
I tried to fix "I tried stochastic simulation of bingo game with Python"
I tried to predict by letting RNN learn the sine wave
I tried to expand the size of the logical volume with LVM
I tried to improve the efficiency of daily work with Python
I tried to verify the speaker identification by the Speaker Recognition API of Azure Cognitive Services with Python. # 1
I tried to verify the speaker identification by the Speaker Recognition API of Azure Cognitive Services with Python. # 2
I tried to predict the number of domestically infected people of the new corona with a mathematical model
I tried to get the authentication code of Qiita API with Python.
I tried to find the optimal path of the dreamland by (quantum) annealing
I tried to verify and analyze the acceleration of Python by Cython
I tried to analyze the negativeness of Nono Morikubo. [Compare with Posipa]
I tried to streamline the standard role of new employees with Python
I tried to visualize the text of the novel "Weathering with You" with WordCloud
I tried to get the movie information of TMDb API with Python
I tried the common story of using Deep Learning to predict the Nikkei 225
I tried to verify the result of A / B test by chi-square test
I tried to open the latest data of the Excel file managed by date in the folder with Python
I tried to predict next year with AI
I tried to save the data with discord
I tried to touch the API of ebay
I tried to correct the keystone of the image
I tried to predict Titanic survival with PyCaret
I tried to vectorize the lyrics of Hinatazaka46!
I tried to predict the deterioration of the lithium ion battery using the Qore SDK
I tried to predict the horses that will be in the top 3 with LightGBM
I tried to easily visualize the tweets of JAWS DAYS 2017 with Python + ELK
I tried to predict the change in snowfall for 2 years by machine learning
I tried to rescue the data of the laptop by booting it on Ubuntu
The story of making soracom_exporter (I tried to monitor SORACOM Air with Prometheus)
I tried to create a model with the sample of Amazon SageMaker Autopilot
I tried to automatically send the literature of the new coronavirus to LINE with Python
I tried to predict the number of people infected with coronavirus in consideration of the effect of refraining from going out
I tried to learn the sin function with chainer
python beginners tried to predict the number of criminals
I tried to summarize the basic form of GPLVM
I tried to touch the CSV file with Python
I tried to predict the J-League match (data analysis)
I tried to solve the soma cube with python
I tried to visualize the spacha information of VTuber
I tried to erase the negative part of Meros
I tried to solve the problem with Python Vol.1
I tried to classify the voices of voice actors
I tried to summarize the string operations of Python
I tried to make something like a chatbot with the Seq2Seq model of TensorFlow
I tried to predict the victory or defeat of the Premier League using the Qore SDK
I tried to put out the frequent word ranking of LINE talk with Python
I tried to visualize the characteristics of new coronavirus infected person information with wordcloud
I tried to compare the processing speed with dplyr of R and pandas of Python
The 15th offline real-time I tried to solve the problem of how to write with python
I tried to predict and submit Titanic survivors with Kaggle