[PYTHON] (Preserved version: Updated from time to time) A collection of useful tutorials for data analysis hackathons by Team AI

We, Team AI, hold machine learning study sessions and data analysis hackathons every day in Shibuya. The goal is to have a community of 1 million people, mainly in Tokyo.

I hope that this data analysis movement will spread throughout Japan and around the world. We have compiled a tutorial that is useful when doing a data analysis hackathon. It's a lot of fun, so everyone, especially the locals, should definitely try it for yourself! We will also cooperate as Team AI.

FullSizeRender (8).jpg

If you are new to Kaggle / especially how to use Kernel, please take a quick look below.

Watch this first! Explanation of Kaggle Kernel function created by Ishii (Increases productivity!) => https://www.youtube.com/watch?v=HkJmnpBjiI0



Lots of DataSets Here, click on an interesting dataset with lots of likes. DataSet can also be searched by keyword. https://www.kaggle.com/datasets

Full-time Kaggler Curry-chan's detailed Kaggle commentary; https://note.mu/currypurin/n/nf390914c721e

Curry-chan also has Kaggle information on Twitter; https://twitter.com/currypurin

スクリーンショット 2018-09-06 19.21.27.png

2018/9/6 Cross-search engine for datasets announced by Google It's very convenient https://toolbox.google.com/datasetsearch

What is Kaggle?

Kaggle begins http://qiita.com/taka4sato/items/802c494fdebeaa7f43b7

If you want to become a data scientist, start with Kaggle


Kaggle Slack Group

Global Group 3000 people https://kagglenoobs.herokuapp.com/

400 people mainly in Japanese group high level http://kaggler-ja.herokuapp.com/

Fintech Data Hackathon

The dataset we are using

Bitcoin Price Prediction (LightWeight CSV) https://www.kaggle.com/team-ai/bitcoin-price-prediction

Uniqlo (FastRetailing) Stock Price Prediction


Foreign Exchange (FX) Prediction - USD/JPY https://www.kaggle.com/team-ai/foreign-exchange-fx-prediction-usdjpy

Foreign Exchange(FX) Prediction - EUR/USD https://www.kaggle.com/meehau/EURUSD/kernels Is the fairly carefully written Kernel => prediction accuracy 99.7% true? ?? https://www.kaggle.com/daiearth22/eurusd-15-minute-interval-price-prediction?scriptVersionId=8708587

Kaggle datasets in finance category (competition is heavy data) https://www.kaggle.com/tags/finance

Credit Card Fraud Credit card fraud detection data (66MB, so heavy) https://www.kaggle.com/mlg-ulb/creditcardfraud

StockPrice and News Correlation analysis of news and stock price (6MB) https://www.kaggle.com/aaron7sun/stocknews

Loan Data for risk analysis Lending risk calculation data (6KB light) https://www.kaggle.com/zhijinzhai/loandata

Loan Data for risk analysis(heavy data) Loan risk calculation data (240MBvery heavy) https://www.kaggle.com/wendykan/lending-club-loan-data

A good blog to read

A story about predicting exchange rates with Deep Learning http://qiita.com/ognek/items/1b776d504d20bd6f6d7d

When I verified the stock price forecast paper with Twitter sentiment analysis, I was able to predict up and down with an accuracy of about 70% http://qiita.com/ryo_grid/items/5a5ecc602186a3381c87

Format and display time series data with different scales and units with Python or Matplotlib http://qiita.com/zaburo/items/00f364422ef3fe64f156

2018/10/19 postscript

Indian financial data provider; https://www.quandl.com/

I received some useful information from a day trader.

Alpha AI's open source project for stock price forecasting from data preprocessing to LSTM training-98% accuracy https://github.com/VivekPa/AlphaAI

Finance x Python Mokumokukai FinPy https://fin-py.connpass.com/

Quantopian Mokumokukai https://quantopian-tokyo.connpass.com/

Zero commission stock trading app Stream https://smartplus-sec.com/stream/

Python day trader Doriran Twitter https://twitter.com/patraqushe?lang=en

Day trading engineer Shinseitaro Twitter https://twitter.com/shinseitaro

2018/9/21 FinTech postscript

Investor support app MyTrade that can be used for free https://mytrade.jp/

Dragon King theory that predicts economic crisis with the concept of anomaly detection (similar to Black Swan) https://www.ted.com/talks/didier_sornette_how_we_can_predict_the_next_financial_crisis/transcript?language=ja#t-6583

Dragon King theoretical paper https://arxiv.org/abs/0907.4290

2018/2/16 added

I tried to analyze card payment default data with Excel (statistics that can not be heard now) https://medium.com/team-ai-math/data-analysis-by-excel-b90fcbd7f4fe

25 overseas FinTech investment surveys Jan 2018 https://medium.com/team-ai-fintech/fintech-investment-jan-35d2424f22f4

Featured overseas FinTech service example 20 https://medium.com/team-ai-fintech/fintech-startups-20-2c21b27ea003

Medical Data Hackathon

Synchronized brainwave dataset EEG https://www.kaggle.com/berkeley-biosense/synchronized-brainwave-dataset

Breast Cancer Wisconsin (Diagnostic) Data Set Breast Cancer https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

Hospital General Information Hospital https://www.kaggle.com/cms/hospital-general-information

Zika Virus Epidemic Zika fever https://www.kaggle.com/cdc/zika-virus-epidemic

Cervical Cancer Risk Classification Cervical Cancer https://www.kaggle.com/loveall/cervical-cancer-risk-classification

Medical Appointment No Shows Patient slapstick analysis https://www.kaggle.com/joniarroba/noshowappointments

Mental Health in Tech Survey Mental Health in Tech Survey https://www.kaggle.com/osmi/mental-health-in-tech-survey

2018/6/18 Added from Medical Data Hackathon

Google's cool data visualization tool FACETS https://pair-code.github.io/facets/

RandamForest's Regressor roughly detects the importance of variables (useful!) http://scikit-learn.org/…/sklearn.ensemble.RandomForestRegr…

Pands Profiling to get an overview of the acquired data https://wonderwall.hatenablog.com/entry/2018/02/12/171500

Pharmaceutical open data DrugBank https://www.drugbank.ca/

Open protein data Protein Bank https://www.rcsb.org/

Google's free GPU cloud Colaboratory is super convenient http://itsukara.hateblo.jp/entry/2018/02/05/214949

NASA/Space Data Hackathon

Exoplanet Hunting in Deep Space Planetary exploration data https://www.kaggle.com/keplersmachines/kepler-labelled-time-series-data

Solar Radiation Prediction Solar Radiation Data https://www.kaggle.com/dronio/SolarEnergy

Climate Change: Earth Surface Temperature Data Earth Surface Temperature Data https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data

Meteorite Landings Meteorite impact data https://www.kaggle.com/nasa/meteorite-landings

UFO Sightings UFO discovery data https://www.kaggle.com/NUFORC/ufo-sightings

Open Exoplanet Catalog exoplanet data https://www.kaggle.com/mrisdal/open-exoplanet-catalogue

Kepler Exoplanet Search Results Exoplanet data 2 https://www.kaggle.com/nasa/kepler-exoplanet-search-results/kernels

NASA Exoplanet Exploration Kepler Space Telescope Mission Details https://japanese.engadget.com/2018/03/15/9-4500/

2018/12/23 added

Sakura Internet's artificial satellite data utilization mechanism Tellus https://www.sakura.ad.jp/information/pressreleases/2018/07/31/1968197591/

Google Earth API https://developers.google.com/earth-engine/

Marketing/Retail Data Hackathon

Springleaf Marketing Response Direct mail response analysis 150MB https://www.kaggle.com/c/springleaf-marketing-response/kernels

Coupon Purchase Prediction Recruit Pompare data https://www.kaggle.com/c/coupon-purchase-prediction

Airbnb New User Bookings Airbnb Booking Data Analysis Where will a new guest book their first travel experience? https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings

Rossmann Store Sales Retail Sales Forecast https://www.kaggle.com/c/rossmann-store-sales/data

Home Depot Product Search Relevance Predict the relevance of search results on homedepot.com https://www.kaggle.com/c/home-depot-product-search-relevance

Acquire Valued Shoppers Challenge Predict which shoppers will become repeat buyers https://www.kaggle.com/c/acquire-valued-shoppers-challenge

Getting real about fake news https://www.kaggle.com/mrisdal/fake-news

Starbucks Locations Worldwide https://www.kaggle.com/starbucks/store-locations

Retail rocket recommendation system dataset https://www.kaggle.com/retailrocket/ecommerce-dataset

Grupo Bimbo Inventory Demand Optimize food sales and minimize returns (Train data 3GB data available) Maximize sales and minimize returns of bakery goods https://www.kaggle.com/c/grupo-bimbo-inventory-demand

Innerwear Data from Victoria's Secret https://www.kaggle.com/PromptCloudHQ/innerwear-data-from-victorias-secret-and-others

NLP (Natural Language Processing) Data Hackathon

Natural language processing tutorial => https://qiita.com/daisuke-team-ai/items/d2e18f07a08d9b4cb783

Summary of typical NLP approaches + Code (Kaggle Kernel) Recommended


NLP Data;

Shinzo Abe Twitter Data (Prime Minister Abe's Twitter data) https://www.kaggle.com/team-ai/shinzo-abe-japanese-prime-minister-twitter-nlp/version/1

World News on Reddit News data analysis on the bulletin board https://www.kaggle.com/rootuser/worldnews-on-reddit

South Park Dialogue Identify the speaker from the dialogue data of the animation script https://www.kaggle.com/tovarischsukhov/southparklines

Deep NLP Analysis of Chatbot and resume data https://www.kaggle.com/samdeeplearning/deepnlp

Python Questions from StackOverFlow Question analysis about Python on programming Q & A site https://www.kaggle.com/stackoverflow/pythonquestions

Japanese English Bilingual Corpus (Wikipedia Corpus in Japanese and English) https://www.kaggle.com/team-ai/japaneseenglish-bilingual-corpus

Japanese lemma frequency 15000 list of frequently used words in Japanese A list of the 15,000 most common word forms in Japanese https://www.kaggle.com/rtatman/japanese-lemma-frequency

Japanese Whiskey Review Dataset (English but Japanese Whiskey Review) 1,000+ Reviews of Japanese Whisky https://www.kaggle.com/koki25ando/japanese-whisky-review

(For advanced users) A competition to classify similar questions on the Q & A site Quora https://www.kaggle.com/c/quora-question-pairs

Extra; President Trump's Twitter AI => Talk to him and he'll answer right away! https://twitter.com/TrumpSidekik スクリーンショット 2018-10-10 20.51.26.png

HR Data

Kaggle ML and Data Science Survey, 2017 Data Analysis Industry-Wide Analysis A big picture view of the state of data science and machine learning. https://www.kaggle.com/kaggle/kaggle-survey-2017

U.S. Incomes by Occupation and Gender Analysis of Income Gap by Gender Analyze gender gap and differences in industry's incomes https://www.kaggle.com/jonavery/incomes-by-career-and-gender

Daily Happiness & Employee Turnover Correlation Analysis of Performance and Employee Happiness Is There a Relationship Between Employee Happiness and Job Turnover? https://www.kaggle.com/harriken/employeeturnover

IBM HR Analytics Employee Attrition & Performance IBM Turnover Analysis Predict attrition of your valuable employees https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset

Human Resources Analytics Why are talented employees leaving their jobs? Analysis Why are our best and most experienced employees leaving prematurely? https://www.kaggle.com/ludobenistant/hr-analytics

2016 New Coder Survey Attribute data for 15,000 new software engineers A survey of 15,000+ people who are new to software development https://www.kaggle.com/freecodecamp/2016-new-coder-survey-

U.S. Incomes by Occupation and Gender Income inequality analysis by occupation and gender Analyze gender gap and differences in industry's incomes https://www.kaggle.com/jonavery/incomes-by-career-and-gender

Good articles to refer to

Get time series data from k-db.com in Python


Recommended dataset

Great information in English

If you install Google Translate for Chrome, you can automatically translate in one shot!

Quora has a lot of know-how on time series forecasting (for FinTech); https://www.google.co.jp/search?q=how+to+predict+time+series+quora&rlz=1C5CHFA_enJP747JP747&oq=how+to+predict+time+series+quora&aqs=chrome..69i57.8273j0j7&sourceid=chrome&ie=UTF-8

List of mathematical approaches

(Preserved version: For amateurs) Machine learning / data analysis List of articles to read by Team AI


Python package


Official Site http://pandas.pydata.org/ Loose fluffy pandas cheat sheet


If you remember this much, you can manage Pandas



Official Site https://seaborn.pydata.org/

Beautiful graph drawing with python -seaborn makes data analysis and visualization easier Part 1


Beautiful graph drawing with python -seaborn makes data analysis and visualization easier Part 2


Japanese settings for matplotlib and Seaborn axes


Recommended Posts

(Preserved version: Updated from time to time) A collection of useful tutorials for data analysis hackathons by Team AI
Easy padding of data that can be used in natural language processing
Learn the basics of document classification by natural language processing, topic model
(Preserved version: Updated from time to time) A collection of useful tutorials for data analysis hackathons by Team AI
Natural language processing of Yu-Gi-Oh! Card name-Yu-Gi-Oh!
(Updated from time to time) Summary of machine learning APIs that allow you to quickly build apps by Team AI
[Updated from time to time] Python memos often used for data analysis [N division, etc.]
A summary of Python e-books that are useful for free-to-read data analysis
A memorandum of commands, packages, terms, etc. used in linux (updated from time to time)
[Updated from time to time] Review of Let Code NumPy
List of my articles that may be useful in competition pros (updated from time to time)
How to plot the distribution of bacterial composition from Qiime2 analysis data in a box plot
[Updated from time to time] Summary of design patterns in Java
A study method for beginners to learn time series analysis
A story about everything from data collection to AI development and Web application release in Python (3. AI development)
Ported from R language of "Sazae-san's rock-paper-scissors data analysis" to Python
A simple data analysis of Bitcoin provided by CoinMetrics in Python
The first time a programming beginner tried simple data analysis by programming
[Note] AI / machine learning / python related websites [updated from time to time]
Prepare a high-speed analysis environment by hitting mysql from the data analysis environment