[PYTHON] I tried cluster analysis of the weather map

Introduction

Have you ever heard the words west high east low and winter type pressure distribution? There are several patterns of pressure distribution near Japan, and the winter-type pressure distribution called West High East Low is probably the most famous (see the figure below). There are several other types, such as the summer-type pressure distribution covered by the Pacific High. In this article, I'll try to classify this pattern by unsupervised learning. img.png (From weather news)

I did the following three things this time.

--Scraping satellite images --Elbow method --Cluster analysis (unsupervised learning)

Acquisition of satellite imagery

The satellite image was acquired from the website of Ebayama Museum of Meteorology. Because it was heavy to download row data, and it was this site that seemed to be good for scraping with nicely processed data. Originally, it seems appropriate to purchase from HP of Meteorological Business Support Center, so this area is self-judgment. Thank you. The source code is not shown here, but it is listed on github.

The image I used is an image near Japan at 12:00 (JST), which is as follows (854 x 480px). 20190721.jpg

Preprocessing


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image
import glob
from tqdm import tqdm
from os import makedirs

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, silhouette_samples

x=np.empty((0,240*427))

paths=glob.glob("pictures/*.jpg ")
for path in tqdm(paths):
    img=Image.open(path)
    img = img.convert('L')
    img=img.resize((int(img.width/2), int(img.height/2)))
    x=np.append(x,np.array(img).reshape(1,-1),axis=0)

Load the image, reshape it into one line, and then populate it into a numpy array. It took a long time if the image quality was good, so I made it grayscale and halved it.

Elbow method


distortions = [] #Elbow method (find the optimum number of clusters)
for k in tqdm(range(1, 20)):
    kmeans = KMeans(n_clusters=k, n_init=10, max_iter=100)
    kmeans.fit(x)
    distortions.append(kmeans.inertia_)
fig = plt.figure(figsize=(12, 8))
plt.xticks(range(1, 20))
plt.plot(range(1, 20), distortions)
plt.savefig("elbow.jpg ")
plt.close()

The optimum number of clusters was calculated by the elbow method. It took about 10 minutes to do it at 20, so I think about 10 is enough. The result is shown in the figure below. elbow.jpg I didn't know exactly how many would be good, but this time I decided to do it with 4.

Cluster analysis


k_means = KMeans(n_clusters=4).fit(x)
y_pred = k_means.predict(x)
print(k_means.labels_)
print(pd.Series(k_means.labels_, name='cluster_number').value_counts(sort=False))

out=pd.DataFrame()
out["picture"]=paths
out["classnumber"]=y_pred
out["date"]=pd.to_datetime(out["picture"].str.split("\\",expand=True).iloc[:,1].str.split(".",expand=True).iloc[:,0])
out.to_csv("out.csv")

The number of elements per cluster was 139,61,68,98. It looks good, so I can expect it.


#Save by class
for i in range(4):
    makedirs(str(i)+"_pictures", exist_ok=True)
for i in out.itertuples():
    img=Image.open(i.picture)
    img.save(str(i.classnumber)+"_"+i.picture)

for i in range(4):
    out["month"]=out["date"].dt.month
    sns.countplot("month",data=out[out["classnumber"]==i])
    plt.title(i)
    plt.savefig("Monthly distribution"+str(i))
    plt.close()

Let's save each class separately and see the monthly distribution and concrete images of each class.

Cluster No.0

月分布0.png It feels like there are many in winter and few in summer. Is it a winter-type pressure distribution? It seems strange that this can be seen even in the summer, even though the number is small.

The images that belong to this cluster are as follows, for example.

2020/1/13 2020/1/19
20200113.jpg 20200119.jpg

This was a weather map with a typical west high east low pressure distribution and a cold wind blowing from the northwest, causing clouds over the Japanese archipelago.

In addition, the figures belonging to this cluster that are not in winter are as shown in the figure below.

2020/6/26 2019/10/26
20200626.jpg 20191026.jpg

The atmosphere is that there are clouds over the continent and over Japan, and there are no clouds over the Pacific Ocean. Although the cloud types are different, I feel that the atmosphere of the cloud place is certainly similar.

Cluster No.1

月分布1.png It is increasing in April and November. I couldn't figure out what they had in common with this graph alone.

The images that belong to this cluster are as follows, for example.

2019/11/2 2020/4/29
20191102.jpg 20200429.jpg

It did not appear to have a clear pressure distribution feature. As a feature of the image, the area around Japan was sunny, and there were many images with diagonal clouds in the southeastern direction of Japan. Some of these clouds were formed around the edge of the Pacific High depending on the season, but I feel that similar clouds are formed by chance. If anything, this cluster had a strong impression like the remainder of other clusters.

Cluster No.2

月分布2.png This is often the case during the rainy season. Is it the pressure distribution when there is a rainy season front? In addition, it seems that no one was seen in February, August, and September.

2020/6/28 2020/7/4
20200628.jpg 20200704.jpg

As expected, many of these clusters showed the Baiu front. The rainy season does not appear in the four categories of spring, summer, autumn and winter, but I think it has been shown that its meteorological characteristics are clear.

The images of other seasons belonging to this cluster were as follows.

2019/10/20 2020/3/19
20191020.jpg 20200319.jpg

Clouds with a shape similar to that of a front spread over the Japanese archipelago, and it is understandable that they were classified into this cluster.

Cluster No.3

月分布3.png It shows an overwhelming summer. It seems to represent the summer-type pressure distribution overhanging the Pacific High.

Looking at the images actually classified into this cluster, it was an image full of summer type feeling as follows.

2019/7/29 2019/8/21
20190729.jpg 20190821.jpg

In addition, many of the images of other seasons had wider sunny days.

2019/10/14 2019/11/1
20191014.jpg 20191101.jpg

Summary

From the above analysis results, it was possible to classify the general tendency of pressure distribution by cluster analysis and interpret those that deviate from it. However, since satellite images do not directly represent atmospheric pressure, it is not possible to directly classify atmospheric pressure distribution, and clusters are divided according to the distribution of clouds, so if the shapes of clouds are similar, it will be incorrect. It will be classified. In order to classify the atmospheric pressure distribution, there is room to consider how to capture not only clouds but also atmospheric pressure. Code executed this time (github)

References

-[weathernews "What is the" pressure distribution of high west and low east "that you often hear in the weather forecast? ]](Https://weathernews.jp/s/topics/201610/140215/)

Recommended Posts

I tried cluster analysis of the weather map
I tried morphological analysis of the general review of Kusoge of the Year
I tried the asynchronous server of Django 3.0
I tried the pivot table function of pandas
Before the coronavirus, I first tried SARS analysis
I tried to touch the API of ebay
I tried to correct the keystone of the image
I tried morphological analysis and vectorization of words
I tried to perform a cluster analysis of customers using purchasing data
I tried using the image filter of OpenCV
I tried to predict the price of ETF
I tried to vectorize the lyrics of Hinatazaka46!
I tried to display the infection condition of coronavirus on the heat map of seaborn
I tried to summarize the basic form of GPLVM
I tried the MNIST tutorial for beginners of tensorflow.
I tried to predict the J-League match (data analysis)
[OpenCV / Python] I tried image analysis of cells with OpenCV
I tried using the API of the salmon data project
I tried to visualize the spacha information of VTuber
I tried to erase the negative part of Meros
I tried scraping the advertisement of the pirated cartoon site
I tried the simplest method of multi-label document classification
I tried to classify the voices of voice actors
I tried running the sample code of the Ansible module
I tried to summarize the string operations of Python
I tried the changefinder library!
I tried to find the entropy of the image with python
[Horse Racing] I tried to quantify the strength of racehorses
I tried to get the location information of Odakyu Bus
I tried the accuracy of three Stirling's approximations in python
I tried to find the average of the sequence with TensorFlow
I tried refactoring the CNN model of TensorFlow using TF-Slim
I tried face recognition of the laughter problem using Keras.
I tried to display the time and today's weather w
[Python] I tried to visualize the follow relationship of Twitter
I tried a little bit of the behavior of the zip function
[Machine learning] I tried to summarize the theory of Adaboost
[Python] I tried collecting data using the API of wikipedia
I tried to fight the Local Minimum of Goldstein-Price Function
I tried to launch ipython cluster to the minimum on AWS
I displayed the chat of YouTube Live and tried playing
I tried the TensorFlow tutorial 1st
I tried the Naro novel API 2
I investigated the mechanism of flask-login!
I tried using GrabCut of OpenCV
I tried the TensorFlow tutorial 2nd
I tried the Naruro novel API
I tried to move the ball
I tried using the checkio API
I tried to estimate the interval.
I tried scraping the ranking of Qiita Advent Calendar with Python
[Linux] I tried to summarize the command of resource confirmation system
I tried to get the index of the list using the enumerate function
I tried to automate the watering of the planter with Raspberry Pi
[Python] I wrote the route of the typhoon on the map using folium
I tried to build the SD boot image of LicheePi Nano
I looked at the meta information of BigQuery & tried using it
I tried to make an analysis base of 5 patterns in 3 years
I tried to expand the size of the logical volume with LVM
I tried running the DNN part of OpenPose with Chainer CPU
I tried to summarize the frequently used implementation method of pytest-mock