[PYTHON] Kaggle Summary: Planet, Understanding the Amazon from Space

1.First of all

We will update the information of Kaggle who participated in the past. Here, we will introduce the data of Planet: Understanding the Amazon from Space and discuss it prominently in the forum. I will pick it up. I will introduce the winner's approach in another article.

2. Background

Screen Shot 2017-07-17 at 21.44.54.png

A large amount of forest is lost every day in the Amazon basin. This has caused global ecosystem destruction, loss of settlements, climate variability and many other catastrophic consequences. By digitizing the destruction of nature and human invasion in forests, government and local stakeholders will be able to respond quickly and costly to such issues. In this competition, it will be possible to make small-scale deforestation and human / accidental judgments by using satellite photographs with a resolution of 3-5 US. Therefore, in this competition, we will label the ground surface from satellite photographs.

The characteristic points of this time are as follows.

3. Evaluation index

The evaluation index this time is F2 score. F score is generally used for scoring, but F2 score is an index with a larger recall weight than Precision. is. Screen Shot 2017-07-17 at 21.54.54.png

4. Introduction of data

Data this time is from Amazon (Brazil, Peru, Uruguay, Colombia, Venezuela, Guyana, Bolivia, It is a satellite photograph of (Ecuador). Each photo has one or more labels for three groups (1. weather conditions, 2. general surface, 3. rare surface).

chips.jpg

Below, we will introduce each label.

4.1. Cloudy label

Clouds are the main barrier in satellite photography. There are three types of cloudy labels here.

4.1.1. Cloudy Scenes

cloudy_1.jpg

4.1.2. Partly Cloudy Scenes

pc1.jpg

4.1.3. Hazy Scenes (lightly cloudy)

haze1.jpg

4.2. General geographic labels

Here are seven common labels.

4.2.1. Rainforest

This is the most label in the data. primary.jpg

4.2.2. Water (rivers, lakes)

Labels indicating rivers, reservoirs, and oxbow lakes, which are the most important features of the Amazon basin.

river.jpg

4.2.3. Residence

A label indicating the place of residence or building. Includes from densely populated areas to rural villages. Labeling becomes more difficult in smaller residential areas that are spot-like living alone.

habitation1.jpg

4.2.4. Agricultural land

Labeling cultivated land for commercial crops is an important technology in the Amazon.

agg1.jpg

4.2.5. Road

road.jpg

4.2.6. Cultivated land

Shifting cultivation is an element of agricultural land and is often run by individuals or families in rural areas. cultivation.jpg

4.2.7. Bare ground

Bare Ground applies to all naturally occurring, treeless areas, not human influences. Such areas occur naturally in the Amazon and include relatively small areas such as the Pantanal Wetlands and the Senard Aridlands.

bare.jpg

4.3. Unusual geographic labels

4.3.1. Slash and burn

It is part of a mobile cultivated land. It shows a dark brown or black color.

slashburn1.jpg

4.3.2. Logging

A label that indicates areas where expensive trees have been cut down, such as teak and mahogany. Appears as a winding muddy road adjacent to a bare brown patch.

logging1.jpg

4.3.3. Hanasaki

A natural phenomenon often seen in the Amazon.

bloom.jpg

4.3.4. Traditional mining

There are large traditional mines in the Amazon.

mine1.jpg

4.3.5. "Technical" mining

This is the case for small-scale mining. It is often found in areas where gold is deposited, such as the Andean hills. This technical mining often involves illegal activity and causes soil erosion of the surrounding area.

artmine1.jpg

4.3.6. Blow Down

A natural phenomenon found in the Andean region, also known as Windslow. Local and suddenly dry cold winds blow down from the Andes (strong and fast winds over 100MPH), knocking down trees in large rainforests.

blowdown.jpg

5. Analyze the data for the time being (EDA)

I will introduce two notebooks. 5.1. Is anoka's EDA 5.2. Is Philipp Schmidt's EDA

5.1. Creating a label humangram

Click here for the original notebook Image analysis using satellite photographs has been seen in many competitions for some time. Apparently it's a recent trend. In this competition, it is a classification problem that creates multi-labels for all images that are a little unusual. Therefore, here, we will briefly analyze the appearance of the overall label.

https://gist.github.com/TomHortons/d766738d4ce4bd564a96bbdd5529bfaa

Satellite photo numbers and labels are provided in separate files. Load them and plotly create a histogram.

スクリーンショット 2017-07-18 16.10.04 1.png

As was the case with the first data introduction, the rainforest labels (primary) and weather conditions labels (cloudy, clear) are the most common. Since the evaluation index this time is F2 score, we emphasize Recall. The challenge is how rare geographic labels can be estimated while suppressing common geographic labels.

https://gist.github.com/TomHortons/68023b27e469bdb25e0249477ba9f123

Next, check the label combination frequency. For example, in the main rainforest, there should be labels for quarries, shifting cultivation, blowdowns, etc. In this way, "co-occurrence" can be information in itself.

スクリーンショット 2017-07-18 16.21.48.png

You can see that there is a tendency for combinations such as "agricultural land and cultivated land, roads", "roads and residential areas, traditional mining".

5.2. Visualization of image clustering vegetation distribution

This is the referral source notebook. I've chopped each element and uploaded it to gist, so if it doesn't work, copy the code from this notebook.

5.2.1. Checking labels and images

from glob import glob
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from subprocess import check_output
import matplotlib.pyplot as plt
from scipy.stats import bernoulli
import seaborn as sns
print(check_output(["ls", "../input"]).decode("utf8"))
%matplotlib inline


df = pd.read_csv('../input/train_v2.csv')
image_paths = sorted(glob('../input/train-jpg/*.jpg'))[0:1000]
image_names = list(map(lambda row: row.split("/")[-1][:-4], image_paths))
image_names[0:10]

plt.figure(figsize=(12,8))
for i in range(6):
    plt.subplot(2,3,i+1)
    plt.imshow(plt.imread(image_paths[i]))
    plt.title(str(df[df.image_name == image_names[i]].tags.values))
スクリーンショット 2017-07-18 18.26.11.png

Apparently, the size of the images are all the same size. You can see that each image has multiple tags.

5.2.2. Try to create a simple classifier

https://gist.github.com/TomHortons/d7e9dc9382f6fcd4763163e1a7db9f99

The above code is a sample code of a person classified by Logistic Regression of scikit learn. In the data preprocessing, the following is a little complicated writing. df is the image data read in the previous sample.

df['split_tags'] = df['tags'].map(lambda row: row.split(" "))
lb = MultiLabelBinarizer()
y = lb.fit_transform(df['split_tags'])
y = y[:n_samples]
X = np.squeeze(np.array([cv2.resize(plt.imread('../input/train-jpg/{}.jpg'.format(name)), (rescaled_dim, rescaled_dim), cv2.INTER_LINEAR).reshape(1, -1) for name in df.head(n_samples)['image_name'].values]))
X = MinMaxScaler().fit_transform(X)

After decomposing the training data tags with df ['tags']. Map (lambda row: row.split ("")), vectorize them with MultiLabelBinarizer. Resize the image with cv2.resize, adjust the vector size with np.squeeze, and finally standardize the data with MinMaxScaler.

The result of the execution is as follows.

Average F2 test score 0.6798866894323994
F2 test scores per tag:
[('primary', 0.96369809349699664),
 ('clear', 0.87778940027894004),
 ('cloudy', 0.60324825986078878),
 ('agriculture', 0.38493549729504789),
 ('road', 0.26332094175960347),
 ('partly_cloudy', 0.22288261515601782),
 ('water', 0.1915041782729805),
 ('habitation', 0.17467248908296948),
 ('cultivation', 0.054811205846528627),
 ('bare_ground', 0.032894736842105261),
 ('haze', 0.0090090090090090089),
 ('slash_burn', 0.0),
 ('conventional_mine', 0.0),
 ('selective_logging', 0.0),
 ('blow_down', 0.0),
 ('blooming', 0.0),
 ('artisinal_mine', 0.0)]

Rare geographic information is almost wiped out. The F2 score is 0.67, but since we have split the training data for training and testing, using the actual test data should result in a much lower score.

5.2.3. Clustering image data

Image data is clustered as it is. https://gist.github.com/TomHortons/7786a01020de08ee4d14cfaee0ebe142 Discover hierarchical clusters, t-SNE, and outliers.

As a procedure,

  1. Reshape the image to np.array
  2. Calculate the distance between images with pairwise distance
  3. Hierarchical clustering with seaborn cluster map

Here is the result of the cluster.

ダウンロード (1).png

Another thing is clustering with t-SNE. Drop the distance data of the image created earlier into 3D with t-SNE and plot it.

https://gist.github.com/TomHortons/da0094c11bf77108c41529cf515f78df

When executed, it looks like this.

[t-SNE] Computing pairwise distances...
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Computed conditional probabilities for sample 600 / 600
[t-SNE] Mean sigma: 2.026685
[t-SNE] Iteration 25: error = 0.9391028, gradient norm = 0.0111124
[t-SNE] Iteration 50: error = 0.8462010, gradient norm = 0.0094869
[t-SNE] Iteration 75: error = 0.6071504, gradient norm = 0.0019813
[t-SNE] Iteration 100: error = 0.5155882, gradient norm = 0.0052735
[t-SNE] KL divergence after 100 iterations with early exaggeration: 0.515588
[t-SNE] Iteration 125: error = 0.4028054, gradient norm = 0.0006756
[t-SNE] Iteration 125: gradient norm 0.000676. Finished.
[t-SNE] Error after 125 iterations: 0.515588
スクリーンショット 2017-07-19 13.12.26.png

We will search for outliers from the created distance information. https://gist.github.com/TomHortons/3aec684789d64cc0631ad5834fa67222

maximally_dissimilar_image_idx = np.nanargmax(np.nanmean(sq_dists, axis=1))

Simply calculate the average distance from the sq_dists created earlier and pick up the maximum (minimum) value. The left side of the figure below is the maximum distance, and the right side is the minimum distance. ダウンロード (1).png

The minimum distance here means the smallest pairwise distance of all images, that is, the most average image. On the contrary, the maximum distance is a rare image. However, since the distance is simply calculated by pairwise of the image, it is different from "rareness" in the sense that there are unusual buildings and strangely shaped roads.

Finally, the result of mapping with t-SNE is plotted as an image. https://gist.github.com/TomHortons/8afcf86ee53e073ad21b7d3d0eb4a9fa

ダウンロード (1).png

5.2.4. NDVI (Normalized Difference Vegetation Index)

Plant activity is often quantified using satellite and aerial photographs taken with drones. NDVI is used as a general indicator and is calculated from the ratio of R and GB in RGB. スクリーンショット 2017-07-19 14.46.06.png

NDVI is calculated using this data.

https://gist.github.com/TomHortons/8d3de37dee1dc7962abe1380f0ff0395

The calculation of NDVI itself is very simple, and the calculation is completed in one line.

ndvis = [(img[:,:,3] - img[:,:,0])/((img[:,:,3] + img[:,:,0])) for img in imgs]

After that, I think that you can get a feel for the atmosphere by arranging NDVI images, standardized tiff images, and jpg images as they are.

ダウンロード (2).png

In NDVI, only the green part is activated.

Furthermore, if you calculate the NDVI for all the images and check the average value in the histogram, you can somehow grasp the activity distribution of the plant.

import seaborn as sns
mndvis = np.nan_to_num([ndvi.mean() for ndvi in ndvis])
plt.figure(figsize=(12,8))
sns.distplot(mndvis)
plt.title('distribution of mean NDVIs')

ダウンロード (3).png

Finally, let's rank by NDVI.

sorted_idcs = np.argsort(mndvis)
print(len(sorted_idcs))
plt.figure(figsize=(12,8))
plt.subplot(221)
plt.imshow(imgs[sorted_idcs[0]])
plt.subplot(222)
plt.imshow(imgs[sorted_idcs[50]])
plt.subplot(223)
plt.imshow(imgs[sorted_idcs[-30]])
plt.subplot(224)
plt.imshow(imgs[sorted_idcs[-11]])

ダウンロード (4).png

The whitish image on the upper left is a cloudy image with a negative average NDVI. The standardized, reddish-looking image in the lower right is the image of the forest with the highest average NDVI.

Recommended Posts

Kaggle Summary: Planet, Understanding the Amazon from Space
Kaggle Summary: Outbrain # 2
Kaggle Summary: Outbrain # 1
Kaggle related summary
Kaggle competition process from the perspective of score transitions
Mathematical understanding of principal component analysis from the beginning