[PYTHON] I analyzed Airbnb data for those who want to stay in Amsterdam

Do you want to stay in Amsterdam?

Amsterdam, the capital of the Netherlands, is a very famous tourist destination with a very beautiful cityscape. It is a city with many canals that are particularly characteristic of Europe, and it is famous as a tourist destination so that too many tourists becomes a problem. netherlands-happiest-coutries-2018-super-169.jpg

Airbnb Airbnb is a famous private lodging service. Airbnb comes from Air bed and Bed & Breakfast. It's a service that is said to have started when Brian Chesky rented out his loft from time to time. It is the mainstream accommodation overseas.

Purpose

The purpose is to get an understanding of the situation on Airbnb when Amsterdam tourists try to stay at Airbnb. This time, I analyzed the Airbnb accommodation data in Amsterdam to find out what the characteristics are and which variables affect the price of Airbnb accommodation.

Target data

[Inside Airbnb -Adding data to the debate] http://insideairbnb.com/get-the-data.html Inside Airbnb is a site that provides actual data on Airbnb. The data is very well organized and provided in csv format, so even beginners like me can easily analyze it.

reference

https://towardsdatascience.com/exploring-machine-learning-for-airbnb-listings-in-toronto-efdbdeba2644 https://note.com/ryohei55/n/n56f723bc3f90

Time series data analysis

calendar = pd.read_csv('calendar.csv')
print(calendar.date.nunique(), 'days', calendar.listing_id.nunique(), 'unique listings')

366 days 20025 unique listings The data is from 2020-12-08 to 2020-12-06, but for some reason there is a slight error of 366 days, but as far as the data is seen, there seems to be no problem, so I will proceed. There are 20025 listings, and I am grateful for the large amount of data.

calendar.head(5)

Screenshot 2019-12-27 at 17.21.52.png

Can I make a reservation? Graph

I tried to graph how much Airbnb is already reserved and how much space is available in chronological order.

calendar_new = calendar[['date', 'available']]
calendar_new['busy'] = calendar_new.available.map( lambda x:0 if x == 't'  else 1)
calendar_new = calendar_new.groupby('date')['busy'].mean().reset_index()
calendar_new['date'] = pd.to_datetime(calendar_new['date'])

plt.figure(figsize=(10, 5))
plt.plot(calendar_new['date'], calendar_new['busy'])
plt.title('Airbnb Amsterdam Calendar')
plt.ylabel('Busy %')
plt.show()

download (12).png ** Consideration ** With a occupancy rate of over 80% overall, it can be recognized that airbnb in Amsterdam is crowded all year round. It gets very crowded over the year. This may be due to the influence of tourists who come to see fireworks over the year.

The congestion rate will increase after March. A similar sudden rise is seen in June. However, these rises may be due to the fact that the airbnb host has not vacated the room because reservations that are a little far from the current time cannot be decided because the host's schedule has not been decided.

Monthly price comparison

calendar['date'] = pd.to_datetime(calendar['date'])
calendar['price'] = calendar['price'].str.replace('$', '')
calendar['price'] = calendar['price'].str.replace(',', '')
calendar['price'] = calendar['price'].astype(float)
calendar['date'] = pd.to_datetime(calendar['date'])

mean_of_month = calendar.groupby(calendar['date'].dt.strftime('%B'), sort=False)['price'].mean()

mean_of_month.plot(kind = 'barh', figsize=(12, 7))
plt.xlabel('Average Monthly Price')

download (13).png ** Consideration ** The average price of airbnb in Amsterdam throughout the year is around 160 euros (¥ 18000) per night. I have the impression that January and February will be a little cheaper if you say it is strong.

Price by day of the week

calendar['dayofweek'] = calendar.date.dt.weekday_name
cats = [ 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
price_week = calendar[['dayofweek', 'price']]
price_week = calendar.groupby(['dayofweek']).mean().reindex(cats)
price_week.drop(['listing_id','maximum_nights', 'minimum_nights'], axis=1, inplace=True)
price_week.plot(grid=True)

ticks = list(range(0,7,1))
labels = "Mon Tues Weds Thurs Fri Sat Sun".split()
plt.xticks(ticks, labels)![download (14).png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/505543/155809c6-0f06-2623-2d6e-78405d07ab30.png)

download (14).png ** Consideration ** It's settled on average below € 170 from Monday to Thursday, but prices are extremely high when staying from Friday to Saturday. It is thought that the demand for airbnb is biased on weekends because schools and companies visit on Fridays and Saturdays when they are closed.

Listing data analysis

Airbnb Contains data about each accommodation.

listings = pd.read_csv('listings.csv')
print('We have', listings.id.nunique(), 'listings in the listing data.')
listings.head(5)

Screenshot 2019-12-27 at 17.21.52.png It looks like this.

Where to stay TOP10

listings.groupby(by = 'neighbourhood_cleansed').count()[['id']].sort_values(by='id', ascending=False).head(10)

Screenshot 2019-12-27 at 18.07.21.png

Price distribution

listings.loc[(listings.price <= 1000) & (listings.price > 0)].price.hist(bins=200)
plt.ylabel('Count')
plt.xlabel('Listing price in EUR')
plt.title('Histogram of listing prices')

download (15).png The price distribution is like this.

Price box plot by region

select_neighbourhood_over_100 = listings.loc[(listings.price <= 1000) & (listings.price > 0)].groupby('neighbourhood_cleansed')\
.filter(lambda x: len(x)>=100)["neighbourhood_cleansed"].values

listings_neighbourhood_over_100 = listings.loc[listings['neighbourhood_cleansed'].map(lambda x: x in select_neighbourhood_over_100)]

sort_price = listings_neighbourhood_over_100.loc[(listings_neighbourhood_over_100.price <= 1000) & (listings_neighbourhood_over_100.price > 0)]\
.groupby('neighbourhood_cleansed')['price'].median().sort_values(ascending=False).index

sns.boxplot(y='price', x='neighbourhood_cleansed', data=listings_neighbourhood_over_100.loc[(listings_neighbourhood_over_100.price <= 1000) & (listings_neighbourhood_over_100.price > 0)],
            order=sort_price)

ax = plt.gca()
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right')
plt.show()

download (16).png ** Consideration ** As you can see from Centrum-West Centrum-Oost, the prices near the central station are quite high. The cheapest price range is when you go to an area like Bijnmer, which takes about 30 minutes by tram. Basically, it seems that the price of airbnb around it is decided by the distance from the central station. ![xxamsterdam-train-stations-map.jpg.pagespeed.ic.POsCpucKFr.jpg](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/505543/2a9486e8-df14 -64f2-c508-7e172c11656b.jpeg)

Price box plot for each type of accommodation

select_property_over_100 = listings.loc[(listings.price <= 1000) & (listings.price > 0)].groupby('property_type')\
.filter(lambda x:len(x) >=20)["property_type"].values

listings_property_over_100 = listings.loc[listings["property_type"].map(lambda x: x in select_property_over_100)]

sort_price = listings_property_over_100.loc[(listings_property_over_100.price <= 1000) & (listings_property_over_100.price >0)]\
.groupby('property_type')['price'].median().sort_values(ascending=False).index

sns.boxplot(y='price', x ='property_type', data=listings_property_over_100.loc[(listings_property_over_100.price <= 1000) & (listings_property_over_100.price >0)],
           order = sort_price)

ax = plt.gca()
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right')
plt.show()

download (17).png ** Consideration ** First of all, the boxplot shows the variation of data, the center line points to the median, the dark line below it is the first quartile, and the dark line above it is the third quartile. It is a number. A hostel is a cheap accommodation that is common in Europe. However, although it is cheap, although it is classified as a hostel, there are many in Amsterdam that cost 1000 EUR. However, it must be taken into consideration that 1000 EUR or more is excluded as an outlier this time.

The data of the Hotel also varies. Probably because some hotels have a high-class taste. However, since the median itself is about 180 EUR, airbnb, which is classified as Hotel, seems to be basically a cheap classification.

Price graph by room type

listings.loc[(listings.price <= 1000) & (listings.price > 0)].pivot(columns='room_type', values='price').plot.hist(stacked=True, bins=100)
plt.xlabel('Listing Price in EUR')

download (18).png ** Consideration ** First you will notice that there are few Shared rooms and Hotel rooms. You can rent out the entire house / apartment, or rent out only the room. And most of them seem to be rented out for each house / apartment. If you want to make it cheaper, it seems more efficient to narrow down your search to private rooms. In the case of renting out the entire house / apartment, it is natural that only the room will be more expensive than renting out.

Amenity number graph

pd.Series(np.concatenate(listings['amenities'].map(lambda amns: amns.split(",")))).value_counts().head(20).plot(kind='bar')
ax = plt.gca()
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right', fontsize=12)
plt.show()

download (19).png ** Consideration ** There is a lot of Wifi. Winters in the Netherlands are cold and most of them are equipped with Heating. Many places do not have amenities such as irons, shampoos and hair dryers, so you will need to check a little.

Family and kid friendly is a little ... but I don't want to be interfered with by airbnb, so this is considered an advantage and tomorrow. You can also see that free parking is not in the top, so please be careful when you come by car.

Relationship between price and amenities

amenities = np.unique(np.concatenate(listings['amenities'].map(lambda amns: amns.split(","))))
amenity_prices = [(amn, listings[listings['amenities'].map(lambda amns: amn in amns)]['price'].mean()) for amn in amenities if amn != ""]
amenity_srs = pd.Series(data=[a[1] for a in amenity_prices], index=[a[0] for a in amenity_prices])

amenity_srs.sort_values(ascending=False)[:20].plot(kind='bar')
ax = plt.gca()
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right', fontsize=12)
plt.show()

download (20).png

** Consideration ** I don't know if Washer / Dryer is the most price related ... But what is Amsterdam-like is Suitable for events. Many events are held in Amsterdam. It seems that the rooms that are easy to attend the event and that are in the right place tend to be expensive. Other than those two, the relationship is almost uniform.

Relationship between number of beds and price


listings.loc[(listings.price <= 1000)&(listings.price > 0)].pivot(columns = 'beds', values='price').plot.hist(stacked=True, bins=100)
plt.xlabel('Listing price in EUR')

download (21).png ** Consideration ** Mostly one or two beds. Is this the result you imagined? By the way, 32 beds! ?? I thought, so I tried it. https://www.airbnb.jp/rooms/779175?source_impression_id=p3_1577402659_vntGlW7Yj5I5pX4U It was a story that this ferry has 32 beds. I'm surprised.

Visualize the relationships between amenities with heatmaps

col = ['host_listings_count', 'accommodates', 'bedrooms', 'price', 'number_of_reviews', 'review_scores_rating']
corr = listings.loc[(listings.price<=1000)&(listings.price > 0)][col].dropna().corr()
plt.figure(figsize=(6,6))
sns.set(font_scale=1)
sns.heatmap(corr, cbar=True, annot=True, square=True, fmt='.2f', xticklabels=col, yticklabels=col)
plt.show()

** Consideration ** This is a heat map that makes it easy to see each correlation in the listings data by color. However, only this time, there is no correlation in most parts. However, there is a strong correlation between bedrooms and accommodates. Since this is the number of people who can stay and the number of beds, it is understandable that there is a correlation. However, such a thing that determines the number of accomodates by the number of beds is considered to be a spurious correlation because the number of guests is artificially determined as the number of beds rather than having a correlation.

Predict prices using decision trees

The following is data preparation. The data is made into a dummy variable.


from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer(tokenizer=lambda x:x.split(','))
amenities = count_vectorizer.fit_transform(listings['amenities'])
df_amenities = pd.DataFrame(amenities.toarray(), columns=count_vectorizer.get_feature_names())
df_amenities = df_amenities.drop('', 1)

columns = ['host_is_superhost', 'host_identity_verified', 'host_has_profile_pic', 'is_location_exact', 'requires_license', 'instant_bookable', 'require_guest_profile_picture', 'require_guest_phone_verification']
for c in columns:
    listings[c] = listings[c].replace('f',0,regex=True)
    listings[c] = listings[c].replace('t',1,regex=True)

listings['security_deposit'] = listings['security_deposit'].fillna(value=0)
listings['security_deposit'] = listings['security_deposit'].replace('[\$,]', '', regex=True).astype(float)
listings['cleaning_fee'] = listings['cleaning_fee'].fillna(value=0)
listings['cleaning_fee'] = listings['cleaning_fee'].replace('[\$,]', '', regex=True).astype(float)

listings_new = listings[['host_is_superhost', 'host_identity_verified', 'host_has_profile_pic','is_location_exact', 
                         'requires_license', 'instant_bookable', 'require_guest_profile_picture', 
                         'require_guest_phone_verification', 'security_deposit', 'cleaning_fee', 
                         'host_listings_count', 'host_total_listings_count', 'minimum_nights',
                     'bathrooms', 'bedrooms', 'guests_included', 'number_of_reviews','review_scores_rating', 'price']]

for col in listings_new.columns[listings_new.isnull().any()]:
    listings_new[col] = listings_new[col].fillna(listings_new[col].median())

for cat_feature in ['zipcode', 'property_type', 'room_type', 'cancellation_policy', 'neighbourhood_cleansed', 'bed_type']:
    listings_new = pd.concat([listings_new, pd.get_dummies(listings[cat_feature])], axis=1)

listings_new = pd.concat([listings_new, df_amenities], axis=1, join='inner')

We will use RandomForestRegressor.


from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor

y = listings_new['price']
x = listings_new.drop('price', axis=1)
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=123)
rf = RandomForestRegressor(n_estimators=500, random_state=123, n_jobs=-1)
rf.fit(X_train, y_train)
y_train_pred = rf.predict(X_train)
y_test_pred = rf.predict(X_test)
rmse_rf = (mean_squared_error(y_test, y_test_pred))**(1/2)

print('RMSE test: %.3f' % rmse_rf)
print('R^2 test: %.3f' % (r2_score(y_test, y_test_pred)))

RMSE test: 73.245 R^2 test: 0.479 The result looks like this. It's 0.479 in the R ^ 2 test, so it's pretty accurate. For the time being, let's look at which item the decision tree judged to be important.

coefs_df = pd.DataFrame()
coefs_df['est_int'] = X_train.columns
coefs_df['coefs'] = rf.feature_importances_
coefs_df.sort_values('coefs', ascending=False).head(20)

Screenshot 2019-12-27 at 19.10.04.png ** Consideration ** You can see that the number of bedrooms has a significant effect on the price. Also, in airbnb, a cleaning fee is charged separately from the room fee, but you can see that the price also affects the price. This seems to have a fairly direct effect.

Lasso regression


from sklearn.linear_model import  Lasso
lasso = Lasso()
lasso.fit(X_train, y_train)

#Regression coefficient
print(lasso.coef_)
#Intercept(error)
print(lasso.intercept_)
#Coefficient of determination
print(lasso.score(X_test, y_test))

[ 1.85022916e-03 1.31073590e+00 -0.00000000e+00 0.00000000e+00 5.23464952e+00 5.97640655e-01 6.42296851e-01 3.67942959e+01 8.80302532e+00 -3.96520183e-02 8.39294507e-01] -30.055848397234712 0.27054071146797

I also tried multiple regression analysis, but I couldn't improve the accuracy very much. Well, this is unavoidable because it is made into a dummy variable, isn't it? right? ??

Summary

I've just learned data analysis, but this Inside Airbnb has very well-organized data, which I'm grateful for as a beginner. I want to analyze it like this! It is a little difficult to find open data, so please refer to it.

This time, there are a lot of parts that I just copied, but I'm glad that I was able to learn what I intended and how to process the data.

I would appreciate any advice!

Recommended Posts

I analyzed Airbnb data for those who want to stay in Amsterdam
Reference reference for those who want to code in Rhinoceros / Grasshopper
Anxible points for those who want to introduce Ansible
For those who want to write Python with vim
For those who want to start machine learning with TensorFlow2
[For those who want to use TPU] I tried using the Tensorflow Object Detection API 2
Python environment construction 2016 for those who aim to be data scientists
Loose articles for those who want to start natural language processing
Python techniques for those who want to get rid of beginners
I want to print in a comprehension
I want to embed Matplotlib in PySimpleGUI
I want to find variations in various statistics! Recommendation for re-sampling (Bootstrap)
I tried using NVDashboard (for those who use GPU in jupyter environment)
I want to do Dunnett's test in Python
I want to pin Datetime.now in Django tests
Tips for those who are wondering how to use is and == in Python
Join Azure Using Go ~ For those who want to start and know Azure with Go ~
Anyway, I want to check JSON data easily
I want to knock 100 data sciences with Colaboratory
I want to store DB information in list
For those who want to learn Excel VBA and get started with Python
I want to merge nested dicts in Python
Things to keep in mind when using Python for those who use MATLAB
5 Reasons Processing is Useful for Those Who Want to Get Started with Python
I want to get League of Legends data ③
I want to get League of Legends data ②
I want to get League of Legends data ①
Library for "I want to do that" of data science on Jupyter Notebook
I want to display the progress in Python!
The first step for those who are amateurs of statistics but want to implement machine learning models in Python
The first step of machine learning ~ For those who want to implement with python ~
I want to use a python data source in Re: Dash to get query results
I want to set up a mock server for python-flask in seconds using swagger-codegen.
I just want to find the 95% confidence interval for the difference in population ratios in Python
For those who are new to programming but have decided to analyze data with Python
Environment construction for those who want to study python easily with VS Code (for Mac)
I want to write in Python! (1) Code format check
I want to embed a variable in a Python string
I want to easily implement a timeout in python
I want to give a group_id to a pandas data frame
I want to transition with a button in flask
I want to use self in Backpropagation (tf.custom_gradient) (tensorflow)
I want to write in Python! (2) Let's write a test
Even in JavaScript, I want to see Python `range ()`!
PostgreSQL-For those who want to INSERT at high speed
I want to randomly sample a file in Python
I want to work with a robot in python.
I want to write in Python! (3) Utilize the mock
When you want to plt.save in a for statement
I want to say that there is data preprocessing ~
I want to use the R dataset in python
I want to do something in Python when I finish
I want to manipulate strings in Kotlin like Python!
For those who want to display images side by side as soon as possible with Python's matplotlib
I was in charge of maintaining the Fabric script, but I don't know.> <To those who
[TensorFlow] I want to master the indexing for Ragged Tensor
For those who are having trouble drawing graphs in python
I want to exchange gifts even for myself! [Christmas hackathon]
I want to move selenium for the time being [for mac]
I want to be able to analyze data with Python (Part 1)
[NetworkX] I want to search for nodes with specific attributes