[PYTHON] [Data analysis] Should I buy the Harumi flag?

0. wrap up

"Harumi Flag" is a large-scale condominium in the city center that can be found on the site of the Olympic Village. A 20-minute walk from the nearest station, this property has few precedents in terms of location. I was wondering if the pricing was reasonable, so I did a multiple regression analysis in Python. From the conclusion, it was decided that "the base price setting is reasonable compared to similar properties". Therefore, it can be said that it is basically a property that does not lose. However, the price premium of the building with a good view is high. If you like the view and are satisfied with the price, you can say that it is a good property.

1. 1. On the front mouth

The Olympic Games will finally be held in Tokyo this year. I am looking forward to the success of the players, including badminton :. There are various topics related to the Olympics. Personally, I'm interested in the Harumi flag on the site of the Olympic Village. I am very interested in it because it is a big project in the city center. However, it is about a 20-minute walk from the nearest Kachidoki station. In the suburbs, anyway, what is the price setting for large-scale development that is a 20-minute walk from the city center? I don't know, so I decided to analyze the data.

2. Data analysis flow

The data analysis was performed according to the following flow. (After analysis, verification was also performed.) (1) Data collection (scraping) (2) Data preprocessing (3) Data analysis

3. 3. Data collection

The data collection was scraped on Mr. Sumo's site. Thank you, SUMO! I really wanted to use the price of a new condominium, but I used the price of a pre-owned condominium because there are many undecided prices and there is little data available. At first, we also targeted Minato Ward and Shinagawa Ward facing Tokyo Bay. However, condominiums such as Azabu and Osaki have also been targeted, and properties with different target images have been included. So, in the end, I targeted only Koto Ward. Although the Harumi flag is less than the 20th floor, I chose properties with more than 20 floors because I had the image of a high-rise condominium, and properties with more than 100 units in total because I wanted to target large-scale development. The floor plan ranges from 2LDK to 4LDK (LDK includes K and DK). As a result of the conditional search, the target data was 438 in total. It is better to have more, but I collected this number of data. There are many helpful articles on scraping. List famous articles for reference.

[I used machine learning to find a bargain rental property in the 23 wards of Tokyo] (http://www.analyze-world.com/entry/2017/11/09/061023)

The site is updated quite often. Sumo's data was stored in the dottable dottable--cassette class, so you can do as follows.

result = requests.get(url)   
c = result.content
soup = BeautifulSoup(c, "html.parser")
summary = soup.find("div",{'id':'js-bukkenList'})
cassetteitems = summary.find_all("div",{'class':'dottable dottable--cassette'})

4. Data preprocessing

This time, the area, the distance from the station (how many minutes it takes on foot), and the age of the building were used as explanatory variables. When the floor plan was adopted as the explanatory variable, it was excluded because it seems that area and multicollinearity (multicollinearity: if similar explanatory variables are used, a strong correlation will occur between the explanatory variables, so the analysis will not work). The objective variable is, of course, the price (10,000 yen).

Data preprocessing is a series of work related to data analysis, and is said to be the simplest but most important. But to be clear, it's not interesting. Sumo's article has few missing values, so it's easy to handle, but the age was processed and converted to months. Originally, it may be better to normalize the data (process the mean to 0 and the standard deviation to 1). However, for example, I wanted to see what kind of effect the price had when I was one minute away from the station, so I decided not to normalize this time. There is an easy-to-understand article on data preprocessing, so please refer to it.

[Real estate data analysis example [python data preprocessing]] (https://sinyblog.com/python/real_estate_analysis_002/)

[Data acquisition and analysis of real estate information using Python (5) [Property for sale / Data preprocessing]] (https://akatak.hatenadiary.jp/entry/2018/09/15/090032)

5. Data analysis

After scraping, put the preprocessed data into the pandas data frame DF. This time, we will use only the following four data.

df=df.loc[:,['Age of construction','Time required(Minutes)','area','price(Ten thousand yen)']]

Checking the contents of the data with df.head () is as follows.

Date of construction Time required(Minutes)area(㎡)price(Ten thousand yen)
0	183	     3	           64.79	  4780
1	61	     8	           55.92	  5190
2	61	     8	           65.88	    5190
3	61	     8	           55.38	    5440
4	143	     8	           78.70	    5480

There are properties with the same age and required time (minutes), but with different areas but the same price. I thought it was a little strange, and found that the property with a smaller area of the condominium itself had a larger balcony area (not subject to analysis), which seems to be the reason for the same price. Well, let's analyze this time without worrying about the details.

If you want to find the price from one type of data, for example, area, you need a simple regression analysis. In that case, the area is the explanatory variable and the price is the objective variable.

This time, there are three types of explanatory variables (date of construction, required time (minutes), area (㎡)), so multiple regression analysis is performed. The objective variable is price. For multiple regression analysis, we will use the standard sklearn in machine learning.

from sklearn import linear_model
clf = linear_model.LinearRegression()
 
#Price to explanatory variable(Ten thousand yen)Use other than
df2 = df.drop('price(Ten thousand yen)', axis=1)
X = df2.as_matrix()
 
#Price in the objective variable(Ten thousand yen)use
Y = df['price(Ten thousand yen)'].as_matrix()
 
#Create a predictive model
clf.fit(X, Y)
 
#Partial regression coefficient
print(pd.DataFrame({"Name":df2.columns,
                    "Coefficients":clf.coef_}).sort_values(by='Coefficients') )
 
#Intercept(error)
print(clf.intercept_)

The result is expressed as follows.

price(Ten thousand yen)=Time required(Minutes)*(-144.791875)+
Age of construction*(-11.745408)+
area(㎡)*90.448675+
      2205.2165149154216

In short, what can be said from this data analysis is (1) One minute away from the station, the price will drop by about 1.5 million yen. (2) One month after the new construction, the price will drop by about 120,000 yen each time. (3) If the area becomes 1 m2 and becomes wider, the price will increase by about 900,000 yen.

6. Verification

Let's apply the results of multiple regression analysis. Since it is a newly built property, the age of the property will naturally be zero. Let's assume that the property to be applied is the middle floor (9th floor if it is 18 stories). The price of the Harumi flag has already been partially announced. Some of the buildings from A to F of PARK VILLAGE have been announced. The verification results will be explained in the following order. (1) Properties that fit the multiple regression equation well (2) Properties for which the multiple regression equation does not apply (3) Why is there a difference in the fit of the multiple regression equation?

(1) Properties that fit the multiple regression equation well

The multiple regression equation applies relatively well to Buildings B and C of PARK VILLAGE. For example, a property of 75.46㎡ in Building B. The distance to Building B is a 20-minute walk from the station, so if you calculate with the multiple regression formula,

20(Minutes)*(-144.791875)+75.46(㎡)*90.448675+2205.2165149154216=6,1.35 million yen

Will be. Since the selling price is 61.3 million yen, it is almost the same as the result of the regression equation.

Then, the property of 78.56㎡ in Building C. The distance to Building C is a 19-minute walk from the station, so if you calculate with the multiple regression formula,

19(Minutes)*(-144.791875)+78.56(㎡)*90.448675+2205.2165149154216=6,5.6 million yen

Will be. The selling price is 65.6 million yen, which is exactly the same as the result of the regression equation.

If you look for it, there are some properties with the price ratio obtained by the multiple regression formula and good deals. For example, a property of 87.43㎡ in Building C. When calculated by the multiple regression equation,

19(Minutes)*(-144.791875)+87.43(㎡)*90.448675+2205.2165149154216=7,3.62 million yen

Will be. Since the selling price is 64.9 million yen, it is a good price setting based on the result of the regression equation.

(2) Properties for which the multiple regression equation does not apply

On the other hand, it is not true, or the prices are set higher than the result of the multiple regression equation in Buildings A and F. For example, a property of 86.55㎡ in Building A. The distance to Building A is a 21-minute walk from the station, so if you calculate with the multiple regression formula,

21(Minutes)*(-144.791875)+86.55(㎡)*90.448675+2205.2165149154216=6,9.93 million yen

Will be. Since the selling price is 101 million yen, it is 30 million yen higher than the regression equation.

Then there is the 81.76㎡ property in Building F. The distance to Building F is also a 21-minute walk from the station, so if you calculate with the multiple regression formula,

21(Minutes)*(-144.791875)+81.76(㎡)*90.448675+2205.2165149154216=6,5.6 million yen

Will be. Since the selling price is 72 million yen, it is 6.4 million yen higher than the regression equation.

However, building F is not as wide as building A. I was surprised at how big the selling price of Building A deviated from the result of the multiple regression equation. The following is a list in descending order of dissociation. Building A> Building F> Building B, Building C

(3) Why is there a difference in the fit of the multiple regression equation?

At first, I thought that the difference between the multiple return formula and the selling price was due to the difference in hardware. Still, I didn't know the specific hardware difference. I finally noticed when I was looking at the map of Harumi Flag without knowing the cause of the price difference. The price difference was caused by the difference in the view.

The following articles clearly describe the characteristics of each building. [[HARUMI FLAG] SEA and PARK VILLAGE first term price list will be released] (https://wangantower.com/?p=16436)

The features of each building described in the above article can be summarized as follows.

・ I want to see the Rainbow Bridge in the front row every day → Building A
・ I want to see the Rainbow Bridge, but I can't put it out as much as Building A → Building F
・ The total amount should be modest, and the view will be compromised to some extent → Buildings B and C

In other words, the difference between seeing and not seeing the Rainbow Bridge. Since the Rainbow Bridge cannot be seen in Buildings B and C, the selling prices are not significantly different from the results of the multiple regression equation obtained by data analysis. In other words, it can be said that the base price is set at a reasonable price. On the other hand, in Buildings A and F, it seems that the selling point of the Harumi flag, "good view," is added as a price premium to the results obtained by the multiple regression equation. By the way, I've heard that there is a difference in the price of a house depending on whether you can see Sakurajima in Kagoshima or not. The cause of the price difference of the Harumi flag was the difference in the "view" of whether the Rainbow Bridge can be seen or not: rainbow :: bridge_at_night:

7. Impressions

There are few precedents ⇒ There are few data ⇒ No decent data analysis is possible I was wondering what to do if the result was messy. The amount of data was not so large, but I personally think that it is a reasonable result.

Even so, every time I try to analyze data, I think that there is a limit to how close to the truth can be achieved by data analysis alone. Again, at first I wasn't sure what caused the price difference. I understand that apartments with a good view are expensive, but honestly, I'm surprised that such a premium is listed. Actually, I thought that the unit price per tsubo ≒ construction cost. Well, I envy those who can buy it.

8. Lessons learned

If you have any questions, consider whether you can verify them by data analysis. However, the world is not so sweet that the true cause can be found only by data analysis.

9. Finally

We would like to thank Mr. Sumo and all the people on the site for their reference. Also, if you have any opinions, please let us know.

Recommended Posts

[Data analysis] Should I buy the Harumi flag?
I tried to predict the J-League match (data analysis)
Which should I study, R or Python, for data analysis?
I passed the Python data analysis test, so I summarized the points
I tried factor analysis with Titanic data!
I saved the scraped data in CSV!
I touched the data preparation tool Paxata
I did Python data analysis training remotely
I tried cluster analysis of the weather map
I searched for railway senryu from the data
I tried to save the data with discord
Before the coronavirus, I first tried SARS analysis
I tried principal component analysis with Titanic data!
Data analysis Titanic 2
[First data science ⑤] I tried to help my friend find the first property by data analysis.
Data analysis python
Data analysis Titanic 3
Have passed the Python Engineer Certification Data Analysis Exam
All the destructive methods that data scientists should know
Let's analyze the questionnaire survey data [4th: Sentiment analysis]
Big data analysis using the data flow control framework Luigi
I tried clustering ECG data using the K-Shape method
Let's look at the scatter plot before data analysis
I examined the data mapping between ArangoDB and Java
I tried using the API of the salmon data project
I tried the same data analysis with kaggle notebook (python) and Power BI at the same time ②
What you should not do in the process of time series data analysis (including reflection)