Nice to meet you. My name is S.I., a third-year university student belonging to the Department of Computer Science. My experience with Python is a bit of a university experiment.

The data science division of Cacco Inc., where I am an intern, has the task of creating crawlers during the trial period to collect, process, and visualize data, and to briefly discuss what I have learned.

Task

theme

My college friend is going to live alone. However, when I look at the real estate website, there are too many properties to choose from. Please solve it by data analysis.

Constraint

Within 60 minutes of commuting time from JR Kanamachi Station

background

When searching for a property from your own property search experience, the "information you want to know" is I thought that it was a "condition for finding a property" to tell the real estate agent to mediate, and decided to solve it by data analysis.

policy

Crawl the property site "Smighty" and save it as an HTML file
Scraping each variable from the obtained HTML file
Data analysis
Present the conditions for finding a property

Crawling

This time, we will use Smighty's "Commuting / School Time Search" to crawl, and the search results will be within 60 minutes to Kanamachi Station.

Save the total number of properties posted on the site in a text file
Specify the URL of the first page and save it as an HTML file
Get the URL of the next page from pagination and transition
Save the destination page as an HTML file
Repeat a few steps until there are no more pages

The crawling code looks like this:

`crawling.py`


import requests
from bs4 import BeautifulSoup
import time
import os
import datetime

def crawling():
    #Path of directory for saving html files
    dirname = './html_files'
    if not os.path.exists(dirname):
        #Create directory if it does not exist
        os.mkdir(dirname)

    #Convert the first page to html
    url = "https://sumaity.com/chintai/commute_list/list.php?search_type=c&text_from_stname%5B%5D=%E9%87%91%E7%94%BA&cost_time%5B%5D=60&price_low=&price_high="
    response = requests.get(url)
    time.sleep(1)
    #Save to file
    page_count = 1    #Page count
    with open('./html_files/page{}.html'.format(page_count), 'w', encoding='utf-8') as file:
        file.write(response.text)

    #Total number of properties(Theoretical value)Acquisition (as an acceptance condition)
    soup = BeautifulSoup(response.content, "lxml")
    num_bukken = int(soup.find(class_='searchResultHit').contents[1].text.replace(',', ''))
    print("Total number of properties within 60 minutes of commuting time:", num_bukken)
    #Save the total number of properties in a text file as it will be used to check the acceptance conditions when scraping.
    path = './data.txt'
    with open(path, mode='w') as f:
        f.write("{}\n".format(num_bukken))

    #Crawling on the second and subsequent pages, continue until the next page runs out
    while True:
        page_count += 1

        #Find the next url
        next_url = soup.find("li", class_="next")

        #Break and finish when the next page runs out
        if next_url == None:
            print("Total number of pages:", page_count-1)
            with open(path, mode='a') as f:
                f.write("{}\n".format(page_count-1))
            break

        #Get the next page url and save it as an html file
        url = next_url.a.get('href')
        response = requests.get(url)
        time.sleep(1)
        with open('./html_files/page{}.html'.format(page_count), 'w', encoding='utf-8') as file:
            file.write(response.text)

        #Prepare for analysis to get the url of the next page
        soup = BeautifulSoup(response.content, "lxml")

        #Crawling progress output
        if page_count % 10 == 0:
            print(page_count, 'Get page')

#Main function
if __name__ == "__main__":
    date_now = datetime.datetime.now()
    print("Start crawling:", date_now)
    crawling()
    date_now = datetime.datetime.now()
    print("Finished crawling:", date_now)

Scraping

The following are the variables scraped this time.

Scraping each variable for each property
When all the variables are available, add them as a record to the CSV file.
Is it consistent? Match the total number of properties acquired by crawling with the number of records

The scraping code looks like this:

`scraping.py`


from bs4 import BeautifulSoup
import datetime
import csv
import re

#Regular expression for dividing an address into a prefecture and a city
pat = '(...??[Prefectures])((?:Asahikawa|Date|Ishikari|Morioka|Oshu|Tamura|Minamisoma|Nasushiobara|Higashimurayama|Musashimurayama|Hamura|Tokamachi|Joetsu|Toyama|Nonoichi|Omachi|Gamagori|Yokkaichi|Himeji|Yamatokoriyama|Hatsukaichi|under>Pine|Iwakuni|Tagawa|Omura|Miyako|Furano|Beppu|Saiki|Kurobe|Komoro|Shiojiri|Tamano|Shunan)city|(?:余city|高city|[^city]{2,3}?)county(?:Tamamura|Omachi|.{1,5}?)[Towns and villages]|(?:.{1,4}city)?[^town]{1,4}?Ward|.{1,7}?[cityTowns and villages])(.+)'

def scraping(total_page, room_num):
    #Initialization of the number of properties
    room_count = 0

    #Preparation of csv file (add header)
    with open('room_data.csv', 'w', newline='', encoding='CP932') as file:
        header = ['No', 'building_name', 'category', 'prefecture', 'city', 'station_num', 'station', 'method', 'time', 'age', 'total_stairs', 'stairs', 'layout', 'room_num', 'space', 'south', 'corner', 'rent', 'unit_price', 'url']
        writer = csv.DictWriter(file, fieldnames=header)
        writer.writeheader()


    for page_num in range(total_page):
        #Scraping progress output
        if page_num % 10 == 0:
            print(page_num , '/', total_page)

        #Open the html file to be scraped with Beautiful Soup
        with open('./html_files/page{}.html'.format(page_num + 1), 'r', encoding='utf-8') as file:
            page = file.read()
        soup = BeautifulSoup(page, "lxml")

        #Get information for each building
        building_list = soup.find_all("div", class_="building")
        for building in building_list:
            #Building category: Condominium or apartment or detached house
            buildingCategory = building.find(class_="buildingCategory").getText()

            #Building name
            buildingName = building.find(class_="buildingName").h3.getText().replace("{}".format(buildingCategory), "").replace("New arrival", "")

            #Extraction of candidates for the nearest station and the distance from the station
            traffic = building.find("ul", class_="traffic").find_all("li")
            #Number of nearest stations
            station_num = len(traffic)
            #Extract those with short walking time
            min_time = 1000000    #Initialize the minimum required time
            for j in range(station_num):
                traffic[j] = traffic[j].text
                figures = re.findall(r'\d+', traffic[j])
                time = 0
                for figure in figures:
                    #Calculation of required time
                    time += int(figure)
                #Store minimum time required and index if minimum
                if time < min_time:
                    min_time = time
                    index = j

            #If you have station or route information
            if len(traffic[index].split(' ')) > 1:
                #Route decision
                line = traffic[index].split(' ')[0]
                #Determining the nearest station
                station = traffic[index].split(' ')[1].split('station')[0]
                #Obtaining transportation (bus, car, walking) to the station
                if len(traffic[index].split(' ')) > 2:
                    if "bus" in traffic[index].split(' ')[1]:
                        method = "bus"
                    elif "car" in traffic[index].split(' ')[2]:
                        method = "car"
                    else:
                        method = "walk"
                #No transportation information to the station
                else:
                    method = None
            #If there is no station or route information
            else:
                station = None
                line = None
                method = None
                time = None

            #Street address
            address = building.find(class_="address").getText().replace('\n','')
            address = re.split(pat, address)
            if len(address) < 3:
                prefecture = "Tokyo"
                city = "Adachi Ward"
            else:
                prefecture = address[1]
                city = address[2]

            #Details of the building (age, structure, total number of floors)
            building_detail = building.find(class_="detailData").find_all("td")
            for j in range(len(building_detail)):
                building_detail[j] = building_detail[j].text

            # ----Get only the number of age----
            #Age unknown
            if 'Unknown construction' == building_detail[0]:
                building_detail[0] = None
            #0 years old
            elif 'Less than' in building_detail[0]:
                building_detail[0] = 0
            #Normal value
            else:
                building_detail[0] = int(re.findall(r'\d+', building_detail[0])[0])

            #Get only the total number of floors
            building_detail[2] = int(re.findall(r'\d+', building_detail[2])[0])


            # ----Get room details----
            rooms = building.find(class_="detail").find_all("tr",
                                                            {'class': ['estate applicable', 'estate applicable gray']})
            for j in range(len(rooms)):
                #Counting the number of properties
                room_count += 1

                # ----Number of floors----
                stairs = rooms[j].find("td", class_="roomNumber").text
                #Get only numbers (delete "floor", process missing values)
                if "-" == stairs:
                    stairs = None
                else:
                    stairs = int(re.findall(r'\d+', stairs)[0])

                #Make the rent an integer type
                price = rooms[j].find(class_="roomPrice").find_all("p")[0].text
                price = round(10000 * float(price.split('Ten thousand')[0]))

                #Management fee
                kanri_price = rooms[j].find(class_="roomPrice").find_all("p")[1].text
                #Unification of notation (deletion of 10,000 yen notation, "-”And“ 0 yen ”missing value processing)
                if "-" in kanri_price or "0 Yen" == kanri_price:
                    kanri_price = 0
                else:
                    kanri_price = int(kanri_price.split('Circle')[0].replace(',',''))

                #Room type (floor plan)
                room_type = rooms[j].find(class_="type").find_all("p")[0].text
                if room_type == "Studio":
                    room_type = "1R"
                #number of rooms
                num_of_rooms = int(re.findall(r'\d+', room_type)[0])


                #Room area, deletion of unit "m2"
                room_area = rooms[j].find(class_="type").find_all("p")[1].text
                room_area = float(room_area.split('m')[0])

                #South facing corner room
                special = rooms[j].find_all("span", class_="specialLabel")
                south = 0
                corner = 0
                for label in range(len(special)):
                    if "South facing" in special[label].text:
                        south = 1
                    if "Corner room" in special[label].text:
                        corner = 1

                #Get detailed url
                room_url = rooms[j].find("td", class_="btn").a.get('href')

                #rent=Rent+Ask for management fee
                rent = price + kanri_price

                # 1m^Find the rent (unit price) for each 2
                unit_price = rent / room_area

                #Output to csv file: encoding default"utf-8", If you handle Japanese on windows"cp932"
                with open('room_data.csv', 'a', newline='', encoding='CP932') as file:
                    writer = csv.DictWriter(file, fieldnames=header)
                    writer.writerow(
                        {'No':room_count, 'building_name':buildingName, 'category':buildingCategory, 'prefecture':prefecture, 'city':city, 'station_num':station_num, 'station':station,
                              'method':method, 'time':min_time, 'age':building_detail[0], 'total_stairs':building_detail[2], 'stairs':stairs,
                              'layout':room_type, 'room_num':num_of_rooms, 'space':room_area, 'south':south, 'corner':corner, 'rent':rent, 'unit_price':unit_price, 'url':room_url})

    print("{}We have acquired the property data.".format(room_count))
    #Confirmation of acceptance conditions
    if room_count == room_num:
        print("Clear acceptance conditions")
    else:
        print("{}There are differences. The acceptance conditions have not been cleared.".format(abs(room_count-room_num)))

if __name__ == "__main__":
    date_now = datetime.datetime.now()
    print("Start scraping:", date_now)
    #Pass the total number of pages and the number of properties to the scraping function (acceptance condition)
    path = './data.txt'
    with open(path) as f:
        data = f.readlines()
    scraping(int(data[1].replace("\n","")), int(data[0].replace("\n","")))
    date_now = datetime.datetime.now()
    print("Finished scraping:", date_now)

Data visualization

First of all, I checked the histogram of how the rent is distributed, and removed the property whose rent was too high because it was considered unsuitable for living alone.

From here, let's see how each variable affects rent.

Floor plan

Let's look at the number of properties and the distribution of rent for each floor plan.

A bar graph of the number of properties for each floor plan shows that the floor plans from 1R to 3LDK account for 98% of the total. If you look at the distribution of rent for those floor plans on a violin plot, you can see that the distribution of rent differs for each floor plan. Therefore, the floor plan is likely to be a variable that affects rent.

place

Let's see where there are many properties.

By prefecture, most of them were in Tokyo and Chiba, and Saitama was about 3%. Looking at each city in more detail, Adachi-ku, Katsushika-ku, Matsudo-shi, Kashiwa-shi, Arakawa-ku has more than 1000 properties, which seems to be a good place to look for properties. Let's look at the distribution of rent in each of these districts.

Looking at the rent histogram by prefecture, we can see that although there are many properties in Tokyo, there are many properties with high rent, and there are many properties in Chiba that are cheaper. If you take a closer look at the rent box plot for each city, you can see that the green box in the Chiba area is located at the bottom. It seems that you can find cheap properties in Matsudo, Kashiwa, Nagareyama, Ichikawa, Abiko, Yoshikawa, and Soka. Looking at the boxplot, you can see that the distribution of rent differs depending on the district, so it seems that where the property is located also affects the rent.

Time required from the station and its means

There is a weak negative correlation between travel time and rent, and it seems that the longer the travel time, the cheaper the rent. We also illustrated the difference in rent depending on the means of transportation such as the bus or walking used at that time. It can be seen that the rent for walking in blue is higher than that for buses. Therefore, both transportation and travel time are likely to affect rent.

Age

We grouped the ages every 5 years and put out a box plot of rent.

You can see that the rent is gradually getting cheaper from the property after 15 years old. Therefore, age is also likely to be a variable that affects rent.

Total floors and types of buildings

Let's look at the distribution of rent by the total number of floors of the building and the histogram of the total number of floors.

Looking at the total number of floors and the distribution of rent, it seems that the rent of properties up to 2 stories is cheap. Looking at the histogram of the total number of floors, most of the two-story properties were apartments. Besides, 95% of the properties are within 10 stories, so I think that you may have a longing for a high-rise property for the first time living alone, but it seems difficult when it comes to a property for living alone. From the above results, it was found that building information also affects rent.

South facing

I will see if the southward orientation, which is a characteristic of the property, affects the rent. I've created a histogram of properties facing south and those that don't.

Looking at the histogram, the distributions are similar, so we tested whether the difference in rent was significant. There was a significant difference in the average rent for properties facing south and those not facing south. At this time, the homoscedasticity test was performed by the F-test, and the homoscedasticity was not rejected, so the t-test assuming homoscedasticity was performed. As a result, the south facing property is about 1500 yen cheaper. From the above, it was found that whether or not it faces south affects the rent.

Corner room

Similarly, we will look at the impact on corner rooms.

Since the homoscedasticity test was performed by the F-test and the homoscedasticity was rejected, the t-test was performed assuming that the variances are not equal. As a result, it turned out that the difference was significant, and the price for the corner room was about 2000 yen higher. From the above, it was found that the corner room also affects the rent.

Data visualization 2

Based on the results so far, I would like to analyze again and specifically determine the conditions of the property that I recommend to my university friends who are actually in trouble.

In what district should I actually look for a property?

The relationship between the number of properties and the average rent is plotted for each city.

You can see that Matsudo City has many properties and the average rent is low.

What is the floor plan?

The number of properties and average rent for each floor plan are as follows.

The average rent for 1R, 2K and 3K is cheaper, but the number of 1K properties is overwhelming. If you live alone, you don't need that much space, so I think a 1K floor plan is good. The rent market price for 1K property in Matsudo was 56000 yen using the median.

How old is it?

A box plot of changes in rent distribution depending on the age of 1K properties in Matsudo City.

You can see that there are many properties below the market price if they are 15 years old or older. I think it's a good idea to look for it in about 15 years.

How long does it take?

I made a bar graph by color-coding according to the means of transportation, showing how many properties are required from the station.

It can be seen that 95% of the properties for living alone are within a 20-minute walk. We plotted the number of properties and the average rent for each time required from the station. You can see that the rent will be cheaper if you extend it within 15 minutes.

result

Based on the above, the conditions for finding a property to tell a friend are as follows.

Taking into account that living alone does not require that much space, and considering the rent and the number of properties, the floor plan is ** 1K **
The location of the property is next to Kanamachi, the rent market price is ** 56000 yen **, and the overall 1K rent market price is cheaper than 63000 yen. Considering the number of properties, ** Matsudo **
It is said that the building is about 15 years old, but if you do not care about the age, it is recommended because the rent is often cheap if you search for 15 years or more.
If you walk to the station within 10 minutes, the rent is high, so extend it within ** 15 minutes **
It turned out that the rent does not increase even if it is attached as an option for the south facing

If you search under the above conditions, it will be a ** apartment ** type room. Under these conditions, I think you can find a good room by going to a real estate agent in Matsudo.

in conclusion

Reflections

The point of reflection is that the data was analyzed without a policy being decided. It's okay if you do it as a hobby, but if you do it, you will end up with a diagram that you do not know where to use it, which is a huge waste of time, so you have to analyze the data with a purpose.

What I learned from the feedback

** Code Review ** Improve readability by inserting comments and writing according to coding standards
** Announcement Review ** Depending on the approach you use to present the results you have produced, and the approach you take, the same content can be good or bad. Are the results for the purpose properly obtained? I have to make a story and announce it in an easy-to-understand manner.

Other

I feel that this issue was a valuable experience because I was able to give feedback by creating materials and making presentations, not just analysis. I noticed that it took so long to collect and analyze the data, and it was difficult to convey the results to the other party, so I would like to make use of it in the future.

that's all.

[PYTHON] [First data science ⑤] I tried to help my friend find the first property by data analysis.