[PYTHON] I tried scraping conversation data from Askfm

Scraping text from ASKfm

I've been learning Deep Learning recently, and I played with seq2seq in the TensorFlow tutorial and implemented something that can respond to conversations. However, there is no Japanese conversation data anywhere. .. .. So I decided to scrape and collect.

Try to implement

The actual source code is below. https://github.com/ryosuke1217/askfm_q-a_scraper/blob/master/askfm.py

Scraping from Chrome using selenium.

askfm.py


driver = webdriver.Chrome()

driver.get("https://ask.fm/" + word)

wordには取得したいURLの「ask.fm/」以降をコマンドラインから渡してあげます。

askfm.py


while True:
    scroll_h = driver.execute_script("var h = window.pageYOffset; return h")
    judge = driver.execute_script("var m = window.pageYOffset; return m")
    previous_h = driver.execute_script("var h = window.pageYOffset; return h")
    #scroll
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    sleep(3)
    after_h = driver.execute_script("var h = window.pageYOffset; return h")
    if previous_h == after_h:
        break
print('load complete')

By getting the height of the screen while scrolling and continuing until there is no change, You can scroll to the bottom layer.

askfm.py


questions = driver.find_elements_by_class_name("streamItemContent-question")
answers = driver.find_elements_by_class_name("answerWrapper")

qas = [(q.find_element_by_tag_name('h2').text, a.find_element_by_tag_name('p').text) for q, a in zip(questions, answers)]

Collect the question and answer parts from the HTML source of the screen.

askfm.py


with codecs.open('data/askfm_data_' + word + '.txt', 'w', 'utf-8') as f:
    for q, a in qas:
        if q == '' or a == '' or 'http' in q or 'http' in a:
            continue
        q = q.replace('\n', '')
        a = a.replace('\n', '')
        f.write(q)
        f.write('\n')
        f.write(a)
        f.write('\n')
        f.write('\n')

driver.quit()

After that, organize the data in the required form, write it to a file, and finish.

I tried to run

askfm_data_partyhike.txt


When I was young, Ayumi Hamasaki hated the rear aura, but recently Ayumi Hamasaki feels blues.
Isn't it a debooth, not a blues?

I was tired of job hunting. Please give me some advice ...
The hard work of this time is 90 of the rest of my life%It's better to keep running without giving up even if you overdo it a little. If you think that the remaining decades will be decided in a few months at most, you should be able to do your best.

Occasionally, people are invited to drink, but how many people will participate each time?
Regardless of gender, I only drink it by hand. If you do more than one, there will be a mix of people who take voyeurs and write personal information on 2channel. I get 5 to 10 DMs every time, but most of the time I don't get it because I don't get many people who I can trust.

Would you like to elope?
I don't think.

Lips, lips, eyes, eyes, hands, hands Isn't God banning anything?
I love you ~ × 3

Is your uncle taking any measures against false accusations on the commuter train?
I rarely get on a crowded train because I come to work late, but once in a while I grab a strap with both hands and protect myself completely.
・
・
Omitted because it is huge below

It's a question and answer text rather than a conversation, I'm going to use it well, so I'm okay.

Thank you for watching.

Recommended Posts

I tried scraping conversation data from Askfm
I tried scraping
I tried scraping with Python
I tried collecting data from a website with Scrapy
I tried reading data from a file using Node.js.
I tried scraping with python
I tried to get data from AS / 400 quickly using pypyodbc
I tried AdaNet on table data
I tried web scraping with python.
I tried task queuing from Celery
I tried to get data from AS / 400 quickly using pypyodbc Preparation 1
I tried scraping Yahoo News with Python
I tried using YOUTUBE Data API V3
I tried factor analysis with Titanic data!
I tried using UnityCloudBuild API from Python
I tried scraping Yahoo weather (Python edition)
I tried using Headless Chrome from Selenium
[Data science basics] I tried saving from csv to mysql with python
I tried web scraping to analyze the lyrics.
I tried hitting the Qiita API from go
I tried web scraping using python and selenium
I searched for railway senryu from the data
I tried to get an image by scraping
I tried to save the data with discord
I tried PyQ
I tried Python! ] I graduated today from "What is Python! Python!"!
I tried principal component analysis with Titanic data!
I tried to get CloudWatch data with Python
I tried debugging from Python via System Console
I tried AutoKeras
[Python] Flow from web scraping to data analysis
I tried papermill
I tried django-slack
I tried Django
I tried spleeter
I tried DBM with Pylearn 2 using artificial data
I tried cgo
I tried to predict horse racing by doing everything from data collection to deep learning
I tried scraping food recall information with Python to create a pandas data frame
I tried to detect the iris from the camera image
I tried to predict the J-League match (data analysis)
I tried running python etc. from a bat file
I tried clustering ECG data using the K-Shape method
I tried using the API of the salmon data project
I tried using PySpark from Jupyter 4.x on EMR
I tried scraping the advertisement of the pirated cartoon site
I tried to analyze J League data with Python
[Deep Learning from scratch] I tried to explain Dropout
I tried using parameterized
I tried using argparse
I tried using mimesis
I tried using anytree
I tried competitive programming
I tried running pymc
Extract data from S3
I tried ARP spoofing
I tried using aiomysql
I tried using Summpy
I tried Python> autopep8
I tried using coturn
I tried using Pipenv