Scraping text from ASKfm

This article is not directly related to Deep Learning

I've been learning Deep Learning recently, and I played with seq2seq in the TensorFlow tutorial and implemented something that can respond to conversations. However, there is no Japanese conversation data anywhere. .. .. So I decided to scrape and collect.

Try to implement

The actual source code is below. https://github.com/ryosuke1217/askfm_q-a_scraper/blob/master/askfm.py

Scraping from Chrome using selenium.

`askfm.py`


driver = webdriver.Chrome()

driver.get("https://ask.fm/" + word)

wordには取得したいURLの「ask.fm/」以降をコマンドラインから渡してあげます。

`askfm.py`


while True:
    scroll_h = driver.execute_script("var h = window.pageYOffset; return h")
    judge = driver.execute_script("var m = window.pageYOffset; return m")
    previous_h = driver.execute_script("var h = window.pageYOffset; return h")
    #scroll
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    sleep(3)
    after_h = driver.execute_script("var h = window.pageYOffset; return h")
    if previous_h == after_h:
        break
print('load complete')

By getting the height of the screen while scrolling and continuing until there is no change, You can scroll to the bottom layer.

`askfm.py`


questions = driver.find_elements_by_class_name("streamItemContent-question")
answers = driver.find_elements_by_class_name("answerWrapper")

qas = [(q.find_element_by_tag_name('h2').text, a.find_element_by_tag_name('p').text) for q, a in zip(questions, answers)]

Collect the question and answer parts from the HTML source of the screen.

`askfm.py`


with codecs.open('data/askfm_data_' + word + '.txt', 'w', 'utf-8') as f:
    for q, a in qas:
        if q == '' or a == '' or 'http' in q or 'http' in a:
            continue
        q = q.replace('\n', '')
        a = a.replace('\n', '')
        f.write(q)
        f.write('\n')
        f.write(a)
        f.write('\n')
        f.write('\n')

driver.quit()

After that, organize the data in the required form, write it to a file, and finish.

I tried to run

`askfm_data_partyhike.txt`


When I was young, Ayumi Hamasaki hated the rear aura, but recently Ayumi Hamasaki feels blues.
Isn't it a debooth, not a blues?

I was tired of job hunting. Please give me some advice ...
The hard work of this time is 90 of the rest of my life%It's better to keep running without giving up even if you overdo it a little. If you think that the remaining decades will be decided in a few months at most, you should be able to do your best.

Occasionally, people are invited to drink, but how many people will participate each time?
Regardless of gender, I only drink it by hand. If you do more than one, there will be a mix of people who take voyeurs and write personal information on 2channel. I get 5 to 10 DMs every time, but most of the time I don't get it because I don't get many people who I can trust.

Would you like to elope?
I don't think.

Lips, lips, eyes, eyes, hands, hands Isn't God banning anything?
I love you ~ × 3

Is your uncle taking any measures against false accusations on the commuter train?
I rarely get on a crowded train because I come to work late, but once in a while I grab a strap with both hands and protect myself completely.
・
・
Omitted because it is huge below

It's a question and answer text rather than a conversation, I'm going to use it well, so I'm okay.

Thank you for watching.

[PYTHON] I tried scraping conversation data from Askfm

Scraping text from ASKfm

Try to implement

askfm.py

askfm.py

askfm.py

askfm.py

I tried to run

askfm_data_partyhike.txt

`askfm.py`

`askfm.py`

`askfm.py`

`askfm.py`

`askfm_data_partyhike.txt`