[PYTHON] Parse the response so you can scrape Wantedly's own story

Preface

Nijibox uses Wantedly for career recruitment. (If you are interested from here)

So, when the story was updated, I wondered if I could notify Slack in the company, but I looked at the source. RSS feed does not exist. [^ 1] When it comes to that, it becomes necessary to make full use of the parser to process the data structure well ...

I'll leave what I did for that in this article.

Today's start and goal

The material is the "story" in Wantedly of Nijibox. https://www.wantedly.com/companies/nijibox/feed

From here, we will write the code to extract the "data structure that can withstand another output".

Answer (= finished product)

feed-from-wantedly.py


import json
import pprint
import requests
from bs4 import BeautifulSoup

URL = 'https://www.wantedly.com/companies/nijibox/feed'

resp = requests.get(URL)
soup = BeautifulSoup(resp.content, 'html.parser')
# <script data-placeholder-key="wtd-ssr-placeholder">I'm fetching the contents
#The content of this tag is a JSON string, but at the beginning'// 'Removed for reading because there is
feed_raw = soup.find('script', {'data-placeholder-key': "wtd-ssr-placeholder"}).string[3:]
feeds = json.loads(feed_raw)
#There are various things in the body in the whole JSON, but the body itself became a key that seems to be a company key in dict
#However, it seems that there is only one, so it is extracted very roughly
feed_body = feeds['body'][list(feeds['body'].keys())[0]]

#Items that appear to be fixed posts
pprint.pprint(feed_body['latest_pinnable_posts'])

When you do this, it looks like this:

$ python3 feed-from-wantedly.py
[{'id': 188578,
  'image': {'id': 4141479,
            'url': 'https://d2v9k5u4v94ulw.cloudfront.net/assets/images/4141479/original/9064f3ba-9327-4fce-9724-c11bf1ea71e2?1569833471'},
  'post_path': '/companies/nijibox/post_articles/188578',
  'title': 'Feel free to start with a casual interview! What Nijibox wants to convey to job seekers and their thoughts on hiring'},
 {'id': 185158,
  'image': {'id': 4063780,
            'url': 'https://d2v9k5u4v94ulw.cloudfront.net/assets/images/4063780/original/44109f75-6590-43cb-a631-cb8b719564d4?1567582305'},
  'post_path': '/companies/nijibox/post_articles/185158',
  'title': '[For beginners] Design is not "sense" but "theory". You can do it from today! How to become a UI designer'},
 {'id': 185123,
  'image': {'id': 4062946,
            'url': 'https://d2v9k5u4v94ulw.cloudfront.net/assets/images/4062946/original/ff2169c7-568e-4992-b082-56f1e1be2780?1567573415'},
  'post_path': '/companies/nijibox/post_articles/185123',
  'title': 'We had a React study session with Mr. Ikeda of ICS!'}]

Preparation

This time, I made it in the following environment.

Look in order

Until "Receive the response with requests and parse it with BeautifulSoup4" is so-called common, so I will skip it this time.

Where to parse more

This time, I'm going to find and parse the "Featured Posts", but there are about two issues here.

--There is an easy-to-understand area called ʻid = "posts" , but there are quite a few divs in it, which is troublesome. -** At the time of response, the body` part is almost empty **

The latter is particularly troublesome, and the method of chasing tags with soup.find does not work. [^ 2]

Then, where to parse スクリーンショット 2019-11-26 19.40.05.png

Here it is.

Wantedly SSR Consideration (Source Only)

スクリーンショット 2019-11-26 19.42.30.png

This is the result of a Google search for "Nijibox" and "Wantedly" as described above, but the outline etc. that are not in the body tag of the response are properly listed. Wantedly's site seems to have a specification that renders the content itself as the material JSON at the time of JS execution.

Extract the corresponding item with Beautiful Soup

This is the only line where Beautiful Soup does its job.

feed_raw = soup.find('script', {'data-placeholder-key': "wtd-ssr-placeholder"}).string[3:]

Find_all of Beautiful Soup is effective not only for tags but also for attribute level, so you can get the contents you want in one shot. It's convenient. The reason for using string [3:] is that there is a character at the beginning of this content that is annoying to parse as JSON called //. [^ 3]

After that, I think that JSON character string is made into an object and only parsed ...

Roughly speaking, the contents of the parsed object look like this.

{
  "router":{"Abbreviation"},
  "page":"companies#feed",
  "auth":{"Abbreviation"},
  "body":{
    "c29bc423-7f81-41c2-8786-313d0998988c":{
      "company":{"Abbreviation"}
    }
  }
}

The mysterious UUID. Maybe something to use separately from the company ID.

So we need to dig into this content.

feed_body = feeds['body'][list(feeds['body'].keys())[0]]

Fortunately, it seems that the key used in the contents of body is only for one company, so I will divide it into the back very roughly.

Finally extract the item

For now, there are two items that seem to be useful.

posts: All stories so far? latest_pinnable_posts: The part corresponding to "Featured Posts"

This time I decided that I needed only the minimum, so I output latest_pinnable_posts and finish. Thank you for your hard work.

pprint.pprint(feed_body['latest_pinnable_posts'])

What about Slack notifications?

I haven't made it yet at this time.

--See the difference from the parsing result of the previous post and notify only the new one --Make it into an RSS feed and throw it into Slack Integration [^ 4]

There is an approach like this. Not applicable this time for the time being.

Looking back

It's been a while since I touched Beautiful Soup4, but after all it has all the functions and is easy to use.

[^ 1]: Wantedly It exists when you search for "application / rss + xml" in the source of the top page or company page, but it seems that there is nothing in particular in this content. [^ 2]: Of course, you can run JS with a headless browser etc. and have the DOM created up to the situation that the browser is looking at, but this time it is not adopted [^ 3]: I don't know the reason [^ 4]: Since there is no posting date and time in the data that can be collected, the bottleneck is that you cannot do something like "re-notify when edited".

Recommended Posts

Parse the response so you can scrape Wantedly's own story
Can you delete the file?
Maybe you can scrape using Twitter Scraper
Until you can read the error log