Nijibox uses Wantedly for career recruitment. (If you are interested from here)
So, when the story was updated, I wondered if I could notify Slack in the company, but I looked at the source. RSS feed does not exist. [^ 1] When it comes to that, it becomes necessary to make full use of the parser to process the data structure well ...
I'll leave what I did for that in this article.
The material is the "story" in Wantedly of Nijibox. https://www.wantedly.com/companies/nijibox/feed
From here, we will write the code to extract the "data structure that can withstand another output".
feed-from-wantedly.py
import json
import pprint
import requests
from bs4 import BeautifulSoup
URL = 'https://www.wantedly.com/companies/nijibox/feed'
resp = requests.get(URL)
soup = BeautifulSoup(resp.content, 'html.parser')
# <script data-placeholder-key="wtd-ssr-placeholder">I'm fetching the contents
#The content of this tag is a JSON string, but at the beginning'// 'Removed for reading because there is
feed_raw = soup.find('script', {'data-placeholder-key': "wtd-ssr-placeholder"}).string[3:]
feeds = json.loads(feed_raw)
#There are various things in the body in the whole JSON, but the body itself became a key that seems to be a company key in dict
#However, it seems that there is only one, so it is extracted very roughly
feed_body = feeds['body'][list(feeds['body'].keys())[0]]
#Items that appear to be fixed posts
pprint.pprint(feed_body['latest_pinnable_posts'])
When you do this, it looks like this:
$ python3 feed-from-wantedly.py
[{'id': 188578,
'image': {'id': 4141479,
'url': 'https://d2v9k5u4v94ulw.cloudfront.net/assets/images/4141479/original/9064f3ba-9327-4fce-9724-c11bf1ea71e2?1569833471'},
'post_path': '/companies/nijibox/post_articles/188578',
'title': 'Feel free to start with a casual interview! What Nijibox wants to convey to job seekers and their thoughts on hiring'},
{'id': 185158,
'image': {'id': 4063780,
'url': 'https://d2v9k5u4v94ulw.cloudfront.net/assets/images/4063780/original/44109f75-6590-43cb-a631-cb8b719564d4?1567582305'},
'post_path': '/companies/nijibox/post_articles/185158',
'title': '[For beginners] Design is not "sense" but "theory". You can do it from today! How to become a UI designer'},
{'id': 185123,
'image': {'id': 4062946,
'url': 'https://d2v9k5u4v94ulw.cloudfront.net/assets/images/4062946/original/ff2169c7-568e-4992-b082-56f1e1be2780?1567573415'},
'post_path': '/companies/nijibox/post_articles/185123',
'title': 'We had a React study session with Mr. Ikeda of ICS!'}]
This time, I made it in the following environment.
Until "Receive the response with requests
and parse it with BeautifulSoup4
" is so-called common, so I will skip it this time.
This time, I'm going to find and parse the "Featured Posts", but there are about two issues here.
--There is an easy-to-understand area called ʻid = "posts" , but there are quite a few
divs in it, which is troublesome. -** At the time of response, the
body` part is almost empty **
The latter is particularly troublesome, and the method of chasing tags with soup.find
does not work. [^ 2]
Then, where to parse
Here it is.
This is the result of a Google search for "Nijibox" and "Wantedly" as described above, but the outline etc. that are not in the body
tag of the response are properly listed.
Wantedly's site seems to have a specification that renders the content itself as the material JSON at the time of JS execution.
This is the only line where Beautiful Soup
does its job.
feed_raw = soup.find('script', {'data-placeholder-key': "wtd-ssr-placeholder"}).string[3:]
Find_all
of Beautiful Soup
is effective not only for tags but also for attribute level, so you can get the contents you want in one shot. It's convenient.
The reason for using string [3:]
is that there is a character at the beginning of this content that is annoying to parse as JSON called //
. [^ 3]
Roughly speaking, the contents of the parsed object look like this.
{
"router":{"Abbreviation"},
"page":"companies#feed",
"auth":{"Abbreviation"},
"body":{
"c29bc423-7f81-41c2-8786-313d0998988c":{
"company":{"Abbreviation"}
}
}
}
The mysterious UUID. Maybe something to use separately from the company ID.
So we need to dig into this content.
feed_body = feeds['body'][list(feeds['body'].keys())[0]]
Fortunately, it seems that the key used in the contents of body
is only for one company, so I will divide it into the back very roughly.
For now, there are two items that seem to be useful.
posts
: All stories so far?
latest_pinnable_posts
: The part corresponding to "Featured Posts"
This time I decided that I needed only the minimum, so I output latest_pinnable_posts
and finish. Thank you for your hard work.
pprint.pprint(feed_body['latest_pinnable_posts'])
I haven't made it yet at this time.
--See the difference from the parsing result of the previous post and notify only the new one --Make it into an RSS feed and throw it into Slack Integration [^ 4]
There is an approach like this. Not applicable this time for the time being.
It's been a while since I touched Beautiful Soup4
, but after all it has all the functions and is easy to use.
[^ 1]: Wantedly It exists when you search for "application / rss + xml" in the source of the top page or company page, but it seems that there is nothing in particular in this content. [^ 2]: Of course, you can run JS with a headless browser etc. and have the DOM created up to the situation that the browser is looking at, but this time it is not adopted [^ 3]: I don't know the reason [^ 4]: Since there is no posting date and time in the data that can be collected, the bottleneck is that you cannot do something like "re-notify when edited".