[PYTHON] Championships (code only) pushing likes within the organization on the Advent Calendar

Introduction

By the way, the Advent Calendar is also exciting, and personally VIP and stock AI It was a month that I enjoyed buzzing. I didn't buzz at all, but I really like Employee 2vec. I've finished my work, so I'm writing an article while drinking highball and thinking about the next story.

I usually focus on machine learning and computer vision, but sometimes it's good to make something you like with the pure glue when you start programming like this.

・ ・ ・

I would like to go with a style that does not end the Advent Calendar in such a beautiful way, but leaves a sword at the end.

In Advent Calenar, the ranking is "Like", and the top companies are happy and conspicuous. But don't you think?

** "Isn't it like each other?" **

Today, I'm going to look for a company that has such "likes". This is a bad article, so if you are offended, I recommend closing the page around here.

let's try it!

Extraction of company information

It seems that it can not be fetched with Qiita API, so I will fetch it by scraping. Companies participating in the Advent calendar seem to be able to take the following.

#Get a list of companies participating in the Advent calendar
ret = requests.get('https://qiita.com/advent-calendar/2019/categories/company')

soup = bs4.BeautifulSoup(ret.text, "html.parser")
companies = soup.find_all("a", class_="ac-Item_name")
companies = [{'name': c.text.strip(), 'href': c['href']} for c in companies]

Next, I will get a list of articles written for Qiita in the company. If you are writing an article for the outside world, save only the user ID.


ret = requests.get('https://qiita.com' + company['href'])

soup = bs4.BeautifulSoup(ret.text, "html.parser")
contents = soup.find_all("div", class_="adventCalendarItem")

data = []
for content in contents:
    d = {}
    author = content.find("a", class_='adventCalendarItem_author')
    user_name = author['href'][1:]

    d['user'] = user_name
    entry = content.find("div", class_="adventCalendarItem_entry")
    if entry is not None:
        item = entry.find('a')
        if item is not None and 'https://qiita.com' in item['href']:
            d['href'] = item['href']
            d['title'] = item.text.strip()
    data.append(d)

target_list.append({'name': company['name'], 'items': data})

Extraction of likes

Use the Qiita API to get users who like it. First, get the total number of likes of the article, then specify the page and extract the user ID who likes it. TOKEN can be issued by clicking your icon in the upper right corner and using Settings ➝ Applications ➝ Personal Access Token.


def get_likes(content_id, token):
    headers = {'Authorization': f'Bearer {token}'}
    per_page = 100
    
    ret = requests.get(f'https://qiita.com/api/v2/items/{content_id}', headers=headers)
    if ret.status_code != 200:
        print(f'https://qiita.com/api/v2/items/{content_id}')
        raise requests.ConnectionError("Expected status code 200, but got {}".format(ret.status_code))
    likes_count = json.loads(ret.content.decode('utf-8'))['likes_count']
    nb_pages = math.ceil(likes_count / per_page)
    #time.sleep(3)
    
    likes = []
    for p in range(nb_pages):
        params = {'page': 1 + p, 'per_page': per_page}
        ret = requests.get(f'https://qiita.com/api/v2/items/{content_id}/likes', params=params, headers=headers)
        if ret.status_code != 200:
            raise requests.ConnectionError("Expected status code 200, but got {}".format(ret.status_code))
        likes.extend(json.loads(ret.content.decode('utf-8')))
        #time.sleep(3)
        
    return likes

Here, a problem occurs. It seems that the Qiita API can only be called 1000 times an hour. So I decided to scrape it. Enter the URL and fetch the users who like it from each page.


def get_likes_direct(url):
    page = 1
    likes = []
    while True:
        ret = requests.get(f'{url}/likers?page={page}')
        soup = bs4.BeautifulSoup(ret.text, "html.parser")
        users = soup.find_all("li", class_="GridList__user")
        local_likes = []
        if users is not None:
            local_likes = [u.find('h4', class_='UserInfo__name').find('a')['href'][1:] for u in users]
        if len(local_likes) == 0:
            break
        local_likes = [{'user': {'id': ll}} for ll in local_likes]
        likes.extend(local_likes)
        page += 1
    return likes

Rank calculation

Well, this is the place I was most worried about, but as a result of my worries, I did the following without thinking about difficult things.

--Extract the users who like 1/4 or more of the articles in the organization. --Only when the number of Qiita articles in the organization exceeds 5

For example, if there are eight articles about an organization and more than two of them are liked, I decided to consider them as "likes" within the organization. Strictly speaking, I have to see how much the person likes in another organization, but since there aren't many people who like the article at a rate of 1/4. First of all, I made this condition.

Result announcement!

I'm sorry, everyone may be interested, but it is not good to publish it as expected, so please try it because I will put the code in here. The following are very important, but ** only those who move their hands can get results. ** **

In addition, ABEJA was 27th in the internal like ranking. Have a nice year!

Recommended Posts

Championships (code only) pushing likes within the organization on the Advent Calendar
Looking back on the transition of the Qiita Advent calendar