[PYTHON] I tried to make a document search slack command using Kendra announced at re: Invent 2019.

Hello, this is the eighth day of ABEJA Advent Calendar.

Introduction

An interesting service was announced at re: Invent 2019 the other day. Yes, enterprise search service [Amazon Kendra Release](https://aws.amazon.com/jp/about-aws/whats-new/2019/12/announcing-amazon-kendra-reinventing-enterprise-search-with -machine-learning /). Screenshot from 2019-12-08 07-56-38.png There are various communication tools in the company. However, all of them tend to be flow information and are not organized, and there is a problem that information gaps are likely to occur. This is exactly Kendra's custom-made task. When it's announced, this has to be done! I was surprised. So I made a slack command that can search documents by natural sentences like this. Screenshot_2019-12-08 Slack yuta nakagawa ABEJA, Inc .png Below, I will introduce how I made Kendra while touching it immediately.

System configuration

Kendra is in preview as of December 8, 2019. Therefore, the services that can be used as data sources are narrowed down to the following.

So I came up with a way to enable natural text search by collecting in-house documents once in S3 and using Kendra. However, it is not always straightforward. Well, as I imagined, Kendra doesn't support Japanese. Screenshot_2019-12-08 Amazon Kendra FAQs - Amazon Web Services.png Therefore, I will try to solve it by combining it with AWS translation service Amazon Translate. If you use Translate to put the document into S3 and to query it, the IF can be in any language with only English processing internally. (I really want Kendra to support Japanese as soon as possible) Regarding IF, I chose slack because it is easy to implement commands because normal communication is slack.

Well, there is no moderation, but in summary, the architecture looks like this. Architecture.png

Document preparation

This is the story of the upper frame of the system configuration. Basically, all you have to do is download the specified document, translate it, and upload it to S3, but Translate limits the size of sentences that can be translated at one time, so we try to translate line by line. For that reason, it may be translated into slightly strange English, but since it is a verification, I will proceed without worrying about it. Also, when I executed the slack command, I wanted a link to the original document, so I am trying to link with the metadata of S3 Object.

def translate_and_upload(url, title, content):
    client = boto3.client('translate')
    bucket = boto3.resource('s3').Bucket('ynaka-kendra')
    # Translate content to English.
    translated = ''
    for line in content.splitlines():
        if len(line) > 0:
            response = client.translate_text(
                Text=line,
                SourceLanguageCode='ja',
                TargetLanguageCode='en'
            )
            translated += (response['TranslatedText'] + os.linesep)
        else:
            translated += os.linesep
    # Upload to S3.
    file_obj = io.BytesIO(translated.encode('utf-8'))
    bucket.upload_fileobj(file_obj, title, ExtraArgs={'Metadata': {'url': url}})

Document search

This is the story of the lower frame of the system configuration. This is where Kendra, the protagonist of the day, appears.

Building Kendra

First, we will build Kendra, which is the basis of this search. Basically, you can build it by following Getting Started with the Console.

First, generate Index. Screenshot_2019-12-07 Kendra.png Creating a new IAM Role for Kendra takes about 30 seconds to generate. Screenshot_2019-12-07 Kendra (1).png In addition, it takes about 30 minutes to generate the index, so let's take a long coffee break lol Screenshot_2019-12-07 Kendra(1).png Next is the selection of the data source. This time, select S3. Screenshot_2019-12-07 Kendra(2).png Screenshot_2019-12-07 Kendra(3).png Specify the path of S3 to be linked. You can also set how often to sync, but this time I made it, so I chose "Run on demand". Screenshot_2019-12-07 Kendra(4).png The setting review screen appears, Screenshot_2019-12-07 Kendra(5).png Also, if you take a slightly longer coffee break, the construction is completed. Screenshot_2019-12-07 Kendra(6).png Screenshot_2019-12-07 Kendra(7).png I already have the documentation in S3, so I'll start syncing with Kendra. Screenshot_2019-12-07 Kendra(8).png However, if this is left as it is, IAM will be insufficient and an error will occur. Screenshot_2019-12-07 Kendra(9).png So I'll add Kendra's own privileges and S3's privileges. Screenshot_2019-12-07 IAM Management Console(4).png Sync is finally successful. Screenshot_2019-12-07 Kendra(11).png

Document search API

Cooperation with slack will be implemented with Slash Commands. It is a common API Gateway + Lambda configuration. Don't forget to add Kendra permissions to the role that runs Lambda. Also note that Kendra can only be used with the latest boto3.

import boto3
import base64
import json
import logging
import os
import urllib


logger = logging.getLogger()
logger.setLevel(logging.INFO)


def respond(err, res=None):
    return {
        'statusCode': '400' if err else '200',
        'response_type': 'in_channel',
        'body': err.message if err else json.dumps({
            'response_type': 'in_channel',
            'text': res 
        }), 
        'headers': {
            'Content-Type': 'application/json',
        },  
    }   


def handler(event, context):
    # Prepare clients.
    translate = boto3.client('translate')
    kendra = boto3.client('kendra')
    s3 = boto3.resource('s3')
    # Parse request.
    body = event['body']
    params = urllib.parse.parse_qs(base64.b64decode(body).decode('utf-8'))
    token = params['token'][0]
    if token != os.environ['VERIFY_TOKEN']:
        logger.error(f'Request token ({token}) does not match expected.')
        return respond(Exception('Invalid request token.'))
    if 'text' not in params:
        logger.error(f'The text should be included in command.')
        return respond(Exception('Need text.'))
    query = params['text'][0]
    # Translate query to English.
    response = translate.translate_text(
        Text=query,
        SourceLanguageCode='ja',
        TargetLanguageCode='en'
    )
    query = response['TranslatedText']
    # Find the most relevant document.
    response = kendra.query(IndexId=os.environ['INDEX_ID'], QueryText=query)
    # Create response message.
    message = ''
    for i, item in enumerate(response['ResultItems'][:3]):
        title = item['DocumentTitle']['Text']
        excerpt = item['DocumentExcerpt']['Text']
        s3_uri = item['DocumentId'].replace('s3://', '').split('/')
        url = s3.Object(s3_uri[0], '/'.join(s3_uri[1:])).metadata['url']
        message_i = f'{i + 1}. *{title}*{os.linesep}url: {url}{os.linesep}```{excerpt}```{os.linesep}{os.linesep}'
        message += message_i
    # Return result.
    return respond(None, message)

Kendra's SDK requires ʻIndexId`, so let's set it in an environment variable. The following three points are devised.

--Kendra only supports English, so I'm translating the query as well. ――Since there are situations where I was actually looking for the document that others wanted to find, I posted it on ʻin_channel`. --Since there are many English members, some of the translated documents hit by the search are also displayed.

result

You can now search by natural sentences like this using Kendra + Translate. Screenshot_2019-12-08 Slack yuta nakagawa ABEJA, Inc .png In response to the question "What is ABEJA's skill map?", The first answer is the skill map that is being organized, and the second is the recently released Technology Stack. It seems that it works like that when I publish an article about / tech-stack-201911). As expected, part of the document is blurred, but please feel the atmosphere.

By the way, of course, you can also search in English. Screenshot_2019-12-08 Slack yuta nakagawa ABEJA, Inc (2).png In Kendra, different places in the same document may be hit, so it seems necessary to consider how to handle this based on what you want to realize with commands.

Also, the reaction within the company is good and I am very happy. Screenshot_2019-12-08 Slack proj_tech_branding ABEJA, Inc .png

Summary

I immediately tried using Amazon Kendra announced at re: Invent 2019 the other day and made an in-house document search slack command. Although the waiting time was long and it took a long time to solve due to a mystery error, I felt that Kendra is generally easy to use and it is a very good service to be able to search in natural sentences. This time, I tried to make it using only some documents in the company due to the relationship between time and cost, but I can collect all kinds of information and search from slack, learn by seeing the results by others, and search history It seems that you can draw a world view where information distribution is visualized and information is structured, and your dreams will spread. I wonder if they will support Japanese as soon as possible. By the way, please note that an Internal Server Error may be returned if you insert an unexpected document such as a Japanese document into Kendra. However, I got the courage to release it even if there was an error. Screenshot_2019-12-04 Kendra(21).png

Recommended Posts

I tried to make a document search slack command using Kendra announced at re: Invent 2019.
I tried to make a ○ ✕ game using TensorFlow
I tried to make a stopwatch using tkinter in python
I tried to make a simple text editor using PyQt
I tried to make a regular expression of "amount" using Python
I tried to make a regular expression of "time" using Python
I tried to make a regular expression of "date" using Python
I tried to make a todo application using bottle with python
I tried to make a Web API
[Python] I tried to make a simple program that works on the command line using argparse.
I tried to make a translation BOT that works on Discord using googletrans
I tried to make a suspicious person MAP quickly using Geolonia address data
I tried to make a "fucking big literary converter"
I tried to draw a configuration diagram using Diagrams
[LPIC 101] I tried to summarize the command options that are easy to make a mistake
I tried to search videos using Youtube Data API (beginner)
I tried to automate [a certain task] using Raspberry Pi
I tried to make a motion detection surveillance camera with OpenCV using a WEB camera with Raspberry Pi
I tried using Slack emojinator
[5th] I tried to make a certain authenticator-like tool with python
I tried to get a database of horse racing using Pandas
[2nd] I tried to make a certain authenticator-like tool with python
[Python] I tried to implement stable sorting, so make a note
I tried to implement anomaly detection using a hidden Markov model
[3rd] I tried to make a certain authenticator-like tool with python
I tried to make a periodical process with Selenium and Python
I tried to get a list of AMI Names using Boto3
I tried to make a 2channel post notification application with Python
[4th] I tried to make a certain authenticator-like tool with python
[1st] I tried to make a certain authenticator-like tool with python
I tried to make a strange quote for Jojo with LSTM
I tried to make a mechanism of exclusive control with Go
How to make a slack bot
Tweet in Chama Slack Bot ~ How to make a Slack Bot using AWS Lambda ~
Python: I tried to make a flat / flat_map just right with a generator
I tried to make a face diagnosis AI for a female professional golfer ②
I learned scraping using selenium to make a horse racing prediction model.
I tried to make a calculator with Tkinter so I will write it
I tried to make "Sakurai-san" a LINE BOT with API Gateway + Lambda
I tried to make a traffic light-like with Raspberry Pi 4 (Python edition)
I tried to perform a cluster analysis of customers using purchasing data
A note I looked up to make a command line tool in Python
I tried to create a sample to access Salesforce using Python and Bottle
I tried to make a skill that Alexa will return as cold
I tried to make a url shortening service serverless with AWS CDK
I tried to make PyTorch model API in Azure environment using TorchServe
I want to make a web application using React and Python flask
I tried to create a linebot (implementation)
I tried using Azure Speech to Text.
I tried to summarize the umask command
I tried to create a linebot (preparation)
I tried playing a ○ ✕ game using TensorFlow
I tried drawing a line using turtle
I tried to classify text using TensorFlow
I tried using Selective search as R-CNN
I tried using pipenv, so a memo
I tried to predict Covid-19 using Darts
I tried to make a periodical process with CentOS7, Selenium, Python and Chrome
I tried to make a simple mail sending application with tkinter of Python
[Python] Deep Learning: I tried to implement deep learning (DBN, SDA) without using a library.
When I tried to make a VPC with AWS CDK but couldn't make it