[PYTHON] I tried to use Twitter Scraper on AWS Lambda and it didn't work.

Aikatsu! I want to collect tweets about it every day, and it is troublesome to do it manually, so I want to automate it so that it will be output every day on AWS Lambda.

First, on AWS Lambda, I registered the Twitter Scraper library (1.4.0) in Lambda Layers, implemented the following code roughly at the operation verification level, and executed the test.

from twitterscraper import query_tweets
import datetime as dt

def lambda_handler(event, context):
    
    begin_date = dt.date(2020,6,5)
    end_date = dt.date(2020,6,6)
    pool_size = (end_date - begin_date).days
    
    tweets = query_tweets("Aikatsu", begindate = begin_date, enddate = end_date, poolsize=pool_size, lang="ja")
    tuple_tweet=[(tweet.user_id, tweet.tweet_id, tweet.text.replace("\n","\t"), tweet.timestamp) for tweet in tweets]
      
    return True

Then, the following "pool" is missing error is output on AWS Lambda.

{
  "errorMessage": "name 'pool' is not defined",
  "errorType": "NameError",
  "stackTrace": [
    "  File \"/var/task/lambda_function.py\", line 14, in lambda_handler\n    tweets = query_tweets(\"Aikatsu\", begindate = begin_date, enddate = end_date, poolsize=pool_size, lang=\"ja\")\n",
    "  File \"/opt/python/twitterscraper/query.py\", line 246, in query_tweets\n    pool.close()\n"
  ]
}

It works normally on Jupyter Notebook, so I wondered if something in Lambda Layers was wrong, so what is the variable "pool" in the first place? Let's find out that.

Apparently it's a variable in query.py of TwitterScraper.

query.py


def query_tweets(query, limit=None, begindate=dt.date(2006, 3, 21), enddate=dt.date.today(), poolsize=20, lang=''):
    no_days = (enddate - begindate).days
    
    if(no_days < 0):
        sys.exit('Begin date must occur before end date.')
    
    if poolsize > no_days:
        # Since we are assigning each pool a range of dates to query,
		# the number of pools should not exceed the number of dates.
        poolsize = no_days
    dateranges = [begindate + dt.timedelta(days=elem) for elem in linspace(0, no_days, poolsize+1)]

    if limit and poolsize:
        limit_per_pool = (limit // poolsize)+1
    else:
        limit_per_pool = None

    queries = ['{} since:{} until:{}'.format(query, since, until)
               for since, until in zip(dateranges[:-1], dateranges[1:])]

    all_tweets = []
    try:
        pool = Pool(poolsize)
        logger.info('queries: {}'.format(queries))
        try:
            for new_tweets in pool.imap_unordered(partial(query_tweets_once, limit=limit_per_pool, lang=lang), queries):
                all_tweets.extend(new_tweets)
                logger.info('Got {} tweets ({} new).'.format(
                    len(all_tweets), len(new_tweets)))
        except KeyboardInterrupt:
            logger.info('Program interrupted by user. Returning all tweets '
                         'gathered so far.')
    finally:
        pool.close()
        pool.join()

    return all_tweets

Probably pool = Pool (poolsize), remove this variable from the try clause and run AWS Lambda.

{
  "errorMessage": "[Errno 38] Function not implemented",
  "errorType": "OSError",
  "stackTrace": [
    "  File \"/var/task/lambda_function.py\", line 14, in lambda_handler\n    tweets = query_tweets(\"Aikatsu\", begindate = begin_date, enddate = end_date, poolsize=pool_size, lang=\"ja\")\n",
    "  File \"/opt/python/twitterscraper/query.py\", line 233, in query_tweets\n    pool = Pool(poolsize)\n",
    "  File \"/opt/python/billiard/pool.py\", line 995, in __init__\n    self._setup_queues()\n",
    "  File \"/opt/python/billiard/pool.py\", line 1364, in _setup_queues\n    self._inqueue = self._ctx.SimpleQueue()\n",
    "  File \"/opt/python/billiard/context.py\", line 150, in SimpleQueue\n    return SimpleQueue(ctx=self.get_context())\n",
    "  File \"/opt/python/billiard/queues.py\", line 390, in __init__\n    self._rlock = ctx.Lock()\n",
    "  File \"/opt/python/billiard/context.py\", line 105, in Lock\n    return Lock(ctx=self.get_context())\n",
    "  File \"/opt/python/billiard/synchronize.py\", line 182, in __init__\n    SemLock.__init__(self, SEMAPHORE, 1, 1, ctx=ctx)\n",
    "  File \"/opt/python/billiard/synchronize.py\", line 71, in __init__\n    sl = self._semlock = _billiard.SemLock(\n"
  ]
}

Since the error content is "Function not implemented", it is apparently caused by the multi-process billiard library. It seems that multi-process is not available in AWS Lambda.

https://aws.amazon.com/es/blogs/compute/parallel-processing-in-python-with-aws-lambda/

Correspondence

There is a description about the same phenomenon in the issues of twitterscraper.

Since there is a workaround implementation in pullrequest on github, I could avoid it by replacing quert.py with the contents of pullrequest. https://github.com/taspinar/twitterscraper/pull/280/commits/685c5b4f601de58c2b2591219a805839011c5faf

Since the number of multi-processes is set using the variable "poolsize" when passing it to the function "query_tweets", it is implemented so that it will not be multi-processed if it is explicitly set to 0.

Recommended Posts

I tried to use Twitter Scraper on AWS Lambda and it didn't work.
I tried to install Docker on Windows 10 Home but it didn't work
I tried to use Java with Termux using Termux Arch but it didn't work
I tried to delete bad tweets regularly with AWS Lambda + Twitter API
When I tried to create a virtual environment with Python, it didn't work
I made an image classification model and tried to move it on mobile
I tried to reduce costs by starting / stopping EC2 collectively on AWS Lambda
Docker x visualization didn't work and I was addicted to it, so I summarized it!
I tried my best to make an optimization function, but it didn't work.
I tried to get an AMI using AWS Lambda
I want to AWS Lambda with Python on Mac!
AWS Lambda now supports Python so I tried it
I tried to install scrapy on Anaconda and couldn't
I tried to understand how to use Pandas and multicollinearity based on the Affairs dataset.
It was a life I wanted to OCR on AWS Lambda to locate the characters.
I wanted to operate google spread sheet with AWS lambda, so I tried it [Part 2]
I tried to launch ipython cluster to the minimum on AWS
processing to use notMNIST data in Python (and tried to classify it)
I tried using Twitter api and Line api
I want to tweet on Twitter with Python, but I'm addicted to it
I installed DSX Desktop and tried it
I made a bot to post on twitter by web scraping a dynamic site with AWS Lambda (continued)
I want to use Linux on mac
Build your Django app on Docker and deploy it to AWS Fargate
I installed PySide2, but pyside2-uic didn't work, so I managed to do it.
Regularly post to Twitter using AWS lambda!
Matching karaoke keys ~ I tried to put it on Laravel ~ <on the way>
I tried to push the Sphinx document to BitBucket and it will be automatically reflected on the web server
It is convenient to use Layers when putting a library on Lambda
I tried to use deep learning to extract the part where the plant is shown from the photo of the balcony, but it didn't work, so I will summarize the contents of trial and error. Part 2
The tree.plot_tree of scikit-learn was very easy and convenient, so I tried to summarize how to use it easily.
When I tried to use Python on WSL (windows subsystem for linux), it got stuck in Jupyter (solved)
When I tried to make a VPC with AWS CDK but couldn't make it
I tried to automate internal operations with Docker, Python and Twitter API + bonus
I tried to use Resultoon on Mac + AVT-C875, but I was frustrated on the way.
Summary of points I was addicted to running Selenium on AWS Lambda (python)
[I'm an IT beginner] I tried my best to implement Linux on Windows
[Introduction to AWS] I tried porting the conversation app and playing with text2speech @ AWS ♪
I didn't understand the Resize of TensorFlow so I tried to summarize it visually.
I tried to use lightGBM, xgboost with Boruta
I implemented DCGAN and tried to generate apples
I tried connecting AWS Lambda with other services
Summary of studying Python to use AWS Lambda
How to install Cascade detector and how to use it
[Introduction to PID] I tried to control and play ♬
I tried to scrape YouTube, but I can use the API, so don't do it.
[Rails] v1.0 came out on google-cloud-vision of gem, so I tried to support it
Memo A beginner tried to build a Java environment and Japaneseize it on Ubuntu 18.04.2 LTS.
I tried to make it easy to change the setting of authenticated Proxy on Jupyter
I tried to summarize how to use matplotlib of python
I tried to implement and learn DCGAN with PyTorch
I tried adding post-increment to CPython. Overview and summary
I tried to implement Minesweeper on terminal with python
I tried to automatically read and save with VOICEROID2
How to set layer on Lambda using AWS SAM
I tried adding system calls and scheduler to Linux
I tried running TensorFlow in AWS Lambda environment: Preparation
I tried to summarize how to use pandas in python
I want to use OpenJDK 11 on Ubuntu Linux 18.04 LTS / 18.10
How to use Decorator in Django and how to make it
I tried to implement Grad-CAM with keras and tensorflow