Twitter doesn't provide an API around bookmarks, so I used Selenium to get all the bookmarks.
environment
CentOS Linux release 7.7.1908
Python 3.6.8
google-chrome Install by referring to this article. ChromeDriver Be careful about the version to install. If you put it carelessly, it will not work properly. Check ChromeDriver site and pip install with version specified.
Example
# google-chrome --version
Google Chrome 78.0.3904.108
# pip install chromedriver-binary==78.0.3904.105
# pip show chromedriver-binary
Name: chromedriver-binary
Version: 78.0.3904.105.0
# chromedriver-path
/usr/lib/python3.6/site-packages/chromedriver_binary (Needed later)
Selenium
# pip install selenium
I've added a lot of options, but --headless
and --no-sandbox
may be enough.
In my environment I got an exception without --headless
.
executable_path specifies the result of the above chromedriver-path.
I have saved a screenshot for confirmation.
test.py
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
options = Options()
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--disable-infobars')
options.add_argument('--disable-extensions')
options.add_argument('--disable-gpu')
options.add_argument('--headless')
driver = webdriver.Chrome(chrome_options=options, executable_path='/usr/lib/python3.6/site-packages/chromedriver_binary/chromedriver')
driver.get('https://www.google.com/')
time.sleep(3.0)
driver.save_screenshot('screenshot.png')
driver.close()
driver.quit()
option.add_argument('--user-data-dir='+os.path.abspath('profile'))
If you specify the profile to use with the above option, the cookie is saved there, so you do not have to log in every time the program is executed. In that state, execute twitter login of this article only once. Even if you think that you have successfully logged in, it may stop at the confirmation page of your email address, so I think it's a good idea to use interactive mode to check the screenshot and URL.
In the Twitter timeline and bookmarks, tweet elements are dynamically added / deleted according to scrolling.
In the following program
Get the url of the loaded tweet → Scroll so that the bottom tweet is at the top of the page → Wait for the page to load the tweet
The url of all tweets is obtained by repeating.
def get_list():
driver.get('https://twitter.com/i/bookmarks')
time.sleep(10.0)
status_urls = []
container_xpath = '//*[@id="react-root"]/div/div/div/main/div/div/div/div[1]/div/div[2]/section/div/div/div'
container = driver.find_element_by_xpath(container_xpath) #A portrait element that contains multiple tweets
end_count = 0
while True:
divs = container.find_elements_by_xpath('./div')
for div in divs:
if len(div.find_elements_by_tag_name('img')) == 0:
end_count += 1
break
status_url = div.find_element_by_xpath('./div/article/div/div[2]/div[2]/div[1]/div[1]/a').get_attribute('href')
status_urls.append(status_url)
if end_count > 8:
break
driver.execute_script('arguments[0].scrollIntoView();', divs[-1]) # must check length
print(len(status_urls))
time.sleep(15.0)
return list(set(status_urls)) #Since duplication occurs in the acquisition method, it is made unique by setting it once.
When you go back to the limit of the bookmark, the tweet is not stored in the bottom element, so you can judge whether you scrolled to the end by div.find_elements_by_tag_name ('img')
.
It doesn't matter how long it takes, so I want to get all of them, so it's a redundant code by sleeping and specifying the number of times.
--If you have to check the version when installing Chrome Driver, you will be addicted to the swamp. --If you load the page using Selenium, you can enter and click values as you normally do in a browser, which is very convenient. --Note that the DOM structure of HTML may change and you may not be able to access the elements.
If you find something wrong, please comment.
Until running Selenium + Python on CentOS7 --Qiita If you want to keep your site logged in the next time you run Selenium Bot to reply from twitter login with Python Selenium --Qiita
Recommended Posts