Do you use Google Colaboratory? It's the best. Google Colaboratory is a Jupyter notebook environment that runs on your browser. Moreover, GPU and TPU machines can be used for free. It's the best.
However, there are two rules regarding usage time (described later), and it was also a addictive point to make Colaboratory users wither.
** "When I wake up in the morning, the connection is cut off before I know it ..." **
** "It was reset (forced disconnection) after 12 hours, so I have to restart it." **
** "I wonder if it's possible to keep the script running locally to avoid the problem for 90 minutes." **
I think everyone had this kind of sickness.
Many Colaboratory users share their workarounds for these issues.
Summary of articles by pioneers
Like the pioneers, I also contribute.
In this article, by using Selenium in Colaboratory, the above problems will be solved as follows.
Automatically save / load data and execute files --You can run a program that takes more than 12 hours without worrying about the previously unavoidable 12-hour reset problem.
The executable file itself accesses its page on a regular basis --90 minutes session does not expire --No need to always start the script on the local PC
--Colaboratory file alone 90 minutes problem avoidance --Use Selenium in Colaboratory and automatically avoid the 12-hour problem by selecting fileA → fileB → fileA. --[Authentication] to open another Colaboratory file in Selenium in Colaboratory by using User Profile created in Chrome (https://qiita.com/Ningensei848/items/a7daa6ee4ef692a3d65e#%E7%8F % BE% E5% 9C% A8% E3% 83% 96% E3% 83% 81% E5% BD% 93% E3% 81% 9F% E3% 81% A3% E3% 81% A6% E3% 81% 84 Avoid% E3% 82% 8B% E5% A3% 81) (bad Japanese ...)
It's been a long article, so please skip it as appropriate. [What you are doing](https://qiita.com/shoyaokayama/items/8869b7dda6deff017046#%E3%82%84%E3%81%A3%E3%81%A6%E3%81%84%E3%82% 8B% E3% 81% 93% E3% 81% A8) ・ [Verification result](https://qiita.com/shoyaokayama/items/8869b7dda6deff017046#%E6%A4%9C%E8%A8%BC%E7%B5 % 90% E6% 9E% 9C) ・ [Operation check video](https://qiita.com/shoyaokayama/items/8869b7dda6deff017046#%E5%8B%95%E4%BD%9C%E7%A2%BA%E8 % AA% 8D% E5% 8B% 95% E7% 94% BB) It may be easier to grasp the image.
1.
using Selenium in FileB.Then, in order to avoid session interruption for both File A and B, perform the following processing using Selenium.
Google Colaboratory
--Ubuntu18.04 (Building an Ubuntu environment that can operate GUI with Docker for Mac. Refer to the article below)
Install Docker to create a GUI-operable Linux (Ubuntu) container
As mentioned above, Google Colaboratory has rules regarding usage time.
12 hours & 90 minutes rule In Google Colaboratory, if the following conditions are met, all instance states will be reset even if there is a running program.
[12-hour rule] 12 hours have passed since the new instance was started. [90-minute rule] 90 minutes have passed since the notebook session expired ◆ [Google Colaboratory 90-minute session disconnection measures [automatic connection]](https://qiita.com/enmaru/items/2770df602dd7778d4ce6#12%E6%99%82%E9%96%9390%E5%88%86% E3% 83% AB% E3% 83% BC% E3% 83% AB)
This article shows how to solve the 12-hour rule and 90-minute rule within Colaboratory.
Also, it seems difficult to run Selenium on Google Coloaboratory (← It seems that it is difficult to start WebDriver). If possible, start selenum with Google Coloaboratory_fileA to start Google Coloaboratory_fileB, and before 12 hours expire, start selenum with Google Coloaboratory_fileB and start Google Coloaboratory_fileC (... endless repeat). I wrote it, but orz ◆ Think about how to run Google Colaboratory regularly
The above article was used as a reference when I regularly run Colaboratory. This time, we will realize the above-mentioned poop program.
The explanation will be given in the above order. If you actually try it, you can reverse 1 and 2.
You need to log in to Google to automatically execute Google Colaboratory files using Selenium. However, [this article](https://qiita.com/Ningensei848/items/a7daa6ee4ef692a3d65e#%E7%8F%BE%E5%9C%A8%E3%83%96%E3%83%81%E5%BD% 93% E3% 81% 9F% E3% 81% A3% E3% 81% A6% E3% 81% 84% E3% 82% 8B% E5% A3% 81) ReCAPTCHA's squishy string in Colaboratory file It seems to be difficult to break through in the cell inside.
As a solution, I decided to use ** User Profile ** stored for each user in Chrome. Bookmarks, cookies (ids, passwords), etc. are stored in the User Profile. I thought that if I could load it when Selenium was executed in Colaboratory, I wouldn't need to be authenticated.
User Profile is a group of files that summarizes the account information associated with each user who uses Chrome. Bookmarks, cookies (ids, passwords), etc. are stored inside.
Once you log in to Google or Colaboratory, your data will be stored in it. Therefore, you need to log in manually once, but you can avoid any authentication process by loading it in Selenium.
This time, the machine running on Google Colaboratory is Ubuntu 18.04 (2019.11.11), so it is sure to use the User Profile that saves the data logged in to Google and Colaboratory in ** the same environment **.
Initially, I was trying to load my Mac's User Profile into Selenium inside the Colaboratory, but I was asked to authenticate again.
After that, I built an Ubuntu environment that can be operated by GUI with Docker for Mac, opened Chrome there, and logged in to Google and Colaboratory. And I succeeded in using the created User Profile in Colaboratory.
Install Docker to create a GUI-operable Linux (Ubuntu) container
There are several ways to get a User Profile. I created a new one using the second method.
(1) Start Chrome, type chrome: // version in the address bar and press Enter.
(2) The profile directory path of the user you are using is displayed in the profile path. ◆ About Chrome User Profile
Executing the code below will create a User Profile in the current directory. So, just log in to Google or Colaboratory manually!
When you quit the program, it seems strange that a User Profile is created. Next, let's load this User Profile into Colaboratory using GitHub's private repository.
import os
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import chromedriver_binary
userdata_dir = 'UserData' #When creating a User Profile directly under the current directory
os.makedirs(userdata_dir, exist_ok=True)
options = webdriver.ChromeOptions()
#User Profile path settings
options.add_argument('--user-data-dir=' + userdata_dir)
driver = webdriver.Chrome(options=options)
try:
driver.get("https://google.com")
except:
driver.quit()
finally:
driver.quit()
The reason why the Mac User Profile didn't work is still unclear ... Is it an OS problem or a chrome driver problem? I would be grateful if you could tell me the details.
Since the User Profile is a ** chunk of personal information **, you are responsible for how to load it into the Colaboratory. I chose GitHub because the following authentication was annoying, Unlimited access to private repositories on GitHub.
--Create a private repository and push your User Profile to it. ――Let's take this opportunity to use two-step verification.
I will omit how to create a private repository. I referred to the following article for how to get an access token.
Create a personal access token for the command line
The following command is used when cloning a private repository with two-step verification.
"git clone https://{Access token}:[email protected]/{username}/{Repository name}.git"
Now you are ready to load your User Profile into Colaboratory!
Thank you very much for reading this far. It's quite tiring to write.
Now it's time to implement it! !! !!
For the setup for using Selenium in Colaboratory, I referred to the following article.
Poem on Selenium on Colaboratory and Time Limit Avoidance Techniques
The files created in this article are as follows.
gitPython I also referred to the following article for the implementation of gitPython.
Poem on Selenium on Colaboratory and Time Limit Avoidance Techniques GitPython documentation
I actually implemented clone using git Python as follows.
python
%cd /content
repository_path = "{Repository name}"
#Directory existence confirmation
if not os.path.isdir(repository_path):
git.Git().clone("https://{Access token}:[email protected]/{username}/" + repository_path + ".git")
else:
pass
#Move to the cloned directory
%cd my-selenium-profile
It looks like this until push.
class SeleniumColaboratory():
def __init__(self, mode="2"):
#Set current directory to path
self.path = os.getcwd()
self.store_path = self.path + "/elapsed_time.txt"
"~~~~~~~~~~~~~~~~~~~~~~~~abridgement~~~~~~~~~~~~~~~~~~~~~~~~"
def git_push(self):
try:
repo = git.Repo.init()
repo.index.add(self.store_path)
repo.index.commit("add elapsed_time.txt")
origin = repo.remote(name="origin")
origin.push()
return "Success"
except:
return "Error"
def main(self):
#Push to GitHub
result = self.git_push("Training data and intermediate products")
To load ʻUserData_Ubuntu` in the directory cloned from GitHub into Selenium, do the following:
# User Plofile path
userdata_dir = "UserData_Ubuntu"
options = webdriver.ChromeOptions()
#Load User Profile
options.add_argument("--user-data-dir=" + userdata_dir)
The whole Selenium setup process looks like this.
class SeleniumColaboratory():
def __init__(self, mode="2"):
#Set current directory to path
self.path = os.getcwd()
self.store_path = self.path + "/elapsed_time.txt"
# User Plofile path
userdata_dir = "UserData_Ubuntu"
options = webdriver.ChromeOptions()
#Load User Profile
options.add_argument("--user-data-dir=" + userdata_dir)
options.add_argument("--headless")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
# open it, go to a website, and get results
self.driver = webdriver.Chrome("chromedriver",options=options)
To access other Colabratory files in the summary code below
def access_another_colabo(self, path):
#Open a new tab and update process
self.driver.execute_script("window.open()") #make new tab
self.driver.switch_to.window(self.driver.window_handles[-1]) #switch new tab
self.driver.get(path)
#Wait for all page elements to load
WebDriverWait(self.driver, 60).until(EC.presence_of_all_elements_located)
#Check if you can access the specified URL(Check if you are skipped to the authentication page)
cur_url = self.driver.current_url
print(cur_url)
self.click_change_runtime()
#Run all cells
self.click_runall()
It is implemented in. It is in the following summary code that the session is not cut off by accessing my file
def auto_access(self, path):
try:
#Open a new tab and update process
self.driver.execute_script("window.open()") #make new tab
self.driver.switch_to.window(self.driver.window_handles[-1]) #switch new tab
self.driver.get(path)
#Wait for all page elements to load
WebDriverWait(self.driver, 60).until(EC.presence_of_all_elements_located)
#Check if you can access the specified URL(Check if you are skipped to the authentication page)
cur_url = self.driver.current_url
print(cur_url)
self.click_change_runtime()
time.sleep(30)
except urllib3.exceptions.NewConnectionError as e:
print(str(e))
print("********Portal New connection timed out***********")
time.sleep(30)
except urllib3.exceptions.MaxRetryError as e:
print(str(e))
time.sleep(30)
print("*********Portal Max tries exceeded************")
It is implemented in.
You also need to run the file, not just open it in another Colaboratory file. The execution order is as follows.
Therefore, you have to click on the element that is pulling the red line below.
To click "Run all cells" in the summary code below
def click_runall(self):
#Runtime click
select_dropdown = WebDriverWait(self.driver, 20).until(EC.element_to_be_clickable((By.ID,"runtime-menu-button")))
select_dropdown.click()
time.sleep(1)
#Click the execute button for all cells
select_dropdown = WebDriverWait(self.driver, 20).until(EC.element_to_be_clickable((By.ID,":1s")))
select_dropdown.click()
It is implemented in. To set the runtime changes, set the instance type once and set it in Selenium. In the summary code below
def set_mode(self, mode):
if mode == "None":
self.mode = "1"
elif mode == "GPU":
self.mode = "2"
elif mode == "TPU":
self.mode == "3"
else:
self.mode = "1"
def click_change_runtime(self):
#Runtime click
select_dropdown = WebDriverWait(self.driver, 20).until(EC.element_to_be_clickable((By.ID,"runtime-menu-button")))
select_dropdown.click()
#Runtime retype click
select_dropdown = WebDriverWait(self.driver, 20).until(EC.element_to_be_clickable((By.ID,":23")))
select_dropdown.click()
#Drop down click
select_dropdown = WebDriverWait(self.driver, 20).until(EC.element_to_be_clickable((By.ID,"input-4")))
select_dropdown.click()
#Because you may click without waiting
time.sleep(1)
#I want to avoid XPATH
#Runtime selection
select_dropdown = WebDriverWait(self.driver, 20).until(EC.element_to_be_clickable((By.XPATH,"//*[@id='accelerator']/paper-item[" + self.mode + "]")))
select_dropdown.click()
#Click the save button
select_dropdown = WebDriverWait(self.driver, 20).until(EC.element_to_be_clickable((By.ID,"ok")))
select_dropdown.click()
It is implemented in.
Summary code is folded ↓ ↓ ↓
import os
# set options to be headless, ..
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import subprocess as sp
from datetime import datetime, timedelta, timezone
import urllib3
import time
class SeleniumColaboratory():
def __init__(self, mode="2"):
userdata_dir = "UserData_Ubuntu"
#Set current directory to path
self.path = os.getcwd()
self.store_path = self.path + "/elapsed_time.txt"
options = webdriver.ChromeOptions()
options.add_argument("--user-data-dir=" + userdata_dir)
options.add_argument("--headless")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
# open it, go to a website, and get results
self.driver = webdriver.Chrome("chromedriver",options=options)
# FIleB path
self.access_path = "https://colab.research.google.com/drive/1s3LeBakro8zDX_FGAZfhmgnNZktjwJtD"
# FilleA path for auto-access
self.access_path_2 = "https://colab.research.google.com/drive/1wT6ZpKLNr24R5qEfH-0jotifhBVrfA9S"
self.mode = str(mode)
jtime = self.get_japan_time()
initial_text = "------------------ " + jtime.strftime("%Y-%m-%d") + " ------------------\n"
self.append_time_file(initial_text)
def click_runall(self):
select_dropdown = WebDriverWait(self.driver, 20).until(EC.element_to_be_clickable((By.ID,"runtime-menu-button")))
select_dropdown.click()
time.sleep(1)
select_dropdown = WebDriverWait(self.driver, 20).until(EC.element_to_be_clickable((By.ID,":1s")))
select_dropdown.click()
def click_change_runtime(self):
#Runtime click
select_dropdown = WebDriverWait(self.driver, 20).until(EC.element_to_be_clickable((By.ID,"runtime-menu-button")))
select_dropdown.click()
#Runtime retype click
select_dropdown = WebDriverWait(self.driver, 20).until(EC.element_to_be_clickable((By.ID,":23")))
select_dropdown.click()
#Drop down click
select_dropdown = WebDriverWait(self.driver, 20).until(EC.element_to_be_clickable((By.ID,"input-4")))
select_dropdown.click()
#Because you may click without waiting
time.sleep(1)
#I want to avoid XPATH
#Runtime selection
select_dropdown = WebDriverWait(self.driver, 20).until(EC.element_to_be_clickable((By.XPATH,"//*[@id='accelerator']/paper-item[" + self.mode + "]")))
select_dropdown.click()
#Click the save button
select_dropdown = WebDriverWait(self.driver, 20).until(EC.element_to_be_clickable((By.ID,"ok")))
select_dropdown.click()
def check_time(self):
#Returns the time since the instance was launched
res = sp.Popen(["cat", "/proc/uptime"], stdout=sp.PIPE)
#The unit is Hour
use_time = float(sp.check_output(["awk", "{print $1 /60 /60 }"], stdin=res.stdout).decode().replace("\n",""))
return use_time
def append_time_file(self, txt):
with open(self.store_path, mode='a') as f:
f.write(txt)
def access_another_colabo(self, path):
#Open a new tab and update process
self.driver.execute_script("window.open()") #make new tab
self.driver.switch_to.window(self.driver.window_handles[-1]) #switch new tab
self.driver.get(path)
#Wait for all page elements to load
WebDriverWait(self.driver, 60).until(EC.presence_of_all_elements_located)
#Check if you can access the specified URL(Check if you are skipped to the authentication page)
cur_url = self.driver.current_url
print(cur_url)
self.click_change_runtime()
#Run all cells
self.click_runall()
def auto_access(self, path):
try:
#Open a new tab and update process
self.driver.execute_script("window.open()") #make new tab
self.driver.switch_to.window(self.driver.window_handles[-1]) #switch new tab
self.driver.get(path)
#Wait for all page elements to load
WebDriverWait(self.driver, 60).until(EC.presence_of_all_elements_located)
#Check if you can access the specified URL(Check if you are skipped to the authentication page)
cur_url = self.driver.current_url
print(cur_url)
self.click_change_runtime()
time.sleep(30)
except urllib3.exceptions.NewConnectionError as e:
print(str(e))
print("********Portal New connection timed out***********")
time.sleep(30)
except urllib3.exceptions.MaxRetryError as e:
print(str(e))
time.sleep(30)
print("*********Portal Max tries exceeded************")
def set_mode(self, mode):
if mode == "None":
self.mode = "1"
elif mode == "GPU":
self.mode = "2"
elif mode == "TPU":
self.mode == "3"
else:
self.mode = "1"
def git_push(self):
try:
repo = git.Repo.init()
repo.index.add(self.store_path)
repo.index.commit("add elapsed_time.txt")
origin = repo.remote(name="origin")
origin.push()
return "Success"
except:
return "Error"
def get_japan_time(self):
#Time zone generation
JST = timezone(timedelta(hours=+9), 'JST')
# GOOD,The time zone is specified. early
return datetime.now(JST)
def main(self):
while True:
elapsed_time = self.check_time()
print(elapsed_time)
jtime = self.get_japan_time()
append_text = "File A : " + str(elapsed_time) + " Hour (" +str(jtime.strftime("%H:%M:%S")) + ")\n"
self.append_time_file(append_text)
#After 11 hours
if elapsed_time > 11:
#Push to GitHub
result = self.git_push()
self.set_mode("None")
#Open Colaboratory file B
self.access_another_colabo(self.access_path)
self.set_mode("GPU")
self.auto_access(self.access_path_2)
break
else:
self.set_mode("GPU")
self.auto_access(self.access_path_2)
#Check every 60 minutes
time.sleep(3600)
print("Done.")
--The shortcut key didn't work
We performed the following processing and verified that the 90-minute and 12-hour problem-solving in this article is useful.
elapsed_time.txt
1 ------------------ {Japan date and time} ------------------
2 {file name}:{Instance startup time}({Japan time})
1.
above2.
until the instance is finishedI started with FileB, but if I check ʻelapsed_time.txt` pushed to GitHub, I can see that it is rotating twice without any problem.
elapsed_time.txt
------------------ 2019-11-15 ------------------
File B : 0.148172 Hour (11:02:53)
File B : 0.659194 Hour (11:33:32)
File B : 1.16914 Hour (12:04:08)
File B : 1.67921 Hour (12:34:45)
------------------ 2019-11-15 ------------------
File A : 0.371208 Hour (12:35:44)
File A : 1.38115 Hour (13:36:19)
File A : 2.39099 Hour (14:36:55)
File A : 3.40073 Hour (15:37:30)
File A : 4.41073 Hour (16:38:06)
File A : 5.42056 Hour (17:38:41)
File A : 6.43057 Hour (18:39:17)
File A : 7.4407 Hour (19:39:54)
File A : 8.45084 Hour (20:40:30)
File A : 9.46208 Hour (21:41:11)
File A : 10.4723 Hour (22:41:48)
File A : 11.4827 Hour (23:42:25)
------------------ 2019-11-15 ------------------
File B : 0.070075 Hour (23:43:33)
File B : 0.579919 Hour (00:14:09)
File B : 1.09002 Hour (00:44:45)
File B : 1.59973 Hour (01:15:20)
------------------ 2019-11-16 ------------------
File A : 0.0631278 Hour (01:16:18)
File A : 1.07302 Hour (02:16:53)
File A : 2.08286 Hour (03:17:29)
File A : 3.09267 Hour (04:18:04)
File A : 4.10244 Hour (05:18:39)
File A : 5.11231 Hour (06:19:15)
File A : 6.1223 Hour (07:19:51)
File A : 7.13236 Hour (08:20:27)
File A : 8.14249 Hour (09:21:04)
File A : 9.15265 Hour (10:21:40)
File A : 10.163 Hour (11:22:17)
File A : 11.1734 Hour (12:22:55)
You can confirm that the process is continued without the session being cut off.
elapsed_time.txt
------------------ 2019-11-14 ------------------
File A : 0.0933806 Hour (21:44:54)
File A : 1.10327 Hour (22:45:30)
File A : 2.11307 Hour (23:46:05)
File A : 3.12278 Hour (00:46:40)
File A : 4.13269 Hour (01:47:16)
File A : 5.14361 Hour (02:47:55)
File A : 6.15773 Hour (03:48:46)
File A : 7.16782 Hour (04:49:22)
File A : 8.17792 Hour (05:49:58)
File A : 9.18807 Hour (06:50:35)
File A : 10.1983 Hour (07:51:12)
File A : 11.2088 Hour (08:51:50)
The session run in the browser on the local PC has already been reset because it has been 12 hours, but since the new file is running on the browser opened in Selenium in Colaboratory, it is run in another session. You can confirm that it is.
colab_selen.ipynb(FileA) https://colab.research.google.com/drive/1wT6ZpKLNr24R5qEfH-0jotifhBVrfA9S
return_selen.ipynb(FileB) https://colab.research.google.com/drive/1s3LeBakro8zDX_FGAZfhmgnNZktjwJtD
By using Selenium, I was able to show how to avoid the problem for 90 minutes and 12 hours with only the Colaboratory file. Therefore, even if you perform processing for more than 12 hours (eg deep learning model learning), it can be done automatically!
I have left as many comments as possible in the source code, but I think there are some parts that are lacking in explanation.
I wanted to write it more carefully, but I'm spending a little too much time writing this article, so I'd like to go back before the professor gets angry! See you again!
-Introduce Docker and create a GUI-operable Linux (Ubuntu) container -Python3 datetime becomes faster just by specifying the time zone -[Create a personal access token for the command line](https://help.github.com/en/github/authenticating-to-github/creating-a-personal-access-token-for-the-command- line) Open link in another tab with selenium
-I tried (and didn't) use Selenium on Colaboratory --Qiita -Prepare Selenium execution environment in Google Colaboratory (failure) --Qiita -Selenium can be used in Colaboratory: Easy scraping of pages generated by javascript --Qiita
-Google Colaboratory 90-minute session disconnection measures [automatic connection] --Qiita -I want to avoid the 90-minute rule of Google Colaboratory. --Qiita -Think about how to run Google Colaboratory regularly --Qiita -Infrastructure development for machine learning with Google Colaboratory --Qiita -Poem on Selenium on Colaboratory and Time Limit Avoidance Techniques
Recommended Posts