[PYTHON] Build a data analysis environment that links GitHub authentication and Django with JupyterHub

Introduction

JupyterHub is a mechanism for operating Python etc. on a WEB browser called Jupyter Notebook. It is a mechanism to make it available in, but since there was no information that was fragmented but organized, login using GitHub authentication and connecting well with Django's Shell environment, it is summarized. We use AWS ELB / EC2 / RDS and use nginx for our web server.

As a use case, I have already built a Django environment for connecting to test data, but I want an environment where I can quickly touch it from the WEB ・ I want to separate the environment for each user, but I want to create a complicated authentication mechanism I think it's useful when you don't want to dull.

When completed, you will be able to quickly do the following things on your web browser. スクリーンショット 2017-02-19 15.47.47.png

Flow of operation

Roughly speaking, it looks like the following.

  1. Access http://jupyter.example.jp/ (It is better to add BASIC authentication etc.)
  2. nginx is processed and proxy to JupyterHub running on 127.0.0.1:8000
  3. You can operate Jupyter Notebook by authenticating with GitHub
  4. Jupyter Notebook works as a Kernel connected to Django processes, so you can access the DB immediately.

Remaining issue: SSL environment has not been built

I think there is a way to do it, but if you use https connection, when you configure SSL → nginx → Tornade on Virtualenv to JupyterHub, it will fail to attach when you specify it as Django's Kernel in notebook. I did. It's okay to access http, but it seems that something else needs to be set, maybe it is necessary to tell JupyterHub that it is http internally. Since there is access from the outside such as cooperation with GitHub, it should be https, but it is a bit disappointing that it did not reach that point.

Environment construction procedure

overall structure

In the environment I'm using, the DB of the production server is masked with Daily and stored in another RDS. In the test / staging environment, the mask DB can be operated from Django. In this document, Goal is to make it possible to execute the environment operated by Django for the RDS through Jupyter Hub.

JupyterHub全体構成

Environmental assumptions

/var/www/jupyter.example.jp/ It is assumed that the Django environment is built under. Some of the directory structures that are likely to be relevant are shown below.

/var/www/jupyter.example.jp/
├── README.md
├── jupyter  #Place application code(Below this manage.There is py)
├── requirements.txt
├── virtualenv

We're assuming that Django is already in the virtualenv environment and django-extensions is included to load the Model using shell_plus.

The virtualenv environment is included as follows.

cd /var/www/jupyter.example.jp/
virtualenv --prompt "(jupyter)" virtualenv

Also, place the JupyterHub configuration file under / etc / jupyterhub.

Create user for JupyterHub

JupyterHub is operated by the jupyterhub user.

sudo useradd jupyterhub
sudo usermod -a -G shadow jupyterhub
sudo mkdir /etc/jupyterhub
sudo chown jupyterhub /etc/jupyterhub

Install JupyterHub

Continue with jupyterhub / jupyterhub: Multi-user server for Jupyter notebooks.

sudo apt-get install npm nodejs-legacy
sudo npm install -g configurable-http-proxy
(Install with pip under virtualenv environment)
pip install ipython jupyterhub

(How to install npm and node, while feeling old)

Creating a JupyterHub configuration file

jupyterhub --generate-config
/etc/jupyterhub/jupyterhub_config.Go to py and fix the following

L.Near 137
c.JupyterHub.ip = '127.0.0.1'

L.Near 106
c.JupyterHub.cookie_secret_file = '/etc/jupyterhub/jupyterhub_cookie_secret'
c.JupyterHub.db_url = '/etc/jupyterhub/jupyterhub.sqlite'

Reference: http://jupyterhub.readthedocs.io/en/latest/config-examples.html

Login with your GitHub account

Build the settings to log in to GitHub with Oauth from JupyterHub.

pip install oauthenticator

Added the following to about L.58 of jupyterhub_config.py

c.JupyterHub.authenticator_class = 'oauthenticator.GitHubOAuthenticator'
c.GitHubOAuthenticator.oauth_callback_url = os.environ['OAUTH_CALLBACK_URL']
c.GitHubOAuthenticator.client_id = os.environ['GITHUB_CLIENT_ID']
c.GitHubOAuthenticator.client_secret = os.environ['GITHUB_CLIENT_SECRET']

Reference: https://github.com/jupyterhub/oauthenticator

Spawner I don't have a deep understanding of what Spawner is, but it seems that I need to define a Spawner to use on JupyterHub for process management.

Reference: http://jupyterhub.readthedocs.io/en/latest/spawners.html

pip install git+https://github.com/jupyter/sudospawner

Describe the following near L.220 of jupyterhub_config.py

c.JupyterHub.confirm_no_ssl = True
c.JupyterHub.spawner_class = 'sudospawner.SudoSpawner'
c.Spawner.notebook_dir = '~/notebooks'

c.SudoSpawner.sudospawner_path = '/var/www/jupyter.example.jp/virtualenv/bin/sudospawner'

Add the following to the end with sudo visudo.

## Jupyterhub
# comma-separated whitelist of users that can spawn single-user servers
Runas_Alias JUPYTER_USERS =GitHub user name to use(Comma separated)

# the command(s) the Hub can run on behalf of the above users without needing a password# the exact pa$
Cmnd_Alias JUPYTER_CMD = /var/www/jupyter.example.jp/virtualenv/bin/sudospawner

# actually give the Hub user permission to run the above command on behalf# of the above users without$
jupyterhub ALL=(JUPYTER_USERS) NOPASSWD:JUPYTER_CMD

Reference: http://qiita.com/mt08/items/301f9fb93d01e78bda47

Start Jupyter Hub

With the settings up to this point, you can start the JupyterHub server. Create a startup script and start it.

sudo -u jupyterhub vi /etc/jupyterhub/launch_notebook.sh
#!/bin/bash

export OAUTH_CALLBACK_URL=http://jupyter.example.jp/hub/oauth_callback
export GITHUB_CLIENT_ID=xxx
export GITHUB_CLIENT_SECRET=xxx

source /var/www/jupyter.example.jp/virtualenv/bin/activate
jupyterhub -f /etc/jupyterhub/jupyterhub_config.py

CLIENT_ID and CLIENT_SECRET required for GitHub integration can be obtained from the following. https://github.com/settings/applications/new

with this

sudo -u jupyterhub /etc/jupyterhub/launch_notebook.sh

If you run, JupyterHub should start at http://127.0.0.1:8000/.

nginx settings

Once JupyterHub starts, we will send access from ELB next. nginx receives access to port 80 and uses it as a proxy to send to port 8000 if it should be processed by JupyterHub.

sudo vi vi /etc/nginx/nginx.conf
http {
    map $http_upgrade $connection_upgrade {
        default upgrade;
        ''      close;
    }
    ...

/ etc / nginx / sites-enabled / Create a conf file that describes the settings for jupyter.example.jp under. (Example: jupyter.example.jp.conf)

server {
    listen 80;
    server_name jupyter.example.jp;

    #I want to enable the following settings when an HTTPS connection environment is created
    # add_header Strict-Transport-Security max-age=15768000;

    # Managing literal requests to the JupyterHub front end
    location / {
        proxy_pass http://127.0.0.1:8000;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header Host $host;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    }

    # Managing WebHook/Socket requests between hub user servers and external proxy
    location ~* /(api/kernels/[^/]+/(channels|iopub|shell|stdin)|terminals/websocket)/? {
        proxy_pass http://127.0.0.1:8000;

        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header Host $host;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        # WebSocket support
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection $connection_upgrade;
    }
}

Allow you to log in via GitHub

If you already have the same GitHub user name for Linux users, there is no problem, but if you do not have it, you need to create it.

USER=GitHub user name to use
useradd -d /home/${USER} -m ${USER}
mkdir /home/${USER}/notebooks
chown ${USER}/home/${USER}/notebooks

Kernel settings

Jupyter notebook has the concept of ** Kernel **, which allows you to specify which Python environment to use at runtime. By default, it is a Python environment at startup, but by setting an environment that uses Django's shell_plus, it will be available in a state where it is already connected to the DB and Model import is completed, which is very convenient.

If you have Jupyter Hub in your Virtualenv environment, you can create a configuration file with: WARNING is out, but ...

% jupyter kernelspec install-self --user
[InstallNativeKernelSpec] WARNING | `jupyter kernelspec install-self` is DEPRECATED as of 4.0. You probably want `ipython kernel install` to install the IPython kernelspec.
[InstallNativeKernelSpec] Installed kernelspec python3 in /home/your_user/.local/share/jupyter/kernels/python3

If you put it in the global area without putting it in the Virtualenv environment, you may be able to do it as follows.

% python -m ipykernel install
Installed kernelspec python3 in /usr/local/share/jupyter/kernels/python3

Then, move the created configuration files.

% cd /usr/local/share
% sudo mv /home/your_user/.local/share/jupyter ./
% sudo chmod 775 jupyter
% sudo chown -R root:root jupyter
% cd /usr/local/share/jupyter/kernels

% ls /usr/local/share/jupyter/kernels
python3

% sudo mv python3 django

% ls /usr/local/share/jupyter/kernels/django
kernel.json  logo-32x32.png  logo-64x64.png
sudo vi django/kernel.json

I think that it is as follows, so modify it as described later.

{
 "language": "python",
 "argv": [
  "/var/www/jupyter.example.jp/virtualenv/bin/python3.4",
  "-m",
  "ipykernel",
  "-f",
  "{connection_file}"
 ],
 "display_name": "Python 3"
}

{
  "display_name": "Django",
  "language": "python",
  "codemirror_mode": {
    "version": 3,
    "name": "ipython"
  },
  "argv": [
    "/var/www/jupyter.example.jp/virtualenv/bin/python",
    "/var/www/jupyter.example.jp/jupyter/manage.py",
    "shell_plus",
    "--settings=jupyter.settings",
    "--kernel",
    "--connection-file",
    "{connection_file}"
  ]
}

※--settings=jupyter.settings is manage.Specify the path of the Django configuration file used by py
※shell_Django has django to use plus-INSTALLED with extensions_Django on APPS_extensions must be specified

Reference 1: https://ipython.org/ipython-doc/3/development/kernels.html Reference 2: http://stackoverflow.com/questions/31088080/django-extensions-shell-plus-kernel-specify-connection-file (must be "--connection-file" as mentioned in the comments) Reference 3: http://stackoverflow.com/questions/39007571/running-jupyter-with-multiple-python-and-ipython-paths

Also, https://github.com/Cadair/jupyter_environment_kernels It may be easier to switch the kernel by inserting this, but I haven't done it.

Manage Jupyter Hub with supervisord

Now you can start it by running launch_notebook.sh, but since it is troublesome to start it every time the server is restarted, supervisord will start it automatically.

sudo vi /etc/supervisor/conf.d/
[program:notebook]
command=/etc/jupyterhub/launch_notebook.sh
directory=/etc/jupyterhub/
autostart=true
autorestart=true
stopgroup=true
startretries=3
exitcodes=0,2
stopsignal=TERM
user=jupyterhub
group=jupyterhub

Load the configuration file and start it.

supervisord reread
supervisord reload
supervisord start notebook

It will start automatically and the log will be listed in /etc/log/supervisor/notebook-xxx.log.

Appendix I will summarize the contents that I got when building the environment and the links that I referred to.

connection_upgrade issues

If you send 80th access to JupyterHub without thinking about it,

2017/02/03 17:20:44 [emerg] 16297#16297: unknown "connection_upgrade" variable

I get the error.

It has been solved by referring to http://mogile.web.fc2.com/nginx/http/websocket.html, but it seems that it is necessary to describe the following in /etc/nginx/nginx.conf.

http {
    map $http_upgrade $connection_upgrade {
        default upgrade;
        ''      close;
    }

Error when connecting to https

If you access from https with ELB, you can access JupyterHub, but when you set Kernel, the following error appears in supervisord.

[I 2017-02-05 21:43:24.410 JupyterHub log:100] 200 GET /hub/api/authorizations/cookie/jupyter-hub-token-xxx/[secret]([email protected]) 14.89ms
21:44:26.703 - error: [ConfigProxy] Proxy error:  Error: socket hang up
    at createHangUpError (http.js:1472:15)
    at Socket.socketCloseListener (http.js:1522:23)
    at Socket.EventEmitter.emit (events.js:95:17)
    at TCP.close (net.js:466:12)

It's called a cookie, and it seems that it can be cured by modifying the setting value somewhere, but I couldn't do that and gave up. I will try again someday.

Install various libraries for data analysis

Since this area is posted elsewhere, it may be omitted, but in order to use matplotlib etc., the library necessary for data analysis is included in the following etc.

sudo apt-get install -y libpng12-dev libjpeg8-dev libfreetype6-dev libxft-dev
pip install numpy pandas matplotlib seaborn scikit-learn
* You can use it by putting it in the virtualenv environment and using supervisord restart notebook.

Other reference materials

--You talk about how cool the Jupyter notebook is. -Powerful notepad for modern engineers Jupyter notebook recommendation ――It's nice and arbitrary, but you can feel the love for Jupyter notebook. -Recommended coding environment Jupyter Notebook for data scientists

Recommended Posts

Build a data analysis environment that links GitHub authentication and Django with JupyterHub
Build a data analysis environment with Kedro + MLflow + Github Actions
[DynamoDB] [Docker] Build a development environment for DynamoDB and Django with docker-compose
[Python] Build a Django development environment with Docker
Build a Django environment with Vagrant in 5 minutes
Build a virtual environment with pyenv and venv
Build a Django development environment with Doker Toolbox
Quickly build a Python Django environment with IntelliJ
Build a python virtual environment with virtualenv and virtualenvwrapper
Build a python virtual environment with virtualenv and virtualenvwrapper
Build a development environment with Poetry Django Docker Pycharm
Build a Django environment for Win10 (with virtual space)
Build a numerical calculation environment with pyenv and miniconda3
Develop a web API that returns data stored in DB with Django and SQLite
Build a Django development environment with Docker! (Docker-compose / Django / postgreSQL / nginx)
Build a Docker environment that can use PyTorch and JupyterLab
Build a machine learning scikit-learn environment with VirtualBox and Ubuntu
[Django] Build a Django container (Docker) development environment quickly with PyCharm
Build a python data analysis environment on Mac (El Capitan)
Build a Python environment and transfer data to the server
Create a Todo app with Django ① Build an environment with Docker
Build a web application with Django
Build a 64-bit Python 2.7 environment with TDM-GCC and MinGW-w64 on Windows 7
Build a Python environment on your Mac with Anaconda and PyCharm
Build a comfortable psychological experiment / analysis environment with PsychoPy + Jupyter Notebook
Create a USB boot Ubuntu with a Python environment for data analysis
Build a Tensorflow environment with Raspberry Pi [2020]
Build a Fast API environment with docker-compose
Build a CentOS Linux 8 environment with Docker and start Apache HTTP Server
[Linux] Build a jenkins environment with Docker
Quickly build a python environment for deep learning and data science (Windows)
Build a python virtual environment with pyenv
Build a drone simulator environment and try a simple flight with Mission Planner
Build a modern Python environment with Neovim
Steps to build a Django environment with Win10 WSL Ubuntu18.04 + Anaconda + Apache2
[Linux] Build a Docker environment with Amazon Linux 2
Build a Python + bottle + MySQL environment with Docker on RaspberryPi3! [Trial and error]
Try creating a web application with Vue.js and Django (Mac)-(1) Environment construction, application creation
Build a TensorFlow development environment on Amazon EC2 with command copy and paste
Practice of creating a data analysis platform with BigQuery and Cloud DataFlow (data processing)
Build a distributed environment with Raspberry PI series (Part 3: Install and configure dnsmasq)
Build a C language development environment with a container
Build a WardPress environment on AWS with pulumi
Build Django + NGINX + PostgreSQL development environment with Docker
Build the fastest Django development environment with docker-compose
Prepare a programming language environment for data analysis
Building a python environment with virtualenv and direnv
Build a Django environment on Raspberry Pi (MySQL)
Build a python environment with ansible on centos6
Start Django in a virtual environment with Pipenv
Create a python3 build environment with Sublime Text3
[Memo] Build a virtual environment with Pyenv + anaconda
Build a Python environment with OSX El capitan
Build PyPy and Python execution environment with Docker
Build a Python machine learning environment with a container
Build a python execution environment with VS Code
Get a large amount of Starbucks Twitter data with python and try data analysis Part 1
Quickly create a Python data analysis dashboard with Streamlit and deploy it to AWS
Build a basic Data Science environment (Jupyter, Python, R, Julia, standard library) with Docker.
[AWS] Development environment version that tried to build a Python environment with eb [Elastic Beanstalk]
Build a python environment for each directory with pyenv-virtualenv