[PYTHON] I tried sentence generation with GPT-2

Last time, [Failure] I wanted to generate a sentence using Flair's TextRegressor, but it failed, so I used GPT-2 this time. I will try to generate a document.

When I started writing, I noticed that there was a god library called gpt-2-simple, so I borrowed it. To go.

Try the sample first

Since gpt-2-simple is based on openai GPT-2, it does not work on tensorflow2.x series. So, if you want to use Docker image, use tensorflow / tensorflow: 1.15.2-py3 (for CPU).

When using GPU
When using Docker images (but [does not work with docker on windows](https://github.com/NVIDIA/nvidia-docker/wiki/Frequently-Asked-Questions#is-microsoft-windows-supported) It seems...)
docker run --runtime=nvidia -it tensorflow/tensorflow:1.15.2-gpu-py3 bash

When building an environment quickly with Conda

conda create -n gpt-2 python=3.6
conda activate gpt-2
pip install tensorflow-gpu==1.15

So, install it with pip3 install gpt-2-simple, and let's learn Shakespeare's text and output it as shown in Usage. The result of learning and executing only one epoch is as follows.

1 Result of learning only epochs
>>> gpt2.generate(sess)
Cells in the body of a whale are often used as a means to induce sleep in some whales. But in the case of a whale, this particular type of whale is known as a cusps whale because it is usually about the size of a human hair.

The, or humpback whale, is one of the largest cusps whales in the world. It is an extremely large, highly muscled, and highly territorial mammal, with a very large mouth and, in some sections, white, skinned head.

...

Is the content that is coming out a whale? It sounds like a story, but it seems that you can learn 1 epoch and save the model for the time being, so let's move on to the next.

Let's learn your own corpus

In the above example, the 124MB model of GPT-2 is fine-tuned. Enter a Japanese character string here and check if fine tuning can be performed well.

An example of throwing Japanese sentences separated by spaces and learning 200 epochs
Northeast Regional, which runs through the corridor 188 (Union Station)(Washington D. C .)Depart Pennsylvania Station(New York)Departed from Philadelphia's 30th Street Station. The train is an ACS that manufactured 7 passenger cars a year ago.-Type 64 electric locomotive( No . 601 )Was towed.

Approximately 11 minutes later, the train is running on the double-track main line southeast of, at 4 degrees (approximately 440 m in radius) near the intersection of Frankford Avenue and Wheatseef Lane in the Port Richmond district. ) Entered the left curve.
2015 Amtrak derailment accident

The output looks nice!

Deliverables

https://github.com/ochiba0227/gpt2-simple/blob/master/gpt2_simple.py

Impressions

When I tried fine tuning only with GPT-2 repository, it was a very difficult task and I was disappointed in the middle ... At that time, I was lucky to meet gpt-2-simple. It's amazing to be able to fine-tune GPT-2 and generate documents with such a short code. I really appreciate the people who make the library! Now that I know how to fine-tune, I would like to learn and play with the sentences I want to generate personally.

What I wrote at the beginning that I was disappointed on the way

What I wrote at the beginning
## Try the sample first I cloned it from [GPT-2 repository](https://github.com/openai/gpt-2) ... Insert the package according to [DEVELOPERS.md](https://github.com/openai/gpt-2/blob/master/DEVELOPERS.md). But suddenly I can't install `tensorflow 1.12` ...
# pip install tensorflow==
ERROR: Could not find a version that satisfies the requirement tensorflow== (from versions: 2.2.0rc1, 2.2.0rc2)
ERROR: No matching distribution found for tensorflow==

When I install the latest version for the time being and proceed, the following error occurs.

# python3 src/generate_unconditional_samples.py | tee /tmp/samples
Traceback (most recent call last):
  File "src/generate_unconditional_samples.py", line 9, in <module>
    import model, sample, encoder
  File "/target/src/model.py", line 3, in <module>
    from tensorflow.contrib.training import HParams
ModuleNotFoundError: No module named 'tensorflow.contrib'

Upon examination, 'tensorflow.contrib' seems to be [obsolete] in tensorflow 2.x (https://github.com/tensorflow/tensorflow/issues/31350#issuecomment-518749548) ... There is no help for it, so switch to the installation from dockerfile. Here, the image of tensorflow / tensorflow: 1.12.0-py3 is used. If you drop all the models, it will be heavy, so modify it so that only the lightest model is dropped.

Dockerfile.cpu


FROM tensorflow/tensorflow:1.12.0-py3

ENV LANG=C.UTF-8
RUN mkdir /gpt-2
WORKDIR /gpt-2
ADD . /gpt-2
RUN pip3 install -r requirements.txt
RUN python3 download_model.py 124M

If you can edit it, it will be executed on the CPU for the time being, so it looks like the following ...

docker build --tag gpt-2 -f Dockerfile.cpu .
docker run -it gpt-2 bash

export PYTHONIOENCODING=UTF-8
python3 src/generate_unconditional_samples.py | tee /tmp/samples

Then I got the error ʻAttributeError: module'tensorflow' has no attribute'sort'. Upon examination, it seems that tensorflow 1.12.0 needs to be modified to import tf.contrib.framework.sort... It seems to work if you usetensorflow 1.14.0`, so this time I would like to modify the dockerfile side.

Dockerfile.cpu


#Because it's a big deal, 1.I made it the latest version of x series
FROM tensorflow/tensorflow:1.15.2-py3

ENV LANG=C.UTF-8
RUN mkdir /gpt-2
WORKDIR /gpt-2
ADD . /gpt-2
RUN pip3 install -r requirements.txt
RUN python3 download_model.py 124M

So, if you take a second look and run it, you will get the following lie article output!

Example output of a lie article
python3 src/generate_unconditional_samples.py | tee /tmp/samples
======================================== SAMPLE 1 ========================================
 — President Donald Trump tweeted on Saturday that he would not do it again in the 2017 budget.

"Of course, and I bet WA will accept my debt — but if a bad story develops, they'll tell me as long as I am cutting deduction for health care," Trump tweeted on December 19.

If a bad story develops, they'll tell me as long as I am reducing deduction for health care. — President Donald Trump (@realDonaldTrump) December 19, 2017  

The first budget request "is building around a debt epidemic for $3.5 trillion," according to CNN. The problem, it turns out, is that Trump would work with 
Republicans to pass a debt-ceiling increase, despite claims that the written framework can't be passed.

The budget would create $11.1 trillion in government debt, according to PPP , Russia, and China – which have agreed on a plan to get rid of regulations on corporate taxes as part of a five-year plan which includes massive cuts to subsidies for growth to deal with the looming financial crisis.

Today's budget contradicts Cliff's upcoming budget agreement, which to...

Certainly, even the smallest model can make an article that feels like ... amazing. Now that I've finally moved the sample, I'd like to learn it next time.

Learning using GPT-2

To be honest, there are too few places to explain, so it's quite difficult ... Maybe it's because it's not abused ...

Use of Japanese corpus

Issue 104, 114 says the corpus It is said that it is Byte_pair_encoding. If you want to make your own model, refer to This person's way and [SentencePiece](https: / (/github.com/google/sentencepiece) looks good.

This time, we will borrow BERT model learned from Mr. Possible's Japanese Wikipedia. Download Trained Model ...

Recommended Posts

I tried sentence generation with GPT-2
I tried fp-growth with python
I tried scraping with Python
I tried Learning-to-Rank with Elasticsearch!
I tried clustering with PyCaret
I tried gRPC with Python
I tried scraping with python
I wrote the code for Japanese sentence generation with DeZero
I tried trimming efficiently with OpenCV
I tried summarizing sentences with summpy
I tried machine learning with liblinear
I tried web scraping with python.
I tried moving food with SinGAN
I tried implementing DeepPose with PyTorch
I tried face detection with MTCNN
I tried SMTP communication with Python
I tried learning LightGBM with Yellowbrick
I tried face recognition with OpenCV
I tried to implement sentence classification by Self Attention with PyTorch
I tried multiple regression analysis with polynomial regression
I tried using Amazon SQS with django-celery
I tried to implement Autoencoder with TensorFlow
I tried linebot with flask (anaconda) + heroku
I tried to get started with Hy
I tried scraping Yahoo News with Python
I tried using Selenium with Headless chrome
I tried factor analysis with Titanic data!
I tried learning with Kaggle's Titanic (kaggle②)
I tried non-photorealistic rendering with Python + opencv
I tried a functional language with Python
I tried batch normalization with PyTorch (+ note)
I tried recursion with Python ② (Fibonacci sequence)
I tried implementing DeepPose with PyTorch PartⅡ
I tried to implement CVAE with PyTorch
I tried playing with the image with Pillow
I tried to solve TSP with QAOA
I tried simple image recognition with Jupyter
I tried CNN fine tuning with Resnet
I tried natural language processing with transformers.
#I tried something like Vlookup with Python # 2
I tried scraping
I tried PyQ
I tried AutoKeras
I tried papermill
I tried django-slack
I tried spleeter
I tried cgo
I tried to predict next year with AI
I tried "smoothing" the image with Python + OpenCV
I tried hundreds of millions of SQLite with python
I tried to detect Mario with pytorch + yolov3
I tried to implement reading Dataset with PyTorch
I tried to use lightGBM, xgboost with Boruta
I tried image recognition of CIFAR-10 with Keras-Learning-
I tried to learn logical operations with TF Learn
I tried to move GAN (mnist) with keras
I tried image recognition of CIFAR-10 with Keras-Image recognition-
I tried to save the data with discord
I tried to detect motion quickly with OpenCV
I tried to integrate with Keras in TFv1.1
I tried Flask with Remote-Containers of VS Code