[PYTHON] Become familiar with (want to be) around the pipeline of spaCy

Caution

The following is the output content described in jupyter notebook downloaded and pasted with markdown.

https://github.com/booink/spacy-trial1/tree/master The operating environment is reflected in this public repository.

It's just a fluent content that I just output markdown as a trial that I could only move my hand for about 30 minutes, so it's not bad to read.


https://spacy.io/usage/processing-pipelines

The sutras are copied from above.

When you call nlp on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline. The pipeline used by the default models consists of a tagger, a parser and an entity recognizer. Each pipeline component returns the processed Doc, which is then passed on to the next component.

The point is that if you pass the text to the nlp method, it will return the tokenized text in a Doc class object. The Doc object has a mechanism called pipeline, and the result of chained processing is a bucket relay of the Doc object. There are tagger, parser and entity recognizer (ner) in the pipeline.

I see. Let's look at the type of the doc object.

import spacy

nlp = spacy.load("en")
doc = nlp("This is a text")
type(doc)
---------------------------------------------------------------------------

OSError                                   Traceback (most recent call last)

<ipython-input-4-69cc80a89d2d> in <module>
      1 import spacy
      2 
----> 3 nlp = spacy.load("en")
      4 doc = nlp("This is a text")
      5 type(doc)


/usr/local/lib/python3.7/site-packages/spacy/__init__.py in load(name, **overrides)
     28     if depr_path not in (True, False, None):
     29         deprecation_warning(Warnings.W001.format(path=depr_path))
---> 30     return util.load_model(name, **overrides)
     31 
     32 


/usr/local/lib/python3.7/site-packages/spacy/util.py in load_model(name, **overrides)
    167     elif hasattr(name, "exists"):  # Path or Path-like to model data
    168         return load_model_from_path(name, **overrides)
--> 169     raise IOError(Errors.E050.format(name=name))
    170 
    171 


OSError: [E050] Can't find model 'en'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

I was angry that there was no en model.

https://spacy.io/usage/models

Try it according to Quick Start

!python -m spacy download en_core_web_sm
Collecting en_core_web_sm==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz (12.0 MB)
      |████████████████████████████████| 12.0 MB 476 kB/s eta 0:00:01
[?25hRequirement already satisfied: spacy>=2.2.2 in /usr/local/lib/python3.7/site-packages (from en_core_web_sm==2.2.5) (2.2.4)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (1.0.2)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (2.0.3)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (4.44.1)
Requirement already satisfied: numpy>=1.15.0 in /usr/local/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (1.18.2)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (2.23.0)
Requirement already satisfied: thinc==7.4.0 in /usr/local/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (7.4.0)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (3.0.2)
Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /usr/local/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (0.6.0)
Requirement already satisfied: plac<1.2.0,>=0.9.6 in /usr/local/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (1.1.3)
Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /usr/local/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (1.0.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (46.0.0)
Requirement already satisfied: blis<0.5.0,>=0.4.0 in /usr/local/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (0.4.1)
Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /usr/local/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (1.0.2)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_sm==2.2.5) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_sm==2.2.5) (2.9)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_sm==2.2.5) (1.25.8)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_sm==2.2.5) (2019.11.28)
Requirement already satisfied: importlib-metadata>=0.20; python_version < "3.8" in /usr/local/lib/python3.7/site-packages (from catalogue<1.1.0,>=0.0.7->spacy>=2.2.2->en_core_web_sm==2.2.5) (1.6.0)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/site-packages (from importlib-metadata>=0.20; python_version < "3.8"->catalogue<1.1.0,>=0.0.7->spacy>=2.2.2->en_core_web_sm==2.2.5) (3.1.0)
Building wheels for collected packages: en-core-web-sm
  Building wheel for en-core-web-sm (setup.py) ... [?25ldone
[?25h  Created wheel for en-core-web-sm: filename=en_core_web_sm-2.2.5-py3-none-any.whl size=12011738 sha256=4e741a4ef6924b14806dc4789ff4156bf93b98c79d33f5959516f6a04c73f4bb
  Stored in directory: /tmp/pip-ephem-wheel-cache-yazrb305/wheels/51/19/da/a3885266a3c241aff0ad2eb674ae058fd34a4870fef1c0a5a0
Successfully built en-core-web-sm
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-2.2.5
✔ Download and installation successful
You can now load the model via spacy.load('en_core_web_sm')

I was able to download it. Try running the code

import spacy
nlp = spacy.load("en_core_web_sm")
---------------------------------------------------------------------------

OSError                                   Traceback (most recent call last)

<ipython-input-6-14d257ed08ca> in <module>
      1 import spacy
----> 2 nlp = spacy.load("en_core_web_sm")


/usr/local/lib/python3.7/site-packages/spacy/__init__.py in load(name, **overrides)
     28     if depr_path not in (True, False, None):
     29         deprecation_warning(Warnings.W001.format(path=depr_path))
---> 30     return util.load_model(name, **overrides)
     31 
     32 


/usr/local/lib/python3.7/site-packages/spacy/util.py in load_model(name, **overrides)
    167     elif hasattr(name, "exists"):  # Path or Path-like to model data
    168         return load_model_from_path(name, **overrides)
--> 169     raise IOError(Errors.E050.format(name=name))
    170 
    171 


OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

Hmmm. Is it Akan on jupyter notebook? Write it in the Dockerfile and rebuild it.

I tried to rebuild it. Try again.

import spacy
nlp = spacy.load("en_core_web_sm")

No error occurs. Is it a success? Let's look at the type of doc.

doc = nlp("This is a text")
type(doc)
spacy.tokens.doc.Doc

spacy.tokens.doc.Doc I see. What is set in the pipeline?

for p in nlp.pipeline:
    print(p)
('tagger', <spacy.pipeline.pipes.Tagger object at 0x7fc3c78613d0>)
('parser', <spacy.pipeline.pipes.DependencyParser object at 0x7fc39292ede0>)
('ner', <spacy.pipeline.pipes.EntityRecognizer object at 0x7fc3928c5360>)

Hmmmm. tagger, parser, ner Certainly.

By the way, if you look at the model's QuickStart, it seems that you can also write like this ↓.

import en_core_web_sm #There seems to be a way to load as a module other than the method of specifying the model to load with a string
nlp = en_core_web_sm.load() #Does the load method with no arguments return nlp?
doc = nlp("This is a text")
print(doc)

for p in nlp.pipeline:
    print(p)
This is a text
('tagger', <spacy.pipeline.pipes.Tagger object at 0x7fc3903805d0>)
('parser', <spacy.pipeline.pipes.DependencyParser object at 0x7fc3928bad70>)
('ner', <spacy.pipeline.pipes.EntityRecognizer object at 0x7fc3928ba9f0>)

What is an nlp object

type(nlp)
spacy.lang.en.English

Hmm

Recommended Posts

Become familiar with (want to be) around the pipeline of spaCy
I want to output the beginning of the next month with Python
I want to check the position of my face with OpenCV!
I want to express my feelings with the lyrics of Mr. Children
I want to stop the automatic deletion of the tmp area with RHEL7
I want to be an OREMO with setParam!
I want to customize the appearance of zabbix
Specify the start and end positions of files to be included with qiitap
What you want to memorize with the basic "string manipulation" grammar of python
I want to grep the execution result of strace
Add information to the bottom of the figure with Matplotlib
The first step of machine learning ~ For those who want to implement with python ~
I want to plot the location information of GTFS Realtime on Jupyter! (With balloon)
I want to inherit to the back with python dataclass
I want to fully understand the basics of Bokeh
Try to get the contents of Word with Golang
The programming language you want to be able to use
I want to increase the security of ssh connections
I want to be notified of the connection environment when the Raspberry Pi connects to the network
Make a note of what you want to do in the future with Raspberry Pi
[Python] I want to make a 3D scatter plot of the epicenter with Cartopy + Matplotlib!
I tried to find the entropy of the image with python
I want to be able to analyze data with Python (Part 3)
I tried to find the average of the sequence with TensorFlow
I want to be able to analyze data with Python (Part 1)
I want to use only the normalization process of SudachiPy
I want to get the operation information of yahoo route
I want to change the Japanese flag to the Palau flag with Numpy
I want to be able to analyze data with Python (Part 4)
Settings to debug the contents of the library with VS Code
I want to color black-and-white photos of memories with GAN
I want to be able to analyze data with Python (Part 2)
[Python] I want to use the -h option with argparse
I want to judge the authenticity of the elements of numpy array
Try to automate the operation of network devices with Python
I want to know the features of Python and pip
Keras I want to get the output of any layer !!
I want to know the legend of the IT technology world
Get the source of the page to load infinitely with python.
Try to extract the features of the sensor data with CNN
How to write when you want to put a number after the group number to be replaced with a regular expression in re.sub of Python
If the people of Tokyo become seriously ill with the new coronavirus, they may be taken to a hospital in Kagoshima prefecture.
I want to extract an arbitrary URL from the character string of the html source with python
I want to get the name of the function / method being executed
If you want to become a data scientist, start with Kaggle
The story of not being able to run pygame with pycharm
I want to manually assign the training parameters of the [Pytorch] model
Save the results of crawling with Scrapy to the Google Data Store
I want to know the weather with LINE bot feat.Heroku + Python
I tried to automate the watering of the planter with Raspberry Pi
How to get the ID of Type2Tag NXP NTAG213 with nfcpy
I want to read the html version of "OpenCV-Python Tutorials" OpenCV 3.1 version
[Introduction to StyleGAN] I played with "The Life of a Man" ♬
Try to solve the N Queens problem with SA of PyQUBO
Output the contents of ~ .xlsx in the folder to HTML with Python
Switch the package to be installed for each environment with poetry
Consider the speed of processing to shift the image buffer with numpy.ndarray
Solving the Maze with Python-Supplement to Chapter 6 of the Algorithm Quick Reference-
When you want to save the result of the callback function somewhere
How to monitor the execution status of sqlldr with the pv command
I tried to expand the size of the logical volume with LVM