Pre-process the index in Python using Solr's ScriptUpdateProcessor

Reference: http://www.rondhuit.com/scriptupdateprocessor.html

The article is about javascript, Here, I will introduce how to register fields with Python using jython. By the way, the environment is CentOS7 and I am using Solr6 and Manifold CF2.4 version.

** 1. Added updateRequestProcessorChain to .solrconfig.xml **

solrconfig.xml


...
<updateRequestProcessorChain name="script">
  <processor class="solr.StatelessScriptUpdateProcessorFactory">
    <str name="script">update-script.py</str>
  </processor>
  <processor class="solr.LogUpdateProcessorFactory" />
  <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
...

It seems that solrconfig.xml is located in the following location after solr5. var/solr/data/<corename>/conf/solrconfig.xml

** 2. Specify the UpdateChain defined above in requestHandler **

solrconfig.xml


...
<requestHandler name="/update" class="solr.UpdateRequestHandler">    
  <lst name="defaults">
    <str name="update.chain">script</str>
  </lst>       
</requestHandler>
...

If you want to crawl PDF or Excel with ManifoldCF, you can use ExtractingRequestHandler. If you want to crawl RDB, you need to add each of the above description to DataImportHanlder.

** 3. Place the python script in update-script.py ** Placement: var/solr/data/<corename>/conf/update-script.py

To register a field in Python doc.setField("field_name","field_value") Write. The field_name to be registered by script must be added to "managed_schema.xml" in advance. If there is a registration with the same name, it will be mapped on the solr side. By the way, the index that has already been introduced doc.getFieldValue("field_name") You can get it at. The script is evaluated each time it is run, so you don't have to restart Solr for editing.

** 4. Install Jython ** I use jython to run python on JavaVM. Get from the link below

http://www.jython.org/downloads.html

It seems that you need the Standalone version instead of the Installer version. Place it on the Solr side like this.

var/solr/data/<corename>/lib/jython-standalone-2.X.X.jar

** 5. When linking with DB ** I needed to get the data from postgresDB, so I need JDBC to access the DB from python.

Get the JDBC that matches the JDK and postegres versions from https://jdbc.postgresql.org/download.html

Place it in var / solr / data / <corename> /lib/postgresql-9.X-XXXX.jar </ code>

Description example of import statement and connection

update.script


from com.ziclix.python.sql import zxJDBC

DB_URL = "jdbc:postgresql://yourpostgreshost:port/dbname"
DB_USER = "postgres"
DB_PASS = "password"
DB_DRIVER = "org.postgresql.Driver"
connection = zxJDBC.connect(DB_URL, DB_USER, DB_PASS, DB_DRIVER)

Place it like this.

** 6. When referencing an external library **

If you want to import an external library, unzip the jython standalone version of the jar and External library You can jar it again after placing it in an appropriate location (usually / Lib). There may be a better way. .. ..

That's it. Run the job in Manifoldcf to see. If your Python script doesn't work properly, it's a bit annoying, but you may need to try and error while looking at the error logs.

Recommended Posts

Pre-process the index in Python using Solr's ScriptUpdateProcessor
Try using the Kraken API in Python
Tweet using the Twitter API in Python
Try using the BitFlyer Ligntning API in Python
Try using the DropBox Core API in Python
[AWS IoT] Register things in AWS IoT using the AWS IoT Python SDK
Initial settings when using the foursquare API in python
Determine the threshold using the P tile method in python
Using the National Diet Library Search API in Python
Using the LibreOffice app in Python (1) Where are the macros?
Find the difference in Python
Measure BMI index in Python.
Translate using googletrans in Python
Using Python mode in Processing
Solve the Japanese problem when using the CSV module in Python.
Get the index of each element of the confusion matrix in Python
GUI programming in Python using Appjar
Getting the arXiv API in Python
Python in the browser: Brython's recommendation
Save the binary file in Python
Hit the Sesami API in Python
Get the desktop path in Python
Try using LevelDB in Python (plyvel)
Get the script path in Python
Using global variables in python functions
Hit the web API in Python
Let's see using input in python
Infinite product in Python (using functools)
Extract the targz file using python
Edit videos in Python using MoviePy
I wrote the queue in Python
Calculate the previous month in Python
Examine the object's class in python
Get the desktop path in Python
Try using the Python Cmd module
Get the host name in Python
Handwriting recognition using KNN in Python
Access the Twitter API in Python
Try using Leap Motion in Python
The first step in Python Matplotlib
Depth-first search using stack in Python
When using regular expressions in Python
I wrote the stack in Python
Master the weakref module in Python
GUI creation in python using tkinter 2
Python beginners tried Hello World in 30 seconds using the micro-framework Flask
Regularly upload files to Google Drive using the Google Drive API in Python
Try using FireBase Cloud Firestore in Python for the time being
Learn the design pattern "Prototype" in Python
Learn the design pattern "Builder" in Python
Mouse operation using Windows API in Python
Load the remote Python SDK in IntelliJ
Notes using cChardet and python3-chardet in Python 3.3.1.
GUI creation in python using tkinter part 1
Check the behavior of destructor in Python
(Bad) practice of using this in Python
Learn the design pattern "Flyweight" in Python
Using venv in Windows + Docker environment [Python]
Learn the design pattern "Observer" in Python
Learn the design pattern "Memento" in Python
Behind the flyer: Using Docker with Python