SoFunction
Updated on 2024-11-16

python dynamic migration solr data process analysis

preamble

On the project, met a demand, the need to online one of the collection of data inside the migration to another collection, so Baidu saw a lot of articles, most of which are using the method of import, did not find the online data migration method. So I wrote a python script and shared it.

Idea: collection has a large amount of data, so it's too big to operate on all the data at once, so the operation is performed in segments.

Segment first Query by 1000 pieces of data, processed into json data

Send the processed json data to the destination collection.

Realization.

First, use the http interface to query first

Use the following format to query:

Where: collection_name is the name of the collection you are querying for

rows is the number of rows to be queried, set to 1000 here.

start How many lines to start the query from, later in the script is to control this parameter to loop the query

http://host:port/solr/collection_name/select?q=*:*&rows=1000&start=0

The query processing will result in the following image inside the data format, where the

Inside the response, there are two key-value data that we need, one is numFound (the total number of data entries), docs (all the json data is in here)

In the docs, each piece of data has a version key, which needs to be removed.


Second, the use of http interface to submit data

wt: submit using json format

http://host:port/solr/collection_name/update?wt=json

header needs to be set to{"Content-Type": "application/json"}

Submit parameter: solr replaces the document if it already exists when indexing. (The parameters here can also be added directly to the url)

{"overwrite":"true","commit":"true"}

data_dict is our processed docs data.

Submit data:data={"add":{ "doc":data_dict}}

III. The realized script is as follows:

#coding=utf-8
import requests as r
import json
import threading
import time
# Send data to the destination url des_url, data_dict parameter is a dictionary of data with the version key removed.
def send_data(des_url,data_dict):
 data={"add":{ "doc":data_dict}}
 headers = {"Content-Type": "application/json"}
 params = {"boost":1.0,"overwrite":"true","&commitWithin":1000,"commit":"true"}
 url = "%s/update?wt=json"%(des_url)
 re = (url,json = data,params=params,headers=headers)
 if re.status_code != 200:
  print("Import error.",data)

# Get data, call send_data to send data to destination url
def get_data(des_url,src_url):
  # Define the starting line
 start = 0
 # Get the total number of data items first
 se_data=("%s/select?q=*:*&rows=0&start=%s"%(src_url,start)).text
 se_dict = (se_data)
 numFound = int(se_dict["response"]["numFound"])
 #while loop, 1000 pieces of data in a loop
 while start < numFound:
  #Define the list that holds the multithreaded
  th_li = []
    #Fetch 1000 pieces of data
  se_data=("%s/select?q=*:*&rows=1000&start=%s"%(src_url,start)).text
    # Convert the fetched data into a dictionary
  se_dict = (se_data)
    # Get the docs data in the data
  s_data = (se_dict["response"]["docs"])

  # Loop over the data, delete the version key value, and send the data using a multi-threaded call to the send_data method.
  for i in s_data:
   del i["_version_"]
   th = (target=send_data,args=(des_url,i))
   th_li.append(th)

  for t in th_li:
   ()
   ()

  start += 1000
  print(start)

if __name__ == "__main__":
 # Source data, query data collection address
 src_url = "http://ip:port/solr/src_connection"
 #The address of the destination collection for the imported data.
 des_url = "http://ip:port/solr/des_connection"
 start_time = ()
 get_data(des_url,src_url)
 end_time = ()
 print("Time consuming:",end_time-start_time,"Seconds.")

Remarks:

I. If your COLLECTION is not on the same network, and you can't realize online transmission.You can first write the data with the for loop deleting the version key into a file, then copy it to the server on the destination network, and then read the file in a loop to upload it, as follows: Write the file (this is written according to your preference), but after reading, you need to convert each piece of data into a dictionary for uploading:

file = open("","a+")
for i in s_data:
del i["version"]
(str(i)+"\n")
()

Second, clear the data can use the following methods, self-test more convenient one

Inside the collection you want to clear

Select documents

document type Select xml

Copy the following content to the location as shown, and then click the submit document button.

# Control web interface to delete data
<delete><query>:</query></delete>
<commit/>

This is the whole content of this article.