preamble
On the project, met a demand, the need to online one of the collection of data inside the migration to another collection, so Baidu saw a lot of articles, most of which are using the method of import, did not find the online data migration method. So I wrote a python script and shared it.
Idea: collection has a large amount of data, so it's too big to operate on all the data at once, so the operation is performed in segments.
Segment first Query by 1000 pieces of data, processed into json data
Send the processed json data to the destination collection.
Realization.
First, use the http interface to query first
Use the following format to query:
Where: collection_name is the name of the collection you are querying for
rows is the number of rows to be queried, set to 1000 here.
start How many lines to start the query from, later in the script is to control this parameter to loop the query
http://host:port/solr/collection_name/select?q=*:*&rows=1000&start=0
The query processing will result in the following image inside the data format, where the
Inside the response, there are two key-value data that we need, one is numFound (the total number of data entries), docs (all the json data is in here)
In the docs, each piece of data has a version key, which needs to be removed.
Second, the use of http interface to submit data
wt: submit using json format
http://host:port/solr/collection_name/update?wt=json
header needs to be set to{"Content-Type": "application/json"}
Submit parameter: solr replaces the document if it already exists when indexing. (The parameters here can also be added directly to the url)
{"overwrite":"true","commit":"true"}
data_dict is our processed docs data.
Submit data:data={"add":{ "doc":data_dict}}
III. The realized script is as follows:
#coding=utf-8 import requests as r import json import threading import time # Send data to the destination url des_url, data_dict parameter is a dictionary of data with the version key removed. def send_data(des_url,data_dict): data={"add":{ "doc":data_dict}} headers = {"Content-Type": "application/json"} params = {"boost":1.0,"overwrite":"true","&commitWithin":1000,"commit":"true"} url = "%s/update?wt=json"%(des_url) re = (url,json = data,params=params,headers=headers) if re.status_code != 200: print("Import error.",data) # Get data, call send_data to send data to destination url def get_data(des_url,src_url): # Define the starting line start = 0 # Get the total number of data items first se_data=("%s/select?q=*:*&rows=0&start=%s"%(src_url,start)).text se_dict = (se_data) numFound = int(se_dict["response"]["numFound"]) #while loop, 1000 pieces of data in a loop while start < numFound: #Define the list that holds the multithreaded th_li = [] #Fetch 1000 pieces of data se_data=("%s/select?q=*:*&rows=1000&start=%s"%(src_url,start)).text # Convert the fetched data into a dictionary se_dict = (se_data) # Get the docs data in the data s_data = (se_dict["response"]["docs"]) # Loop over the data, delete the version key value, and send the data using a multi-threaded call to the send_data method. for i in s_data: del i["_version_"] th = (target=send_data,args=(des_url,i)) th_li.append(th) for t in th_li: () () start += 1000 print(start) if __name__ == "__main__": # Source data, query data collection address src_url = "http://ip:port/solr/src_connection" #The address of the destination collection for the imported data. des_url = "http://ip:port/solr/des_connection" start_time = () get_data(des_url,src_url) end_time = () print("Time consuming:",end_time-start_time,"Seconds.")
Remarks:
I. If your COLLECTION is not on the same network, and you can't realize online transmission.You can first write the data with the for loop deleting the version key into a file, then copy it to the server on the destination network, and then read the file in a loop to upload it, as follows: Write the file (this is written according to your preference), but after reading, you need to convert each piece of data into a dictionary for uploading:
file = open("","a+") for i in s_data: del i["version"] (str(i)+"\n") ()
Second, clear the data can use the following methods, self-test more convenient one
Inside the collection you want to clear
Select documents
document type Select xml
Copy the following content to the location as shown, and then click the submit document button.
# Control web interface to delete data <delete><query>:</query></delete> <commit/>
This is the whole content of this article.