SoFunction
Updated on 2024-11-21

Python3 crawler on the Splash load balancing configuration details

Splash to do page crawling, if the amount of crawling is very large, the task is very large, with a Splash service to deal with it, it is too much pressure, at this time you can consider building a load balancer to spread the pressure to the various servers. This is equivalent to multiple machines and multiple services to participate in the task of processing, you can reduce the pressure of a single Splash service.

1. Configure the Splash service

To set up Splash load balancing, you first need to have multiple Splash services. Suppose here the Splash service is enabled on port 8050 on all 4 remote hosts, their service addresses are 41.159.27.223:8050, 41.159.27.221:8050, 41.159.27.9:8050, and 41.159.117.119:8050, which are exactly the same as the 4 services, all of which are opened via Docker's Splash image. The Splash service can be used when accessing any of these services.

2. Configure load balancing

Next, you can choose any host with a public IP to configure load balancing. First, install Nginx on this host, and then modify the Nginx configuration file to add the following:

http {
    upstream splash {
        least_conn;
        server 41.159.27.223:8050;
        server 41.159.27.221:8050;
        server 41.159.27.9:8050;
        server 41.159.117.119:8050;
    }
    server {
        listen 8050;
        location / {
            proxy_pass http://splash;
        }
    }
}

In this way we define a service cluster configuration with the name splash through the upstream field. The last_conn stands for least_conn load balancing, which is suitable for dealing with situations where the server is overloaded by requests that take a different amount of time to process.

Of course, we can also not specify the configuration as follows:

upstream splash {
    server 41.159.27.223:8050;
    server 41.159.27.221:8050;
    server 41.159.27.9:8050;
    server 41.159.117.119:8050;
}

This defaults to a polling strategy for load balancing, where each server is equally stressed. This policy is suitable for services with comparable server configurations, statelessness, and short bursts.

Alternatively, we can specify weights, configured as follows:

upstream splash {
    server 41.159.27.223:8050 weight=4;
    server 41.159.27.221:8050 weight=2;
    server 41.159.27.9:8050 weight=2;
    server 41.159.117.119:8050 weight=1;
}

The weight parameter specifies the weight of each service, the higher the weight, the more requests are assigned to be processed. This can be used if the configuration of different servers varies greatly.

Finally, there is an IP hash load balancing with the following configuration:

upstream splash {
    ip_hash;
    server 41.159.27.223:8050;
    server 41.159.27.221:8050;
    server 41.159.27.9:8050;
    server 41.159.117.119:8050;
}

The server performs hashing calculations based on the IP address of the requesting client to ensure that the same server is used to respond to the request, a policy that is appropriate for stateful services, such as when a user logs in and accesses a page. For Splash, there is no need to apply this setting.

We can choose different configurations according to different situations, and restart the Nginx service after the configuration is complete:

sudo nginx -s reload

This will enable load balancing by directly accessing port 8050 on the server where Nginx is hosted.

3. Configuring authentication

Now Splash is publicly accessible, if you don't want it to be publicly accessible, you can also configure authentication, which is still done with the help of Nginx. you can add auth_basic and auth_basic_user_file fields to the server's location field with the following configuration:

http {
    upstream splash {
        least_conn;
        server 41.159.27.223:8050;
        server 41.159.27.221:8050;
        server 41.159.27.9:8050;
        server 41.159.117.119:8050;
    }
    server {
        listen 8050;
        location / {
            proxy_pass http://splash;
            auth_basic "Restricted";
            auth_basic_user_file /etc/nginx//.htpasswd;
        }
    }
}

The username and password configuration used here is placed in the /etc/nginx/ directory, which we need to create using the htpasswd command. For example, to create a file with the username admin, the relevant command is as follows:

htpasswd -c .htpasswd admin

Next we will be prompted for a password, and after entering it twice, a password file will be generated with the following contents:

cat .htpasswd 
admin:5ZBxQr0rCqwbc

Once the configuration is complete, restart the Nginx service:

sudo nginx -s reload

This way access authentication is successfully configured.

4. Testing

Finally, we can test the load balancing configuration with code to see if it switches IPs with each request or not. using the /get test is sufficient, the implementation code is as follows:

import requests
from  import quote
import re
lua = '''
function main(splash, args)
  local treat = require("treat")
  local response = splash:http_get("/get")
  return treat.as_string()
end
'''
url = 'http://splash:8050/execute?lua_source=' + quote(lua)
response = (url, auth=('admin', 'admin'))
ip = ('(\d+\.\d+\.\d+\.\d+)', ).group(1)
print(ip)

Please replace the splash string in the URL here with your own Nginx server IP. here I modified Hosts and set splash to Nginx server IP.

After running the code multiple times, you can see that the IP changes with each request, such as the first result:

41.159.27.223

The results of the second time:

41.159.27.9

This indicates that load balancing has been successfully implemented.

In this section, we have successfully implemented the load balancing configuration. It is still useful to configure load balancing so that multiple Splash services can work together to reduce the load on a single service.

This article on the Python3 crawler on the Splash load balancing configuration details of the article is introduced to this, more related Python3 Splash load balancing configuration content, please search for my previous posts or continue to browse the following related articles I hope that you will support me in the future more!