Trouble selecting functional proxies from a list of proxies quickly - python

I've created a scraper using requests module implementing rotation of proxies (taken from a free proxy site) within it to fetch content from yellowpages.
The script appears to work correctly but it is terribly slow as it takes a lot of time to find a working proxy. I've tried to reuse the same working proxy (when found) until it is dead and for that I had to declare proxies and proxy_url as global.
Although shop_name and categories are available in landing pages, I scraped both of them from inner pages so that the script can demonstrate that it uses the same working proxy (when it finds one) multiple times.
This is the script I'm trying with:
import random
import requests
from bs4 import BeautifulSoup
base = 'https://www.yellowpages.com{}'
link = 'https://www.yellowpages.com/search?search_terms=pizza&geo_location_terms=Los+Angeles%2C+CA'
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36',
}
def get_proxies():
response = requests.get('https://www.sslproxies.org/')
soup = BeautifulSoup(response.text,"lxml")
proxies = []
for item in soup.select("table.table tbody tr"):
if not item.select_one("td"):break
ip = item.select_one("td").text
port = item.select_one("td:nth-of-type(2)").text
proxies.append(f"{ip}:{port}")
return [{'https': f'http://{x}'} for x in proxies]
def fetch_resp(link,headers):
global proxies, proxy_url
while True:
print("currently being used:",proxy_url)
try:
res = requests.get(link, headers=headers, proxies=proxy_url, timeout=10)
print("status code",res.status_code)
assert res.status_code == 200
return res
except Exception as e:
proxy_url = proxies.pop(random.randrange(len(proxies)))
def fetch_links(link,headers):
res = fetch_resp(link,headers)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select(".v-card > .info a.business-name"):
yield base.format(item.get("href"))
def get_content(link,headers):
res = fetch_resp(link,headers)
soup = BeautifulSoup(res.text,"lxml")
shop_name = soup.select_one(".sales-info > h1.business-name").get_text(strip=True)
categories = ' '.join([i.text for i in soup.select(".categories > a")])
return shop_name,categories
if __name__ == '__main__':
proxies = get_proxies()
proxy_url = proxies.pop(random.randrange(len(proxies)))
for inner_link in fetch_links(link,headers):
print(get_content(inner_link,headers))
How can I quickly select a functional proxy from a list of proxies?

Please let me point out that using free proxy IP addresses can be highly problematic. These type of proxies are notorious for having connections issues, such as timeouts related to latency. Plus these sites can also be intermittent, which means that they can go down at anytime. And sometimes these sites are being abused, so they can get blocked.
With that being said, below are multiple methods that can be used to accomplish your use case related to scraping content from the Yellow Pages.
UPDATE 07-11-2022 16:47 GMT
I tried a different proxy validation method this morning. It is slightly faster than the proxy judge method. The issue with both these methods is error handling. I have to catch all the errors below when validating a proxy IP address and passing a validated address to your function fetch_resp.
ConnectionResetError
requests.exceptions.ConnectTimeout
requests.exceptions.ProxyError
requests.exceptions.ConnectionError
requests.exceptions.HTTPError
requests.exceptions.Timeout
requests.exceptions.TooManyRedirects
urllib3.exceptions.MaxRetryError
urllib3.exceptions.ProxySchemeUnknown
urllib3.exceptions.ProtocolError
Occasionally a proxy fails when extracting from a page, which causes a delay. There is nothing you can do to prevent these failures. The only thing you can do is catch the error and reprocess the request.
I was able to improve the extraction time by adding threading to function get_content.
Content Extraction Runtime: 0:00:03.475362
Total Runtime: 0:01:16.617862
The only way you can increase the speed of your code is to redesign it to query each page element at the same time. If you don't this is a timing bottleneck.
Here is the code that I used to validate the proxy addresses.
def check_proxy(proxy):
try:
session = requests.Session()
session.headers['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36'
session.max_redirects = 300
proxy = proxy.split('\n', 1)[0]
# print('Checking ' + proxy)
req = session.get("http://google.com", proxies={'http':'http://' + proxy}, timeout=30, allow_redirects=True)
if req.status_code == 200:
return proxy
except requests.exceptions.ConnectTimeout as e:
return None
except requests.exceptions.ConnectionError as e:
return None
except ConnectionResetError as e:
# print('Error,ConnectionReset!')
return None
except requests.exceptions.HTTPError as e:
return None
except requests.exceptions.Timeout as e:
return None
except ProxySchemeUnknown as e:
return None
except ProtocolError as e:
return None
except requests.exceptions.ChunkedEncodingError as e:
return None
except requests.exceptions.TooManyRedirects as e:
return None
UPDATE 07-10-2022 23:53 GMT
I did some more research into this question. I have noted that the website https://www.sslproxies.org provides a list of 100 HTTPS. Out of those less than 20% pass the proxy judge test. Even after obtaining this 20% some will fail when being passed to your function fetch_resp. They can fail for multiple reasons, which include ConnectTimeout, MaxRetryError, ProxyError, etc. When this happens you can rerun the function with the same link (url), headers and a new proxy. The best workaround for these errors is to use a commercial proxy service.
In my latest test I was able to obtain a list of potentially functional proxies and extract all the content for all 25 pages related to your search. Below is the timeDelta for this test:
Content Extraction Runtime: 0:00:34.176803
Total Runtime: 0:01:22.429338
I can speed this up if I use threading with the function fetch_resp.
Below is the current code that I'm using. I need to improve the error handling, but it currently works.
import time
import random
import requests
from datetime import timedelta
from bs4 import BeautifulSoup
from proxy_checking import ProxyChecker
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from urllib3.exceptions import MaxRetryError, ProxySchemeUnknown
from concurrent.futures import ThreadPoolExecutor, as_completed
proxies_addresses = []
current_proxy = ''
def requests_retry_session(retries=5,
backoff_factor=0.5,
status_force_list=(500, 502, 503, 504),
session=None,
):
session = session or requests.Session()
retry = Retry(
total=retries,
read=retries,
connect=retries,
backoff_factor=backoff_factor,
status_forcelist=status_force_list,
)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
return session
def ssl_proxy_addresses():
global proxies_addresses
response = requests.get('https://www.sslproxies.org/')
soup = BeautifulSoup(response.text, "lxml")
proxies = []
table = soup.find('tbody')
table_rows = table.find_all('tr')
for row in table_rows:
ip_address = row.find_all('td')[0]
port_number = row.find_all('td')[1]
proxies.append(f'{ip_address.text}:{port_number.text}')
proxies_addresses = proxies
return proxies
def proxy_verification(current_proxy_address):
checker = ProxyChecker()
proxy_judge = checker.check_proxy(current_proxy_address)
proxy_status = bool([value for key, value in proxy_judge.items() if key == 'status' and value is True])
if proxy_status is True:
return current_proxy_address
else:
return None
def get_proxy_address():
global proxies_addresses
proxy_addresses = ssl_proxy_addresses()
processes = []
with ThreadPoolExecutor(max_workers=40) as executor:
for proxy_address in proxy_addresses:
processes.append(executor.submit(proxy_verification, proxy_address))
proxies = [task.result() for task in as_completed(processes) if task.result() is not None]
proxies_addresses = proxies
return proxies_addresses
def fetch_resp(link, http_headers, proxy_url):
try:
print(F'Current Proxy: {proxy_url}')
response = requests_retry_session().get(link,
headers=http_headers,
allow_redirects=True,
verify=True,
proxies=proxy_url,
timeout=(30, 45)
)
print("status code", response.status_code)
if response.status_code == 200:
return response
else:
current_proxy = proxies_addresses.pop(random.randrange(len(proxies_addresses)))
fetch_resp(link, http_headers, current_proxy)
except requests.exceptions.ConnectTimeout as e:
print('Error,Timeout!')
current_proxy = proxies_addresses.pop(random.randrange(len(proxies_addresses)))
fetch_resp(link, http_headers, current_proxy)
pass
except requests.exceptions.ProxyError as e:
print('ProxyError!')
current_proxy = proxies_addresses.pop(random.randrange(len(proxies_addresses)))
fetch_resp(link, http_headers, current_proxy)
pass
except requests.exceptions.ConnectionError as e:
print('Connection Error')
current_proxy = proxies_addresses.pop(random.randrange(len(proxies_addresses)))
fetch_resp(link, http_headers, current_proxy)
pass
except requests.exceptions.HTTPError as e:
print('HTTP ERROR!')
current_proxy = proxies_addresses.pop(random.randrange(len(proxies_addresses)))
fetch_resp(link, http_headers, current_proxy)
pass
except requests.exceptions.Timeout as e:
print('Error! Connection Timeout!')
current_proxy = proxies_addresses.pop(random.randrange(len(proxies_addresses)))
fetch_resp(link, http_headers, current_proxy)
pass
except ProxySchemeUnknown as e:
print('ERROR unknown Proxy Scheme!')
current_proxy = proxies_addresses.pop(random.randrange(len(proxies_addresses)))
fetch_resp(link, http_headers, current_proxy)
pass
except MaxRetryError as e:
print('MaxRetryError')
current_proxy = proxies_addresses.pop(random.randrange(len(proxies_addresses)))
fetch_resp(link, http_headers, current_proxy)
pass
except requests.exceptions.TooManyRedirects as e:
print('ERROR! Too many redirects!')
current_proxy = proxies_addresses.pop(random.randrange(len(proxies_addresses)))
fetch_resp(link, http_headers, current_proxy)
pass
def get_content(http_headers, proxy_url):
start_time = time.time()
results = []
pages = int(25)
for page_number in range(1, pages):
print(page_number)
next_url = f"https://www.yellowpages.com/search?search_terms=pizza&geo_location_terms=Los%20Angeles%2C%20CA" \
f"&page={page_number}"
res = fetch_resp(next_url, http_headers, proxy_url)
soup = BeautifulSoup(res.text, "lxml")
info_sections = soup.find_all('li', {'class': 'business-card'})
for info_section in info_sections:
shop_name = info_section.find('h2', {'class': 'title business-name'})
categories = ', '.join([i.text for i in info_section.find_all('a', {'class': 'category'})])
results.append({shop_name.text, categories})
end_time = time.time() - start_time
print(f'Content Extraction Runtime: {timedelta(seconds=end_time)}')
return results
start_time = time.time()
get_proxy_address()
if len(proxies_addresses) != 0:
print(proxies_addresses)
print('\n')
current_proxy = proxies_addresses.pop(random.randrange(len(proxies_addresses)))
print(current_proxy)
print('\n')
base_url = 'https://www.yellowpages.com{}'
current_url = 'https://www.yellowpages.com/search?search_terms=pizza&geo_location_terms=Los+Angeles%2C+CA'
headers = {
'user-agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.5 Mobile/15E148 Safari/604.1',
}
PROXIES = {
'https': f"http://{current_proxy}"
}
results = get_content(headers, PROXIES)
end_time = time.time() - start_time
print(f'Total Runtime: {timedelta(seconds=end_time)}')
UPDATE 07-06-2022 11:02 GMT
This seems to be your core question:
How can I quickly select a functional proxy from a list of proxies?
First, all my previous code is able to validate that a proxy is working at a given moment in time. Once validated I'm able to query and extract data from your Yellow Pages search for pizza in Los Angeles.
Using my previous method I'm able to query and extract data for all 24 pages related to your search in 0:00:45.367209 seconds.
Back to your question.
The website https://www.sslproxies.org provides a list of 100 HTTPS proxies. There is zero guarantee that all 100 are currently operational. One of the ways to identify the working ones is using a Proxy Judge service.
In my previous code I continually selected a random proxy from the list of 100 and passed this proxy to a Proxy Judge for validation. Once a proxy is validated to be working it is used to query and extract data Yellow Pages.
The method above works, but I was wondering how many proxies out of the 100 pass the sniff test for the Proxy Judge service. I attempted to check using a basic for loop, which was deathly slow. I decided to concurrent.futures to speed up the validation.
The code below takes about 1 minute to obtain a list of HTTPS proxies and validate them using a Proxy Judge service.
This is the fastest way to obtain a list of free proxies that are functional at a specific moment in time.
import requests
from bs4 import BeautifulSoup
from proxy_checking import ProxyChecker
from concurrent.futures import ThreadPoolExecutor, as_completed
def ssl_proxy_addresses():
response = requests.get('https://www.sslproxies.org/')
soup = BeautifulSoup(response.text, "lxml")
proxies = []
table = soup.find('tbody')
table_rows = table.find_all('tr')
for row in table_rows:
ip_address = row.find_all('td')[0]
port_number = row.find_all('td')[1]
proxies.append(f'{ip_address.text}:{port_number.text}')
return proxies
def proxy_verification(current_proxy_address):
checker = ProxyChecker()
proxy_judge = checker.check_proxy(current_proxy_address)
proxy_status = bool([value for key, value in proxy_judge.items() if key == 'status' and value is True])
if proxy_status is True:
return current_proxy_address
else:
return None
def get_proxy_address():
proxy_addresses = ssl_proxy_addresses()
processes = []
with ThreadPoolExecutor(max_workers=20) as executor:
for proxy_address in proxy_addresses:
processes.append(executor.submit(proxy_verification, proxy_address))
proxies = [task.result() for task in as_completed(processes) if task.result() is not None]
print(len(proxies))
13
print(proxies)
['34.228.74.208:8080', '198.41.67.18:8080', '139.9.64.238:443', '216.238.72.163:59394', '64.189.24.250:3129', '62.193.108.133:1976', '210.212.227.68:3128', '47.241.165.133:443', '20.26.4.251:3128', '185.76.9.123:3128', '129.41.171.244:8000', '12.231.44.251:3128', '5.161.105.105:80']
UPDATE CODE 07-05-2022 17:07 GMT
I added a snippet of code below to query the second page. I did this to see if the proxy stayed the same, which it did. You still need to add some error handling.
In my testing I was able to query all 24 pages related to your search in 0:00:45.367209 seconds. I don't consider this query and extraction speed slow by any means.
Concerning performing a different search. I would do the same method as below, but I would request a new proxy for this search, because free proxies do have limitations, such as life time and performance degradation.
import random
import logging
import requests
import traceback
from time import sleep
from random import randint
from bs4 import BeautifulSoup
from proxy_checking import ProxyChecker
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from urllib3.exceptions import ProxySchemeUnknown
from http_request_randomizer.requests.proxy.ProxyObject import Protocol
from http_request_randomizer.requests.proxy.requestProxy import RequestProxy
current_proxy = ''
def requests_retry_session(retries=5,
backoff_factor=0.5,
status_force_list=(500, 502, 503, 504),
session=None,
):
session = session or requests.Session()
retry = Retry(
total=retries,
read=retries,
connect=retries,
backoff_factor=backoff_factor,
status_forcelist=status_force_list,
)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
return session
def random_ssl_proxy_address():
try:
# Obtain a list of HTTPS proxies
# Suppress the console debugging output by setting the log level
req_proxy = RequestProxy(log_level=logging.ERROR, protocol=Protocol.HTTPS)
# Obtain a random single proxy from the list of proxy addresses
random_proxy = random.sample(req_proxy.get_proxy_list(), 1)
return random_proxy[0].get_address()
except AttributeError as e:
pass
def proxy_verification(current_proxy_address):
checker = ProxyChecker()
proxy_judge = checker.check_proxy(current_proxy_address)
proxy_status = bool([value for key, value in proxy_judge.items() if key == 'status' and value is True])
return proxy_status
def get_proxy_address():
global current_proxy
random_proxy_address = random_ssl_proxy_address()
current_proxy = random_proxy_address
proxy_status = proxy_verification(random_proxy_address)
if proxy_status is True:
return
else:
print('Looking for a valid proxy address.')
# this sleep timer is helping with some timeout issues
# that were happening when querying
sleep(randint(5, 10))
get_proxy_address()
def fetch_resp(link, http_headers, proxy_url):
try:
response = requests_retry_session().get(link,
headers=http_headers,
allow_redirects=True,
verify=True,
proxies=proxy_url,
timeout=(30, 45)
)
print(F'Current Proxy: {proxy_url}')
print("status code", response.status_code)
return response
except requests.exceptions.ConnectTimeout as e:
print('Error,Timeout!')
print(''.join(traceback.format_tb(e.__traceback__)))
except requests.exceptions.ConnectionError as e:
print('Connection Error')
print(''.join(traceback.format_tb(e.__traceback__)))
except requests.exceptions.HTTPError as e:
print('HTTP ERROR!')
print(''.join(traceback.format_tb(e.__traceback__)))
except requests.exceptions.Timeout as e:
print('Error! Connection Timeout!')
print(''.join(traceback.format_tb(e.__traceback__)))
except ProxySchemeUnknown as e:
print('ERROR unknown Proxy Scheme!')
print(''.join(traceback.format_tb(e.__traceback__)))
except requests.exceptions.TooManyRedirects as e:
print('ERROR! Too many redirects!')
print(''.join(traceback.format_tb(e.__traceback__)))
def get_next_page(raw_soup, http_headers, proxy_urls):
next_page_element = raw_soup.find('a', {'class': 'paginator-next arrow-next'})
next_url = f"https://www.yellowpages.com{next_page_element['href']}"
sub_response = fetch_resp(next_url, http_headers, proxy_urls)
new_soup = BeautifulSoup(sub_response.text, "lxml")
return new_soup
def get_content(link, http_headers, proxy_urls):
res = fetch_resp(link, http_headers, proxy_urls)
soup = BeautifulSoup(res.text, "lxml")
info_sections = soup.find_all('li', {'class': 'business-card'})
for info_section in info_sections:
shop_name = info_section.find('h2', {'class': 'title business-name'})
print(shop_name.text)
categories = ', '.join([i.text for i in info_section.find_all('a', {'class': 'category'})])
print(categories)
business_website = info_section.find('a', {'class': 'website listing-cta action'})
if business_website is not None:
print(business_website['href'])
elif business_website is None:
print('no website')
# get page 2
if soup.find('a', {'class': 'paginator-next arrow-next'}) is not None:
soup_next_page = get_next_page(soup, http_headers, proxy_urls)
info_sections = soup_next_page.find_all('li', {'class': 'business-card'})
for info_section in info_sections:
shop_name = info_section.find('h2', {'class': 'title business-name'})
print(shop_name.text)
categories = ', '.join([i.text for i in info_section.find_all('a', {'class': 'category'})])
print(categories)
business_website = info_section.find('a', {'class': 'website listing-cta action'})
if business_website is not None:
print(business_website['href'])
elif business_website is None:
print('no website')
get_proxy_address()
if len(current_proxy) != 0:
print(current_proxy)
base_url = 'https://www.yellowpages.com{}'
current_url = 'https://www.yellowpages.com/search?search_terms=pizza&geo_location_terms=Los+Angeles%2C+CA'
headers = {
'user-agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.5 Mobile/15E148 Safari/604.1',
}
PROXIES = {
'https': f"http://{current_proxy}"
}
get_content(current_url, headers, PROXIES)
truncated output
Current Proxy: {'https': 'http://157.185.161.123:59394'}
status code 200
1.Casa Bianca Pizza Pie
2.Palermo Italian Restaurant
... truncated
Current Proxy: {'https': 'http://157.185.161.123:59394'}
status code 200
31.Johnnie's New York Pizzeria
32.Amalfi Restaurant and Bar
... truncated
UPDATE CODE 07-05-2022 14:07 GMT
I reworked my code posted on 07-01-2022 to output these data elements, business name, business categories and business website.
1.Casa Bianca Pizza Pie
Pizza, Italian Restaurants, Restaurants
http://www.casabiancapizza.com
2.Palermo Italian Restaurant
Pizza, Restaurants, Italian Restaurants
no website
... truncated
UPDATE CODE 07-01-2022
I noted that when using the free proxies errors were being thrown. I added the requests_retry_session function to handle this. I didn't rework all your code, but I did make sure that I could query the site and produce results using a free proxy. You should be able to work my code into yours.
import random
import logging
import requests
from time import sleep
from random import randint
from bs4 import BeautifulSoup
from proxy_checking import ProxyChecker
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from http_request_randomizer.requests.proxy.ProxyObject import Protocol
from http_request_randomizer.requests.proxy.requestProxy import RequestProxy
current_proxy = ''
def requests_retry_session(retries=5,
backoff_factor=0.5,
status_force_list=(500, 502, 504),
session=None,
):
session = session or requests.Session()
retry = Retry(
total=retries,
read=retries,
connect=retries,
backoff_factor=backoff_factor,
status_forcelist=status_force_list,
)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
return session
def random_ssl_proxy_address():
try:
# Obtain a list of HTTPS proxies
# Suppress the console debugging output by setting the log level
req_proxy = RequestProxy(log_level=logging.ERROR, protocol=Protocol.HTTPS)
# Obtain a random single proxy from the list of proxy addresses
random_proxy = random.sample(req_proxy.get_proxy_list(), 1)
return random_proxy[0].get_address()
except AttributeError as e:
pass
def proxy_verification(current_proxy_address):
checker = ProxyChecker()
proxy_judge = checker.check_proxy(current_proxy_address)
proxy_status = bool([value for key, value in proxy_judge.items() if key == 'status' and value is True])
return proxy_status
def get_proxy_address():
global current_proxy
random_proxy_address = random_ssl_proxy_address()
current_proxy = random_proxy_address
proxy_status = proxy_verification(random_proxy_address)
if proxy_status is True:
return
else:
print('Looking for a valid proxy address.')
# this sleep timer is helping with some timeout issues
# that were happening when querying
sleep(randint(5, 10))
get_proxy_address()
def fetch_resp(link, http_headers, proxy_url):
response = requests_retry_session().get(link,
headers=http_headers,
allow_redirects=True,
verify=True,
proxies=proxy_url,
timeout=(30, 45)
)
print("status code", response.status_code)
return response
def get_content(link, headers, proxy_urls):
res = fetch_resp(link, headers, proxy_urls)
soup = BeautifulSoup(res.text, "lxml")
info_sections = soup.find_all('li', {'class': 'business-card'})
for info_section in info_sections:
shop_name = info_section.find('h2', {'class': 'title business-name'})
print(shop_name.text)
categories = ', '.join([i.text for i in info_section.find_all('a', {'class': 'category'})])
print(categories)
business_website = info_section.find('a', {'class': 'website listing-cta action'})
if business_website is not None:
print(business_website['href'])
elif business_website is None:
print('no website')
get_proxy_address()
if len(current_proxy) != 0:
print(current_proxy)
base_url = 'https://www.yellowpages.com{}'
current_url = 'https://www.yellowpages.com/search?search_terms=pizza&geo_location_terms=Los+Angeles%2C+CA'
headers = {
'user-agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.5 Mobile/15E148 Safari/604.1',
}
PROXIES = {
'https': f"http://{current_proxy}"
}
get_content(current_url, headers, PROXIES)
PREVIOUS ANSWERS
06-30-2022:
During some testing I found a bug, so I updated my code to handle the bug.
06-28-2022:
You could use a proxy judge, which is used for testing the performance and the anonymity status of a proxy server.
The code below is from one of my previous answers.
import random
import logging
from time import sleep
from random import randint
from proxy_checking import ProxyChecker
from http_request_randomizer.requests.proxy.ProxyObject import Protocol
from http_request_randomizer.requests.proxy.requestProxy import RequestProxy
current_proxy = ''
def random_ssl_proxy_address():
try:
# Obtain a list of HTTPS proxies
# Suppress the console debugging output by setting the log level
req_proxy = RequestProxy(log_level=logging.ERROR, protocol=Protocol.HTTPS)
# Obtain a random single proxy from the list of proxy addresses
random_proxy = random.sample(req_proxy.get_proxy_list(), 1)
return random_proxy[0].get_address()
except AttributeError as e:
pass
def proxy_verification(current_proxy_address):
checker = ProxyChecker()
proxy_judge = checker.check_proxy(current_proxy_address)
proxy_status = bool([value for key, value in proxy_judge.items() if key == 'status' and value is True])
return proxy_status
def get_proxy_address():
global current_proxy
random_proxy_address = random_ssl_proxy_address()
current_proxy = random_proxy_address
proxy_status = proxy_verification(random_proxy_address)
if proxy_status is True:
return
else:
print('Looking for a valid proxy address.')
# this sleep timer is helping with some timeout issues
# that were happening when querying
sleep(randint(5, 10))
get_proxy_address()
get_proxy_address()
if len(current_proxy) != 0:
print(f'Valid proxy address: {current_proxy}')
# output
Valid proxy address: 157.100.12.138:999
I noted today that the Python package HTTP_Request_Randomizer has a couple of Beautiful Soup path problems that need to be modified, because they currently don't work in version 1.3.2 of HTTP_Request_Randomizer.
You need to modified line 27 in FreeProxyParser.py to this:
table = soup.find("table", attrs={"class": "table table-striped table-bordered"})
You need to modified line 27 in SslProxyParser.py to this:
table = soup.find("table", attrs={"class": "table table-striped table-bordered"})
I found another bug that needs to be fixed. This one is in the proxy_checking.py I had to add the line if url != None:
def get_info(self, url=None, proxy=None):
info = {}
proxy_type = []
judges = ['http://proxyjudge.us/azenv.php', 'http://azenv.net/', 'http://httpheader.net/azenv.php', 'http://mojeip.net.pl/asdfa/azenv.php']
if url != None:
try:
response = requests.get(url, headers=headers, timeout=5)
return response
except:
pass
elif proxy != None:

Related

add_done_call_back() gives AttributeError: 'Future' object has no attribute 'select'

I tried making a simple web crawler.
I want to create a data frame with the parent link and the links that are found on the parent link page along with the text associated with it.
import requests
from bs4 import BeautifulSoup
from queue import Queue, Empty
from concurrent.futures import ThreadPoolExecutor
from urllib.parse import urljoin
import urllib
from urllib.error import HTTPError #used in the main function, to catch HTTPErrors
from requests.exceptions import InvalidURL #used in the main function, to catch invalidUrl errors
from urllib.parse import urlparse #used to parse the homepage url and get network location out of it
from urllib.parse import quote #used to correct incorrect urls
import pandas as pd
class MultiThreadScraper:
global df
df = pd.DataFrame(data=None, columns = ['parent_link','link', 'text'])
def __init__(self, base_url):
self.base_url = base_url
self.root_url = '{}://{}'.format(urlparse(self.base_url).scheme, urlparse(self.base_url).netloc)
self.pool = ThreadPoolExecutor(max_workers=20)
self.scraped_pages = set([])
self.to_crawl = Queue()
self.to_crawl.put(self.base_url)
#gets http.client.HTTPResponse from the server
def get_http_response(self, url):
header = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
request = urllib.request.Request(url, headers = header) #urllib.request.Request object
response = urllib.request.urlopen(request) #http.client.HTTPResponse
return response
#accepts the http.client.HTTPResponse from the server and fetches the html content using BeautifulSoup
def get_html_content(self, httpResponse):
httpResponse_content = httpResponse.read() #data in bytes
httpResponse_htmlContent = BeautifulSoup(httpResponse_content, 'html.parser')
return httpResponse_htmlContent
def parse_links(self, html,df):
all_tags_with_hrefs = html.select('[href]')
for link in all_tags_with_hrefs:
url = link['href']
if url.startswith('/') or url.startswith(self.root_url):
url = urljoin(self.root_url, url)
if (link.text).strip()!='':
df.loc[-1,'parent_link']= self.base_url
df.loc[-1,'link'] = url
df.loc[-1,'text'] = (link.text).strip()
df.drop_duplicates(subset='link', inplace=True)
df.reset_index(drop=True, inplace=True)
if url not in self.scraped_pages:
self.to_crawl.put(url)
def post_scrape_callback(self, html_content):
self.parse_links(html_content, df)
def scrape_page(self, url):
try:
url = urllib.parse.quote(url, safe='/,:,-,?,=,&')
get_response = self.get_http_response(url)
try:
html_content = self.get_html_content(get_response)
except:
print(f'{url} : ERROR READING RESPONSE !')
except(HTTPError, InvalidURL):
print(f'{url} : NO RESPONSE !')
return html_content
def run_scraper(self):
while True:
try:
target_url = self.to_crawl.get(timeout=10)
if target_url not in self.scraped_pages:
print("Scraping URL: {}".format(target_url))
self.scraped_pages.add(target_url)
job = self.pool.submit(self.scrape_page, target_url)
job.add_done_callback(self.post_scrape_callback)
except Empty:
return
except Exception as e:
print(e)
continue
if __name__ == '__main__':
s = MultiThreadScraper("https://www.nationalgrid.com/")
s.run_scraper()
It gives me the following error :
AttributeError: 'Future' object has no attribute 'select'
From documentation I understood that add_done_callback() adds a callback to be run when the Future is done.
Any help would be appreciated!
You will need a pool and to split your code logic to smaller functions that can run independently.
For example if you write a crawl function that receives a URL as the argument you can do something like:
import concurrent.futures
urls = [] # a url list you create
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
for url in urls:
executor.submit(crawl, url)
Then you can parse multiple URLs using multiple threads (here it's limited to 5)

Python Error with scraping Forum for Title and URL

I want to scrape the title and the URL of each Posting at the Forum of the URL, so that when a new Post is created with 1 of the Titles below i'd like to receive a Mail with that Link of the Post.
Please do not be so harsh with me i'm a beginner with Python and Scraping
I have multiple Problems.
1: at the While(True) Function the "soup" is red underlined with the Error: Undefined variable 'soup'
2: When commenting out the While(True) Function then the Program will not run. I get no error.
3: When there is a new Posting with one of my Criterias, how do I get the URL of that Post?
Titles
def Jeti_DC_16
def Jeti_DC_16_v2
def Jeti_DS_16
def Jeti_DS16_v2
My FullCode
from requests import get
from bs4 import BeautifulSoup
import re
import smtplib
import time
import lxml
import pprint
import json
URL = 'https://www.rc-network.de/forums/biete-rc-elektronik-zubeh%C3%B6r.135/'
def scrape_page_metadata(URL):
headers = {
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36'}
pp = pprint.PrettyPrinter(indent=4)
response = get(URL, headers=headers)
soup = BeautifulSoup(response.content, "lxml")
metadata = {
'Jeti_DC_16': Jeti_DC_16(soup, URL),
'jeti_dc_16_2': Jeti_DC_16_v2(soup, URL),
'jeti_ds_16': Jeti_DS_16(soup, URL),
'jeti_ds_16_2': Jeti_DS_16_v2(soup, URL)
}
pp.pprint(metadata)
return metadata
def Jeti_DC_16(soup, URL):
jeti_dc_16 = None
if soup.name.string:
jeti_dc_16 = soup.title.string
elif soup.find_all("div", class_='structItem-title'):
jeti_dc_16 = soup.find_all(
"div", class_='structItem-title').get('text')
else:
jeti_dc_16 = URL.split('//')[1]
return jeti_dc_16.split('/')[0].rsplit('.')[1].capitalize()
return jeti_dc_16
def Jeti_DC_16_v2(soup, URL):
jeti_dc_16_v2 = None
if soup.name.string:
jeti_dc_16_v2 = soup.title.string
elif soup.find_all("div", class_='structItem-title'):
jeti_dc_16_v2 = soup.find_all(
"div", class_='structItem-title').get('text')
else:
jeti_dc_16_v2 = URL.split('//')[1]
return jeti_dc_16_v2.split('/')[0].rsplit('.')[1].capitalize()
return jeti_dc_16_v2
def Jeti_DS_16(soup, URL):
jeti_ds_16 = None
if soup.jeti_ds_16.string:
jeti_ds_16 = soup.jeti_ds_16.string
elif soup.find_all("div", class_='structItem-title'):
jeti_ds_16 = soup.find_all(
"div", class_='structItem-title').get('text')
else:
jeti_ds_16 = URL.split('//')[1]
return jeti_ds_16.split('/')[0].rsplit('.')[1].capitalize()
return jeti_ds_16
def Jeti_DS_16_v2(soup, URL):
jeti_ds_16_v2 = None
if soup.name.string:
jeti_ds_16_v2 = soup.title.string
elif soup.find_all("div", class_='structItem-title'):
jeti_ds_16_v2 = soup.find_all(
"div", class_='structItem-title').get('text')
else:
jeti_dc_16_v2 = URL.split('//')[1]
return jeti_dc_16_v2.split('/')[0].rsplit('.')[1].capitalize()
return jeti_ds_16_v2
# search_for_class = soup.find_all(
# 'div', class_='structItem-title')
# Jeti_DS_16 = soup.find_all(text="Jeti DS 16")
# Jeti_DS_16_v2 = soup.find_all(text="Jeti DS 16 2")
# Jeti_DC_16 = soup.find_all(text="Jeti DC 16")
# Jeti_DC_16_v2 = soup.find_all(text="Jeti DC 16 2")
if(Jeti_DC_16, Jeti_DC_16_v2, Jeti_DS_16, Jeti_DS_16_v2):
send_mail()
# # print('Die Nummer {0} {1} {2} {3} wurden gezogen'.format(
# # Jeti_DC_16, Jeti_DC_16_v2, Jeti_DS_16, Jeti_DS_16_v2))
# for i in soup.find_all('div', attrs={'class': 'structItem-title'}):
# print(i.a['href'])
# first_result = search_for_class[2]
# print(first_result.text)
# print(Jeti_DC_16, Jeti_DC_16_v2, Jeti_DS_16, Jeti_DS_16_v2)
def send_mail():
with open('/Users/blackbox/Desktop/SynologyDrive/Programmieren/rc-network/credentials.json', 'r') as myFile:
data = myFile.read()
obj = json.loads(data)
print("test: " + str(obj['passwd']))
server_ssl = smtplib.SMTP_SSL('smtp.gmail.com', 465)
server_ssl.ehlo()
# server.starttls()
# server.ehlo()
server_ssl.login('secure#gmail.com', 'secure')
subject = 'Es gibt ein neuer Post im RC-Network auf deine gespeicherte Anfragen. Sieh in dir an{Link to Post}'
body = 'Sieh es dir an Link: https://www.rc-network.de/forums/biete-rc-elektronik-zubeh%C3%B6r.135/'
msg = f"Subject: {subject}\n\n{body}"
emails = ["secure#gmx.de"]
server_ssl.sendmail(
'secure#gmail.com',
emails,
msg
)
print('e-Mail wurde versendet!')
# server_ssl.quit
while(True):
Jeti_DC_16(soup, URL)
Jeti_DC_16_v2(soup, URL)
Jeti_DS_16(soup, URL)
Jeti_DS_16_v2(soup, URL)
time.sleep(10)
# time.sleep(86400)
You create soup inside scrape_page_metadata and it is local varible which doesn't exist outside scrape_page_metadata. In while-loop you should rather use scrape_page_metadata() instead of functions Jeti_DC_16(), Jeti_DC_16_v2(), Jeti_DS_16(), Jeti_DS_16_v2()
And this functions gives you metadata which you should check instead of if(Jeti_DC_16, Jeti_DC_16_v2, Jeti_DS_16, Jeti_DS_16_v2)
More or less (you have to use correct value in place of ... because I don't know what you want to compare)
while True:
metadata = scrape_page_metadata(URL)
if metadata["Jeti_DC_16"] == ... and metadata["Jeti_DC_16_v2"] == ... and metadata["Jeti_DS_16"] == ... and metadata["Jeti_DS_16_v2"] == ...:
send_mail()
time.sleep(10)
But there are other problems
All your functions Jeti_DC_16, Jeti_DC_16_v2, Jeti_DS_16, Jeti_DS_16_v2 look the same and probably they return the same element. You could use one of them and delete others. Or you should change them and they should search different elements.
Probably you would have to use more print() to see values in variables and which part of code is executed because I think this code needs a lot changes yet.
For example find_all() gives list with results and you can't use get() which needs single element. You need for-loop to get all titles from all elements
More or less
jeti_ds_16_v2 = soup.find_all("div", class_='structItem-itle')
jeti_ds_16_v2 = [item.get('text') for item in jeti_ds_16_v2]

Python scraping data of multiple pages issue

I'm getting one issue my code scrape everything from only the first page. But I want to scrape data of multiple pages same as from the first page. Actully I also wrote a code for multiple pages and it also move forward to next page but scrape data of first page again. please have a look at my code and gude me how can i fix this issue. thanks!
here is my code:
import requests
from bs4 import BeautifulSoup
import csv
def get_page(url):
response = requests.get(url)
if not response.ok:
print('server responded:', response.status_code)
else:
soup = BeautifulSoup(response.text, 'html.parser') # 1. html , 2. parser
return soup
def get_detail_page(soup):
try:
title = (soup.find('h1',class_="cdm_style",id=False).text)
except:
title = 'Empty Title'
try:
collection = (soup.find('td',id="metadata_collec").find('a').text)
except:
collection = "Empty Collection"
try:
author = (soup.find('td',id="metadata_creato").text)
except:
author = "Empty Author"
try:
abstract = (soup.find('td',id="metadata_descri").text)
except:
abstract = "Empty Abstract"
try:
keywords = (soup.find('td',id="metadata_keywor").text)
except:
keywords = "Empty Keywords"
try:
publishers = (soup.find('td',id="metadata_publis").text)
except:
publishers = "Empty Publishers"
try:
date_original = (soup.find('td',id="metadata_contri").text)
except:
date_original = "Empty Date original"
try:
date_digital = (soup.find('td',id="metadata_date").text)
except:
date_digital = "Empty Date digital"
try:
formatt = (soup.find('td',id="metadata_source").text)
except:
formatt = "Empty Format"
try:
release_statement = (soup.find('td',id="metadata_rights").text)
except:
release_statement = "Empty Realease Statement"
try:
library = (soup.find('td',id="metadata_librar").text)
except:
library = "Empty Library"
try:
date_created = (soup.find('td',id="metadata_dmcreated").text)
except:
date_created = "Empty date Created"
data = {
'Title' : title.strip(),
'Collection' : collection.strip(),
'Author' : author.strip(),
'Abstract' : abstract.strip(),
'Keywords' : keywords.strip(),
'Publishers' : publishers.strip(),
'Date_original': date_original.strip(),
'Date_digital' : date_digital.strip(),
'Format' : formatt.strip(),
'Release-st' : release_statement.strip(),
'Library' : library.strip(),
'Date_created' : date_created.strip()
}
return data
def get_index_data(soup):
try:
titles_link = soup.find_all('a',class_="body_link_11")
except:
titles_link = []
else:
titles_link_output = []
for link in titles_link:
try:
item_id = link.attrs.get('item_id', None) #All titles with valid links will have an item_id
if item_id:
titles_link_output.append("{}{}".format("http://cgsc.cdmhost.com",link.attrs.get('href', None)))
except:
continue
return titles_link_output
def write_csv(data,url):
with open('1111_to_5555.csv','a') as csvfile:
writer = csv.writer(csvfile)
row = [data['Title'], data['Collection'], data['Author'],
data['Abstract'], data['Keywords'], data['Publishers'], data['Date_original'],
data['Date_digital'], data['Format'], data['Release-st'], data['Library'],
data['Date_created'], url]
writer.writerow(row)
def main():
for x in range(2,4):
mainurl = ("http://cgsc.cdmhost.com/cdm/search/collection/p4013coll8/searchterm/1/field/all/mode/all/conn/and/order/nosort/page/")
print(x)
url = f"{mainurl}{x}"
products = get_index_data(get_page(url))
for product in products:
data1 = get_detail_page(get_page(product))
write_csv(data1,product)
if __name__ == '__main__':
main()
in get_page() function, try to add headers on upon requests
def get_page(url):
headers = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
response = requests.get(url, headers=headers)

Python beautifulsoup performance, IDE, poolsize, multiprocessing requests.get for web-crawler

Fine fellows of stack overflow!
New at python(second day), Beautiful Soup and sadly, in a bit of hurry.
I've built a crawler that take street names from a file into a search engine(merinfo_url). Those companies under right conditions get further scraped then exported.
I'm in a "hurry" because, despite a complete debug mess of a code everything is working! I'm itching to begin a long debug test on a remote computer today. I stopped at 5000 hits.
But performance is slow. I understand I could change the parser to lxml, and open my local file only once.
I hope to implement that today.
Multiprocessing however confuses me. What's my best option, a pool or open several connections? Am I using two terms for the same call?
How large of a pool? Two per thread seems to be frequent advise but I've seen a hundred on a local machine. Any general rule?
If I change nothing in my current code, where do I implement the pool and how do you do it generally for the requests object?
Finally; in terms of performance, on the top of your heads, good performing IDE to debug a crawler running on a local machine?
Many thanks for any feedback offered!
Code
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
import json
import requests
import urllib
from bs4 import BeautifulSoup
from time import sleep
from sys import exit
from multiprocessing import Pool
def main():
if __debug__:
print ("[debug] Instaellningar:")
print ("[debug] Antal anstaellda: "+(antal_anstaellda))
print ("[debug] Omsaettning: "+(omsaettning))
with open(save_file, 'wb') as filedescriptor:
filedescriptor.write('companyName,companySSN,companyAddressNo,companyZipCity,phoneNumber,phoneProvider,phoneNumberType\n')
lines = [line.rstrip('\n') for line in open(streetname_datfile)]
for adresssokparameter in lines:
searchparams = { 'emp': antal_anstaellda, 'rev': omsaettning, 'd': 'c', 'who': '', 'where': adresssokparameter, 'bf': '1' }
sokparametrar = urllib.urlencode(searchparams)
merinfo_url = merinfobaseurl+searchurl+sokparametrar
if __debug__:
print ("[debug] Antal requests gjorda till merinfo.se: "+str(numberOfRequestsCounter))
crawla_merinfo(merinfo_url)
# Crawler
def crawla_merinfo(url):
if __debug__:
print ("[debug] crawl url: "+url)
global numberOfRequestsCounter
numberOfRequestsCounter += 1
merinfosearchresponse = requests.get(url, proxies=proxies)
if merinfosearchresponse.status_code == 429:
print ("[!] For manga sokningar, avslutar")
exit(1)
merinfosoup = BeautifulSoup(merinfosearchresponse.content, 'html.parser')
notfound = merinfosoup.find(string=merinfo404text)
if notfound == u"Din sokning gav tyvaerr ingen traeff. Prova att formulera om din sokning.":
if __debug__:
print ("[debug] [!] " + merinfo404text)
return
for merinfocompanycontent in merinfosoup.find_all('div', attrs={'class': 'result-company'}):
phonelink = merinfocompanycontent.find('a', attrs={'class': 'phone'})
if phonelink == None:
# No numbers, do nothing
if __debug__:
print ("[!] Inget telefonnummer for foretaget")
return
else:
companywithphonenolink = merinfobaseurl+phonelink['href']
thiscompanyphonenodict = crawla_merinfo_telefonnummer(companywithphonenolink)
companyName = merinfocompanycontent.find('h2', attrs={'class': 'name'}).find('a').string
companySSN = merinfocompanycontent.find('p', attrs={'class': 'ssn'}).string
companyAddress = merinfocompanycontent.find('p', attrs={'class': 'address'}).text
splitAddress = companyAddress.splitlines()
addressStreetNo = splitAddress[0]
addressZipCity = splitAddress[1]
addressStreetNo.encode('utf-8')
addressZipCity.encode('utf-8')
if __debug__:
print ("[debug] [*] Foretaget '"+companyName.encode('utf-8')+("' har telefonnummer..."))
for companyPhoneNumber in thiscompanyphonenodict.iterkeys():
companyRow = companyName+","+companySSN+","+addressStreetNo+","+addressZipCity+","+thiscompanyphonenodict[companyPhoneNumber]
if __debug__:
print ("[debug] ::: "+thiscompanyphonenodict[companyPhoneNumber])
with open(save_file, 'a') as filedescriptor:
filedescriptor.write(companyRow.encode('utf-8')+'\n')
return
#telephone crawl function
def crawla_merinfo_telefonnummer(url):
global numberOfRequestsCounter
numberOfRequestsCounter += 1
if __debug__:
print ("[debug] crawl telephone url: "+url)
phonenoDict = {}
s = requests.session()
merinfophonenoresponse = s.get(url, timeout=60)
merinfophonenosoup = BeautifulSoup(merinfophonenoresponse.content, 'html.parser')
merinfotokeninfo = merinfophonenosoup.find('meta', attrs={'name': '_token'})
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36 Edge/15.15063',
'X-Requested-With': 'XMLHttpRequest',
'Host': 'www.merinfo.se',
'Referer': url
}
headers['X-CSRF-TOKEN'] = merinfotokeninfo['content']
headers['Cookie'] = 'merinfo_session='+s.cookies['merinfo_session']+';'
merinfophonetable = merinfophonenosoup.find('table', id='phonetable')
i = 0
for merinfophonenoentry in merinfophonetable.find_all('tr', id=True):
i += 1
phoneNumberID = merinfophonenoentry['id']
phoneNumberPhoneNo = merinfophonenoentry['number']
for phoneNumberColumn in merinfophonenoentry.find_all('td', attrs={'class':'col-xs-2'}):
phoneNumberType = phoneNumberColumn.next_element.string.replace(",",";")
phoneNumberType = phoneNumberType.rstrip('\n').lstrip('\n')
payload = {
'id': phoneNumberID,
'phonenumber': phoneNumberPhoneNo
}
r = s.post(ajaxurl, data=payload, headers=headers)
numberOfRequestsCounter += 1
if r.status_code != 200:
print ("[!] Error, response not HTTP 200 while querying AJAX carrier info.")
exit(1)
else:
carrierResponseDict = json.loads(r.text)
# print carrierResponseDict['operator']
phoneNoString = phoneNumberPhoneNo+','+carrierResponseDict['operator']+','+phoneNumberType
phonenoDict['companyPhoneNo'+str(i)] = phoneNoString
return phonenoDict
# Start main program
main()
You should start to use Scrapy
One of the main advantages about Scrapy: requests are scheduled and processed asynchronously.
This means that Scrapy doesn’t need to wait for a request to be finished and processed, it can send another request or do other things in the meantime. This also means that other requests can keep going even if some request fails or an error happens while handling it.

Python: how to make parallelize processing

I need to divide the task to 8 processes.
I use multiprocessing to do that.
I try to describe my task:
I have dataframe and there are column with urls. Some urls have a captcha and I try to use proxies from other file to get page from every url.
It takes a lot of time and I want to divide that. I want to open first url with one proxy, secong url with another proxy etc. I can't use map or zip, because length of list with proxies is smaller.
urls looks like
['https://www.avito.ru/moskva/avtomobili/bmw_x5_2016_840834845', 'https://www.avito.ru/moskva/avtomobili/bmw_1_seriya_2016_855898883', 'https://www.avito.ru/moskva/avtomobili/bmw_3_seriya_2016_853351780', 'https://www.avito.ru/moskva/avtomobili/bmw_3_seriya_2016_856641142', 'https://www.avito.ru/moskva/avtomobili/bmw_3_seriya_2016_856641140', 'https://www.avito.ru/moskva/avtomobili/bmw_3_seriya_2016_853351780', 'https://www.avito.ru/moskva/avtomobili/bmw_3_seriya_2016_856641134', 'https://www.avito.ru/moskva/avtomobili/bmw_3_seriya_2016_856641141']
and proxies looks like
['http://203.223.143.51:8080', 'http://77.123.18.56:81', 'http://203.146.189.61:80', 'http://113.185.19.130:80', 'http://212.235.226.133:3128', 'http://5.39.89.84:8080']
My code:
def get_page(url):
m = re.search(r'avito.ru\/[a-z]+\/avtomobili\/[a-z0-9_]+$', url)
if m is not None:
url = 'https://www.' + url
print url
proxy = pd.read_excel('proxies.xlsx')
proxies = proxy.proxy.values.tolist()
for i, proxy in enumerate(proxies):
print "Trying HTTP proxy %s" % proxy
try:
result = urllib.urlopen(url, proxies={'http': proxy}).read()
if 'Мы обнаружили, что запросы, поступающие с вашего IP-адреса, похожи на автоматические' in result:
raise Exception
else:
page = page.read()
soup = BeautifulSoup(page, 'html.parser')
price = soup.find('span', itemprop="price")
print price
except:
print "Trying next proxy %s in 10 seconds" % proxy
time.sleep(10)
if __name__ == '__main__':
pool = Pool(processes=8)
pool.map(get_page, urls)
My code takes 8 urls and try open it with one proxy. How can I change algorithm to open 8 urls with 8 different proxies?
Something like that might help:
def get_page(url):
m = re.search(r'avito.ru\/[a-z]+\/avtomobili\/[a-z0-9_]+$', url)
if m is not None:
url = 'https://www.' + url
print url
proxy = pd.read_excel('proxies.xlsx')
proxies = proxy.proxy.values.tolist()
for i, proxy in enumerate(proxies):
thread.start_new_thread( run, (proxy,i ) )
def run(proxy,i):
print "Trying HTTP proxy %s" % proxy
try:
result = urllib.urlopen(url, proxies={'http': proxy}).read()
if 'Мы обнаружили, что запросы, поступающие с вашего IP-адреса, похожи на автоматические' in result:
raise Exception
else:
page = page.read()
soup = BeautifulSoup(page, 'html.parser')
price = soup.find('span', itemprop="price")
print price
except:
print "Trying next proxy %s in 10 seconds" % proxy
time.sleep(10)
if __name__ == '__main__':
pool = Pool(processes=8)
pool.map(get_page, urls)

Categories