Python: how to make parallelize processing - python

I need to divide the task to 8 processes.
I use multiprocessing to do that.
I try to describe my task:
I have dataframe and there are column with urls. Some urls have a captcha and I try to use proxies from other file to get page from every url.
It takes a lot of time and I want to divide that. I want to open first url with one proxy, secong url with another proxy etc. I can't use map or zip, because length of list with proxies is smaller.
urls looks like
['https://www.avito.ru/moskva/avtomobili/bmw_x5_2016_840834845', 'https://www.avito.ru/moskva/avtomobili/bmw_1_seriya_2016_855898883', 'https://www.avito.ru/moskva/avtomobili/bmw_3_seriya_2016_853351780', 'https://www.avito.ru/moskva/avtomobili/bmw_3_seriya_2016_856641142', 'https://www.avito.ru/moskva/avtomobili/bmw_3_seriya_2016_856641140', 'https://www.avito.ru/moskva/avtomobili/bmw_3_seriya_2016_853351780', 'https://www.avito.ru/moskva/avtomobili/bmw_3_seriya_2016_856641134', 'https://www.avito.ru/moskva/avtomobili/bmw_3_seriya_2016_856641141']
and proxies looks like
['http://203.223.143.51:8080', 'http://77.123.18.56:81', 'http://203.146.189.61:80', 'http://113.185.19.130:80', 'http://212.235.226.133:3128', 'http://5.39.89.84:8080']
My code:
def get_page(url):
m = re.search(r'avito.ru\/[a-z]+\/avtomobili\/[a-z0-9_]+$', url)
if m is not None:
url = 'https://www.' + url
print url
proxy = pd.read_excel('proxies.xlsx')
proxies = proxy.proxy.values.tolist()
for i, proxy in enumerate(proxies):
print "Trying HTTP proxy %s" % proxy
try:
result = urllib.urlopen(url, proxies={'http': proxy}).read()
if 'Мы обнаружили, что запросы, поступающие с вашего IP-адреса, похожи на автоматические' in result:
raise Exception
else:
page = page.read()
soup = BeautifulSoup(page, 'html.parser')
price = soup.find('span', itemprop="price")
print price
except:
print "Trying next proxy %s in 10 seconds" % proxy
time.sleep(10)
if __name__ == '__main__':
pool = Pool(processes=8)
pool.map(get_page, urls)
My code takes 8 urls and try open it with one proxy. How can I change algorithm to open 8 urls with 8 different proxies?

Something like that might help:
def get_page(url):
m = re.search(r'avito.ru\/[a-z]+\/avtomobili\/[a-z0-9_]+$', url)
if m is not None:
url = 'https://www.' + url
print url
proxy = pd.read_excel('proxies.xlsx')
proxies = proxy.proxy.values.tolist()
for i, proxy in enumerate(proxies):
thread.start_new_thread( run, (proxy,i ) )
def run(proxy,i):
print "Trying HTTP proxy %s" % proxy
try:
result = urllib.urlopen(url, proxies={'http': proxy}).read()
if 'Мы обнаружили, что запросы, поступающие с вашего IP-адреса, похожи на автоматические' in result:
raise Exception
else:
page = page.read()
soup = BeautifulSoup(page, 'html.parser')
price = soup.find('span', itemprop="price")
print price
except:
print "Trying next proxy %s in 10 seconds" % proxy
time.sleep(10)
if __name__ == '__main__':
pool = Pool(processes=8)
pool.map(get_page, urls)

Related

How do I extract only the content from this webpage

I am trying out webscraping using BeautifulSoup.
I only want extract the content from this webpage basically everything from Barry Kripke without all the headers..etc.
https://bigbangtheory.fandom.com/wiki/Barry_Kripke
I tried this, but it doesn't give me what I want
quote = 'https://bigbangtheory.fandom.com/wiki/Barry_Kripke'
http = urllib3.PoolManager()
r = http.request('GET', quote)
if r.status == 200:
page = r.data
print('Type of the variable \'page\':', page.__class__.__name__)
print('Page Retrieved. Request Status: %d, Page Size: %d' % (r.status, len(page)))
else:
print('Some problem occurred. Request Status: %s' % r.status)
soup = BeautifulSoup(page, 'html.parser')
print('Type of the variable \'soup\':', soup.__class__.__name__)
print(soup.prettify()[:1000])
article_tag = 'p'
article = soup.find_all(article_tag)[0]
print(f'Type of the variable "article":{article.__class__.__name__}')
article.text
The output I get is below, which is just the first paragraph
What I want is this:
Next I tried to get all the links, but that didn't work either - I got only 2 links:
for t in article.find_all('a'):
print(t)
Please can someone help me with this.
You only grab and print out the 1st <p> tag with article = soup.find_all(article_tag)[0]
You need to go through all the <p> tags:
import requests
from bs4 import BeautifulSoup
url = 'https://bigbangtheory.fandom.com/wiki/Barry_Kripke'
r = requests.get(url)
if r.status_code == 200:
page = r.text
print('Type of the variable \'page\':', page.__class__.__name__)
print('Page Retrieved. Request Status: %d, Page Size: %d' % (r.status_code, len(page)))
else:
print('Some problem occurred. Request Status: %s' % r.status_code)
soup = BeautifulSoup(page, 'html.parser')
print('Type of the variable \'soup\':', soup.__class__.__name__)
print(soup.prettify()[:1000])
article_tag = 'p'
articles = soup.find_all(article_tag)
for p in articles:
print(p.text)

Trouble selecting functional proxies from a list of proxies quickly

I've created a scraper using requests module implementing rotation of proxies (taken from a free proxy site) within it to fetch content from yellowpages.
The script appears to work correctly but it is terribly slow as it takes a lot of time to find a working proxy. I've tried to reuse the same working proxy (when found) until it is dead and for that I had to declare proxies and proxy_url as global.
Although shop_name and categories are available in landing pages, I scraped both of them from inner pages so that the script can demonstrate that it uses the same working proxy (when it finds one) multiple times.
This is the script I'm trying with:
import random
import requests
from bs4 import BeautifulSoup
base = 'https://www.yellowpages.com{}'
link = 'https://www.yellowpages.com/search?search_terms=pizza&geo_location_terms=Los+Angeles%2C+CA'
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36',
}
def get_proxies():
response = requests.get('https://www.sslproxies.org/')
soup = BeautifulSoup(response.text,"lxml")
proxies = []
for item in soup.select("table.table tbody tr"):
if not item.select_one("td"):break
ip = item.select_one("td").text
port = item.select_one("td:nth-of-type(2)").text
proxies.append(f"{ip}:{port}")
return [{'https': f'http://{x}'} for x in proxies]
def fetch_resp(link,headers):
global proxies, proxy_url
while True:
print("currently being used:",proxy_url)
try:
res = requests.get(link, headers=headers, proxies=proxy_url, timeout=10)
print("status code",res.status_code)
assert res.status_code == 200
return res
except Exception as e:
proxy_url = proxies.pop(random.randrange(len(proxies)))
def fetch_links(link,headers):
res = fetch_resp(link,headers)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select(".v-card > .info a.business-name"):
yield base.format(item.get("href"))
def get_content(link,headers):
res = fetch_resp(link,headers)
soup = BeautifulSoup(res.text,"lxml")
shop_name = soup.select_one(".sales-info > h1.business-name").get_text(strip=True)
categories = ' '.join([i.text for i in soup.select(".categories > a")])
return shop_name,categories
if __name__ == '__main__':
proxies = get_proxies()
proxy_url = proxies.pop(random.randrange(len(proxies)))
for inner_link in fetch_links(link,headers):
print(get_content(inner_link,headers))
How can I quickly select a functional proxy from a list of proxies?
Please let me point out that using free proxy IP addresses can be highly problematic. These type of proxies are notorious for having connections issues, such as timeouts related to latency. Plus these sites can also be intermittent, which means that they can go down at anytime. And sometimes these sites are being abused, so they can get blocked.
With that being said, below are multiple methods that can be used to accomplish your use case related to scraping content from the Yellow Pages.
UPDATE 07-11-2022 16:47 GMT
I tried a different proxy validation method this morning. It is slightly faster than the proxy judge method. The issue with both these methods is error handling. I have to catch all the errors below when validating a proxy IP address and passing a validated address to your function fetch_resp.
ConnectionResetError
requests.exceptions.ConnectTimeout
requests.exceptions.ProxyError
requests.exceptions.ConnectionError
requests.exceptions.HTTPError
requests.exceptions.Timeout
requests.exceptions.TooManyRedirects
urllib3.exceptions.MaxRetryError
urllib3.exceptions.ProxySchemeUnknown
urllib3.exceptions.ProtocolError
Occasionally a proxy fails when extracting from a page, which causes a delay. There is nothing you can do to prevent these failures. The only thing you can do is catch the error and reprocess the request.
I was able to improve the extraction time by adding threading to function get_content.
Content Extraction Runtime: 0:00:03.475362
Total Runtime: 0:01:16.617862
The only way you can increase the speed of your code is to redesign it to query each page element at the same time. If you don't this is a timing bottleneck.
Here is the code that I used to validate the proxy addresses.
def check_proxy(proxy):
try:
session = requests.Session()
session.headers['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36'
session.max_redirects = 300
proxy = proxy.split('\n', 1)[0]
# print('Checking ' + proxy)
req = session.get("http://google.com", proxies={'http':'http://' + proxy}, timeout=30, allow_redirects=True)
if req.status_code == 200:
return proxy
except requests.exceptions.ConnectTimeout as e:
return None
except requests.exceptions.ConnectionError as e:
return None
except ConnectionResetError as e:
# print('Error,ConnectionReset!')
return None
except requests.exceptions.HTTPError as e:
return None
except requests.exceptions.Timeout as e:
return None
except ProxySchemeUnknown as e:
return None
except ProtocolError as e:
return None
except requests.exceptions.ChunkedEncodingError as e:
return None
except requests.exceptions.TooManyRedirects as e:
return None
UPDATE 07-10-2022 23:53 GMT
I did some more research into this question. I have noted that the website https://www.sslproxies.org provides a list of 100 HTTPS. Out of those less than 20% pass the proxy judge test. Even after obtaining this 20% some will fail when being passed to your function fetch_resp. They can fail for multiple reasons, which include ConnectTimeout, MaxRetryError, ProxyError, etc. When this happens you can rerun the function with the same link (url), headers and a new proxy. The best workaround for these errors is to use a commercial proxy service.
In my latest test I was able to obtain a list of potentially functional proxies and extract all the content for all 25 pages related to your search. Below is the timeDelta for this test:
Content Extraction Runtime: 0:00:34.176803
Total Runtime: 0:01:22.429338
I can speed this up if I use threading with the function fetch_resp.
Below is the current code that I'm using. I need to improve the error handling, but it currently works.
import time
import random
import requests
from datetime import timedelta
from bs4 import BeautifulSoup
from proxy_checking import ProxyChecker
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from urllib3.exceptions import MaxRetryError, ProxySchemeUnknown
from concurrent.futures import ThreadPoolExecutor, as_completed
proxies_addresses = []
current_proxy = ''
def requests_retry_session(retries=5,
backoff_factor=0.5,
status_force_list=(500, 502, 503, 504),
session=None,
):
session = session or requests.Session()
retry = Retry(
total=retries,
read=retries,
connect=retries,
backoff_factor=backoff_factor,
status_forcelist=status_force_list,
)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
return session
def ssl_proxy_addresses():
global proxies_addresses
response = requests.get('https://www.sslproxies.org/')
soup = BeautifulSoup(response.text, "lxml")
proxies = []
table = soup.find('tbody')
table_rows = table.find_all('tr')
for row in table_rows:
ip_address = row.find_all('td')[0]
port_number = row.find_all('td')[1]
proxies.append(f'{ip_address.text}:{port_number.text}')
proxies_addresses = proxies
return proxies
def proxy_verification(current_proxy_address):
checker = ProxyChecker()
proxy_judge = checker.check_proxy(current_proxy_address)
proxy_status = bool([value for key, value in proxy_judge.items() if key == 'status' and value is True])
if proxy_status is True:
return current_proxy_address
else:
return None
def get_proxy_address():
global proxies_addresses
proxy_addresses = ssl_proxy_addresses()
processes = []
with ThreadPoolExecutor(max_workers=40) as executor:
for proxy_address in proxy_addresses:
processes.append(executor.submit(proxy_verification, proxy_address))
proxies = [task.result() for task in as_completed(processes) if task.result() is not None]
proxies_addresses = proxies
return proxies_addresses
def fetch_resp(link, http_headers, proxy_url):
try:
print(F'Current Proxy: {proxy_url}')
response = requests_retry_session().get(link,
headers=http_headers,
allow_redirects=True,
verify=True,
proxies=proxy_url,
timeout=(30, 45)
)
print("status code", response.status_code)
if response.status_code == 200:
return response
else:
current_proxy = proxies_addresses.pop(random.randrange(len(proxies_addresses)))
fetch_resp(link, http_headers, current_proxy)
except requests.exceptions.ConnectTimeout as e:
print('Error,Timeout!')
current_proxy = proxies_addresses.pop(random.randrange(len(proxies_addresses)))
fetch_resp(link, http_headers, current_proxy)
pass
except requests.exceptions.ProxyError as e:
print('ProxyError!')
current_proxy = proxies_addresses.pop(random.randrange(len(proxies_addresses)))
fetch_resp(link, http_headers, current_proxy)
pass
except requests.exceptions.ConnectionError as e:
print('Connection Error')
current_proxy = proxies_addresses.pop(random.randrange(len(proxies_addresses)))
fetch_resp(link, http_headers, current_proxy)
pass
except requests.exceptions.HTTPError as e:
print('HTTP ERROR!')
current_proxy = proxies_addresses.pop(random.randrange(len(proxies_addresses)))
fetch_resp(link, http_headers, current_proxy)
pass
except requests.exceptions.Timeout as e:
print('Error! Connection Timeout!')
current_proxy = proxies_addresses.pop(random.randrange(len(proxies_addresses)))
fetch_resp(link, http_headers, current_proxy)
pass
except ProxySchemeUnknown as e:
print('ERROR unknown Proxy Scheme!')
current_proxy = proxies_addresses.pop(random.randrange(len(proxies_addresses)))
fetch_resp(link, http_headers, current_proxy)
pass
except MaxRetryError as e:
print('MaxRetryError')
current_proxy = proxies_addresses.pop(random.randrange(len(proxies_addresses)))
fetch_resp(link, http_headers, current_proxy)
pass
except requests.exceptions.TooManyRedirects as e:
print('ERROR! Too many redirects!')
current_proxy = proxies_addresses.pop(random.randrange(len(proxies_addresses)))
fetch_resp(link, http_headers, current_proxy)
pass
def get_content(http_headers, proxy_url):
start_time = time.time()
results = []
pages = int(25)
for page_number in range(1, pages):
print(page_number)
next_url = f"https://www.yellowpages.com/search?search_terms=pizza&geo_location_terms=Los%20Angeles%2C%20CA" \
f"&page={page_number}"
res = fetch_resp(next_url, http_headers, proxy_url)
soup = BeautifulSoup(res.text, "lxml")
info_sections = soup.find_all('li', {'class': 'business-card'})
for info_section in info_sections:
shop_name = info_section.find('h2', {'class': 'title business-name'})
categories = ', '.join([i.text for i in info_section.find_all('a', {'class': 'category'})])
results.append({shop_name.text, categories})
end_time = time.time() - start_time
print(f'Content Extraction Runtime: {timedelta(seconds=end_time)}')
return results
start_time = time.time()
get_proxy_address()
if len(proxies_addresses) != 0:
print(proxies_addresses)
print('\n')
current_proxy = proxies_addresses.pop(random.randrange(len(proxies_addresses)))
print(current_proxy)
print('\n')
base_url = 'https://www.yellowpages.com{}'
current_url = 'https://www.yellowpages.com/search?search_terms=pizza&geo_location_terms=Los+Angeles%2C+CA'
headers = {
'user-agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.5 Mobile/15E148 Safari/604.1',
}
PROXIES = {
'https': f"http://{current_proxy}"
}
results = get_content(headers, PROXIES)
end_time = time.time() - start_time
print(f'Total Runtime: {timedelta(seconds=end_time)}')
UPDATE 07-06-2022 11:02 GMT
This seems to be your core question:
How can I quickly select a functional proxy from a list of proxies?
First, all my previous code is able to validate that a proxy is working at a given moment in time. Once validated I'm able to query and extract data from your Yellow Pages search for pizza in Los Angeles.
Using my previous method I'm able to query and extract data for all 24 pages related to your search in 0:00:45.367209 seconds.
Back to your question.
The website https://www.sslproxies.org provides a list of 100 HTTPS proxies. There is zero guarantee that all 100 are currently operational. One of the ways to identify the working ones is using a Proxy Judge service.
In my previous code I continually selected a random proxy from the list of 100 and passed this proxy to a Proxy Judge for validation. Once a proxy is validated to be working it is used to query and extract data Yellow Pages.
The method above works, but I was wondering how many proxies out of the 100 pass the sniff test for the Proxy Judge service. I attempted to check using a basic for loop, which was deathly slow. I decided to concurrent.futures to speed up the validation.
The code below takes about 1 minute to obtain a list of HTTPS proxies and validate them using a Proxy Judge service.
This is the fastest way to obtain a list of free proxies that are functional at a specific moment in time.
import requests
from bs4 import BeautifulSoup
from proxy_checking import ProxyChecker
from concurrent.futures import ThreadPoolExecutor, as_completed
def ssl_proxy_addresses():
response = requests.get('https://www.sslproxies.org/')
soup = BeautifulSoup(response.text, "lxml")
proxies = []
table = soup.find('tbody')
table_rows = table.find_all('tr')
for row in table_rows:
ip_address = row.find_all('td')[0]
port_number = row.find_all('td')[1]
proxies.append(f'{ip_address.text}:{port_number.text}')
return proxies
def proxy_verification(current_proxy_address):
checker = ProxyChecker()
proxy_judge = checker.check_proxy(current_proxy_address)
proxy_status = bool([value for key, value in proxy_judge.items() if key == 'status' and value is True])
if proxy_status is True:
return current_proxy_address
else:
return None
def get_proxy_address():
proxy_addresses = ssl_proxy_addresses()
processes = []
with ThreadPoolExecutor(max_workers=20) as executor:
for proxy_address in proxy_addresses:
processes.append(executor.submit(proxy_verification, proxy_address))
proxies = [task.result() for task in as_completed(processes) if task.result() is not None]
print(len(proxies))
13
print(proxies)
['34.228.74.208:8080', '198.41.67.18:8080', '139.9.64.238:443', '216.238.72.163:59394', '64.189.24.250:3129', '62.193.108.133:1976', '210.212.227.68:3128', '47.241.165.133:443', '20.26.4.251:3128', '185.76.9.123:3128', '129.41.171.244:8000', '12.231.44.251:3128', '5.161.105.105:80']
UPDATE CODE 07-05-2022 17:07 GMT
I added a snippet of code below to query the second page. I did this to see if the proxy stayed the same, which it did. You still need to add some error handling.
In my testing I was able to query all 24 pages related to your search in 0:00:45.367209 seconds. I don't consider this query and extraction speed slow by any means.
Concerning performing a different search. I would do the same method as below, but I would request a new proxy for this search, because free proxies do have limitations, such as life time and performance degradation.
import random
import logging
import requests
import traceback
from time import sleep
from random import randint
from bs4 import BeautifulSoup
from proxy_checking import ProxyChecker
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from urllib3.exceptions import ProxySchemeUnknown
from http_request_randomizer.requests.proxy.ProxyObject import Protocol
from http_request_randomizer.requests.proxy.requestProxy import RequestProxy
current_proxy = ''
def requests_retry_session(retries=5,
backoff_factor=0.5,
status_force_list=(500, 502, 503, 504),
session=None,
):
session = session or requests.Session()
retry = Retry(
total=retries,
read=retries,
connect=retries,
backoff_factor=backoff_factor,
status_forcelist=status_force_list,
)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
return session
def random_ssl_proxy_address():
try:
# Obtain a list of HTTPS proxies
# Suppress the console debugging output by setting the log level
req_proxy = RequestProxy(log_level=logging.ERROR, protocol=Protocol.HTTPS)
# Obtain a random single proxy from the list of proxy addresses
random_proxy = random.sample(req_proxy.get_proxy_list(), 1)
return random_proxy[0].get_address()
except AttributeError as e:
pass
def proxy_verification(current_proxy_address):
checker = ProxyChecker()
proxy_judge = checker.check_proxy(current_proxy_address)
proxy_status = bool([value for key, value in proxy_judge.items() if key == 'status' and value is True])
return proxy_status
def get_proxy_address():
global current_proxy
random_proxy_address = random_ssl_proxy_address()
current_proxy = random_proxy_address
proxy_status = proxy_verification(random_proxy_address)
if proxy_status is True:
return
else:
print('Looking for a valid proxy address.')
# this sleep timer is helping with some timeout issues
# that were happening when querying
sleep(randint(5, 10))
get_proxy_address()
def fetch_resp(link, http_headers, proxy_url):
try:
response = requests_retry_session().get(link,
headers=http_headers,
allow_redirects=True,
verify=True,
proxies=proxy_url,
timeout=(30, 45)
)
print(F'Current Proxy: {proxy_url}')
print("status code", response.status_code)
return response
except requests.exceptions.ConnectTimeout as e:
print('Error,Timeout!')
print(''.join(traceback.format_tb(e.__traceback__)))
except requests.exceptions.ConnectionError as e:
print('Connection Error')
print(''.join(traceback.format_tb(e.__traceback__)))
except requests.exceptions.HTTPError as e:
print('HTTP ERROR!')
print(''.join(traceback.format_tb(e.__traceback__)))
except requests.exceptions.Timeout as e:
print('Error! Connection Timeout!')
print(''.join(traceback.format_tb(e.__traceback__)))
except ProxySchemeUnknown as e:
print('ERROR unknown Proxy Scheme!')
print(''.join(traceback.format_tb(e.__traceback__)))
except requests.exceptions.TooManyRedirects as e:
print('ERROR! Too many redirects!')
print(''.join(traceback.format_tb(e.__traceback__)))
def get_next_page(raw_soup, http_headers, proxy_urls):
next_page_element = raw_soup.find('a', {'class': 'paginator-next arrow-next'})
next_url = f"https://www.yellowpages.com{next_page_element['href']}"
sub_response = fetch_resp(next_url, http_headers, proxy_urls)
new_soup = BeautifulSoup(sub_response.text, "lxml")
return new_soup
def get_content(link, http_headers, proxy_urls):
res = fetch_resp(link, http_headers, proxy_urls)
soup = BeautifulSoup(res.text, "lxml")
info_sections = soup.find_all('li', {'class': 'business-card'})
for info_section in info_sections:
shop_name = info_section.find('h2', {'class': 'title business-name'})
print(shop_name.text)
categories = ', '.join([i.text for i in info_section.find_all('a', {'class': 'category'})])
print(categories)
business_website = info_section.find('a', {'class': 'website listing-cta action'})
if business_website is not None:
print(business_website['href'])
elif business_website is None:
print('no website')
# get page 2
if soup.find('a', {'class': 'paginator-next arrow-next'}) is not None:
soup_next_page = get_next_page(soup, http_headers, proxy_urls)
info_sections = soup_next_page.find_all('li', {'class': 'business-card'})
for info_section in info_sections:
shop_name = info_section.find('h2', {'class': 'title business-name'})
print(shop_name.text)
categories = ', '.join([i.text for i in info_section.find_all('a', {'class': 'category'})])
print(categories)
business_website = info_section.find('a', {'class': 'website listing-cta action'})
if business_website is not None:
print(business_website['href'])
elif business_website is None:
print('no website')
get_proxy_address()
if len(current_proxy) != 0:
print(current_proxy)
base_url = 'https://www.yellowpages.com{}'
current_url = 'https://www.yellowpages.com/search?search_terms=pizza&geo_location_terms=Los+Angeles%2C+CA'
headers = {
'user-agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.5 Mobile/15E148 Safari/604.1',
}
PROXIES = {
'https': f"http://{current_proxy}"
}
get_content(current_url, headers, PROXIES)
truncated output
Current Proxy: {'https': 'http://157.185.161.123:59394'}
status code 200
1.Casa Bianca Pizza Pie
2.Palermo Italian Restaurant
... truncated
Current Proxy: {'https': 'http://157.185.161.123:59394'}
status code 200
31.Johnnie's New York Pizzeria
32.Amalfi Restaurant and Bar
... truncated
UPDATE CODE 07-05-2022 14:07 GMT
I reworked my code posted on 07-01-2022 to output these data elements, business name, business categories and business website.
1.Casa Bianca Pizza Pie
Pizza, Italian Restaurants, Restaurants
http://www.casabiancapizza.com
2.Palermo Italian Restaurant
Pizza, Restaurants, Italian Restaurants
no website
... truncated
UPDATE CODE 07-01-2022
I noted that when using the free proxies errors were being thrown. I added the requests_retry_session function to handle this. I didn't rework all your code, but I did make sure that I could query the site and produce results using a free proxy. You should be able to work my code into yours.
import random
import logging
import requests
from time import sleep
from random import randint
from bs4 import BeautifulSoup
from proxy_checking import ProxyChecker
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from http_request_randomizer.requests.proxy.ProxyObject import Protocol
from http_request_randomizer.requests.proxy.requestProxy import RequestProxy
current_proxy = ''
def requests_retry_session(retries=5,
backoff_factor=0.5,
status_force_list=(500, 502, 504),
session=None,
):
session = session or requests.Session()
retry = Retry(
total=retries,
read=retries,
connect=retries,
backoff_factor=backoff_factor,
status_forcelist=status_force_list,
)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
return session
def random_ssl_proxy_address():
try:
# Obtain a list of HTTPS proxies
# Suppress the console debugging output by setting the log level
req_proxy = RequestProxy(log_level=logging.ERROR, protocol=Protocol.HTTPS)
# Obtain a random single proxy from the list of proxy addresses
random_proxy = random.sample(req_proxy.get_proxy_list(), 1)
return random_proxy[0].get_address()
except AttributeError as e:
pass
def proxy_verification(current_proxy_address):
checker = ProxyChecker()
proxy_judge = checker.check_proxy(current_proxy_address)
proxy_status = bool([value for key, value in proxy_judge.items() if key == 'status' and value is True])
return proxy_status
def get_proxy_address():
global current_proxy
random_proxy_address = random_ssl_proxy_address()
current_proxy = random_proxy_address
proxy_status = proxy_verification(random_proxy_address)
if proxy_status is True:
return
else:
print('Looking for a valid proxy address.')
# this sleep timer is helping with some timeout issues
# that were happening when querying
sleep(randint(5, 10))
get_proxy_address()
def fetch_resp(link, http_headers, proxy_url):
response = requests_retry_session().get(link,
headers=http_headers,
allow_redirects=True,
verify=True,
proxies=proxy_url,
timeout=(30, 45)
)
print("status code", response.status_code)
return response
def get_content(link, headers, proxy_urls):
res = fetch_resp(link, headers, proxy_urls)
soup = BeautifulSoup(res.text, "lxml")
info_sections = soup.find_all('li', {'class': 'business-card'})
for info_section in info_sections:
shop_name = info_section.find('h2', {'class': 'title business-name'})
print(shop_name.text)
categories = ', '.join([i.text for i in info_section.find_all('a', {'class': 'category'})])
print(categories)
business_website = info_section.find('a', {'class': 'website listing-cta action'})
if business_website is not None:
print(business_website['href'])
elif business_website is None:
print('no website')
get_proxy_address()
if len(current_proxy) != 0:
print(current_proxy)
base_url = 'https://www.yellowpages.com{}'
current_url = 'https://www.yellowpages.com/search?search_terms=pizza&geo_location_terms=Los+Angeles%2C+CA'
headers = {
'user-agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.5 Mobile/15E148 Safari/604.1',
}
PROXIES = {
'https': f"http://{current_proxy}"
}
get_content(current_url, headers, PROXIES)
PREVIOUS ANSWERS
06-30-2022:
During some testing I found a bug, so I updated my code to handle the bug.
06-28-2022:
You could use a proxy judge, which is used for testing the performance and the anonymity status of a proxy server.
The code below is from one of my previous answers.
import random
import logging
from time import sleep
from random import randint
from proxy_checking import ProxyChecker
from http_request_randomizer.requests.proxy.ProxyObject import Protocol
from http_request_randomizer.requests.proxy.requestProxy import RequestProxy
current_proxy = ''
def random_ssl_proxy_address():
try:
# Obtain a list of HTTPS proxies
# Suppress the console debugging output by setting the log level
req_proxy = RequestProxy(log_level=logging.ERROR, protocol=Protocol.HTTPS)
# Obtain a random single proxy from the list of proxy addresses
random_proxy = random.sample(req_proxy.get_proxy_list(), 1)
return random_proxy[0].get_address()
except AttributeError as e:
pass
def proxy_verification(current_proxy_address):
checker = ProxyChecker()
proxy_judge = checker.check_proxy(current_proxy_address)
proxy_status = bool([value for key, value in proxy_judge.items() if key == 'status' and value is True])
return proxy_status
def get_proxy_address():
global current_proxy
random_proxy_address = random_ssl_proxy_address()
current_proxy = random_proxy_address
proxy_status = proxy_verification(random_proxy_address)
if proxy_status is True:
return
else:
print('Looking for a valid proxy address.')
# this sleep timer is helping with some timeout issues
# that were happening when querying
sleep(randint(5, 10))
get_proxy_address()
get_proxy_address()
if len(current_proxy) != 0:
print(f'Valid proxy address: {current_proxy}')
# output
Valid proxy address: 157.100.12.138:999
I noted today that the Python package HTTP_Request_Randomizer has a couple of Beautiful Soup path problems that need to be modified, because they currently don't work in version 1.3.2 of HTTP_Request_Randomizer.
You need to modified line 27 in FreeProxyParser.py to this:
table = soup.find("table", attrs={"class": "table table-striped table-bordered"})
You need to modified line 27 in SslProxyParser.py to this:
table = soup.find("table", attrs={"class": "table table-striped table-bordered"})
I found another bug that needs to be fixed. This one is in the proxy_checking.py I had to add the line if url != None:
def get_info(self, url=None, proxy=None):
info = {}
proxy_type = []
judges = ['http://proxyjudge.us/azenv.php', 'http://azenv.net/', 'http://httpheader.net/azenv.php', 'http://mojeip.net.pl/asdfa/azenv.php']
if url != None:
try:
response = requests.get(url, headers=headers, timeout=5)
return response
except:
pass
elif proxy != None:

Python ThreadPoolExecutor: future.result randomly returns None

So I have the following code for threading:
with concurrent.futures.ThreadPoolExecutor() as executor:
futures = []
for url in total_links:
futures.append(executor.submit(process_url, input_url=url))
for future in concurrent.futures.as_completed(futures):
print('RESULT INSIDE')
result = future.result() #Returns None randomly
print(result)
records.append(result)
At times future.result() returns None. Below is the process_url function:
def process_url(input_url):
res = None
sleep(0.07)
r = session.get(input_url, headers=headers, cookies=c, cert=cert, timeout=20)
if r.status_code == 200:
soup = BeautifulSoup(r.text, 'lxml')
res = get_status(soup)
print('Inside Process URL')
print(res)
print('======================')
return res
res will always have data available but that data is not being fetched under thread. I also add that it is happening randomly that is, If I run script 5 times then at least once it returns None.
So there may be several issues, but most notably if you don't receive a status code of 200 the futures object will return None. For example, say a thread goes and retrieves a URL, but a following thread is unable to reach a host, the request gets timed out, etc. that thread will report back None.
You could validate this behavior by:
Making sure each request is actually working, e.g making sure res=None changes to to the success state in your thread. What happens if the request times out? Are you returning those details? If not, res=None
Adding logic to show errors, e.g. make sure res=None is referencing a response.
def process_url(input_url):
res = None
sleep(0.07)
r = session.get(input_url, headers=headers, cookies=c, cert=cert, timeout=20)
if r.status_code == 200:
soup = BeautifulSoup(r.text, 'lxml')
res = get_status(soup)
print('Inside Process URL')
print(res)
print('======================')
if r.status_code != 200:
res = r.text
return res
I would potentially use a dictionary instead of an array as the futures object so you can reference each thread by it's url. Right now, its by index. This way you can know which url is the culprit.
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 20): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
return "Something went wrong!"
else:
print('%r page is %d bytes' % (url, len(data)))
return "Success!"
But the process you're running is not "randomly" returning None. The res object is referencing None during your execution. The only time you may also receive a None object is if the thread isn't collecting a result. Here are the docs for reference.
https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.Future.result

Appending items Multiprocessing

In the function get_links, I am fetching the links of URLs. And in Scrape function, I am getting the content of each URL using text_from_html function( Not in the code). I want to append the url and visible_text into two lists containing urls and visible_text of each url. Here the list contains only one item and previous one is getting replaced. I want to keep the previous values also.
I'm getting the output as:
['https://www.scrapinghub.com']
['https://www.goodreads.com/quotes']
I need them in a single list.
def get_links(url):
visited_list.append(url)
try:
source_code = requests.get(url)
except Exception:
get_links(fringe.pop(0))
plain_text = source_code.text
soup = BeautifulSoup(plain_text,"lxml")
for link in soup.findAll(re.compile(r'(li|a)')):
href = link.get('href')
if (href is None) or (href in visited_list) or (href in fringe) or (('http://' not in href) and ('https://' not in href)):
continue
else:
subs = href.split('/')[2]
fstr = repr(fringe)
if subs in fstr:
continue
else:
if('blah' in href):
if('www' not in href):
href = href.split(":")[0] + ':' + "//" + "www." + href.split(":")[1][2:]
fringe.append(href)
else:
fringe.append(href)
return fringe
def test(url):
try:
res = requests.get(url)
plain_text = res.text
soup = BeautifulSoup(plain_text,"lxml")
visible_text = text_from_html(plain_text)
URL.append(url)
paragraph.append(visible_text)
except Exception:
print("CHECK the URL {}".format(url))
if __name__ == "__main__":
p = Pool(10)
p.map(test,fringe)
p.terminate()
p.join()

Get all urls from a website using python

I am learning to build web crawlers and currently working on getting all urls from a site. I have been playing around and don't have the same code as I did before but I have been able to get all the links but my issues is the recursion I need to do the same things over and over but what I think my issue is the recursion what it is doing is right for the code I have written. My code is bellow
#!/usr/bin/python
import urllib2
import urlparse
from BeautifulSoup import BeautifulSoup
def getAllUrl(url):
page = urllib2.urlopen( url ).read()
urlList = []
try:
soup = BeautifulSoup(page)
soup.prettify()
for anchor in soup.findAll('a', href=True):
if not 'http://' in anchor['href']:
if urlparse.urljoin('http://bobthemac.com', anchor['href']) not in urlList:
urlList.append(urlparse.urljoin('http://bobthemac.com', anchor['href']))
else:
if anchor['href'] not in urlList:
urlList.append(anchor['href'])
length = len(urlList)
for url in urlList:
getAllUrl(url)
return urlList
except urllib2.HTTPError, e:
print e
if __name__ == "__main__":
urls = getAllUrl('http://bobthemac.com')
for x in urls:
print x
What I am trying to achieve is get all the urls for a site with the current set-up the program runs till it runs out of memory all I want is to get the urls from a site. Does anyone have any idea on how to do this think I have the right idea just need some small changes to the code.
EDIT
For those of you what are intrested bellow is my working code that gets all the urs for the site someone might find it useful. It's not the best code and does need some work but with some work it could be quite good.
#!/usr/bin/python
import urllib2
import urlparse
from BeautifulSoup import BeautifulSoup
def getAllUrl(url):
urlList = []
try:
page = urllib2.urlopen( url ).read()
soup = BeautifulSoup(page)
soup.prettify()
for anchor in soup.findAll('a', href=True):
if not 'http://' in anchor['href']:
if urlparse.urljoin('http://bobthemac.com', anchor['href']) not in urlList:
urlList.append(urlparse.urljoin('http://bobthemac.com', anchor['href']))
else:
if anchor['href'] not in urlList:
urlList.append(anchor['href'])
return urlList
except urllib2.HTTPError, e:
urlList.append( e )
if __name__ == "__main__":
urls = getAllUrl('http://bobthemac.com')
fullList = []
for x in urls:
listUrls = list
listUrls = getAllUrl(x)
try:
for i in listUrls:
if not i in fullList:
fullList.append(i)
except TypeError, e:
print 'Woops wrong content passed'
for i in fullList:
print i
I think this works:
#!/usr/bin/python
import urllib2
import urlparse
from BeautifulSoup import BeautifulSoup
def getAllUrl(url):
try:
page = urllib2.urlopen( url ).read()
except:
return []
urlList = []
try:
soup = BeautifulSoup(page)
soup.prettify()
for anchor in soup.findAll('a', href=True):
if not 'http://' in anchor['href']:
if urlparse.urljoin(url, anchor['href']) not in urlList:
urlList.append(urlparse.urljoin(url, anchor['href']))
else:
if anchor['href'] not in urlList:
urlList.append(anchor['href'])
length = len(urlList)
return urlList
except urllib2.HTTPError, e:
print e
def listAllUrl(urls):
for x in urls:
print x
urls.remove(x)
urls_tmp = getAllUrl(x)
for y in urls_tmp:
urls.append(y)
if __name__ == "__main__":
urls = ['http://bobthemac.com']
while(urls.count>0):
urls = getAllUrl('http://bobthemac.com')
listAllUrl(urls)
In you function getAllUrl, you call getAllUrl again in a for loop, it makes a recursion.
Elements will never be moved out once put into urlList, so urlList will never be empty, and then, the recursion will never break up.
That's why your program will never end up util out of memory.

Categories