Python web crawling is throwing connection errors

Python web crawling is throwing connection errors - python

I have the following Python code intended for web crawling, When I try to run this one, It is throwing me the following error. Code :
import lxml.html
import requests
from bs4 import BeautifulSoup
url1='http://stats.espncricinfo.com/ci/engine/stats/index.html?class=11;filter=advanced;orderby=runs;'
url2 ='page='
url3 ='size=200;template=results;type=batting'
url5 = ['http://stats.espncricinfo.com/ci/engine/stats/index.html?class=11;filter=advanced;orderby=runs;size=200;template=results;type=batting']
for i in range(2,3854):
url4 = url1 + url2 + str(i) + ';' + url3
url5.append(url4)
for page in url5:
source_code = requests.get(page, verify=False)
# just get the code, no headers or anything
plain_text = source_code.text
# BeautifulSoup objects can be sorted through easy
soup = BeautifulSoup(plain_text, "lxml")
for link in soup.findAll('a', {'class': 'data-link'}):
href = "https://www.espncricinfo.com" + link.get('href')
title = link.string # just the text, not the HTML
source_code = requests.get(href)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "lxml")
# if you want to gather information from that page
for item_name in soup.findAll('span', {'class': 'ciPlayerinformationtxt'}):
print(item_name.string)
The error:
Traceback (most recent call last):
File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\packages\urllib3\connectionpool.py", line 559, in urlopen
body=body, headers=headers)
File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\packages\urllib3\connectionpool.py", line 345, in _make_request
self._validate_conn(conn)
File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\packages\urllib3\connectionpool.py", line 782, in _validate_conn
conn.connect()
File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\packages\urllib3\connection.py", line 266, in connect
match_hostname(cert, self.assert_hostname or hostname)
File "C:\Python34\lib\ssl.py", line 285, in match_hostname
% (hostname, ', '.join(map(repr, dnsnames))))
ssl.CertificateError: hostname 'www.espncricinfo.com' doesn't match either of 'a248.e.akamai.net', '*.akamaihd.net', '*.akamaihd-staging.net', '*.akamaized.net', '*.akamaized-staging.net'
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\adapters.py", line 369, in send
timeout=timeout File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\packages\urllib3\connectionpool.py", line 588, in urlopen
raise SSLError(e) requests.packages.urllib3.exceptions.SSLError: hostname 'www.espncricinfo.com' doesn't match either of 'a248.e.akamai.net', '*.akamaihd.net', '*.akamaihd-staging.net', '*.akamaized.net', '*.akamaized-staging.net'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:/Python34/intplayername.py", line 23, in <module>
source_code = requests.get(href)
File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\api.py", line 69, in get
return request('get', url, params=params, **kwargs)
File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\api.py", line 50, in request
response = session.request(method=method, url=url, **kwargs)
File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\sessions.py", line 471, in request
resp = self.send(prep, **send_kwargs)
File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\sessions.py", line 579, in send
r = adapter.send(request, **kwargs)
File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\adapters.py", line 430, in send
raise SSLError(e, request=request)
requests.exceptions.SSLError: hostname 'www.espncricinfo.com' doesn't match either of 'a248.e.akamai.net', '*.akamaihd.net', '*.akamaihd-staging.net', '*.akamaized.net', '*.akamaized-staging.net'

It's due to the misconfiguration of the https certificates on the site you want to crawl. As a workaround, you can turn off certificate checking in the requests library
requests.get(href, verify=False)
Please be advised, that it's not a recommended practice when you work with sensitive information.

Related

Requests Error: No connection adapters were found

I'm currently new to programming and I'm following along on one of Qazi tutorials and I'm on a section for web scraping but unfortunately I'm getting errors that I can't seem to find a solution for, can you please help me out. Thanks
The error code is bellow.
Traceback (most recent call last):
File "D:\Users\Vaughan\Qazi\Web Scrapping\webscraping.py", line 6, in <module>
page = requests.get(
File "C:\Users\vaugh\AppData\Local\Programs\Python\Python38-32\lib\site-packages\requests\api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\vaugh\AppData\Local\Programs\Python\Python38-32\lib\site-packages\requests\api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\vaugh\AppData\Local\Programs\Python\Python38-32\lib\site-packages\requests\sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\vaugh\AppData\Local\Programs\Python\Python38-32\lib\site-packages\requests\sessions.py", line 640, in send
adapter = self.get_adapter(url=request.url)
File "C:\Users\vaugh\AppData\Local\Programs\Python\Python38-32\lib\site-packages\requests\sessions.py", line 731, in get_adapter
raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for '['https://forecast.weather.gov/MapClick.php?lat=34.09979000000004&lon=-118.32721499999997#.XkzZwCgzaUk']'
[Finished in 1.171s]
My line of code is as follows
from bs4 import BeautifulSoup
import requests
import csv
import pandas as pd
import lxml
page = requests.get('https://forecast.weather.gov/MapClick.php?lat=34.09979000000004&lon=-118.32721499999997#.XkzZwCgzaUk')
soup = BeautifulSoup(page.content, 'html.parser')
week = soup.find(id='seven-day-forecast-body')
print(week)

How to fix ' raise ValueError("unknown url type: %r" % self.full_url) ValueError: unknown url type: '''

So I have been following along some videos to learn python, but can't get rid of this error. I have experience in other languages so I'm usually fine to fix errors, but no matter what I do, I either get the same error or something different.
I've tried switching the argument from 'xml' to 'lxml', but this only changes the error that I get
from bs4 import BeautifulSoup
import urllib.request
req = urllib.request.urlopen('http://pythonprogramming.net/')
xml = BeautifulSoup(req, 'xml')
for item in xml.findAll('link'):
url = item.text
news = urllib.request.urlopen(url).read()
print(news)
Ideally, this would print out some of the text within link tags, but instead, I get the following errors -
Error while using xml -
File "/Users/rodrigo/Desktop/ALL/Programming/Python/Python Web Programming/Working with HTML/scrapingParagraphData.py", line 13, in <module>
news = urllib.request.urlopen(url).read()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 525, in open
response = self._open(req, data)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 548, in _open
'unknown_open', req)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 503, in _call_chain
result = func(*args)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1387, in unknown_open
raise URLError('unknown url type: %s' % type)
urllib.error.URLError: <urlopen error unknown url type: #media (min-width>
Error while using lxml -
File "/Users/rodrigo/Desktop/ALL/Programming/Python/Python Web Programming/Working with HTML/scrapingParagraphData.py", line 13, in <module>
news = urllib.request.urlopen(url).read()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 510, in open
req = Request(fullurl, data)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 328, in __init__
self.full_url = url
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 354, in full_url
self._parse()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 383, in _parse
raise ValueError("unknown url type: %r" % self.full_url)
ValueError: unknown url type: ''

Your current code is targeting link elements and is extracting text rather than href so there is no known protocol to work with.
Even if you extracted the href they are relative so you would still have a problem with unknown protocol.
item['href'] would give:
/static/favicon.ico
/static/css/materialize.min.css
https://fonts.googleapis.com/icon?family=Material+Icons
/static/css/bootstrap.css
I don't think you are after these type of links. If you were after the tutorial links then you need something that targets those elements e.g.
tutorial_links = ['https://pythonprogramming.net' + i['href'] for i in xml.select('.waves-light.btn')]
I would probably rename the assignment variable in BeautifulSoup(req, 'lxml') to:
from bs4 import BeautifulSoup
import urllib.request
req = urllib.request.urlopen('http://pythonprogramming.net/')
soup = BeautifulSoup(req, 'lxml')
tutorial_links = ['https://pythonprogramming.net' + i['href'] for i in xml.select('.waves-light.btn')]

2captcha selenium out of range

I'm trying to implement 2captcha using selenium with Python.
I just copied the example form their documentation:
https://github.com/2captcha/2captcha-api-examples/blob/master/ReCaptcha%20v2%20API%20Examples/Python%20Example/2captcha_python_api_example.py
This is my code:
from selenium import webdriver
from time import sleep
from selenium.webdriver.support.select import Select
import requests
driver = webdriver.Chrome('chromedriver.exe')
driver.get('the_url')
current_url = driver.current_url
captcha = driver.find_element_by_id("captcha-box")
captcha2 = captcha.find_element_by_xpath("//div/div/iframe").get_attribute("src")
captcha3 = captcha2.split('=')
#print(captcha3[2])
# Add these values
API_KEY = 'my_api_key' # Your 2captcha API KEY
site_key = captcha3[2] # site-key, read the 2captcha docs on how to get this
url = current_url # example url
proxy = 'Myproxy' # example proxy
proxy = {'http': 'http://' + proxy, 'https': 'https://' + proxy}
s = requests.Session()
# here we post site key to 2captcha to get captcha ID (and we parse it here too)
captcha_id = s.post("http://2captcha.com/in.php?key={}&method=userrecaptcha&googlekey={}&pageurl={}".format(API_KEY, site_key, url), proxies=proxy).text.split('|')[1]
# then we parse gresponse from 2captcha response
recaptcha_answer = s.get("http://2captcha.com/res.php?key={}&action=get&id={}".format(API_KEY, captcha_id), proxies=proxy).text
print("solving ref captcha...")
while 'CAPCHA_NOT_READY' in recaptcha_answer:
sleep(5)
recaptcha_answer = s.get("http://2captcha.com/res.php?key={}&action=get&id={}".format(API_KEY, captcha_id), proxies=proxy).text
recaptcha_answer = recaptcha_answer.split('|')[1]
# we make the payload for the post data here, use something like mitmproxy or fiddler to see what is needed
payload = {
'key': 'value',
'gresponse': recaptcha_answer # This is the response from 2captcha, which is needed for the post request to go through.
}
# then send the post request to the url
response = s.post(url, payload, proxies=proxy)
# And that's all there is to it other than scraping data from the website, which is dynamic for every website.
This is my error:
solving ref captcha...
Traceback (most recent call last):
File "main.py", line 38, in
recaptcha_answer = recaptcha_answer.split('|')[1]
IndexError: list index out of range
The captcha is getting solved because I can see it on 2captcha dashboard, so which is the error if it's de official documentation?
EDIT:
For some without modification I'm getting the captcha solved form 2captcha but then I get this error:
solving ref captcha...
OK|this_is_the_2captch_answer
Traceback (most recent call last):
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\connectionpool.py", line 594, in urlopen
self._prepare_proxy(conn)
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\connectionpool.py", line 805, in _prepare_proxy
conn.connect()
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\connection.py", line 308, in connect
self._tunnel()
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\http\client.py", line 906, in _tunnel
(version, code, message) = response._read_status()
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\http\client.py", line 278, in _read_status
raise BadStatusLine(line)
http.client.BadStatusLine: <html>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\adapters.py", line 449, in send
timeout=timeout
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\connectionpool.py", line 638, in urlopen
_stacktrace=sys.exc_info()[2])
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\util\retry.py", line 368, in increment
raise six.reraise(type(error), error, _stacktrace)
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\packages\six.py", line 685, in reraise
raise value.with_traceback(tb)
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\connectionpool.py", line 594, in urlopen
self._prepare_proxy(conn)
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\connectionpool.py", line 805, in _prepare_proxy
conn.connect()
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\connection.py", line 308, in connect
self._tunnel()
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\http\client.py", line 906, in _tunnel
(version, code, message) = response._read_status()
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\http\client.py", line 278, in _read_status
raise BadStatusLine(line)
urllib3.exceptions.ProtocolError: ('Connection aborted.', BadStatusLine('<html>\r\n'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "main.py", line 49, in <module>
response = s.post(url, payload, proxies=proxy)
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\sessions.py", line 581, in post
return self.request('POST', url, data=data, json=json, **kwargs)
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\sessions.py", line 646, in send
r = adapter.send(request, **kwargs)
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\adapters.py", line 498, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine('<html>\r\n'))
Why am I getting this error?
I'm setting as site_key = current_url_where_captcha_is_located
Is this correct?

Use your debugger or put a print(recaptcha_answer) before the error line to see what's the value of recaptcha_answer before you try to call .split('|') on it. There is no | in the string so when you're trying to get the second element of the resulting list with [1] it fails.

Looks like you don't provide any valid proxy connection parameters but passing this proxy to requests when connecting to the API.
Just comment these two lines:
#proxy = 'Myproxy' # example proxy
#proxy = {'http': 'http://' + proxy, 'https': 'https://' + proxy}
And then remove proxies=proxy from four lines:
captcha_id = s.post("http://2captcha.com/in.php?key={}&method=userrecaptcha&googlekey={}&pageurl={}".format(API_KEY, site_key, url)).text.split('|')[1]
recaptcha_answer = s.get("http://2captcha.com/res.php?key={}&action=get&id={}".format(API_KEY, captcha_id)).text
recaptcha_answer = s.get("http://2captcha.com/res.php?key={}&action=get&id={}".format(API_KEY, captcha_id)).text
response = s.post(url, payload, proxies=proxy)

InvalidSchema("No connection adapters were found for '%s'" % url)

I was able to gather data from a web page using this
import requests
import lxml.html
import re
url = "http://animesora.com/flying-witch-episode-7-english-subtitle/"
r = requests.get(url)
page = r.content
dom = lxml.html.fromstring(page)
for link in dom.xpath('//div[#class="downloadarea"]//a/#href'):
down = re.findall('https://.*',link)
print (down)
when I try this to gather more data on the results of the above code I was presented with this error:
Traceback (most recent call last):
File "/home/sven/PycharmProjects/untitled1/.idea/test4.py", line 21, in <module>
r2 = requests.get(down)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 70, in get
return request('get', url, params=params, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 56, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 475, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 590, in send
adapter = self.get_adapter(url=request.url)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 672, in get_adapter
raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for '['https://link.safelinkconverter.com/review.php?id=aHR0cDovLygqKC5fKC9zTGZYZ0s=&c=1&user=51757']'
This is the code I was using:
for link2 in down:
r2 = requests.get(down)
page2 = r.url
dom2 = lxml.html.fromstring(page2)
for link2 in dom2('//div[#class="button green"]//onclick'):
down2 = re.findall('.*',down2)
print (down2)

You are passing in the whole list:
for link2 in down:
r2 = requests.get(down)
Note how you passed in down, not link2. down is a list, not a single URL string.
Pass in link2:
for link2 in down:
r2 = requests.get(link2)
I'm not sure why you are using regular expressions. In the loop
for link in dom.xpath('//div[#class="downloadarea"]//a/#href'):
each link is already a fully qualified URL:
>>> for link in dom.xpath('//div[#class="downloadarea"]//a/#href'):
... print link
...
https://link.safelinkconverter.com/review.php?id=aHR0cDovLygqKC5fKC9FZEk2Qg==&c=1&user=51757
https://link.safelinkconverter.com/review.php?id=aHR0cDovLygqKC5fKC95Tmg2Qg==&c=1&user=51757
https://link.safelinkconverter.com/review.php?id=aHR0cDovLygqKC5fKC93dFBmVFg=&c=1&user=51757
https://link.safelinkconverter.com/review.php?id=aHR0cDovLygqKC5fKC9zTGZYZ0s=&c=1&user=51757
You don't need to do any further processing on that.
Your remaining code has more errors; you confused r2.url with r2.content, and forgot the .xpath part in your dom2.xpath(...) query.

Timing out when I use a proxy

I am trying to implement proxies into my web crawler. Without the proxies, my code has no problem connecting to the website, however when I try to add in proxies, suddenly it won't connect! It doesn't look like anybody in python-requests has made a post about this problem, so I'm hoping you all can help me!
Background info: I'm using a Mac and using Anaconda's Python 3.4 inside of a virtual environment.
Here is my code that works without proxies
proxyDict = {'http': 'http://10.10.1.10:3128'}
def pmc_spider(max_pages, pmid):
start = 1
titles_list = []
url_list = []
url_keys = []
while start <= max_pages:
url = 'http://www.ncbi.nlm.nih.gov/pmc/articles/pmid/'+str(pmid)+'/citedby/?page='+str(start)
req = requests.get(url) #this works
plain_text = req.text
soup = BeautifulSoup(plain_text, "lxml")
for items in soup.findAll('div', {'class': 'title'}):
title = items.get_text()
titles_list.append(title)
for link in items.findAll('a'):
urlkey = link.get('href')
url_keys.append(urlkey) #url = base + key
url = "http://www.ncbi.nlm.nih.gov"+str(urlkey)
url_list.append(url)
start += 1
return titles_list, url_list, authors_list
Based on other posts I'm looking at, I should just be able to replace this:
req = requests.get(url)
with this:
req = requests.get(url, proxies=proxyDict, timeout=2)
But this doesn't work! :( If I run it with this line of code the terminal gives me a TimeOut error
socket.timeout: timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 578, in urlopen
chunked=chunked)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 362, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/http/client.py", line 1137, in request
self._send_request(method, url, body, headers)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/http/client.py", line 1182, in _send_request
self.endheaders(body)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/http/client.py", line 1133, in endheaders
self._send_output(message_body)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/http/client.py", line 963, in _send_output
self.send(msg)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/http/client.py", line 898, in send
self.connect()
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/connection.py", line 167, in connect
conn = self._new_conn()
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/connection.py", line 147, in _new_conn
(self.host, self.timeout))
requests.packages.urllib3.exceptions.ConnectTimeoutError: (<requests.packages.urllib3.connection.HTTPConnection object at 0x1052665f8>, 'Connection to 10.10.1.10 timed out. (connect timeout=2)')
And then I get a few of these printed in the terminal with different traces but the same error:
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/adapters.py", line 403, in send
timeout=timeout
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 623, in urlopen
_stacktrace=sys.exc_info()[2])
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/util/retry.py", line 281, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
requests.packages.urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='10.10.1.10', port=3128): Max retries exceeded with url: http://www.ncbi.nlm.nih.gov/pmc/articles/pmid/18269575/citedby/?page=1 (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x1052665f8>, 'Connection to 10.10.1.10 timed out. (connect timeout=2)'))
Why would the addition of proxies to my code suddenly cause me to timeout? I tried it on several random urls and had the same thing happen. So it seems to be a problem with proxies rather than a problem with my code. However, I'm at the point where I MUST use proxies now so I need to get to the root of this and fix it. I've also tried several different IP addresses for the proxy from a VPN that I use, so I know the IP addresses are valid.
I appreciate your help so much! Thank you!

It looks like you'll need to use a http or https proxy that will respond to requests.
The 10.10.1.10:3128 in your code seems to be from examples in the requests documentation
Taking a proxy from the list at http://proxylist.hidemyass.com/search-1291967 (may not be the best source) your proxyDict should look like this: {'http' : 'http://209.242.141.60:8080'}
testing this on the command line seems to work fine:
>>> proxies = {'http' : 'http://209.242.141.60:8080'}
>>> requests.get('http://google.com', proxies=proxies)
<Response [200]>

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python web crawling is throwing connection errors - python

It's due to the misconfiguration of the https certificates on the site you want to crawl. As a workaround, you can turn off certificate checking in the requests library requests.get(href, verify=False) Please be advised, that it's not a recommended practice when you work with sensitive information.

Related

Requests Error: No connection adapters were found

How to fix ' raise ValueError("unknown url type: %r" % self.full_url) ValueError: unknown url type: '''

2captcha selenium out of range

InvalidSchema("No connection adapters were found for '%s'" % url)

Timing out when I use a proxy

Categories

Resources