I am trying to implement proxies into my web crawler. Without the proxies, my code has no problem connecting to the website, however when I try to add in proxies, suddenly it won't connect! It doesn't look like anybody in python-requests has made a post about this problem, so I'm hoping you all can help me!
Background info: I'm using a Mac and using Anaconda's Python 3.4 inside of a virtual environment.
Here is my code that works without proxies
proxyDict = {'http': 'http://10.10.1.10:3128'}
def pmc_spider(max_pages, pmid):
start = 1
titles_list = []
url_list = []
url_keys = []
while start <= max_pages:
url = 'http://www.ncbi.nlm.nih.gov/pmc/articles/pmid/'+str(pmid)+'/citedby/?page='+str(start)
req = requests.get(url) #this works
plain_text = req.text
soup = BeautifulSoup(plain_text, "lxml")
for items in soup.findAll('div', {'class': 'title'}):
title = items.get_text()
titles_list.append(title)
for link in items.findAll('a'):
urlkey = link.get('href')
url_keys.append(urlkey) #url = base + key
url = "http://www.ncbi.nlm.nih.gov"+str(urlkey)
url_list.append(url)
start += 1
return titles_list, url_list, authors_list
Based on other posts I'm looking at, I should just be able to replace this:
req = requests.get(url)
with this:
req = requests.get(url, proxies=proxyDict, timeout=2)
But this doesn't work! :( If I run it with this line of code the terminal gives me a TimeOut error
socket.timeout: timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 578, in urlopen
chunked=chunked)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 362, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/http/client.py", line 1137, in request
self._send_request(method, url, body, headers)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/http/client.py", line 1182, in _send_request
self.endheaders(body)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/http/client.py", line 1133, in endheaders
self._send_output(message_body)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/http/client.py", line 963, in _send_output
self.send(msg)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/http/client.py", line 898, in send
self.connect()
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/connection.py", line 167, in connect
conn = self._new_conn()
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/connection.py", line 147, in _new_conn
(self.host, self.timeout))
requests.packages.urllib3.exceptions.ConnectTimeoutError: (<requests.packages.urllib3.connection.HTTPConnection object at 0x1052665f8>, 'Connection to 10.10.1.10 timed out. (connect timeout=2)')
And then I get a few of these printed in the terminal with different traces but the same error:
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/adapters.py", line 403, in send
timeout=timeout
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 623, in urlopen
_stacktrace=sys.exc_info()[2])
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/util/retry.py", line 281, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
requests.packages.urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='10.10.1.10', port=3128): Max retries exceeded with url: http://www.ncbi.nlm.nih.gov/pmc/articles/pmid/18269575/citedby/?page=1 (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x1052665f8>, 'Connection to 10.10.1.10 timed out. (connect timeout=2)'))
Why would the addition of proxies to my code suddenly cause me to timeout? I tried it on several random urls and had the same thing happen. So it seems to be a problem with proxies rather than a problem with my code. However, I'm at the point where I MUST use proxies now so I need to get to the root of this and fix it. I've also tried several different IP addresses for the proxy from a VPN that I use, so I know the IP addresses are valid.
I appreciate your help so much! Thank you!
It looks like you'll need to use a http or https proxy that will respond to requests.
The 10.10.1.10:3128 in your code seems to be from examples in the requests documentation
Taking a proxy from the list at http://proxylist.hidemyass.com/search-1291967 (may not be the best source) your proxyDict should look like this: {'http' : 'http://209.242.141.60:8080'}
testing this on the command line seems to work fine:
>>> proxies = {'http' : 'http://209.242.141.60:8080'}
>>> requests.get('http://google.com', proxies=proxies)
<Response [200]>
Related
I'm trying to learn web scraping in python with the request-html package. At first, I render a mainpage and pull out all the necessary links. That works just fine. Later I iterate over all links and render the specific subpage for that link. 2 Iterations are successful, but with the third I get an error that i am unable to solve.
Here is my code:
# import HTMLSession from requests_html
from requests_html import HTMLSession
# create an HTML Session object
session = HTMLSession()
# Use the object above to connect to needed webpage
baseurl = 'http://www.möbelfreude.de/'
resp = session.get(baseurl+'alle-boxspringbetten')
# Run JavaScript code on webpage
resp.html.render()
links = resp.html.find('a.image-wrapper.text-center')
for link in links:
print('Rendering... {}'.format(link.attrs['href']))
r = session.get(baseurl + link.attrs['href'])
r.html.render()
print('Completed rendering... {}'.format(link.attrs['href']))
# do stuff
Error:
Completed rendering... bett/boxspringbett-bea
Rendering... bett/boxspringbett-valina
Completed rendering... bett/boxspringbett-valina
Rendering... bett/boxspringbett-benno-anthrazit
Traceback (most recent call last):
File "C:\Users\pasca\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\connectionpool.py", line 603, in urlopen
chunked=chunked)
File "C:\Users\pasca\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\connectionpool.py", line 387, in _make_request
six.raise_from(e, None)
File "<string>", line 2, in raise_from
File "C:\Users\pasca\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\connectionpool.py", line 383, in _make_request
httplib_response = conn.getresponse()
File "C:\Users\pasca\AppData\Local\Programs\Python\Python37-32\lib\http\client.py", line 1336, in getresponse
response.begin()
File "C:\Users\pasca\AppData\Local\Programs\Python\Python37-32\lib\http\client.py", line 306, in begin
version, status, reason = self._read_status()
File "C:\Users\pasca\AppData\Local\Programs\Python\Python37-32\lib\http\client.py", line 275, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
During handling of the above exception, another exception occurred:```
The error is due to the connection closure, and may be due to some configurations on the server side.
Have you scraping the site and appending the links to a list.
Then request each link individually to find and locate the specific directory that is cause issue.
Using dev mode in chrome under the network tab can help identify the necessary headers for requests that require them.
I'm trying to implement 2captcha using selenium with Python.
I just copied the example form their documentation:
https://github.com/2captcha/2captcha-api-examples/blob/master/ReCaptcha%20v2%20API%20Examples/Python%20Example/2captcha_python_api_example.py
This is my code:
from selenium import webdriver
from time import sleep
from selenium.webdriver.support.select import Select
import requests
driver = webdriver.Chrome('chromedriver.exe')
driver.get('the_url')
current_url = driver.current_url
captcha = driver.find_element_by_id("captcha-box")
captcha2 = captcha.find_element_by_xpath("//div/div/iframe").get_attribute("src")
captcha3 = captcha2.split('=')
#print(captcha3[2])
# Add these values
API_KEY = 'my_api_key' # Your 2captcha API KEY
site_key = captcha3[2] # site-key, read the 2captcha docs on how to get this
url = current_url # example url
proxy = 'Myproxy' # example proxy
proxy = {'http': 'http://' + proxy, 'https': 'https://' + proxy}
s = requests.Session()
# here we post site key to 2captcha to get captcha ID (and we parse it here too)
captcha_id = s.post("http://2captcha.com/in.php?key={}&method=userrecaptcha&googlekey={}&pageurl={}".format(API_KEY, site_key, url), proxies=proxy).text.split('|')[1]
# then we parse gresponse from 2captcha response
recaptcha_answer = s.get("http://2captcha.com/res.php?key={}&action=get&id={}".format(API_KEY, captcha_id), proxies=proxy).text
print("solving ref captcha...")
while 'CAPCHA_NOT_READY' in recaptcha_answer:
sleep(5)
recaptcha_answer = s.get("http://2captcha.com/res.php?key={}&action=get&id={}".format(API_KEY, captcha_id), proxies=proxy).text
recaptcha_answer = recaptcha_answer.split('|')[1]
# we make the payload for the post data here, use something like mitmproxy or fiddler to see what is needed
payload = {
'key': 'value',
'gresponse': recaptcha_answer # This is the response from 2captcha, which is needed for the post request to go through.
}
# then send the post request to the url
response = s.post(url, payload, proxies=proxy)
# And that's all there is to it other than scraping data from the website, which is dynamic for every website.
This is my error:
solving ref captcha...
Traceback (most recent call last):
File "main.py", line 38, in
recaptcha_answer = recaptcha_answer.split('|')[1]
IndexError: list index out of range
The captcha is getting solved because I can see it on 2captcha dashboard, so which is the error if it's de official documentation?
EDIT:
For some without modification I'm getting the captcha solved form 2captcha but then I get this error:
solving ref captcha...
OK|this_is_the_2captch_answer
Traceback (most recent call last):
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\connectionpool.py", line 594, in urlopen
self._prepare_proxy(conn)
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\connectionpool.py", line 805, in _prepare_proxy
conn.connect()
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\connection.py", line 308, in connect
self._tunnel()
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\http\client.py", line 906, in _tunnel
(version, code, message) = response._read_status()
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\http\client.py", line 278, in _read_status
raise BadStatusLine(line)
http.client.BadStatusLine: <html>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\adapters.py", line 449, in send
timeout=timeout
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\connectionpool.py", line 638, in urlopen
_stacktrace=sys.exc_info()[2])
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\util\retry.py", line 368, in increment
raise six.reraise(type(error), error, _stacktrace)
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\packages\six.py", line 685, in reraise
raise value.with_traceback(tb)
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\connectionpool.py", line 594, in urlopen
self._prepare_proxy(conn)
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\connectionpool.py", line 805, in _prepare_proxy
conn.connect()
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\connection.py", line 308, in connect
self._tunnel()
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\http\client.py", line 906, in _tunnel
(version, code, message) = response._read_status()
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\http\client.py", line 278, in _read_status
raise BadStatusLine(line)
urllib3.exceptions.ProtocolError: ('Connection aborted.', BadStatusLine('<html>\r\n'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "main.py", line 49, in <module>
response = s.post(url, payload, proxies=proxy)
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\sessions.py", line 581, in post
return self.request('POST', url, data=data, json=json, **kwargs)
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\sessions.py", line 646, in send
r = adapter.send(request, **kwargs)
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\adapters.py", line 498, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine('<html>\r\n'))
Why am I getting this error?
I'm setting as site_key = current_url_where_captcha_is_located
Is this correct?
Use your debugger or put a print(recaptcha_answer) before the error line to see what's the value of recaptcha_answer before you try to call .split('|') on it. There is no | in the string so when you're trying to get the second element of the resulting list with [1] it fails.
Looks like you don't provide any valid proxy connection parameters but passing this proxy to requests when connecting to the API.
Just comment these two lines:
#proxy = 'Myproxy' # example proxy
#proxy = {'http': 'http://' + proxy, 'https': 'https://' + proxy}
And then remove proxies=proxy from four lines:
captcha_id = s.post("http://2captcha.com/in.php?key={}&method=userrecaptcha&googlekey={}&pageurl={}".format(API_KEY, site_key, url)).text.split('|')[1]
recaptcha_answer = s.get("http://2captcha.com/res.php?key={}&action=get&id={}".format(API_KEY, captcha_id)).text
recaptcha_answer = s.get("http://2captcha.com/res.php?key={}&action=get&id={}".format(API_KEY, captcha_id)).text
response = s.post(url, payload, proxies=proxy)
I wrote my first program in Python.
#This program casts votes in online poll using different proxy servers for each request.
#It works, but some proxy servers cause errors crashing the whole thing.
#To avoid that, I would like it to skip those servers and ignore the errors.
import requests
import time
#Votes to be cast
votes = 5
#Makes proxy list
f=open('proxy2.txt')
lines=f.read().splitlines()
f.close()
#Vote counter
i = 1
#Proxy list counter
j = 0
while (i<=votes):
#Tests and moves to next proxy if there was a problem.
try:
r = requests.get('http://www.google.com')
except requests.exceptions.RequestException:
j = j + 1
#Headers copied from my browser. Some of them cause errors. Could you tell me why?
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
#'Accept-Encoding': 'gzip, deflate',
#'Accept-Language': 'pl-PL,pl;q=0.8,en-US;q=0.6,en;q=0.4',
#'Cache-Control': 'max-age=0',
#'Connection': 'keep-alive',
#'Content-Length': '101',
'Content-Type': 'application/x-www-form-urlencoded',
#'Host': 'www.mylomza.pl',
#'Origin': 'http://www.mylomza.pl',
#'Referer': 'http://www.mylomza.pl/home/lomza/item/11780-wybierz-miss-%C5%82ks-i-portalu-mylomzapl-video-i-foto.html',
#'Upgrade-Insecure-Requests': '1',
#'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
}
proxies = {
'http': 'http://'+lines[j] #31.207.0.99:3128
}
r = requests.get('http://www.mylomza.pl/home/lomza/item/11780-wybierz-miss-%C5%82ks-i-portalu-mylomzapl-video-i-foto.html', headers=headers, proxies=proxies, timeout=10)
#The funny part - form, that I have to post, requires some kind of ID and this is my way of getting it :P Feel free to suggest an alternative way.
userid = r.text[(22222-32):22222]
print('Voter', userid, 'registered.')
data = {
'voteid': '141',
'task_button': 'Głosuj',
'option': 'com_poll',
'task': 'vote',
'id': '25',
userid: '1'
}
r = requests.post('http://www.mylomza.pl/home/lomza/item/index.php', headers=headers, cookies=r.cookies, data=data, proxies=proxies, timeout=10)
print('Vote nr', i, 'cast from', lines[i])
i = i + 1
j = j + 1
time.sleep(1)
What I need is to make it handle exceptions and errors.
#Tests and moves to next proxy if there was a problem.
try:
r = requests.get('http://www.google.com')
except requests.exceptions.RequestException:
j = j + 1
Beside that I could use an alternative way of achieving this:
#The funny part - form, that I have to post, requires some kind of ID and this is my way of getting it :P Feel free to suggest an alternative way.
userid = r.text[(22222-32):22222]
Sometimes my method doesn't work (example below). First vote went through, second didn't and then all crashed.
Voter 53bf55490ebd07d9c190787c5c6ca44c registered.
Vote nr 1 cast from 111.23.6.161:80
Voter registered.
Vote nr 2 cast from 94.141.102.203:8080
Traceback (most recent call last):
File "C:\Users\Adrian\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\packages\urllib3\connection.py", line 142, in _new_conn
(self.host, self.port), self.timeout, **extra_kw)
File "C:\Users\Adrian\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\packages\urllib3\util\connection.py", line 91, in create_connection
raise err
File "C:\Users\Adrian\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\packages\urllib3\util\connection.py", line 81, in create_connection
sock.connect(sa)
socket.timeout: timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\Adrian\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\packages\urllib3\connectionpool.py", line 578, in urlopen
chunked=chunked)
File "C:\Users\Adrian\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\packages\urllib3\connectionpool.py", line 362, in _make_request
conn.request(method, url, **httplib_request_kw)
File "C:\Users\Adrian\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1083, in request
self._send_request(method, url, body, headers)
File "C:\Users\Adrian\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1128, in _send_request
self.endheaders(body)
File "C:\Users\Adrian\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1079, in endheaders
self._send_output(message_body)
File "C:\Users\Adrian\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 911, in _send_output
self.send(msg)
File "C:\Users\Adrian\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 854, in send
self.connect()
File "C:\Users\Adrian\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\packages\urllib3\connection.py", line 167, in connect
conn = self._new_conn()
File "C:\Users\Adrian\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\packages\urllib3\connection.py", line 147, in _new_conn
(self.host, self.timeout))
requests.packages.urllib3.exceptions.ConnectTimeoutError: (<requests.packages.urllib3.connection.HTTPConnection object at 0x03612730>, 'Connection to 94.141.102.203 timed out. (connect timeout=10)')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\Adrian\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\adapters.py", line 403, in send
timeout=timeout
File "C:\Users\Adrian\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\packages\urllib3\connectionpool.py", line 623, in urlopen
_stacktrace=sys.exc_info()[2])
File "C:\Users\Adrian\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\packages\urllib3\util\retry.py", line 281, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
requests.packages.urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='94.141.102.203', port=8080): Max retries exceeded with url: http://www.mylomza.pl/home/lomza/item/11780-wybierz-miss-%C5%82ks-i-portalu-mylomzapl-video-i-foto.html (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x03612730>, 'Connection to 94.141.102.203 timed out. (connect timeout=10)'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\Adrian\AppData\Local\Programs\Python\Python35-32\PollVoter.py", line 50, in <module>
r = requests.get('http://www.mylomza.pl/home/lomza/item/11780-wybierz-miss-%C5%82ks-i-portalu-mylomzapl-video-i-foto.html', headers=headers, proxies=proxies, timeout=10)
File "C:\Users\Adrian\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\api.py", line 71, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\Adrian\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\api.py", line 57, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\Adrian\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\sessions.py", line 475, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\Adrian\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\sessions.py", line 585, in send
r = adapter.send(request, **kwargs)
File "C:\Users\Adrian\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\adapters.py", line 459, in send
raise ConnectTimeout(e, request=request)
requests.exceptions.ConnectTimeout: HTTPConnectionPool(host='94.141.102.203', port=8080): Max retries exceeded with url: http://www.mylomza.pl/home/lomza/item/11780-wybierz-miss-%C5%82ks-i-portalu-mylomzapl-video-i-foto.html (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x03612730>, 'Connection to 94.141.102.203 timed out. (connect timeout=10)'))
It looks like you're flooding the server with too many requests, that's why you're getting the other errors like requests.packages.urllib3.exceptions.MaxRetryError, since likely the server throttles the number of connections you can make in a given amount of time. You can try handling all the exceptions listed in your output, and you can also try making fewer attempts at the url you're requesting from.
[Edit] Or if you want to brute force and handle all errors and exceptions, try the following instead
except:
j = j + 1
[Edit:] You could try https: as well as http:
[Edit] Found this:
If the remote server is very slow, you can tell Requests to wait forever for a response, by passing None as a timeout value and then retrieving a cup of coffee.
r = requests.get('https://github.com', timeout=None)
PROBLEM SOLVED
Turns out that I shouldn't open more than 1 connection per proxy server.
But I have to make 2 requests. The solution was to send first one from my ip then switch to proxy for second one.
r = requests.get(url, headers=headers, timeout=timeout)
try:
r = requests.post(url, headers=headers, cookies=r.cookies, data=data, timeout=timeout, proxies=proxies)
except:
j = j + 1
Works perfectly so far. :)
I had the similar thing and used
except:
continue
which ment to continue the loop over again in case of exceptions and continue 'trying'
The following request from a python client to elasticsearch fails
2014-12-19 13:39:05,429 WARNING GET http://10.129.0.53:9200/delivery-logs-index.prod-20141218/_search?timeout=20m [status:N/A request:10.010s]
Traceback (most recent call last):
File "/usr/lib/python2.6/site-packages/elasticsearch/connection/http_urllib3.py", line 46, in perform_request
response = self.pool.urlopen(method, url, body, retries=False, headers=headers, **kw)
File "/usr/lib/python2.6/site-packages/urllib3/connectionpool.py", line 559, in urlopen
_pool=self, _stacktrace=stacktrace)
File "/usr/lib/python2.6/site-packages/urllib3/util/retry.py", line 223, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/lib/python2.6/site-packages/urllib3/connectionpool.py", line 516, in urlopen
body=body, headers=headers)
File "/usr/lib/python2.6/site-packages/urllib3/connectionpool.py", line 336, in _make_request
self, url, "Read timed out. (read timeout=%s)" % read_timeout)
ReadTimeoutError: HTTPConnectionPool(host=u'10.129.0.53', port=9200): Read timed out. (read timeout=10)
Elasticsearch([es_host],
sniff_on_start=True,
max_retries=100,
retry_on_timeout=True,
sniff_on_connection_fail=True,
sniff_timeout=1000)
Is there a way to increase the request timeout? Currently it seems to be configured by default to read timeout=10
You can try adding a request_timeout to a value in your request like:
res = client.search(index=blabla, search_type="count", timeout="20m", request_timeout="10000", body={
You can also pass timeout=60 when instantiating the client object (60 meaning 60 seconds and of course being only an example).
This parameter overrides the 10s default specified in the Connection constructor.
https://github.com/elastic/elasticsearch-py/blob/master/elasticsearch/connection/base.py#L27
I'm trying to fetch some data from http://m.finnkino.fi/events/now_showing, but at the moment I'm failing badly because I'm not even able to load the page source with python.
At the moment I'm using following code:
req = urllib2.urlopen(URL,None,2.5)
page = req.read()
print page
Here is the traceback for timeout error:
Traceback (most recent call last):
File "user/src/finnkinoParser.py", line 26, in <module>
main()
File "user/src/finnkinoParser.py", line 13, in main
getNowPlayingMovies()
File "user/src/finnkinoParser.py", line 17, in getNowPlayingMovies
req = urllib2.urlopen(baseURL,None,2.5)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 124, in urlopen
return _opener.open(url, data, timeout)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 383, in open
response = self._open(req, data)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 401, in _open
'_open', req)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 361, in _call_chain
result = func(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 1130, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 1105, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error timed out>
If I browse to the url with my browser it works fine. So could someone tell me what makes that site that much different so the urllib2 is unable to load the page. I suppose it has something to do with the site being aimed to mobile users. With "regular" sites urllib2 works fine. Is there any other kind of sites to which the basic urlopen(URL) doesn't work?
Thanks for help
Following snippet works fine.
import httplib
headers = {"User-Agent": "Mozilla/5.0"}
conn = httplib.HTTPConnection("m.finnkino.fi")
conn.request("GET", "/events/now_showing", "", headers)
response = conn.getresponse()
print response.status, response.reason
data = response.read()
print data
conn.close()
It seems their server has verified several request vars. After tested some times, here is conclusion:
http protocol must be HTTP/1.1.
if request headers have Connection prop, its value should be keep-alive.
request headers must have User-Agent prop, whatever its value.
While in urllib2, Connection prop in HTTPHandler has been set to Close by default (L1127 in urllib2.py). you can use urlgrabber or other HTTP handler which supports HTTP/1.1 and keep-alive.