I am using StormProxies to access Etsy data but despite using proxies and implementing retries I am getting 429 Too Many Requests error most of the time(~80%+). Here is my code to access data:
import requests
def create_request(url, logging, headers={}, is_proxy=True):
r = None
try:
proxies = {
'http': 'http://{}'.format(PROXY_GATEWAY_IP),
'https': 'http://{}'.format(PROXY_GATEWAY_IP),
}
with requests.Session() as s:
retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504, 429])
s.mount('http://', HTTPAdapter(max_retries=retries))
if is_proxy:
r = s.get(url, proxies=proxies, timeout=30, headers=headers)
else:
r = s.get(url, headers=headers, timeout=30)
r.raise_for_status()
if r.status_code != 200:
print('Status Code = ', r.status_code)
if logging is not None:
logging.info('Status Code = ' + str(r.status_code))
except Exception as ex:
print('Exception occur in create_request for the url:- {url}'.format())
crash_date = time.strftime("%Y-%m-%d %H:%m:%S")
crash_string = "".join(traceback.format_exception(etype=type(ex), value=ex, tb=ex.__traceback__))
exception_string = '[' + crash_date + '] - ' + crash_string + '\n'
print('Could not connect. Proxy issue or something else')
print('==========================================================')
print(exception_string)
finally:
return r
StormProxies guys say that I implement retries, this is how I have done but it is not working for me.
I am using Python multiprocessing and spawning 30+ threads at a time.
My recommendation is, remove huge overhead based on thread management in one process (30+ is really lot).
It is more efficiency to use more processes with only a few threads (2-4 threads, based on delay with I/O) because threads in one process have to play with GIL (Global Interpreter Lock). In this case all will be only about configuration for your Python code.
Related
I'm trying to speed up API requests using multi threading.
I don't understand why but I often get the same API response for different calls (they should not have the same response). At the end I get a lot of duplicates in my new file and a lot of rows are missing.
example : request.post("id=5555") --> response for the request.post("id=444") instead of request.post("id=5555")
It looks like workers catch the wrong responses.
Have anybody faced this issue ?
` def request_data(id, useragent):
- ADD ID to data and useragent to headers -
time.sleep(0.2)
resp = requests.post(
-URL-,
params=params,
headers=headerstemp,
cookies=cookies,
data=datatemp,
)
return resp
df = pd.DataFrame(columns=["ID", "prenom", "nom", "adresse", "tel", "mail", "prem_dispo", "capac_acc", "tarif_haut", "tarif_bas", "presentation", "agenda"])
ids = pd.read_csv('ids.csv')
ids.drop_duplicates(inplace=True)
ids = list(ids['0'].to_numpy())
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
future_to_url = {executor.submit(request_data, id, usera): id for id in ids}
for future in concurrent.futures.as_completed(future_to_url):
ok=False
while(ok==False):
try:
resp = future.result()
ok=True
except Exception as e:
print(e)
df.loc[len(df)] = parse(json.loads(resp))
`
I tried using asyncio, first response from Multiple async requests simultaneously but it returned the request and not the API response...
I'm trying to learn how I can use timeout within a session while sending requests. The way I've tried below can fetch the content of a webpage but I'm not sure this is the right way as I could not find the usage of timeout in this documentation.
import requests
link = "https://stackoverflow.com/questions/tagged/web-scraping"
with requests.Session() as s:
r = s.get(link,timeout=5)
print(r.text)
How can I use timeout within session?
According to the Documentation - Quick Start.
You can tell Requests to stop waiting for a response after a given
number of seconds with the timeout parameter. Nearly all production code should use this parameter in nearly all requests.
requests.get('https://github.com/', timeout=0.001)
Or from the Documentation Advanced Usage you can set 2 values (connect and read timeout)
The timeout value will be applied to both the connect and the read
timeouts. Specify a tuple if you would like to set the values
separately:
r = requests.get('https://github.com', timeout=(3.05, 27))
Making Session Wide Timeout
Searched throughout the documentation and it seams it is not possible to set timeout parameter session wide.
But there is a GitHub Issue Opened (Consider making Timeout option required or have a default) which provides a workaround as an HTTPAdapter you can use like this:
import requests
from requests.adapters import HTTPAdapter
class TimeoutHTTPAdapter(HTTPAdapter):
def __init__(self, *args, **kwargs):
if "timeout" in kwargs:
self.timeout = kwargs["timeout"]
del kwargs["timeout"]
super().__init__(*args, **kwargs)
def send(self, request, **kwargs):
timeout = kwargs.get("timeout")
if timeout is None and hasattr(self, 'timeout'):
kwargs["timeout"] = self.timeout
return super().send(request, **kwargs)
And mount on a requests.Session()
s = requests.Session()
s.mount('http://', TimeoutHTTPAdapter(timeout=5)) # 5 seconds
s.mount('https://', TimeoutHTTPAdapter(timeout=5))
...
r = s.get(link)
print(r.text)
or similarly you can use the proposed EnhancedSession by #GordonAitchJay
with EnhancedSession(5) as s: # 5 seconds
r = s.get(link)
print(r.text)
I'm not sure this is the right way as I could not find the usage of timeout in this documentation.
Scroll to the bottom. It's definitely there. You can search for it in the page by pressing Ctrl+F and entering timeout.
You're using timeout correctly in your code example.
You can actually specify the timeout in a few different ways, as explained in the documentation:
If you specify a single value for the timeout, like this:
r = requests.get('https://github.com', timeout=5)
The timeout value will be applied to both the connect and the read timeouts. Specify a tuple if you would like to set the values separately:
r = requests.get('https://github.com', timeout=(3.05, 27))
If the remote server is very slow, you can tell Requests to wait forever for a response, by passing None as a timeout value and then retrieving a cup of coffee.
r = requests.get('https://github.com', timeout=None)
Try using https://httpstat.us/200?sleep=5000 to test your code.
For example, this raises an exception because 0.2 seconds is not long enough to establish a connection with the server:
import requests
link = "https://httpstat.us/200?sleep=5000"
with requests.Session() as s:
try:
r = s.get(link, timeout=(0.2, 10))
print(r.text)
except requests.exceptions.Timeout as e:
print(e)
Output:
HTTPSConnectionPool(host='httpstat.us', port=443): Read timed out. (read timeout=0.2)
This raises an exception because the server waits for 5 seconds before sending the response, which is longer than the 2 second read timeout set:
import requests
link = "https://httpstat.us/200?sleep=5000"
with requests.Session() as s:
try:
r = s.get(link, timeout=(3.05, 2))
print(r.text)
except requests.exceptions.Timeout as e:
print(e)
Output:
HTTPSConnectionPool(host='httpstat.us', port=443): Read timed out. (read timeout=2)
You specifically mention using a timeout within a session. So maybe you want a session object which has a default timeout. Something like this:
import requests
link = "https://httpstat.us/200?sleep=5000"
class EnhancedSession(requests.Session):
def __init__(self, timeout=(3.05, 4)):
self.timeout = timeout
return super().__init__()
def request(self, method, url, **kwargs):
print("EnhancedSession request")
if "timeout" not in kwargs:
kwargs["timeout"] = self.timeout
return super().request(method, url, **kwargs)
session = EnhancedSession()
try:
response = session.get(link)
print(response)
except requests.exceptions.Timeout as e:
print(e)
try:
response = session.get(link, timeout=1)
print(response)
except requests.exceptions.Timeout as e:
print(e)
try:
response = session.get(link, timeout=10)
print(response)
except requests.exceptions.Timeout as e:
print(e)
Output:
EnhancedSession request
HTTPSConnectionPool(host='httpstat.us', port=443): Read timed out. (read timeout=4)
EnhancedSession request
HTTPSConnectionPool(host='httpstat.us', port=443): Read timed out. (read timeout=1)
EnhancedSession request
<Response [200]>
I have been building my own python (version 3.2.1) trading application in a practice account of a Forex provider (OANDA) but I am having some issues in receiving the streaming prices with a Linux debian-based OS.
In particular, I have followed their "Python streaming rates" guide available here: http://developer.oanda.com/rest-live/sample-code/.
I have a thread calling the function 'connect_to_stream' which prints out all the ticks received from the server:
streaming_thread = threading.Thread(target=streaming.connect_to_stream, args=[])
streaming_thread.start()
The streaming.connect_to_stream function is defined as following:
def connect_to_stream():
[..]#provider-related info are passed here
try:
s = requests.Session()
url = "https://" + domain + "/v1/prices"
headers = {'Authorization' : 'Bearer ' + access_token,
'Connection' : 'keep-alive'
}
params = {'instruments' : instruments, 'accountId' : account_id}
req = requests.Request('GET', url, headers = headers, params = params)
pre = req.prepare()
resp = s.send(pre, stream = True, verify = False)
return resp
except Exception as e:
s.close()
print ("Caught exception when connecting to stream\n%s" % str(e))
if response.status_code != 200:
print (response.text)
return
for line in response.iter_lines(1):
if line:
try:
msg = json.loads(line)
print(msg)
except Exception as e:
print ("Caught exception when connecting to stream\n%s" % str(e))
return
The msg variable contains the tick received for the streaming.
The problem is that I receive ticks for three hours on average after which the connection gets dropped and the script either hangs without receiving any ticks or throws an exception with reason "Connection Reset by Peer".
Could you please share any thoughts on where I am going wrong here? Is it anything related to the requests library (iter_lines maybe)?
I would like to receive ticks indefinitely unless a Keyboard exception is raised.
Thanks
That doesn't seem too weird to me that a service would close connections living for more than 3 hours.
That's probably a safety on their side to make sure to free their server sockets from ghost clients.
So you should probably just reconnect when you are disconnected.
try:
s = requests.Session()
url = "https://" + domain + "/v1/prices"
headers = {'Authorization' : 'Bearer ' + access_token,
'Connection' : 'keep-alive'
}
params = {'instruments' : instruments, 'accountId' : account_id}
req = requests.Request('GET', url, headers = headers, params = params)
pre = req.prepare()
resp = s.send(pre, stream = True, verify = False)
return resp
except SocketError as e:
if e.errno == errno.ECONNRESET:
pass # connection has been reset, reconnect.
except Exception as e:
pass # other exceptions but you'll probably need to reconnect too.
I would like the following script to try every url in url_list, and if it exist print it exist(url) and if not print don't(url) and if request timeout skip to the next url using "requests" lib:
url_list = ['www.google.com','www.urlthatwilltimeout.com','www.urlthatdon\'t exist']
def exist:
if request.status_code == 200:
print"exist{0}".format(url)
else:
print"don\'t{0}".format(url)
a = 0
while (a < 2):
url = urllist[a]
try:
request = requests.get(url, timeout=10)
except request.timeout:#any option that is similar?
print"timed out"
continue
validate()
a+=1
Based on this SO answer
below is code which will limit the total time taken by a GET request as well
as discern other exceptions that may happen.
Note that in requests 2.4.0 and later you may specify a connection timeout and read timeout
by using the syntax:
requests.get(..., timeout=(...conn timeout..., ...read timeout...))
The read timeout, however, only specifies the timeout between individual
read calls, not a timeout for the entire request.
Code:
import requests
import eventlet
eventlet.monkey_patch()
url_list = ['http://localhost:3000/delay/0',
'http://localhost:3000/delay/20',
'http://localhost:3333/', # no server listening
'http://www.google.com'
]
for url in url_list:
try:
with eventlet.timeout.Timeout(1):
response = requests.get(url)
print "OK -", url
except requests.exceptions.ReadTimeout:
print "READ TIMED OUT -", url
except requests.exceptions.ConnectionError:
print "CONNECT ERROR -", url
except eventlet.timeout.Timeout, e:
print "TOTAL TIMEOUT -", url
except requests.exceptions.RequestException, e:
print "OTHER REQUESTS EXCEPTION -", url, e
And here is an express server you can use to test it:
var express = require('express');
var sleep = require('sleep')
var app = express();
app.get('/delay/:secs', function(req, res) {
var secs = parseInt( req.params.secs )
sleep.sleep(secs)
res.send('Done sleeping for ' + secs + ' seconds')
});
app.listen(3000, function () {
console.log('Example app listening on port 3000!');
});
Hi so I have written a multithreaded request and response handler using requests-futures library.
However, it seems to be very slow, and not asynchronous as I would imagine. The output is slow and in order, not interlaced as i would expect if it was threading properly.
My question is why is my code slow, and what can i do to speed it up? An example would be great.
here is the code:
#!/usr/bin/python
import requests
import time
from concurrent.futures import ThreadPoolExecutor
from requests_futures.sessions import FuturesSession
session = FuturesSession(executor=ThreadPoolExecutor(max_workers=12))
def responseCallback( sess, resp ):
response = resp.text
if not "things are invalid" in response in response:
resp.data = "SUCCESS %s" % resp.headers['content-length']
else:
resp.data = "FAIL %s" % resp.headers['content-length']
proxies = {
"http":"http://localhost:8080",
"https":"https://localhost:8080"
}
url = 'https://www.examplehere.com/blah/etc/'
headers= {
'Host':'www.examplehere.com',
'Connection':'close',
'Cache-Control':'max-age=0',
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Origin':'https://www.examplehere.com',
'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/533.32 (KHTML, like Gecko) Ubuntu Chromium/34.0.1847.123 Chrome/34.0.1847.123 Safari/337.12',
'Content-Type':'application/x-www-form-urlencoded',
'Referer':'https://www.exampleblah.etc/',
'Accept-Encoding':'gzip,deflate,sdch',
'Accept-Language':'en-US,en;q=0.8,de;q=0.6',
'Cookie':'blah=123; etc=456;',
}
for n in range( 0, 9999 ):
#wibble = n.zfill( 4 )
wibble = "%04d" % n
payload = {
'name':'test',
'genNum':wibble,
'Button1':'Push+Now'
}
#print payload
#r = requests.post( url, data=payload, headers=headers, proxies=proxies, verify=False )
future = session.post( url, data=payload, headers=headers, verify=False, background_callback=responseCallback )
response = future.result()
print( "%s : %s" % ( wibble, response.data ) )
Ideally i'd like to fix my actual code still using the library I have already utilised, but if it's bad for some reason i'm open to suggestions...
edit: i am currently using python2 with the concurrent.futures backport.
edit: slow - approx one request a second, and not concurrent, but one after the other, so request1, response1, request2, response2 - i would expect them to be interlaced as the requests go out and come in on multiple threads?
The following code is another way to submit multiple requests, work on several of them at a time, then print out the results. The results are printed as they are ready, not necessarily in the same order as when they were submitted.
It also uses extensive logging, to help debug issues. It captures the payload for logging. Multithreaded code is hard, so more logs is more better!
source
import logging, sys
import concurrent.futures as cf
from requests_futures.sessions import FuturesSession
URL = 'http://localhost'
NUM = 3
logging.basicConfig(
stream=sys.stderr, level=logging.INFO,
format='%(relativeCreated)s %(message)s',
)
session = FuturesSession()
futures = {}
logging.info('start')
for n in range(NUM):
wibble = "%04d" % n
payload = {
'name':'test',
'genNum':wibble,
'Button1':'Push+Now'
}
future = session.get( URL, data=payload )
futures[future] = payload
logging.info('requests done, waiting for responses')
for future in cf.as_completed(futures, timeout=5):
res = future.result()
logging.info(
"wibble=%s, %s, %s bytes",
futures[future]['genNum'],
res,
len(res.text),
)
logging.info('done!')
output
69.3101882935 start
77.9430866241 Starting new HTTP connection (1): localhost
78.3731937408 requests done, waiting for responses
79.4050693512 Starting new HTTP connection (2): localhost
84.498167038 wibble=0000, <Response [200]>, 612 bytes
85.0481987 wibble=0001, <Response [200]>, 612 bytes
85.1981639862 wibble=0002, <Response [200]>, 612 bytes
85.2642059326 done!