Re-use http connection Python 3 - python

So every second I am making a bunch of requests to website X every second, as of now with the standard urllib packages like so (the requestreturns a json):
import urllib.request
import threading, time
def makerequests():
request = urllib.request.Request('http://www.X.com/Y')
while True:
time.sleep(0.2)
response = urllib.request.urlopen(request)
data = json.loads(response.read().decode('utf-8'))
for i in range(4):
t = threading.Thread(target=makerequests)
t.start()
However because I'm making so much requests after about 500 requests the website returns HTTPError 429: Too manyrequests. I was thinking it might help if I re-use the initial TCP connection, however I noticed it was not possible to do this with the urllib packages.
So I did some googling and discovered that the following packages might help:
Requests
http.client
socket ?
So I have a question: which one is best suited for my situation and can someone show an example of either one of them (for Python 3)?

requests handles keep alive automatically if you use a session. This might not actually help you if the server is rate limiting requests, however, requests also handles parsing JSON so that's a good reason to use it. Here's an example:
import requests
s = requests.Session()
while True:
time.sleep(0.2)
response = s.get('http://www.X.com/y')
data = response.json()

Related

How do i solve HTTP Error 429: Too Many Requests? [duplicate]

I am trying to use Python to login to a website and gather information from several webpages and I get the following error:
Traceback (most recent call last):
File "extract_test.py", line 43, in <module>
response=br.open(v)
File "/usr/local/lib/python2.7/dist-packages/mechanize/_mechanize.py", line 203, in open
return self._mech_open(url, data, timeout=timeout)
File "/usr/local/lib/python2.7/dist-packages/mechanize/_mechanize.py", line 255, in _mech_open
raise response
mechanize._response.httperror_seek_wrapper: HTTP Error 429: Unknown Response Code
I used time.sleep() and it works, but it seems unintelligent and unreliable, is there any other way to dodge this error?
Here's my code:
import mechanize
import cookielib
import re
first=("example.com/page1")
second=("example.com/page2")
third=("example.com/page3")
fourth=("example.com/page4")
## I have seven URL's I want to open
urls_list=[first,second,third,fourth]
br = mechanize.Browser()
# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
# Browser options
br.set_handle_equiv(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
# Log in credentials
br.open("example.com")
br.select_form(nr=0)
br["username"] = "username"
br["password"] = "password"
br.submit()
for url in urls_list:
br.open(url)
print re.findall("Some String")
Receiving a status 429 is not an error, it is the other server "kindly" asking you to please stop spamming requests. Obviously, your rate of requests has been too high and the server is not willing to accept this.
You should not seek to "dodge" this, or even try to circumvent server security settings by trying to spoof your IP, you should simply respect the server's answer by not sending too many requests.
If everything is set up properly, you will also have received a "Retry-after" header along with the 429 response. This header specifies the number of seconds you should wait before making another call. The proper way to deal with this "problem" is to read this header and to sleep your process for that many seconds.
You can find more information on status 429 here: https://www.rfc-editor.org/rfc/rfc6585#page-3
Writing this piece of code when requesting fixed my problem:
requests.get(link, headers = {'User-agent': 'your bot 0.1'})
This works because sites sometimes return a Too Many Requests (429) error when there isn't a user agent provided. For example, Reddit's API only works when a user agent is applied.
As MRA said, you shouldn't try to dodge a 429 Too Many Requests but instead handle it accordingly. You have several options depending on your use-case:
1) Sleep your process. The server usually includes a Retry-after header in the response with the number of seconds you are supposed to wait before retrying. Keep in mind that sleeping a process might cause problems, e.g. in a task queue, where you should instead retry the task at a later time to free up the worker for other things.
2) Exponential backoff. If the server does not tell you how long to wait, you can retry your request using increasing pauses in between. The popular task queue Celery has this feature built right-in.
3) Token bucket. This technique is useful if you know in advance how many requests you are able to make in a given time. Each time you access the API you first fetch a token from the bucket. The bucket is refilled at a constant rate. If the bucket is empty, you know you'll have to wait before hitting the API again. Token buckets are usually implemented on the other end (the API) but you can also use them as a proxy to avoid ever getting a 429 Too Many Requests. Celery's rate_limit feature uses a token bucket algorithm.
Here is an example of a Python/Celery app using exponential backoff and rate-limiting/token bucket:
class TooManyRequests(Exception):
"""Too many requests"""
#task(
rate_limit='10/s',
autoretry_for=(ConnectTimeout, TooManyRequests,),
retry_backoff=True)
def api(*args, **kwargs):
r = requests.get('placeholder-external-api')
if r.status_code == 429:
raise TooManyRequests()
if response.status_code == 429:
time.sleep(int(response.headers["Retry-After"]))
Another workaround would be to spoof your IP using some sort of Public VPN or Tor network. This would be assuming the rate-limiting on the server at IP level.
There is a brief blog post demonstrating a way to use tor along with urllib2:
http://blog.flip-edesign.com/?p=119
I've found out a nice workaround to IP blocking when scraping sites. It lets you run a Scraper indefinitely by running it from Google App Engine and redeploying it automatically when you get a 429.
Check out this article
In many cases, continuing to scrape data from a website even when the server is requesting you not to is unethical. However, in the cases where it isn't, you can utilize a list of public proxies in order to scrape a website with many different IP addresses.

Retry for python requests module hanging

I had to use the request module to fetch a huge number of URLs. Due to network errors, I wanted to implement the Retry mechanism. So my code looks like this:
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
import requests
session = requests.Session()
retry = Retry(total=1,backoff_factor=0.5,status_forcelist=(500,502,504,503))
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://',adapter)
head_response = session.get(url,timeout=2)
However, the code keeps hanging when the URL is: tiffins.NET. A normal requests.get with timeout=2 gives 503 status code but doesn't hang.
Am I doing something wrong?
I'm using python 2.7.15rc1.
Just had the same problem and figured out that it's because of the website's header "Retry-After". By default, Retry uses that value to wait until the next request. To ignore it, set the respect_retry_after_header parameter to False when creating a Retry object.

How to read JSON from URL in Python?

I am trying to use Python to get a JSON file from the Web. If I open the URL in my browser (Mozilla or Chromium) I do see the JSON. But when I do the following with the Python:
response = urllib2.urlopen(url)
data = json.loads(response.read())
I get an error message that tells me the following (after translation in English): Errno 10060, a connection troughs an error, since the server after a certain time period did not react, or the connection was erroneous, or the host did not react.
ADDED
It looks like there are many people who faced the described problem. There are also some answers to the similar (or the same) question. For example here we can see the following solution:
import requests
r = requests.get("http://www.google.com", proxies={"http": "http://61.233.25.166:80"})
print(r.text)
It is already a step forward for me (I think that it is very likely that the proxy is the reason of the problem). However, I still did not get it done since I do not know URL of my proxy and I probably will need user name and password. Howe can I find them? How did it happen that my browsers have them I do not?
ADDED 2
I think I am now one step further. I have used this site to find out what my proxy is: http://www.whatismyproxy.com/
Then I have used the following code:
proxies = {'http':'my_proxy.blabla.com/'}
r = requests.get(url, proxies = proxies)
print r
As a result I get
<Response [404]>
Looks not so good, but at least I think that my proxy is correct, because when I randomly change the address of the proxy I get another error:
Cannot connect to proxy
So, I can connect to proxy but something is not found.
I think there might be something wrong, when you're trying to get the json from the online source(URL). Just to make things clear, here is a small code snippet
#!/usr/bin/env python
try:
# For Python 3+
from urllib.request import urlopen
except ImportError:
# For Python 2
from urllib2 import urlopen
import json
def get_jsonparsed_data(url):
response = urlopen(url)
data = str(response.read())
return json.loads(data)
If you still get a connection error, You can try a couple of steps:
Try to urlopen() a random site from the Interpreter (Interactive Mode). If you are able to grab the source code you're good. If not check internet conditions or try the request module. Check here
Check and see if the json in the URL is in the correct syntax. For sample json syntax check here
Try the simplejson module.
Edit 1:
if you want to access websites using a system wide proxy you will have to use a proxy handler to use loopback(local host) to connect to that proxy.. A sample code is shown below.
proxy = urllib2.ProxyHandler({
'http': '127.0.0.1',
'https': '127.0.0.1'
})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
# this way you can send both http and https request using proxies
urllib2.urlopen('http://www.google.com')
urllib2.urlopen('https://www.google.com')
I have not not worked a lot with ProxyHandler. I just know the theory and code. I am sure there are better ways to access websites through proxies; One which does not involve installing the opener everytime you run the program. But hopefully it will point you in the right direction.

Python file sending correct way? (urllib2, PyCurl).

Hello!
My problem is the python sending to a controller which works with a prepared html sending script. Here the problem is that upload does not succeed (doesn't even start uploading) although the script runs through. The file is a binary container file. The problem should be in the code, because other way the upload could be completed
Here the output:
09:54:40:11 ...STEP: Upload Firmware
09:54:49:63 ...Upload was successful!
09:54:49:64 ...POST resource
09:54:50:60 ...Response: {"uploadFirmwareAck":0}
So the upload "was done" it says within 9 sec but it should take about 5 minutes. With debugger I monitored that it did not start just jump over it and give the "Upload was successful" message. I have no clue why. Any ideas?
The code:
import pycurl
from cStringIO import StringIO
import urllib2
import simplejson as json
url = 'http://eData/pvi?rName=FirmwareUpload'
req = urllib2.Request(url)
req.add_header('Content-Type','application/json')
c = pycurl.Curl()
c.setopt(c.POST, 1)
c.setopt(c.URL, url)
c.setopt(c.CONNECTTIMEOUT,0)
c.setopt(c.TIMEOUT, 0)
c.setopt(pycurl.FOLLOWLOCATION, 1)
c.setopt(pycurl.MAXREDIRS, 5)
c.setopt(pycurl.NOSIGNAL, 1)
c.setopt(c.HTTPPOST, [("file1", (c.FORM_FILE, "c:\\Users\\dem2bp\\Desktop\\HMI_Firmware update materials\\output_38.efc"))])
c.perform()
print "Upload was successful!"
print "Tx JSON:"
print "POST resource"
res = urllib2.urlopen(req)
print "Response:"
str_0 = res.read()
print str_0
c.close()
From the documentation:
PycURL is targeted at an advanced developer - if you need dozens of
concurrent, fast and reliable connections or any of the sophisticated
features listed above then PycURL is for you.
I would give http://www.python-requests.org/en/latest/ a try. For me, it's always the first choice when doing some http stuff. Usually it just does what it' supposed to do in a few lines of code.
Thank you, but as I see this request library does not run with an old version of Python like 2.6. I think it would be too risky to upgrade. Do you have other idea?
When I import requests library at some point of library which requires later versions throw me synthax errors.

Python Suds.Client Should it Close or Not

I am using the suds package to make API request from one website. I wrote a function which opens up the client to the website and make request.
I am wondering should I or how can I terminate the connection in the end of the function?
I am wondering will the client be something like the MySQLDb.connect that actually opens up many many separate API connections that never close every time I call this function.
from suds.client import Client
import sys, re
def querysearch(reqPartNumber, reqMfg, lock):
try:
client = Client('http://app....')
userInfo = {'id':.., 'password':...}
apiResponse = client.service.getParts(...)
...
print apiResponse
except:
...
SOAP is still an HTTP request, which is stateless. Each request will start a whole new connection, re-auth, etc. Browsers kind of short circuit that with cookies, but SOAP doesn't. So, you don't need to close the connection, it's already closed by the time suds returns your data back to you.
Additionally, looking at the latest source, Client() doesn't define a close or __exit__ method, so there's nothing you really have to do here.

Categories