TooManyRequests Overpass Error - python

I'm using overpy to query the Overpass API, and the nature of the data is such that I have a lot of queries to execute. I've run into the 429 OverpassTooManyRequests exception and I'm trying to play by the rules. I've tried introducing time.sleep methods to space out the requests, but I have no basis for how long the program should wait before continuing.
I found this link which mentions a "Retry-after" header:
How to avoid HTTP error 429 (Too Many Requests) python
Is there a way to access that header in an overpy response? I've been through the docs and the source code, but nothing stood out that would allow me to access that header so I can pause querying until it's acceptable to do so again.
I'm using Python 3.6 and overpy 0.4.

Maybe this isn't quite the answer you're seeking, but I ran into the same issue and fixed it by simply hosting my own OSM database server using docker. Just clone the repo and follow instructions:
https://github.com/mediasuitenz/docker-overpass-api

from http://overpass-api.de/command_line.html do check that you do not have a single 'runaway' request that is taking up all the resources.

After verifying that I don't have runaway queries, I have taken Peter's advice and added a catch for the TooManyRequests exception that waits 30s and tries again. This seems to be working as an immediate solution.
I will also raise an issue with the originators of OverPy to suggest an enhancement to allow evaluating the /api/status, as per mmd's advice.

Related

gtts.tts.gTTSError: 429 (Too Many Requests) from TTS API. Probable cause: Unknown

I've installed GTTS using pip with python and the first copule of iterations seemes fine. However now I keep getting this error:
gtts.tts.gTTSError: 429 (Too Many Requests) from TTS API. Probable cause: Unknown
I've removed it from a loop but it stil wont run, here is my code:
audio = gTTS(text="Hello World", lang='en', slow=False)
audio.save("audio.mp3")
How do i fix this, I've uninstalled, and waited for about an hour but Its not fixed. I've researched and all of the solutions are saying its an anti DDOS filter but I've waited and the error doesn't show any indication to this.
You may be blocked for longer than an hour. I would suggest waiting for longer, such as a day. After that if it works, then you can try to introduce an artificial wait by using time.sleep(10) before each request, which would pause program execution for 10 seconds. This way might help you to avoid being rate limited.
The translate.googleapis.com site use is very limited. It only allows about 100 requests per one hour period and there after returns a 429 error (Too many requests). On the other hand, the Google Translate Api has a default billable limit of 5 requests/second/user and 200,000 requests/day."
The Google Translate API has a specific Google Group where many more people discuss that product since we don't get too many questions about the API so you may find https://groups.google.com/forum/#!forum/google-translate-api very interesting to read.
Google Translate API does come with their own support located at https://cloud.google.com/support-hub/ as well since Google Cloud Platform can cost money (the API is something that can incur costs).

503 error when downloading data from imdb api

I am trying to download a plot for almost 25 000 movies with the usage of imdbpy module for python. To speed up, I'm using Pool function from Multiprocessing module. However after almost 100 requests the 503 error occurs with a following message: Service Temporarily Unavailable. After 10-15 minutes I can process again but after approximately 20 requests the same error occurs again.
I am aware that it might be a simple block from the api to prevent too many calls however I can't find any info about maximum number of requests per time unit on the web.
Do you have any idea how to process so many calls without being shutdown? Moreover, do you know where I can find the documentation of imdb api?
Best
Please, don't do it.
Scraping is forbidden by IMDb's terms of service, and IMDbPY was never intended to be used to mass-scrape the web site: in fact it's explicitly designed to fetch a single movie at a time.
In theory IMDbPY can manage the plain text data files they distribute, but unfortunately they recently changed both the format and the content of the data.
IMDb has no APIs that I know of; if you have to manage such a huge portion of their data, you have to get a licence.
Please consider the use of http://www.omdbapi.com/

Python : SpaqrlWrapper, Timeout?

I am new to python as well as new to the world of querying the semantic web.
I am using SPARQLWrapper library to query dbpedia, I searched the library documentation but failed to find 'timeout' for a query fired to dbpedia from sparqlWrapper.
Anyone has any idea about the same.
As of 2018, you can use SPARQLWrapper.setTimeout() to set the timeout for SPARQLWrapper requests.
As Karoo mentioned you can use SPARQLWrapper.setTimeout(timeout=(int)).
If you want a timeout as a float, go to the Wrapper.py module and change self.timeout = int(timeout) to self.timeout = float(timeout) in the def setTimeout(self, timeout): function.
I don't know if this is specifically an answer for your question, but I searched for it for ages and heres my solution for anyone else having trouble with Virtuoso specific timeouts on SPARQLWrapper:
Your can use this line of code to set a server-side timeout for your queries (not clientside like .setTimeout):
[your SPARQLWrapper entity].addExtraURITag("timeout","[your timeout in ms]")
In my case it looks like this:
s.addExtraURITag("timeout","10000")
This should give you 10 seconds of time before your query stops searching and returns results instead of just giving you a Timeout error.
Hope I could help anyone.
DBPedia uses Virtuoso server for it's endpoint and timeout is a virtuoso-specific option. SparqlWrapper doesn't currently support it.
Next version will feature better modularity and proper vendor-specific extensions might be implemented after that, but I guess you don't have time to wait.
Currently, the only way to add such parameter is to manually hardcode it into your local version of library

Inexplicable Urllib2 problem between virtualenv's.

I have some test code (as a part of a webapp) that uses urllib2 to perform an operation I would usually perform via a browser:
Log in to a remote website
Move to another page
Perform a POST by filling in a form
I've created 4 separate, clean virtualenvs (with --no-site-packages) on 3 different machines, all with different versions of python but the exact same packages (via pip requirements file), and the code only works on the two virtualenvs on my local development machine(2.6.1 and 2.7.2) - it won't work on either of my production VPSs
In the failing cases, I can log in successfully, move to the correct page but when I submit the form, the remote server replies telling me that there has been an error - it's an application server error page ('we couldn't complete your request') and not a webserver error.
because I can successfully log in and maneuver to a second page, this doesn't seem to be a session or a cookie problem - it's particular to the final POST
because I can perform the operation on a particular machine with the EXACT same headers and data, this doesn't seem to be a problem with what I am requesting/posting
because I am trying the code on two separate VPS rented from different companies, this doesn't seem to be a problem with the VPS physical environment
because the code works on 2 different python versions, I can't imagine it being an incompabilty problem
I'm completely lost at this stage as to why this wouldn't work. I've even 'turned-it-off-and-turn-it-on-again' because I just can't see what the problem could be.
I think it has to be something to do with the final POST coming from a VPS that the remote server doesn't like, but I can't figure out what that could be. I feel like there is something going on under the hood of URLlib that is causing the remote server to dislike the reply.
EDIT
I've installed the exact same Python version (2.6.1) on the VPS as is on my working local copy and it doesn't work remotely, so it must be something to do with originating from a VPS. How could this effect the Http request? Is it something lower level?
You might try setting the debuglevel=1 for urllib2 and see what it comes up with:
import urllib2
h=urllib2.HTTPHandler(debuglevel=1)
opener = urllib2.build_opener(h)
...
This is a total shot in the dark, but are your VPSs 64-bit and your home computer 32-bit, or vice versa? Maybe a difference in default sizes or accuracies of something could be freaking out the server.
Barring that, can you try to find out any information on the software stack the web server is using?
I had similar issues with urllib2 (working with Zimbra's REST api), in the end switched to pycurl with success.
PS
for operations of login/navigate/post, I usually find Mechanize useful and easier to use. Maybe you can give it a show.
Well, it looks like I know why the problem was happening, but I'm not 100% the reason for it.
I simply had to make the server wait (time.sleep()) after it sent the 2nd request (Move to another page) before doing the 3rd request (Perform a POST by filling in a form).
I don't know is it because of a condition with the 3rd party server, or if it's some sort of odd issue with URLlib? The reason it seemed to work on my development machine is presumably because it was slower then the server at running the code?

How to improve Trac's performance

I have noticed that my particular instance of Trac is not running quickly and has big lags. This is at the very onset of a project, so not much is in Trac (except for plugins and code loaded into SVN).
Setup Info: This is via a SELinux system hosted by WebFaction. It is behind Apache, and connections are over SSL. Currently the .htpasswd file is what I use to control access.
Are there any recommend ways to improve the performance of Trac?
It's hard to say without knowing more about your setup, but one easy win is to make sure that Trac is running in something like mod_python, which keeps the Python runtime in memory. Otherwise, every HTTP request will cause Python to run, import all the modules, and then finally handle the request. Using mod_python (or FastCGI, whichever you prefer) will eliminate that loading and skip straight to the good stuff.
Also, as your Trac database grows and you get more people using the site, you'll probably outgrow the default SQLite database. At that point, you should think about migrating the database to PostgreSQL or MySQL, because they'll be able to handle concurrent requests much faster.
We've had the best luck with FastCGI. Another critical factor was to only use https for authentication but use http for all other traffic -- I was really surprised how much that made a difference.
I have noticed that if
select disctinct name from wiki
takes more than 5 seconds (for example due to a million rows in this table - this is a true story (We had a script that filled it)), browsing wiki pages becomes very slow and takes over 2*t*n, where t is time of execution of the quoted query (>5s of course), and n is a number of tracwiki links present on the viewed page.
This is due to trac having a (hardcoded) 5s cache expire for this query. It is used by trac to tell what the colour should the link be. We re-hardcoded the value to 30s (We need that many pages, so every 30s someone has to wait 6-7s).
It may not be what caused Your problem, but it may be. Good luck on speeding up Your Trac instance.
Serving the chrome files statically with and expires-header could help too. See the end of this page.

Categories