503 error when downloading data from imdb api - python

I am trying to download a plot for almost 25 000 movies with the usage of imdbpy module for python. To speed up, I'm using Pool function from Multiprocessing module. However after almost 100 requests the 503 error occurs with a following message: Service Temporarily Unavailable. After 10-15 minutes I can process again but after approximately 20 requests the same error occurs again.
I am aware that it might be a simple block from the api to prevent too many calls however I can't find any info about maximum number of requests per time unit on the web.
Do you have any idea how to process so many calls without being shutdown? Moreover, do you know where I can find the documentation of imdb api?
Best

Please, don't do it.
Scraping is forbidden by IMDb's terms of service, and IMDbPY was never intended to be used to mass-scrape the web site: in fact it's explicitly designed to fetch a single movie at a time.
In theory IMDbPY can manage the plain text data files they distribute, but unfortunately they recently changed both the format and the content of the data.
IMDb has no APIs that I know of; if you have to manage such a huge portion of their data, you have to get a licence.
Please consider the use of http://www.omdbapi.com/

Related

Qualtrics API, getting "[SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:2570)"

I occasionally get the above error when making requests with the python requests library to Qualtrics APIs.
In a nutshell, I have a Google Cloud Function on Google Cloud that will trigger when a csv file is placed on a specific Cloud Storage Bucket. The function will create a Qualtrics distribution list on Qualtrics, upload the contacts and then download the distribution links.
Every day, three files are uploaded on Cloud Storage, each for a different survey, and so three Google Cloud instances will be started.
My gripes with the issues are:
it doesn't happen regularly, in fact the workflow correctly worked for the past year
it doesn't seem to be tied to the processed files: when the function crashes and I manually restart it by reuploading the same csv into the bucket, it will work smoothly
The problem started around when we added the third daily csv to the process, and tend to happen when two files are being processed at the same time. For these reasons my suspects are:
I'm getting rate limited by Qualtrics (but I would expect a more clear message from Qualtrics)
The requests get in some way "crossed up" when two files are processed. I'm not sure if requests.request implicitly opens a session with the APIs. In that case the problem could be generated by multiple sessions being open at the same time from two csv being processed at the same time
As I said, the error seem to happen without a pattern, and it has happened on any part of the code where I'm doing a request to the APIs, so I'm not sure if sharing extensive code is helpful, but in general the requests are performed in a pretty standard way:
requests.request("POST", requestUrl, data=requestPayload, headers=headers)
requests.request("GET", requestUrl, headers=headers)
etc
i.e.: I'm not using any particular custom option
In the end I kind of resolved the issue with a workaround:
I separated the processing of the three csv so that there is no overlap in processing time between two files
implemented a retry policy in the POST request
Since then, separating processing time for the files reduced substantially the number of errors (from one or more each day to around 1 error a week), and even when they happen the retry policy circumvents the error at the first retry.
I realize this may not be the ideal solution, so I'm open to alternatives if someone comes up with something better (or even more insights on the root problem).

gtts.tts.gTTSError: 429 (Too Many Requests) from TTS API. Probable cause: Unknown

I've installed GTTS using pip with python and the first copule of iterations seemes fine. However now I keep getting this error:
gtts.tts.gTTSError: 429 (Too Many Requests) from TTS API. Probable cause: Unknown
I've removed it from a loop but it stil wont run, here is my code:
audio = gTTS(text="Hello World", lang='en', slow=False)
audio.save("audio.mp3")
How do i fix this, I've uninstalled, and waited for about an hour but Its not fixed. I've researched and all of the solutions are saying its an anti DDOS filter but I've waited and the error doesn't show any indication to this.
You may be blocked for longer than an hour. I would suggest waiting for longer, such as a day. After that if it works, then you can try to introduce an artificial wait by using time.sleep(10) before each request, which would pause program execution for 10 seconds. This way might help you to avoid being rate limited.
The translate.googleapis.com site use is very limited. It only allows about 100 requests per one hour period and there after returns a 429 error (Too many requests). On the other hand, the Google Translate Api has a default billable limit of 5 requests/second/user and 200,000 requests/day."
The Google Translate API has a specific Google Group where many more people discuss that product since we don't get too many questions about the API so you may find https://groups.google.com/forum/#!forum/google-translate-api very interesting to read.
Google Translate API does come with their own support located at https://cloud.google.com/support-hub/ as well since Google Cloud Platform can cost money (the API is something that can incur costs).

Schedule web scraping jobs on Azure and store results on ADLS

I have a python job which uses beautiful soup to scrape data from the web.I have tried executing the script using U-SQL, however I keep receiving a generic error message :
An unhandled exception from user code has been reported
I haven't explored the error too much as I am not sure if it is possible to scrape the web through U-SQL.
Is this possible using U-SQL, and if not which Azure resource can i use to schedule this script and store the results on Azure data lake store?
Also, it normally would be helpful if you provided the complete error code and exactly how you want to scrape the web.
I make the random assumption right now that you wrote some code that accessed web pages and tried to run it from within U-SQL. If that is correct, you will get blocked by that the U-SQL container blocks all external network access. For more details why that is done, see the previous answer here.
Hi I'm a PM from the Azure Data Lake team and I'd love to help out with this. I just need some clarification first about what you're trying to do. Could you reach out to me at mabasile(at)microsoft.com with the job ID of the failed job? (Any sensitive information can of course be scrubbed out). That'll be the best way to figure out exactly what you're trying to do and if it's possible on ADL.
Thanks, and I hope to hear from you soon!
Matt Basile
Azure Data Lake Analytics
Update: Confirming Michael Rys's answer - you cannot call external services through U-SQL, because if ADLA scales out to hundreds of vertices and each vertex makes a separate call, you could end up DDOSing the service, so ADLA blocks external calls.

TooManyRequests Overpass Error

I'm using overpy to query the Overpass API, and the nature of the data is such that I have a lot of queries to execute. I've run into the 429 OverpassTooManyRequests exception and I'm trying to play by the rules. I've tried introducing time.sleep methods to space out the requests, but I have no basis for how long the program should wait before continuing.
I found this link which mentions a "Retry-after" header:
How to avoid HTTP error 429 (Too Many Requests) python
Is there a way to access that header in an overpy response? I've been through the docs and the source code, but nothing stood out that would allow me to access that header so I can pause querying until it's acceptable to do so again.
I'm using Python 3.6 and overpy 0.4.
Maybe this isn't quite the answer you're seeking, but I ran into the same issue and fixed it by simply hosting my own OSM database server using docker. Just clone the repo and follow instructions:
https://github.com/mediasuitenz/docker-overpass-api
from http://overpass-api.de/command_line.html do check that you do not have a single 'runaway' request that is taking up all the resources.
After verifying that I don't have runaway queries, I have taken Peter's advice and added a catch for the TooManyRequests exception that waits 30s and tries again. This seems to be working as an immediate solution.
I will also raise an issue with the originators of OverPy to suggest an enhancement to allow evaluating the /api/status, as per mmd's advice.

Multi-threaded web requests in python -- 'Name or service not known'

I have a large scraping job to do -- most of the script's time is spent blocking due to a lot of network latency. I'm trying to multi-thread the script so I can make multiple requests simultaneously, but about 10% of my threads die with the following error
URLError: <urlopen error [Errno -2] Name or service not known>
The other 90% complete successfully. I am requesting multiple pages from the same domain, so it seems like there may be some DNS issue. I make 25 requests at a time (25 threads). Everything works fine if i limit myself to 5 requests at a time, but once I get to around 10 requests, I start seeing this error sometimes.
I have read Repeated host lookups failing in urllib2
which describes the same issue I have and followed the suggestions therein, but to no avail.
I have also tried using the multiprocessing module instead of multi-threading, I get the same behaviour -- about 10% of the processes die with the same error -- which leads me to believe this is not an issue with urllib2 but something else.
Can someone explain what is going on and suggest how to fix?
UPDATE
If I manually code the ip address of the site into my script everything works perfectly, so this error happens sometime during the DNS lookup.
Suggestion: Try enabling a DNS cache in your system, such as nscd. This should eliminate DNS lookup problems if your scraper always makes requests to the same domain.
Make sure that the file objects returned by urllib2.urlopen are properly closed after being read, in order to free resources. Otherwise, you may reach the limit of max open sockets in your system.
Also, take into account the politeness policy web crawlers should have to avoid overloading a server with multiple requests.

Categories