I have a large scraping job to do -- most of the script's time is spent blocking due to a lot of network latency. I'm trying to multi-thread the script so I can make multiple requests simultaneously, but about 10% of my threads die with the following error
URLError: <urlopen error [Errno -2] Name or service not known>
The other 90% complete successfully. I am requesting multiple pages from the same domain, so it seems like there may be some DNS issue. I make 25 requests at a time (25 threads). Everything works fine if i limit myself to 5 requests at a time, but once I get to around 10 requests, I start seeing this error sometimes.
I have read Repeated host lookups failing in urllib2
which describes the same issue I have and followed the suggestions therein, but to no avail.
I have also tried using the multiprocessing module instead of multi-threading, I get the same behaviour -- about 10% of the processes die with the same error -- which leads me to believe this is not an issue with urllib2 but something else.
Can someone explain what is going on and suggest how to fix?
UPDATE
If I manually code the ip address of the site into my script everything works perfectly, so this error happens sometime during the DNS lookup.
Suggestion: Try enabling a DNS cache in your system, such as nscd. This should eliminate DNS lookup problems if your scraper always makes requests to the same domain.
Make sure that the file objects returned by urllib2.urlopen are properly closed after being read, in order to free resources. Otherwise, you may reach the limit of max open sockets in your system.
Also, take into account the politeness policy web crawlers should have to avoid overloading a server with multiple requests.
Related
This is the first time I am trying to use Python for Web scraping. I have to extract some information from a website. I work in an institution, so I am using a proxy for Internet access.
I have used this code. Which works fine with URLs like e.g. https://www.google.co.in, or https://www.pythonprogramming.net
But when I use this URL: http://www.genecards.org/cgi-bin/carddisp.pl?gene=APOA1 which I need for scraping data, it shows
urllib.error.URLError: <urlopen error [Errno 11001] getaddrinfo failed>
Here is my code.
import urllib.request as req
proxy = req.ProxyHandler({'http': r'http://username:password#url:3128'})
auth = req.HTTPBasicAuthHandler()
opener = req.build_opener(proxy, auth, req.HTTPHandler)
req.install_opener(opener)
conn = req.urlopen('https://www.google.co.in')
return_str = conn.read()
print(return_str)
Please guide me on what the issue here which I am not able to understand.
Also while searching for the above error, I read something about absolute URLs. Is that related to it?
The problem is that your proxy server, and your own host, seem to use two different DNS resolvers, or two resolvers updated at different instants in time.
So when you pass www.genecards.org, the proxy does not know that address, and the attempt to get address information (getAddrInfo) fails. Hence the error.
The problem is quite a bit more awkward than that, though. GeneCards.org is an alias for an Incapsula DNS host:
$ host www.genecards.org
www.genecards.org is an alias for 6hevx.x.incapdns.net.
And that machine is itself a proxy, hiding the real GeneCards site behind (so you might use http://192.230.83.165/ as an address, and it would never work).
This kind of merry-go-round is used by those sites that, among other things - how shall I put it - take a dim view of being scraped:
So yes, you could try several things to make scraping work. Chances are that they will only work for a short time, before being shut down harder and harder. So in the best scenario, you would be forced to continuously update your scraping code. Which can, and will, break down whenever it's most inconvenient to you.
This is no accident: it is intentional on GeneCards' part, and clearly covered in their terms of service:
Misuse of the Services
7.2 LifeMap may restrict, suspend or terminate the account of any Registered Users who abuses or misuses the GeneCards Suite Products. Misuse of the GeneCards Suite Products includes scraping, spidering and/or crawling GeneCards Suite Products; creating multiple or false profiles...
I suggest you take a different approach - try enquiring for a consultation license. Scraping a web site that does not care (or is unable, or hasn't yet come around) to providing its information in a easier format is one thing - stealing that information is quite different.
Also, note that you're connecting to a Squid proxy that in all probability is logging the username you're using. Any scraping made through that proxy would immediately be traced back to that user, in the event that LifeMap files a complaint for unauthorized scraping.
Try to ping url:3128 from your terminal. Provide responses? Problem seems related to security from server.
Some background: I work for a corporation which uses a proxy. Ping/nslookup are blocked, and I think this may be contributing to the following problem. The operating system being used is Windows, and the version of Python I'm testing 3.4.3.
I'm trying to create an application that communicates with a webservice, and this application will run inside our network. However, all requests take over 10 seconds to complete, while in the web browser it loads in under a second. Note that these requests succeed, they just take too long to be usable.
I profiled the application using the cProfile module, and I found that the application is spending 11 seconds on gethostbyaddr, and 4 seconds on gethostbyname.
I'm not familiar enough with networks, but is this a timeout? Why does the request go through despite the timeout? How do I disable these operations? And if I can't, is there a library that does not use these operations?
I tried both the requests and urllib modules. Pip is also exceedingly slow and may be because of the same cause.
Thanks in advance for any help or information on this subject.
Edit
I just tried monkey patching socket.gethostbyaddr and socket.gethostbyname, and the speed delay was gone. This doesn't feel like a proper solution though.
import requests
import socket
def do_nothing(*args):
return None
socket.gethostbyaddr = do_nothing
socket.gethostbyname = do_nothing
r = requests.get('https://google.com')
print(r.status_code) # prints 200
Openstack-Swift is using evenlet.green.httplib for BufferedHttpconnections.
When I do performance benchmark of it for write operations, I could observer that write throughput drops even only one replica node is overloaded.
As I know write quorum is 2 out of 3 replicas, therefore overloading only one replica cannot affect for the throughput.
When I dig deeper what I observed was, the consequent requests are blocked until the responses are reached for the previous requests. Its mainly because of the BufferedHttpConnection which stops issuing new request until the previous response is read.
Why Openstack-swift use such a method?
Is this the usual behaviour of evenlet.green.httplib.HttpConnection?
This does not make sense in write quorum point of view, because its like waiting for all the responses not a quorum.
Any ideas, any workaround to stop this behaviour using the same library?
Its not a problem of the library but a limitation due to the Openstack Swift configuration where the "Workers" configuration in all Account/Container/Object config of Openstack Swift was set to 1
Regarding the library
When new connections are made using evenlet.green.httplib.HttpConnection
it does not block.
But if requests are using the same connection, subsequent requests are blocked until the response is fully read.
I have some Ring routes which I'm running one of two ways.
lein ring server, with the lein-ring plugin
using org.httpkit.server, like (hs/run-server app {:port 3000}))
It's a web app (being consumed by an Angular.js browser client).
I have some API tests written in Python using the Requests library:
my_r = requests.post(MY_ROUTE,
data=MY_DATA,
headers={"Content-Type": "application/json"},
timeout=10)
When I use lein ring server, this request works fine in the JS client and the Python tests.
When I use httpkit, this works fine in the JS client but the Python client times out with
socket.timeout: timed out
I can't figure out why the Python client is timing out. It happens with httpkit but not with lein-ring, so I can only assume that the cause is related to the difference.
I've looked at the traffic in WireShark and both look like they give the correct response. Both have the same Content-Length field (15 bytes).
I've raised the number of threads to 10 (shouldn't need to) and no change.
Any ideas what's wrong?
I found how to fix this, but no satisfactory explanation.
I was using wrap-json-response Ring middleware to take a HashMap and convert it to JSON. I switched to doing my own conversion in my handler with json/write-str, and this fixes it.
At a guess it might be something to do with the server handling output buffering, but that's speculation.
I've combed through the Wireshark dumps and I can't see any relevant differences between the two. The sent Content-Length fields are identical. The 'bytes in flight' differ, at 518 and 524.
No clue as to why the web browser was happy with this but Python Requests wasn't, and whether or this is a bug in Requests, httpkit, ring-middleware-format or my own code.
I have some test code (as a part of a webapp) that uses urllib2 to perform an operation I would usually perform via a browser:
Log in to a remote website
Move to another page
Perform a POST by filling in a form
I've created 4 separate, clean virtualenvs (with --no-site-packages) on 3 different machines, all with different versions of python but the exact same packages (via pip requirements file), and the code only works on the two virtualenvs on my local development machine(2.6.1 and 2.7.2) - it won't work on either of my production VPSs
In the failing cases, I can log in successfully, move to the correct page but when I submit the form, the remote server replies telling me that there has been an error - it's an application server error page ('we couldn't complete your request') and not a webserver error.
because I can successfully log in and maneuver to a second page, this doesn't seem to be a session or a cookie problem - it's particular to the final POST
because I can perform the operation on a particular machine with the EXACT same headers and data, this doesn't seem to be a problem with what I am requesting/posting
because I am trying the code on two separate VPS rented from different companies, this doesn't seem to be a problem with the VPS physical environment
because the code works on 2 different python versions, I can't imagine it being an incompabilty problem
I'm completely lost at this stage as to why this wouldn't work. I've even 'turned-it-off-and-turn-it-on-again' because I just can't see what the problem could be.
I think it has to be something to do with the final POST coming from a VPS that the remote server doesn't like, but I can't figure out what that could be. I feel like there is something going on under the hood of URLlib that is causing the remote server to dislike the reply.
EDIT
I've installed the exact same Python version (2.6.1) on the VPS as is on my working local copy and it doesn't work remotely, so it must be something to do with originating from a VPS. How could this effect the Http request? Is it something lower level?
You might try setting the debuglevel=1 for urllib2 and see what it comes up with:
import urllib2
h=urllib2.HTTPHandler(debuglevel=1)
opener = urllib2.build_opener(h)
...
This is a total shot in the dark, but are your VPSs 64-bit and your home computer 32-bit, or vice versa? Maybe a difference in default sizes or accuracies of something could be freaking out the server.
Barring that, can you try to find out any information on the software stack the web server is using?
I had similar issues with urllib2 (working with Zimbra's REST api), in the end switched to pycurl with success.
PS
for operations of login/navigate/post, I usually find Mechanize useful and easier to use. Maybe you can give it a show.
Well, it looks like I know why the problem was happening, but I'm not 100% the reason for it.
I simply had to make the server wait (time.sleep()) after it sent the 2nd request (Move to another page) before doing the 3rd request (Perform a POST by filling in a form).
I don't know is it because of a condition with the 3rd party server, or if it's some sort of odd issue with URLlib? The reason it seemed to work on my development machine is presumably because it was slower then the server at running the code?