I'm trying to get the rendered markup for http://www.epicurious.com/recipes/food/reviews/Breaded-Chicken-Cutlets-aka-Grandma-Jodys-Chicken-51114400; in theory the very same markup given by the 'View Page Source' menu option in Firefox.
I'm using a Python 2.7 script and the httplib library (http://docs.python.org/2/library/httplib.html). I've created an HTTPConnection object and when I try to get the markup via the HTTPResponse object's functions, I inevitably get a getaddrinfo - 11004 error. This script has been executed in Windows 7 and Ubuntu environments.
None of the other solutions for this error that I've read fit the bill: I am not behind any firewall, and I have no problem pinging www.google.com. I wonder if that website just doesn't conform to some standard I'm unaware of, as I haven't been able to successfully ping my target website.
I'm open to alternate approaches, let me know if there is a better way.
You might want to check out the reqests library. It makes simple things like this much easier:
import requests
r = requests.get('http://www.epicurious.com/recipes/food/reviews/Breaded-Chicken-Cutlets-aka-Grandma-Jodys-Chicken-51114400')
print r.text
Here are the docs: http://docs.python-requests.org/en/latest/
Ran the above and verified it works.
Related
This is the first time I am trying to use Python for Web scraping. I have to extract some information from a website. I work in an institution, so I am using a proxy for Internet access.
I have used this code. Which works fine with URLs like e.g. https://www.google.co.in, or https://www.pythonprogramming.net
But when I use this URL: http://www.genecards.org/cgi-bin/carddisp.pl?gene=APOA1 which I need for scraping data, it shows
urllib.error.URLError: <urlopen error [Errno 11001] getaddrinfo failed>
Here is my code.
import urllib.request as req
proxy = req.ProxyHandler({'http': r'http://username:password#url:3128'})
auth = req.HTTPBasicAuthHandler()
opener = req.build_opener(proxy, auth, req.HTTPHandler)
req.install_opener(opener)
conn = req.urlopen('https://www.google.co.in')
return_str = conn.read()
print(return_str)
Please guide me on what the issue here which I am not able to understand.
Also while searching for the above error, I read something about absolute URLs. Is that related to it?
The problem is that your proxy server, and your own host, seem to use two different DNS resolvers, or two resolvers updated at different instants in time.
So when you pass www.genecards.org, the proxy does not know that address, and the attempt to get address information (getAddrInfo) fails. Hence the error.
The problem is quite a bit more awkward than that, though. GeneCards.org is an alias for an Incapsula DNS host:
$ host www.genecards.org
www.genecards.org is an alias for 6hevx.x.incapdns.net.
And that machine is itself a proxy, hiding the real GeneCards site behind (so you might use http://192.230.83.165/ as an address, and it would never work).
This kind of merry-go-round is used by those sites that, among other things - how shall I put it - take a dim view of being scraped:
So yes, you could try several things to make scraping work. Chances are that they will only work for a short time, before being shut down harder and harder. So in the best scenario, you would be forced to continuously update your scraping code. Which can, and will, break down whenever it's most inconvenient to you.
This is no accident: it is intentional on GeneCards' part, and clearly covered in their terms of service:
Misuse of the Services
7.2 LifeMap may restrict, suspend or terminate the account of any Registered Users who abuses or misuses the GeneCards Suite Products. Misuse of the GeneCards Suite Products includes scraping, spidering and/or crawling GeneCards Suite Products; creating multiple or false profiles...
I suggest you take a different approach - try enquiring for a consultation license. Scraping a web site that does not care (or is unable, or hasn't yet come around) to providing its information in a easier format is one thing - stealing that information is quite different.
Also, note that you're connecting to a Squid proxy that in all probability is logging the username you're using. Any scraping made through that proxy would immediately be traced back to that user, in the event that LifeMap files a complaint for unauthorized scraping.
Try to ping url:3128 from your terminal. Provide responses? Problem seems related to security from server.
Some background: I work for a corporation which uses a proxy. Ping/nslookup are blocked, and I think this may be contributing to the following problem. The operating system being used is Windows, and the version of Python I'm testing 3.4.3.
I'm trying to create an application that communicates with a webservice, and this application will run inside our network. However, all requests take over 10 seconds to complete, while in the web browser it loads in under a second. Note that these requests succeed, they just take too long to be usable.
I profiled the application using the cProfile module, and I found that the application is spending 11 seconds on gethostbyaddr, and 4 seconds on gethostbyname.
I'm not familiar enough with networks, but is this a timeout? Why does the request go through despite the timeout? How do I disable these operations? And if I can't, is there a library that does not use these operations?
I tried both the requests and urllib modules. Pip is also exceedingly slow and may be because of the same cause.
Thanks in advance for any help or information on this subject.
Edit
I just tried monkey patching socket.gethostbyaddr and socket.gethostbyname, and the speed delay was gone. This doesn't feel like a proper solution though.
import requests
import socket
def do_nothing(*args):
return None
socket.gethostbyaddr = do_nothing
socket.gethostbyname = do_nothing
r = requests.get('https://google.com')
print(r.status_code) # prints 200
I have been trying, in vain, to make a program that reads text out loud using the web application found here (http://www.ispeech.org/text.to.speech.demo.php). It is a demo text-to-speech program, that works very well, and is relatively fast. What I am trying to do is make a Python program that would input text to the application, then output the result. The result, in this case, would be sound. Is there any way in Python to do this, like, say, a library? And if not, is it possible to do this through any other means? I have looked into the iSpeech API (found here), but the only problem with it is that there is a limited number of free uses (I believe that it is 200). While this program is only meant to be used a couple of times, I would rather it be able to use the service more then 200 times. Also, if this solution is impractical, could anyone direct me towards another alternative?
# AKX I am currently using eSpeak, and it works well. It just, well, doesn't sound too good, and it is hard to tell at times what is being said.
If using iSpeech is not required, there's a decent (it's surely not as beautifully articulated as many commercial solutions) open-source text-to-speech solution available called eSpeak.
It's usable from the command line (subprocess with Python), or as a shared library. It seems there's also a Python wrapper (python-espeak) for it.
Hope this helps.
OK. I found a way to do it, seems to work fine. Thanks to everyone who helped! Here is the code I'm using:
from urllib import quote_plus
def speak(text):
import pydshow
words = text.split()
temp = []
stuff = []
while words:
temp.append(words.pop(0))
if len(temp) == 24:
stuff.append(' '.join(temp))
temp = []
stuff.append(' '.join(temp))
for i in stuff:
pydshow.PlayFileWait('http://api.ispeech.org/api/rest?apikey=8d1e2e5d3909929860aede288d6b974e&format=mp3&action=convert&voice=ukenglishmale&text='+quote_plus(i))
if __name__ == '__main__':
speak('Hello. This is a text-to speech test.')
I find this ideal because it DOES use the API, but it uses the API key that is used for the demo program. Therefore, it never runs out. The key is 8d1e2e5d3909929860aede288d6b974e.
You can actually test this at work without the program, by typing the following into your address bar:
http://api.ispeech.org/api/rest?apikey=8d1e2e5d3909929860aede288d6b974e&format=mp3&action=convert&voice=ukenglishmale&text=
Followed by the text you want to speak. You can also adjust the language, by changing, in this case, the ukenglishmale to something else that iSpeech offers. For example, ukenglishfemale. This will speak the same text, but in a feminine voice.
NOTE: Pydshow is my wrapper around DirectShow. You can use yours instead.
The flow of your application would be like this:
Client-side: User inputs text into form, and form submits a request to server
Server: may be python or whatever language/framework you want. Receives http request with text.
Server: Runs text-to-speech either with pure python library or by running a subprocess to a utility that can generate speech as a wav/mp3/aiff/etc
Server: Sends HTTP response back by streaming file with a mime type to Client
Client: Receives the http response and plays the content
Specifically about step 3...
I don't have any particular advise on the most articulate open source speech synthesizing software available, but I can say that it does not have to necessarily be pure python, or even python at all for that matter. Most of these packages have some form of a command line utility to take stdin or a file and produce an audio file as output. You would simply launch this utility as a subprocess to generate the file, and then stream the file back in your http response.
If you decide to make use of an existing web service that provides text-to-speech via an API (iSpeech), then step 3 would be replaced with making your own server-side http request out to iSpeech, receiving the response and pretty much forwarding that response back to the original client request, like a proxy. I would say the benefit is not having to maintain your own speech synthesis solution or getting better quality that you could from an open source... but the downside is that you probably will have a bit more latency in your response time since your server has to make its own external http request and download the data first.
I want to make an HTTPS request to a real-time stream and keep the connection open so that I can keep reading content from it and processing it.
I want to write the script in python. I am unsure how to keep the connection open in my script. I have tested the endpoint with curl which keeps the connection open successfully. But how do I do it in Python. Currently, I have the following code:
c = httplib.HTTPSConnection('userstream.twitter.com')
c.request("GET", "/2/user.json?" + req.to_postdata())
response = c.getresponse()
Where do I go from here?
Thanks!
It looks like your real-time stream is delivered as one endless HTTP GET response, yes? If so, you could just use python's built-in urllib2.urlopen(). It returns a file-like object, from which you can read as much as you want until the server hangs up on you.
f=urllib2.urlopen('https://encrypted.google.com/')
while True:
data = f.read(100)
print(data)
Keep in mind that although urllib2 speaks https, it doesn't validate server certificates, so you might want to try and add-on package like pycurl or urlgrabber for better security. (I'm not sure if urlgrabber supports https.)
Connection keep-alive features are not available in any of the python standard libraries for https. The most mature option is probably urllib3
httplib2 supports this. (I'd have thought this the most mature option, didn't know urllib3 yet, so TokenMacGuy may still be right)
EDIT: while httplib2 does support persistent connections, I don't think you can really consume streams with it (ie. one long response vs. multiple requests over the same connection), which I now realise you may need.
I am using facebook python Graph API. When i am calling put_object to write to news feed it is taking about 12-14 sec to complete the call. When i run from command line using curl with same parameters i get the response back in 1.2 seconds.
I ran the profiler on the python code and from i see that it is spending 99.5% time in the socket.recv . I am not sure if it is the problem with facebook python sdk or something else.
I am on python 2.6. i see from facebook.py that it is using urllib.
file = urllib.urlopen("https://graph.facebook.com/" + path + "?" +
urllib.urlencode(args), post_data)
Has someone experienced similar slow down ? Any suggestions will be highly appreciated.
Direct command-line CURL is bound to be faster than urllib or urllib2. If you want speed, you could replace the call using pycurl (which is also a C-extension) whereas urllib is a python module written on top of httplib.
What more you could do is, if you're flexible enough to use a Tornado server, use the async caller of Tornado which directly talks to sockets and is also asynchronous.
Also, if nothing out of these can be done, try replacing urllib with urllib2 and create a non blocking caller with callback returns. This is all that I've done to improve the native 3rd party wrappers of facebook/twitter/amazon etc.
Are you behind an http proxy server? Curl honors proxy server environment variables, while urllib doesn't do so by default, and also doesn't support calling an https url (such as https://graph.facebook.com) over a proxy server.
In any event I expect it's more likely a network issue than a Python vs C issue. Yes C is faster, but this isn't a CPU-bound task, and there's no way that you're burning 12-14 seconds inside the Python interpreter to make this call.
If curl is happy but urllib is not, perhaps trying pycurl will solve your problem. http://pycurl.sourceforge.net/