I have the following code:
import re
from re import sub
import cookielib
from cookielib import CookieJar
import urllib2
from urllib2 import urlopen
cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.addheaders=[('user-agent' , 'Safari/7.0.2')]
def check(word):
try:
query = "select * from geo.places where text ='"+word+"'"
sourceCode=opener.open('http://query.yahooapis.com/v1/public/yql?q='+query+'&diagnostics=true').read()
print sourceCode
except Exception, e:
print str(e)
print 'ERROR IN MAIN TRY'
myStr = ['I','went','to','Boston']
for item in myStr:
check(item)
I am trying to query select * from geo.places where text = 'Boston' (for example).
I keep receiving this error:
HTTP Error 505: HTTP Version Not Supported
ERROR IN MAIN TRY
What can cause this error and how can I solve it?
The URL you construct is not a valid URL. What you send is
GET /v1/public/yql?q=select * from geo.places where text ='I'&diagnostics=true HTTP/1.1
Accept-Encoding: identity
Host: query.yahooapis.com
Connection: close
User-Agent: Safari/7.0.2
There should be no spaces inside the URL, e.g. you have to do proper URL encoding (replace space with '+' etc). I guess requests just fixes the bad URL for you.
Your query might have blank spaces in between. Requests take care of the white spaces in your url and hence you don't have to take care of it.
Just replace each " " by "%20" to make the url work.
Not sure, what is going wrong, but when I try to do the same action using requests library, it works:
>>> import requests
>>> word = "Boston"
>>> query = "select * from geo.places where text ='"+word+"'"
>>> query
"select * from geo.places where text ='Boston'"
>>> baseurl = 'http://query.yahooapis.com/v1/public/yql?q='
>>> url = baseurl + query
>>> url
"http://query.yahooapis.com/v1/public/yql?q=select * from geo.places where text ='Boston'"
>>> req = requests.get(url)
>>> req
<Response [200]>
>>> req.text
u'<?xml version="1.0" encoding="UTF-8"?>\n<query xmlns:yahoo="http://www.yahooapis.com/v1/base.rng" yahoo:count="10" yahoo:created="2014-05-17T21:12:52Z" yahoo:lang="en-US"><results><place xmlns="http://where.yahooapis.com/v1/schema.rng" xml:lang="en-US" yahoo:uri="http://where.yahooapis.com/v1/place/2367105"><woeid>2367105</woeid><placeTypeName code="7">Town</placeTypeName><name>Boston</name><country code="US" type="Country" woeid="23424977">United States</country><admin1 code="US-MA" type="State" woeid="2347580">Massachusetts</admin1><admin2 code="" type="County" woei....
Note, that there are differences, my code is much simpler, it does not work with cookies and it does not try to pretend Safari browser.
If you need to use cookies with requests, you will find very good support for it there.
As stated in other answers, you need to encode your url due to the white space. The call would be urllib.quote if using python2 or urllib.parse.quote for python3. The safe parameter is used to ignore characters when encoding.
from urllib import quote
url = 'http://query.yahooapis.com/v1/public/yql?q=select * from geo.places where text =\'Boston\''
print(quote(url, safe=':/?*=\''))
# outputs "http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20geo.places%20where%20text%20='Boston'"
Use requests is good choose. but we should found out why?
query = "select * from geo.places where text ='"+word+"'"
There are some space in your paramter.we should be url encode this space.
you should convert the spaces to '%20', but in python the '%' is special char, you should be escaped use '%%20'
Related
Python 3 requests.get().text returns unencoded string.
If I execute:
import requests
request = requests.get('https://google.com/search?q=Кто является президентом России?').text.lower()
print(request)
I get kind of this:
Кто является презид
I've tried to change google.com to google.ru
If I execute:
import requests
request = requests.get('https://google.ru/search?q=Кто является президентом России?').text.lower()
print(request)
I get kind of this:
d0%9a%d1%82%d0%be+%d1%8f%d0%b2%d0%bb%d1%8f%d0%b5%d1%82%d1%81%d1%8f+%d0%bf%d1%80%d0%b5%d0%b7%d0%b8%d0%b4%d0%b5%d0%bd%d1%82%d0%be%d0%bc+%d0%a0%d0%be%d1%81%d1%81%d0%b8%d0
I need to get an encoded normal string.
You were getting this error because requests was not able to identify the correct encoding of the response. So if you are sure about the response encoding then you can set it like the following:
response = requests.get(url)
response.encoding --> to check the encoding
response.encoding = "utf-8" --> or any other encoding.
And then get the content with .text method.
I fixed it with urllib.parse.unquote() method:
import requests
from urllib.parse import unquote
request = unquote(requests.get('https://google.ru/search?q=Кто является президентом России?').text.lower())
print(request)
For some reason or another, it appears that depending on where I copy and paste a url from, urllib.request.urlopen won't work. For example when I copy http://www.google.com/ from the address bar and run the following script:
from urllib import request
url = "http://www.google.com/"
response = request.urlopen(url)
print(response)
I get the following error at the call to urlopen:
UnicodeEncodeError: 'ascii' codec can't encode character '\ufeff' in position 5: ordinal not in range(128)
But if I copy the url string from text on a web page or type it out by hand, it works just fine. To be clear, this doesn't work
from urllib import request
url = "http://www.google.com/"
response = request.urlopen(url)
print(response)
#url1 = "http://www.google.com/"
#
#response1 = request.urlopen(url1)
#print(response1)
but this does:
from urllib import request
#url = "http://www.google.com/"
#
#response = request.urlopen(url)
#print(response)
url1 = "http://www.google.com/"
response1 = request.urlopen(url1)
print(response1)
My suspicion is that the encoding is different in the actual address bar, and Spyder knows how to handle it, but I don't because I can't see what is actually going on.
EDIT: As requested...
print(ascii(url))
'http://www.google.com/\ufeff'
print(ascii(url1))
'http://www.google.com/'
Indeed the strings are different.
\ufeff is a zero-width non-breaking space, so it's no wonder you can't see it. Yes, there's an invisible character in your URL. At least it's not gremlins.
You could try
from urllib import request
url = "http://www.google.com/"
response = request.urlopen(url.decode("utf-8-sig").replace("\ufeff", ""))
print(response)
That Unicode character is the byte-order mark or BOM (more info here) you can encode without BOM by using utf-8-sig decoding then replacing the problematic character
Actually i am calling 3rd party API and requirement in to add json dictionary as it is. refer below URL example
https://pguat.paytm.com/oltp/HANDLER_INTERNAL/getTxnStatus?JsonData={"MID":"MID117185435","ORDERID":"ORDR4o22310421111",
"CHECKSUMHASH":
"NU9wPEWxmbOTFL2%2FUKr3lk6fScfnLy8wORc3YRylsyEsr2MLRPn%2F3DRePtFEK55ZcfdTj7mY9vS2qh%2Bsm7oTRx%2Fx4BDlvZBj%2F8Sxw6s%2F9io%3D"}
The query param name in "JsonData" and data should be in {} brackets.
import requests
import json
from urllib.parse import urlencode, quote_plus
import urllib.request
import urllib
data = '{"MID":"MID117185435","ORDERID":"ORDR4o22310421111","CHECKSUMHASH":"omcrIRuqDP0v%2Fa2DXTlVI4XtzvmuIW56jlXtGEp3S%2B2b1h9nU9cfJx5ZO2Hp%2FAN%2F%2FyF%2F01DxmoV1VHJk%2B0ZKHrYxqvDMJa9IOcldrfZY1VI%3D"}'
jsonData = data
uri = 'https://pguat.paytm.com/oltp/HANDLER_INTERNAL/getTxnStatus?jsonData='+str(quote_plus(data))
r = requests.get(uri)
print(r.url)
print(r.json)
print(r.json())
print(r.url) output on console
https://pguat.paytm.com/oltp/HANDLER_INTERNAL/getTxnStatus?jsonData=%7B%22MID%22%3A%22MEDTPA37902117185435%22%2C%22ORDERID%22%3A%22medipta1521537969o72718962111%22%2C%22CHECKSUMHASH%22%3A%22omcrIRuqDP0v%252Fa2DXTlVI4XtzvmuIW56jlXtGEp3S%252B2b1h9nU9cfJx5ZO2Hp%252FAN%252F%252FyF%252F01DxmoV1VHJk%252B0ZKHrYxqvDMJa9IOcldrfZY1VI%253D%22%7D
It converts {} to %7B and i want {} as it is..
Plz help ...
You need to undo quote_plus by importing and using unquote_plus.
I didn’t test against your url, just against your string.
When I print your uri string I get this as my output:
https://pguat.paytm.com/oltp/HANDLER_INTERNAL/getTxnStatus?jsonData=%7B%22MID%22%3A%22MID117185435%22%2C%22ORDERID%22%3A%22ORDR4o22310421111%22%2C%22CHECKSUMHASH%22%3A%22omcrIRuqDP0v%252Fa2DXTlVI4XtzvmuIW56jlXtGEp3S%252B2b1h9nU9cfJx5ZO2Hp%252FAN%252F%252FyF%252F01DxmoV1VHJk%252B0ZKHrYxqvDMJa9IOcldrfZY1VI%253D%22%7D
If I surround it like this:
print(str(unquote_plus(uri)))
I get this as output:
https://pguat.paytm.com/oltp/HANDLER_INTERNAL/getTxnStatus?jsonData={"MID":"MID117185435","ORDERID":"ORDR4o22310421111","CHECKSUMHASH":"omcrIRuqDP0v%2Fa2DXTlVI4XtzvmuIW56jlXtGEp3S%2B2b1h9nU9cfJx5ZO2Hp%2FAN%2F%2FyF%2F01DxmoV1VHJk%2B0ZKHrYxqvDMJa9IOcldrfZY1VI%3D"}
I am using python+bs4+pyside in the code ,please look the part of the code below:
enter code here
#coding:gb2312
import urllib2
import sys
import urllib
import urlparse
import random
import time
from datetime import datetime, timedelta
import socket
from bs4 import BeautifulSoup
import lxml.html
from PySide.QtGui import *
from PySide.QtCore import *
from PySide.QtWebKit import *
def download(self, url, headers, proxy, num_retries, data=None):
print 'Downloading:', url
request = urllib2.Request(url, data, headers or {})
opener = self.opener or urllib2.build_opener()
if proxy:
proxy_params = {urlparse.urlparse(url).scheme: proxy}
opener.add_handler(urllib2.ProxyHandler(proxy_params))
try:
response = opener.open(request)
html = response.read()
code = response.code
except Exception as e:
print 'Download error:', str(e)
html = ''
if hasattr(e, 'code'):
code = e.code
if num_retries > 0 and 500 <= code < 600:
# retry 5XX HTTP errors
return self._get(url, headers, proxy, num_retries-1, data)
else:
code = None
return {'html': html, 'code': code}
def crawling_hdf(openfile):
filename = open(openfile,'r')
namelist = filename.readlines()
app = QApplication(sys.argv)
for name in namelist:
url = "http://so.haodf.com/index/search?type=doctor&kw="+ urllib.quote(name)
#get doctor's home page
D = Downloader(delay=DEFAULT_DELAY, user_agent=DEFAULT_AGENT, proxies=None, num_retries=DEFAULT_RETRIES, cache=None)
html = D(url)
soup = BeautifulSoup(html)
tr = soup.find(attrs={'class':'docInfo'})
td = tr.find(attrs={'class':'docName font_16'}).get('href')
print td
#get doctor's detail information page
loadPage_bs4(td)
filename.close()
if __name__ == '__main__':
crawling_hdf("name_list.txt")
After I run the program , there shows a waring message:
Warning (from warnings module):
File "C:\Python27\lib\site-packages\bs4\dammit.py", line 231
"Some characters could not be decoded, and were "
UnicodeWarning: Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
I have used print str(html) and find all chinese language in tages are messy code.
I have tried use ”decode or encode“ and ”gzip“ solutions which are search in this website,but it doesn't work in my case.
Thank you very much for your help!
It looks like that page is encoded in gbk. The problem is that there is no direct conversion between utf-8 and gbk (that I am aware of).
I've seen this workaround used before, try:
html.encode('latin-1').decode('gbk').encode('utf-8')
GBK is one of the built-in encodings in the codecs in Python.
That means that anywhere you have a string of raw bytes, you can use the method decode and the appropriate codec name (or its alias) to convert it to a native Unicode string.
The following works (adapted from https://stackoverflow.com/q/36530578/2564301), insofar the returned text does not contain 'garbage' or 'unknown' characters, and indeed is differently encoded than the source page (as verified by saving this as a new file and comparing the values for the Chinese characters).
from urllib import request
def scratch(url,encode='utf-8'):
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = {'User-Agent':user_agent}
req = request.Request(url,headers=headers)
result = request.urlopen(req)
page = result.read()
u_page = page.decode(encoding="gbk")
result.close()
print(u_page)
return u_page
page = scratch('http://so.haodf.com/index/search')
print (page)
I notice that sometimes audio files on the internet have a "fake" URL.
http://garagaeband.com/3252243
And this will 302 to the real URL:
http://garageband.com/michael_jackson4.mp3
My question is...when supplied with the fake URL, how can you get the REAL URL from headers?
Currently, this is my code for reading the headers of a file. I don't know if this code will get me what I want to accomplish. How do I parse out the "real" URL From the response headers?
import httplib
conn = httplib.HTTPConnection(head)
conn.request("HEAD",tail)
res = conn.getresponse()
This has a 302 redirect:
http://www.garageband.com/mp3cat/.UZCMYiqF7Kum/01_No_pierdas_la_fuente_del_gozo.mp3
Use urllib.getUrl()
edit:
Sorry, I haven't done this in a while:
import urllib
urllib.urlopen(url).geturl()
For example:
>>> f = urllib2.urlopen("http://tinyurl.com/oex2e")
>>> f.geturl()
'http://www.amazon.com/All-Creatures-Great-Small-Collection/dp/B00006G8FI'
>>>
Mark Pilgrim advises to use httplib2 in "Dive Into Python3" as it handles many things (including redirects) in a smarter way.
>>> import httplib2
>>> h = httplib2.Http()
>>> response, content = h.request("http://garagaeband.com/3252243")
>>> response["content-location"]
"http://garageband.com/michael_jackson4.mp3"
You have to read the response, realize that you got a 302 (FOUND), and parse out the real URL from the response headers, then fetch the resource using the new URI.
I solved the answer.
import urllib2
req = urllib2.Request('http://' + theurl)
opener = urllib2.build_opener()
f = opener.open(req)
print 'the real url is......' + f .url