Following is the URL :
https://www.siasat.pk/forum/showthread.php?553205-قطری-ہو-یا-برطانوی-خط-کرپشن-کی-نشانی-ہے&s=be8abfc34aa0ca5ddf9b6d40b2acad4b&p=4505464#post4505464
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
If i try to urlopen(req) it gives exception as
I want to convert the characters to make it valid URL, how to get that substring and convert to valid utf 8 from quote
If i try to to quote(url) complete one, it will make it invalid.
You need to extarct the query part and quote just that part:
from urllib.parse import urlsplit, urlunsplit, quote
url_split = urlsplit(url)
query = quote(url_split.query)
url_quoted = urlunsplit(url_split._replace(query=query))
# 'https://www.siasat.pk/forum/showthread.php?553205-%D9%82%D8%B7%D8%B1%DB%8C-...'
Related
Python 3 requests.get().text returns unencoded string.
If I execute:
import requests
request = requests.get('https://google.com/search?q=Кто является президентом России?').text.lower()
print(request)
I get kind of this:
Кто является презид
I've tried to change google.com to google.ru
If I execute:
import requests
request = requests.get('https://google.ru/search?q=Кто является президентом России?').text.lower()
print(request)
I get kind of this:
d0%9a%d1%82%d0%be+%d1%8f%d0%b2%d0%bb%d1%8f%d0%b5%d1%82%d1%81%d1%8f+%d0%bf%d1%80%d0%b5%d0%b7%d0%b8%d0%b4%d0%b5%d0%bd%d1%82%d0%be%d0%bc+%d0%a0%d0%be%d1%81%d1%81%d0%b8%d0
I need to get an encoded normal string.
You were getting this error because requests was not able to identify the correct encoding of the response. So if you are sure about the response encoding then you can set it like the following:
response = requests.get(url)
response.encoding --> to check the encoding
response.encoding = "utf-8" --> or any other encoding.
And then get the content with .text method.
I fixed it with urllib.parse.unquote() method:
import requests
from urllib.parse import unquote
request = unquote(requests.get('https://google.ru/search?q=Кто является президентом России?').text.lower())
print(request)
When I append a Unicode string to the end of str, I can not click on the URL.
Bad:
base_url = 'https://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xml&titles='
url = base_url + u"Ángel_Garasa"
print url
Good:
base_url = 'https://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xml&titles='
url = base_url + u"Toby_Maquire"
print url
It appears that you're printing the results in an IDE, perhaps PyCharm. You need to percent encode a UTF-8 encoded version of the string:
import urllib
base_url = 'https://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xml&titles='
name = u"Ángel_Garasa"
print base_url + urllib.quote(name.encode("utf-8"))
This shows:
In your case you need to update your code, so that the relevant field from the database is percent encoded. You only need to encode this one field to UTF-8 just for the percent encoding.
I would like to parse through a set of URLs, so I would like to concatenate an integer where the page id is changing like this.
In the middle of the URL there is %count% but it seems not working. How can I concatenate it?
count=2
while (count < pages):
mech = Browser()
url = 'http://www.amazon.com/s/ref=sr_pg_%s'% count %'%s?rh=n%3A2858778011%2Cp_drm_rights%3APurchase%7CRental%2Cn%3A2858905011%2Cp_n_date%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491'
url = int(raw_input(url))
mech = Browser()
page = mech.open(url)
soup = BeautifulSoup(page)
print url
for thediv in soup.findAll('li',{'class':' ilo2'}):
links = thediv.find('a')
links = links['href']
print links
count = count+1
I am getting this error:
TypeError: not all arguments converted during string formatting
Final Url Format
http://www.amazon.com/s/ref=sr_pg_2?rh=n%3A2858778011%2Cp_drm_rights%3APurchase%7CRental%2Cn%3A2858905011%2Cp_n_date%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491
The % operator does not work like that in python.
Here is how you should use it :
url = 'http://....../ref=sr_pg_%s?rh=.............' % (count, )
As you already have % symbols in your URL pattern, you should begin by doubling them so they won't be seen as placeholders by python :
url = 'http://www.amazon.com/s/ref=sr_pg_%s?rh=n%%3A2858778011%%2Cp_drm_rights%%3APurchase%%7CRental%%2Cn%%3A2858905011%%2Cp_n_date%%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491' % (count, )
That being said, there is python module dedicated to parse and create URL, it is named urllib and you can find its documentation here : https://docs.python.org/3.3/library/urllib.parse.html
You have urlencoded entities in your string (%3A etc.). You might try using {} syntax instead:
url = 'http://.....{}...{}...'.format(first_arg, second_arg)
then you'll see any other issues in the string also..
If you were looking to keep the string as is (not inserting a variable value inside), the problem would be due to the fact that you use single quotes ' to delimit your string that contains itself quotes inside. You can use instead double quotes:
url = "http://www.amazon.com/s/ref=sr_pg_%s'% count %'%s?rh=n%3A2858778011%2Cp_drm_rights%3APurchase%7CRental%2Cn%3A2858905011%2Cp_n_date%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491"
A better solution is escaping the quotes:
url = 'http://www.amazon.com/s/ref=sr_pg_%s\'% count %\'%s?rh=n%3A2858778011%2Cp_drm_rights%3APurchase%7CRental%2Cn%3A2858905011%2Cp_n_date%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491'
Instead of trying to parse or edit URLs using raw strings, one should use the dedicated module, urllib2 (or urllib, depending on the python version).
Here is a simple example, using the OP's url :
from urllib2 import urlparse
original_url = (
"""http://www.amazon.com/s/ref=sr_pg_2?rh=n%3A2858778011%2"""
"""Cp_drm_rights%3APurchase%7CRental%2Cn%3A2858905011%2Cp_n_date"""
"""%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491""")
parsed = urlparse.urlparse(original_url)
This returns something like that :
ParseResult(
scheme='http', netloc='www.amazon.com', path='/s/ref=sr_pg_2',
params='',
query='rh=n%3A2858778011%2Cp_drm_rights%3APurchase%7CRental%2Cn%3A2858905011%2Cp_n_date%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491', fragment='')
Then we edit the path part of the url
scheme, netloc, path, params, query, fragment = parsed
path = '/s/ref=sr_pg_%d' % (count, )
And we "unparse" the url :
new_url = urlparse.urlunparse((scheme, netloc, path, params, query, fragment))
And we have a new url with path edited :
'http://www.amazon.com/s/ref=sr_pg_423?rh=n%3A2858778011%2Cp_drm_rights%3APurchase%7CRental%2Cn%3A2858905011%2Cp_n_date%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491'
I have the following code:
import re
from re import sub
import cookielib
from cookielib import CookieJar
import urllib2
from urllib2 import urlopen
cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.addheaders=[('user-agent' , 'Safari/7.0.2')]
def check(word):
try:
query = "select * from geo.places where text ='"+word+"'"
sourceCode=opener.open('http://query.yahooapis.com/v1/public/yql?q='+query+'&diagnostics=true').read()
print sourceCode
except Exception, e:
print str(e)
print 'ERROR IN MAIN TRY'
myStr = ['I','went','to','Boston']
for item in myStr:
check(item)
I am trying to query select * from geo.places where text = 'Boston' (for example).
I keep receiving this error:
HTTP Error 505: HTTP Version Not Supported
ERROR IN MAIN TRY
What can cause this error and how can I solve it?
The URL you construct is not a valid URL. What you send is
GET /v1/public/yql?q=select * from geo.places where text ='I'&diagnostics=true HTTP/1.1
Accept-Encoding: identity
Host: query.yahooapis.com
Connection: close
User-Agent: Safari/7.0.2
There should be no spaces inside the URL, e.g. you have to do proper URL encoding (replace space with '+' etc). I guess requests just fixes the bad URL for you.
Your query might have blank spaces in between. Requests take care of the white spaces in your url and hence you don't have to take care of it.
Just replace each " " by "%20" to make the url work.
Not sure, what is going wrong, but when I try to do the same action using requests library, it works:
>>> import requests
>>> word = "Boston"
>>> query = "select * from geo.places where text ='"+word+"'"
>>> query
"select * from geo.places where text ='Boston'"
>>> baseurl = 'http://query.yahooapis.com/v1/public/yql?q='
>>> url = baseurl + query
>>> url
"http://query.yahooapis.com/v1/public/yql?q=select * from geo.places where text ='Boston'"
>>> req = requests.get(url)
>>> req
<Response [200]>
>>> req.text
u'<?xml version="1.0" encoding="UTF-8"?>\n<query xmlns:yahoo="http://www.yahooapis.com/v1/base.rng" yahoo:count="10" yahoo:created="2014-05-17T21:12:52Z" yahoo:lang="en-US"><results><place xmlns="http://where.yahooapis.com/v1/schema.rng" xml:lang="en-US" yahoo:uri="http://where.yahooapis.com/v1/place/2367105"><woeid>2367105</woeid><placeTypeName code="7">Town</placeTypeName><name>Boston</name><country code="US" type="Country" woeid="23424977">United States</country><admin1 code="US-MA" type="State" woeid="2347580">Massachusetts</admin1><admin2 code="" type="County" woei....
Note, that there are differences, my code is much simpler, it does not work with cookies and it does not try to pretend Safari browser.
If you need to use cookies with requests, you will find very good support for it there.
As stated in other answers, you need to encode your url due to the white space. The call would be urllib.quote if using python2 or urllib.parse.quote for python3. The safe parameter is used to ignore characters when encoding.
from urllib import quote
url = 'http://query.yahooapis.com/v1/public/yql?q=select * from geo.places where text =\'Boston\''
print(quote(url, safe=':/?*=\''))
# outputs "http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20geo.places%20where%20text%20='Boston'"
Use requests is good choose. but we should found out why?
query = "select * from geo.places where text ='"+word+"'"
There are some space in your paramter.we should be url encode this space.
you should convert the spaces to '%20', but in python the '%' is special char, you should be escaped use '%%20'
I am trying to join a string in a URL, but the problem is that since it's spaced the other part does not get recognized as part of the URL.
Here would be an example:
import urllib
import urllib2
website = "http://example.php?id=1 order by 1--"
request = urllib2.Request(website)
response = urllib2.urlopen(request)
html = response.read()
The "order by 1--" part is not recognized as part of the URL.
You should better use urllib.urlencode or urllib.quote:
website = "http://example.com/?" + urllib.quote("?id=1 order by 1--")
or
website = "http://example.com/?" + urllib.urlencode({"id": "1 order by 1 --"})
and about the query you're trying to achieve:
I think you're forgetting a ; to end the first query.
Of course not. Spaces are invalid in a query string, and should be replaced by +.
http://example.com/?1+2+3