When I append a Unicode string to the end of str, I can not click on the URL.
Bad:
base_url = 'https://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xml&titles='
url = base_url + u"Ángel_Garasa"
print url
Good:
base_url = 'https://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xml&titles='
url = base_url + u"Toby_Maquire"
print url
It appears that you're printing the results in an IDE, perhaps PyCharm. You need to percent encode a UTF-8 encoded version of the string:
import urllib
base_url = 'https://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xml&titles='
name = u"Ángel_Garasa"
print base_url + urllib.quote(name.encode("utf-8"))
This shows:
In your case you need to update your code, so that the relevant field from the database is percent encoded. You only need to encode this one field to UTF-8 just for the percent encoding.
Related
For some reason or another, it appears that depending on where I copy and paste a url from, urllib.request.urlopen won't work. For example when I copy http://www.google.com/ from the address bar and run the following script:
from urllib import request
url = "http://www.google.com/"
response = request.urlopen(url)
print(response)
I get the following error at the call to urlopen:
UnicodeEncodeError: 'ascii' codec can't encode character '\ufeff' in position 5: ordinal not in range(128)
But if I copy the url string from text on a web page or type it out by hand, it works just fine. To be clear, this doesn't work
from urllib import request
url = "http://www.google.com/"
response = request.urlopen(url)
print(response)
#url1 = "http://www.google.com/"
#
#response1 = request.urlopen(url1)
#print(response1)
but this does:
from urllib import request
#url = "http://www.google.com/"
#
#response = request.urlopen(url)
#print(response)
url1 = "http://www.google.com/"
response1 = request.urlopen(url1)
print(response1)
My suspicion is that the encoding is different in the actual address bar, and Spyder knows how to handle it, but I don't because I can't see what is actually going on.
EDIT: As requested...
print(ascii(url))
'http://www.google.com/\ufeff'
print(ascii(url1))
'http://www.google.com/'
Indeed the strings are different.
\ufeff is a zero-width non-breaking space, so it's no wonder you can't see it. Yes, there's an invisible character in your URL. At least it's not gremlins.
You could try
from urllib import request
url = "http://www.google.com/"
response = request.urlopen(url.decode("utf-8-sig").replace("\ufeff", ""))
print(response)
That Unicode character is the byte-order mark or BOM (more info here) you can encode without BOM by using utf-8-sig decoding then replacing the problematic character
Following is the URL :
https://www.siasat.pk/forum/showthread.php?553205-قطری-ہو-یا-برطانوی-خط-کرپشن-کی-نشانی-ہے&s=be8abfc34aa0ca5ddf9b6d40b2acad4b&p=4505464#post4505464
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
If i try to urlopen(req) it gives exception as
I want to convert the characters to make it valid URL, how to get that substring and convert to valid utf 8 from quote
If i try to to quote(url) complete one, it will make it invalid.
You need to extarct the query part and quote just that part:
from urllib.parse import urlsplit, urlunsplit, quote
url_split = urlsplit(url)
query = quote(url_split.query)
url_quoted = urlunsplit(url_split._replace(query=query))
# 'https://www.siasat.pk/forum/showthread.php?553205-%D9%82%D8%B7%D8%B1%DB%8C-...'
I would like to parse through a set of URLs, so I would like to concatenate an integer where the page id is changing like this.
In the middle of the URL there is %count% but it seems not working. How can I concatenate it?
count=2
while (count < pages):
mech = Browser()
url = 'http://www.amazon.com/s/ref=sr_pg_%s'% count %'%s?rh=n%3A2858778011%2Cp_drm_rights%3APurchase%7CRental%2Cn%3A2858905011%2Cp_n_date%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491'
url = int(raw_input(url))
mech = Browser()
page = mech.open(url)
soup = BeautifulSoup(page)
print url
for thediv in soup.findAll('li',{'class':' ilo2'}):
links = thediv.find('a')
links = links['href']
print links
count = count+1
I am getting this error:
TypeError: not all arguments converted during string formatting
Final Url Format
http://www.amazon.com/s/ref=sr_pg_2?rh=n%3A2858778011%2Cp_drm_rights%3APurchase%7CRental%2Cn%3A2858905011%2Cp_n_date%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491
The % operator does not work like that in python.
Here is how you should use it :
url = 'http://....../ref=sr_pg_%s?rh=.............' % (count, )
As you already have % symbols in your URL pattern, you should begin by doubling them so they won't be seen as placeholders by python :
url = 'http://www.amazon.com/s/ref=sr_pg_%s?rh=n%%3A2858778011%%2Cp_drm_rights%%3APurchase%%7CRental%%2Cn%%3A2858905011%%2Cp_n_date%%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491' % (count, )
That being said, there is python module dedicated to parse and create URL, it is named urllib and you can find its documentation here : https://docs.python.org/3.3/library/urllib.parse.html
You have urlencoded entities in your string (%3A etc.). You might try using {} syntax instead:
url = 'http://.....{}...{}...'.format(first_arg, second_arg)
then you'll see any other issues in the string also..
If you were looking to keep the string as is (not inserting a variable value inside), the problem would be due to the fact that you use single quotes ' to delimit your string that contains itself quotes inside. You can use instead double quotes:
url = "http://www.amazon.com/s/ref=sr_pg_%s'% count %'%s?rh=n%3A2858778011%2Cp_drm_rights%3APurchase%7CRental%2Cn%3A2858905011%2Cp_n_date%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491"
A better solution is escaping the quotes:
url = 'http://www.amazon.com/s/ref=sr_pg_%s\'% count %\'%s?rh=n%3A2858778011%2Cp_drm_rights%3APurchase%7CRental%2Cn%3A2858905011%2Cp_n_date%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491'
Instead of trying to parse or edit URLs using raw strings, one should use the dedicated module, urllib2 (or urllib, depending on the python version).
Here is a simple example, using the OP's url :
from urllib2 import urlparse
original_url = (
"""http://www.amazon.com/s/ref=sr_pg_2?rh=n%3A2858778011%2"""
"""Cp_drm_rights%3APurchase%7CRental%2Cn%3A2858905011%2Cp_n_date"""
"""%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491""")
parsed = urlparse.urlparse(original_url)
This returns something like that :
ParseResult(
scheme='http', netloc='www.amazon.com', path='/s/ref=sr_pg_2',
params='',
query='rh=n%3A2858778011%2Cp_drm_rights%3APurchase%7CRental%2Cn%3A2858905011%2Cp_n_date%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491', fragment='')
Then we edit the path part of the url
scheme, netloc, path, params, query, fragment = parsed
path = '/s/ref=sr_pg_%d' % (count, )
And we "unparse" the url :
new_url = urlparse.urlunparse((scheme, netloc, path, params, query, fragment))
And we have a new url with path edited :
'http://www.amazon.com/s/ref=sr_pg_423?rh=n%3A2858778011%2Cp_drm_rights%3APurchase%7CRental%2Cn%3A2858905011%2Cp_n_date%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491'
I'm having a nightmare with data scrapped with Scrapy. Currently I encode it using UTF-8 i.e detail_content.select('p/text()[1]').extract()[0].encode('utf-8') saved into a JSON file, and then the captured text is displayed again using Django and a mobile app.
In the JSON file the escaped HTML gets escaped using unicode 'blah blah \u00a34,000 blah'
Now my problem is when I try and display the text in a django template or the mobile app the actual literal characters display: \u00a3 instead of £
Should I not be storing escaped unicode in JSON? Would it be better to store ASCII in the JSON file using the JSON escaping? If so how do you go about doing this with scrapy?
Scrappy code:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from scrapy.item import Item, Field
import datetime
import unicodedata
import re
class Spider(BaseSpider):
#spider stuff
def parse(self, response):
hxs = HtmlXPathSelector(response)
rows = hxs.select('//ul[#class = "category3"]/li')
for row in rows:
item = Item()
if len(row.select('div[2]/a/text()').extract()) > 0:
item['header'] = str(row.select('div[2]/a/text()')
.extract()[0].encode('utf-8'))
else:
item['header'] = ''
if len(row.select('div[2]/a/text()').extract()) > 0:
item['_id'] = str(row.select('div[2]/a/text()')
.extract()[0].encode('utf-8'))
else:
item['_id'] = ''
item['_id'] = self.slugify(item['_id'])[0:20]
item_url = row.select('div[2]/a/#href').extract()
today = datetime.datetime.now().isoformat()
item['dateAdded'] = str(today)
yield Request(item_url[0], meta={'item' : item},
callback=self.parse_item)
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
detail_content = hxs.select('//*[#id="content-area"]')
item = response.request.meta['item']
item['description'] = str(detail_content.select('p/text()[1]')
.extract()[0])
item['itemUrl'] = str(detail_content.select('//a[#title="Blah"]/#href')
.extract()[0])
item['image_urls'] = detail_content.select('//img[#width="418"]/../#href')
.extract()
print item
return item
Ok this I find very odd:
item['header'] = str(row.select('div[2]/a/text()')
.extract()[0].encode('utf-8'))
Is not correct to do str(<some_value>.encode('utf-8')). That basically means you're converting a utf-8 bunch of bytes to ascii. This may yield errors when the utf-8 bytes exceed 128.
Now, I strongly believe your getting the characters from Scrappy already in unicode.
I receive errors like: exceptions.UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 127: ordinal not in range(128)
So, my suggestion is to change the code to this:
item['header'] = row.select('div[2]/a/text()')
.extract()[0].encode('utf-8')
Just remove the str() calling. This will get the unicode received from Scrappy and turn it into utf-8. Once it is in utf-8. Be careful with string operations. Normally this conversion from unicode to a specific encoding should be done just before the writing to disk.
Note that you have this kind of code in two places. Modify them both.
UPDATE: Take a look at this, might be helpful: scrapy text encoding
Hope this helps!
I am trying to join a string in a URL, but the problem is that since it's spaced the other part does not get recognized as part of the URL.
Here would be an example:
import urllib
import urllib2
website = "http://example.php?id=1 order by 1--"
request = urllib2.Request(website)
response = urllib2.urlopen(request)
html = response.read()
The "order by 1--" part is not recognized as part of the URL.
You should better use urllib.urlencode or urllib.quote:
website = "http://example.com/?" + urllib.quote("?id=1 order by 1--")
or
website = "http://example.com/?" + urllib.urlencode({"id": "1 order by 1 --"})
and about the query you're trying to achieve:
I think you're forgetting a ; to end the first query.
Of course not. Spaces are invalid in a query string, and should be replaced by +.
http://example.com/?1+2+3