Is there a urllib bug with unquoting 'ß'? - python

I'm scraping a german website with python and want to save files with the original name that I get as an url encoded string.
e.g. what I start with:
'GA_Pottendorf_Wiener%20Stra%C3%9Fe%2035_signed.pdf'
what I want:
'GA_Pottendorf_Wiener Straße 35_signed.pdf'
import urllib.parse
urllib.parse.unquote(file_name_quoted)
Above code works fine with 99% of the names until they contain an 'ß', which should be a legit
utf-8 character.
But I get this:
'GA_Pottendorf_Wiener Stra�e 35_signed.pdf'
>>> urllib.parse.quote('ß')
'%C3%9F'
>>> urllib.parse.unquote('%C3%9F')
'�'
Is this a bug, or a feature that I don't understand?
(and please feel free to correct my way of asking this question, it's my first one)

Related

How to use extended ascii with bs4 url

I've been reluctant to post a question about this, but after 3 days of google I can't get this to work. Long story short i'm making a raid gear tracker for WoW.
I'm using BS4 to handle the webscraping, I'm able to pull the page and scrape the info I need from it. The problem I'm having is when there is an extended ascii character in the player's name, ex: thermíte. (the i is alt+161)
http://us.battle.net/wow/en/character/garrosh/thermíte/advanced
I'm trying to figure out how to re-encode the url so it is more like this:
http://us.battle.net/wow/en/character/garrosh/therm%C3%ADte/advanced
I'm using tkinter for the gui, I have the user select their realm from a dropdown and then type in the character name in an entry field.
namefield = Entry(window, textvariable=toonname)
I have a scraping function that performs the initial scrape of the main profile page. this is where I assign the value of namefield to a global variable.(I tried to passing it directly to the scraper from with this
namefield = Entry(window, textvariable=toonname, command=firstscrape)
I thought I was close, because when it passed "thermíte", the scrape function would print out "therm\xC3\xADte" all I needed to do was replace the '\x' with '%' and i'd be golden. But it wouldn't work. I could use mastername.find('\x') and it would find instances of it in the string, but doing mastername.replace('\x','%') wouldn't actually replace anything.
I tried various combinations of r'\x' '\%' r'\x' etc etc. no dice.
Lastly when I try to do things like encode into latin then decode back into utf-8 i get errors about how it can't handle the extended ascii character.
urlpart1 = "http://us.battle.net/wow/en/character/garrosh/"
urlpart2 = mastername
urlpart3 = "/advanced"
url = urlpart1 + urlpart2 + urlpart3
That's what I've been using to try and rebuild the final url(atm i'm leaving the realm constant until I can get the name problem fixed)
Tldr:
I'm trying to take a url with extended ascii like:
http://us.battle.net/wow/en/character/garrosh/thermíte/advanced
And have it become a url that a browser can easily process like:
http://us.battle.net/wow/en/character/garrosh/therm%C3%ADte/advanced
with all of the normal extended ascii characters.
I hope this made sense.
here is a pastebin for the full script atm. there are some things in it atm that aren't utilized until later on. pastebin link
There shouldn't be non-ascii characters in the result url. Make sure mastername is a Unicode string (isinstance(mastername, str) on Python 3):
#!/usr/bin/env python3
from urllib.parse import quote
mastername = "thermíte"
assert isinstance(mastername, str)
url = "http://us.battle.net/wow/en/character/garrosh/{mastername}/advanced"\
.format(mastername=quote(mastername, safe=''))
# -> http://us.battle.net/wow/en/character/garrosh/therm%C3%ADte/advanced
You can try something like this:
>>> import urllib
>>> 'http://' + '/'.join([urllib.quote(x) for x in url.strip('http://').split('/')]
'http://us.battle.net/wow/en/character/garrosh/therm%C3%ADte/advanced'
urllib.quote() "safe" urlencodes characters of a string. You don't want all the characters to be affected, just everything between the '/' characters and excluding the initial 'http://'. So the strip and split functions take those out of the equation, and then you concatenate them back in with the + operator and join
EDIT: This one is on me for not reading the docs... Much cleaner:
>>> url = 'http://us.battle.net/wow/en/character/garrosh/therm%C3%ADte/advanced'
>>> urllib.quote(url, safe=':/')
'http://us.battle.net/wow/en/character/garrosh/therm%25C3%25ADte/advanced'

Parsing xml with "not well-formed" characters in python

I am getting xml data from an application, which I want to parse in python:
#!/usr/bin/python
import xml.etree.ElementTree as ET
import re
xml_file = 'tickets_prod.xml'
xml_file_handle = open(xml_file,'r')
xml_as_string = xml_file_handle.read()
xml_file_handle.close()
xml_cleaned = re.sub(u'[^\x01-\x7f]+',u'',xml_as_string)
root = ET.fromstring(xml_cleaned)
It works for smaller datasets with example data, but when I go to real live data, I get
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 364658, column 72
Looking at the xml file, I see this line 364658:
WARNING - (1 warnings in check_logfiles.protocol-2013-05-28-12-53-46) - ^[[0:36mnotice: Scope(Class[Hwsw]): Not required on ^[[0m</description>
I guess it is the ^[ which makes python choke - it is also highlighted blue in vim. Now I was hoping that I could clean the data with my regex substitution, but that did not work.
The best thing would be fixing the application which generated the xml, but that is out of scope. So I need to deal with the data as it is. How can I work around this? I could live with just throwing away "illegal" characters.
You already do:
xml_cleaned = re.sub(u'[^\x01-\x7f]+',u'',xml_as_string)
but the character ^[ is probably Python's \x1b. If xml.parser.expat chokes on it, you need simply to clean up more, by only accepting some characters below 0x20 (space). For example:
xml_cleaned = re.sub(u'[^\n\r\t\x20-\x7f]+',u'',xml_as_string)
I know this is pretty old, but stumbled upon the following url that has a list of all of the primary characters and their encodings.
https://medium.com/interview-buddy/handling-ascii-character-in-python-58993859c38e

Python BeautifulSoup Ampersand issue Mac vs. Linux Ubuntu

I've read that BeautifulSoup has problems with ampersands (&) which are not strictly correct in HTML but still interpreted correctly by most browsers. However weirdly I'm getting different behaviour on a Mac system and on a Ubuntu system, both using bs4 version 4.3.2:
html='<td>S&P500</td>'
s=bs4.BeautifulSoup(html)
On the Ubuntu system s is equal to:
<td>S&P500;</td>
Notice the added semicolon at the end which is a real problem
On the mac system:
<html><head></head><body>S&P500</body></html>
Never mind the html/head/body tags, I can deal with that, but notice S&P 500 is correctly interpreted this time, without the added ";".
Any idea what's going on? How to make cross-platform code without resorting to an ugly hack? Thanks a lot,
First I can't reproduce the mac results using python2.7.1 and beautifulsoup4.3.2, that is I am getting the extra semicolon on all systems.
The easy fix is a) use strictly valid HTML, or b) add a space after the ampersand. Chances are you can't change the source, and if you could parse out and replace these in python you wouldn't be needing BeautifulSoup ;)
So the problem is that the BeautifulSoupHTMLParser first converts S&P500 to S&P500; because it assumes P500 is the character name and you just forgot the semicolon.
Then later it reparses the string and finds &P500;. Now it doesn't recognize P500 as a valid name and converts the & to & without touching the rest.
Here is a stupid monkeypatch only to demonstrate my point. I don't know the inner workings of BeautifulSoup well enough to propose a proper solution.
from bs4 import BeautifulSoup
from bs4.builder._htmlparser import BeautifulSoupHTMLParser
from bsp.dammit import EntitySubstitution
def handle_entityref(self, name):
character = EntitySubstitution.HTML_ENTITY_TO_CHARACTER.get(name)
if character is not None:
data = character
else:
# Previously was
# data = "&%s;" % name
data = "&%s" % name
self.handle_data(data)
html = '<td>S&P500</td>'
# Pre monkeypatching
# <td>S&P500;</td>
print(BeautifulSoup(html))
BeautifulSoupHTMLParser.handle_entityref = handle_entityref
# Post monkeypatching
# <td>S&P500</td>
print(BeautifulSoup(html))
Hopefully someone more versed in bs4 can give you a proper solution, good luck.

How do I get a regular expression to recognize non-ASCII characters as letters?

I'm extracting information from a webpage in Swedish. This page is using characters like: öäå.
My problem is that when I print the information the öäå are gone.
I'm extracting the information using Beautiful Soup. I think that the problem is that I do a bunch of regular expressions on the strings that I extract, e.g. location = re.sub(r'([^\w])+', '', location) to remove everything except for the letters. Before this I guess that Beautiful Soup encoded the strings so that the öäå became something like /x02/, a hex value.
So if I'm correct, then the regexes are removing the öäå, right, I mean the only thing that should be left of the hex char is x after the regex, but there are no x instead of öäå on my page, so this little theory is maybe not correct? Anyway, if it's right or wrong, how do you solve this? When I later print the extracted information to my webpage i use self.response.out.write() in google app engine (don't know if that help in solving the problem)
EDIT: The encoding on the Swedish site is utf-8 and the encoding on my site is also utf-8.
EDIT2: You can use ISO-8859-10 for Swedish, but according to google chrome the encoding is Unicode(utf-8) on this specific site
Always work in unicode and only convert to an encoded representation when necessary.
For this particular situation, you also need to use the re.U flag so \w matches unicode letters:
#coding: utf-8
import re
location = "öäå".decode('utf-8')
location = re.sub(r'([^\w])+', '', location, flags=re.U)
print location # prints öäå
It would help if you could dump the strings before and after each step.
Check your value of re.UNICODE first, see this

Returning a lower case ASCII string from a (possibly encoded) string fetched using urllib2 or BeautifulSoup

I am fetching data from a web page using urllib2. The content of all the pages is in the English language so there is no issue of dealing with non-English text. The pages are encoded however, and they sometimes contain HTML entities such as £ or the copyright symbol etc.
I want to check if portions of a page contains certain keywords - however, I want to do a case insensitive check (for obvious reasons).
What is the best way to convert the returned page content into all lower case letters?
def get_page_content_as_lower_case(url):
request = urllib2.Request(url)
page = urllib2.urlopen(request)
temp = page.read()
return str(temp).lower() # this dosen't work because page contains utf-8 data
[[Update]]
I don't have to use urllib2 to get the data, in fact I may use BeautifulSoup instead, since I need to retrieve data from a specific element(s) in the page - for which BS is a much better choice. I have changed the title to reflect this.
HOWEVER, the problem still remains that the fetched data is in some non-asci coding (supposed to be) in utf-8. I did check one of the pages and the encoding was iso-8859-1.
Since I am only concerned with the English language, I want to know how I can obtain a lower case ASCII string version of the data retrieved from the page - so that I can carry out a case sensitive test as to whether a keyword is found in the text.
I am assuming that the fact that I have restricted myself to only English (from English speaking websites) reduces the choices of encoding?. I don't know much about encoding, but I assuming that the valid choices are:
ASCII
iso-8859-1
utf-8
Is that a valid assumption, and if yes, perhaps there is a way to write a 'robust' function that accepts an encoded string containing English text and returns a lower case ASCII string version of it?
Case-insensitive string search is more complicated than simply searching in the lower-cased variant. For example, a German user would expect to match both STRASSE as well as Straße with the search term Straße, but 'STRASSE'.lower() == 'strasse' (and you can't simply replace a double s with ß - there's no ß in Trasse). Other languages (in particular Turkish) will have similar complications as well.
If you're looking to support other languages than English, you should therefore use a library that can handle proper casefolding (such as Matthew Barnett's regexp).
That being said, the way to extract the page's content is:
import contextlib
def get_page_content(url):
with contextlib.closing(urllib2.urlopen(url)) as uh:
content = uh.read().decode('utf-8')
return content
# You can call .lower() on the result, but that won't work in general
Or with Requests:
page_text = requests.get(url).text
lowercase_text = page_text.lower()
(Requests will automatically decode the response.)
As #tchrist says, .lower() will not do the job for unicode text.
You could check out this alternative regex implementation which implements case folding for unicode case insensitive comparison: http://code.google.com/p/mrab-regex-hg/
There are also casefolding tables available: http://unicode.org/Public/UNIDATA/CaseFolding.txt
BeautifulSoup stores data as Unicode internally so you don't need to perform character encoding manipulations manually.
To find keywords (case-insensitive) in a text (not in attribute values, or tag names):
#!/usr/bin/env python
import urllib2
from contextlib import closing
import regex # pip install regex
from BeautifulSoup import BeautifulSoup
with closing(urllib2.urlopen(URL)) as page:
soup = BeautifulSoup(page)
print soup(text=regex.compile(ur'(?fi)\L<keywords>',
keywords=['your', 'keywords', 'go', 'here']))
Example (Unicode words by #tchrist)
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import regex
from BeautifulSoup import BeautifulSoup, Comment
html = u'''<div attr="PoSt in attribute should not be found">
<!-- it must not find post inside a comment either -->
<ol> <li> tag names must not match
<li> Post will be found
<li> the same with post
<li> and post
<li> and poſt
<li> this is ignored
</ol>
</div>'''
soup = BeautifulSoup(html)
# remove comments
comments = soup.findAll(text=lambda t: isinstance(t, Comment))
for comment in comments: comment.extract()
# find text with keywords (case-insensitive)
print ''.join(soup(text=regex.compile(ur'(?fi)\L<opts>', opts=['post', 'li'])))
# compare it with '.lower()'
print '.lower():'
print ''.join(soup(text=lambda t: any(k in t.lower() for k in ['post', 'li'])))
# or exact match
print 'exact match:'
print ''.join(soup(text=' the same with post\n'))
Output
Post will be found
the same with post
and post
and poſt
.lower():
Post will be found
the same with post
exact match:
the same with post

Categories