How to convert u'\x96' to u'–' in python - python

I'm porting content from an old Wordpress blog to Mezzanine. I was given a json dump of the database and the posts are littered with special characters that look like this: \x96 among otherwise unescaped html.
If I manually replace the slash with &# and append a semicolon the character renders correctly
so \x96 to –
escaped UTF-8(hex) to HTML Entity(hex)
How to do this in Python?

If – is also acceptable, you can use:
>>> u'\x96'.encode('ascii', 'xmlcharrefreplace')
'–'
which is even called out in the documentation1.
1(although not very clearly)...

Related

How to remove special characters in json data python

I am reading a set of data from a json file. Content of the json file looks like:
"Address":"4820 ALCOA AVE� ",
"City":"VERNON� "
As you can see that it contains a special character � and white space at the end. While reading this json data, it is coming like:
'address': '4820 ALCOA AVE� '
'city': 'VERNON� '
I can remove the whitespace easily but I am not sure how can I remove the ¿½. I do not have direct access to json file so cannot edit it and even if I had access to json file, I would talk couple of hours to edit the file. Is there any way in python we can remove this special characters. Please help. Thanks
you can use regexp
import re
address = re.sub(r"[^\x20-\x7E]", "", "4820 ALCOA AVE� ")
print(address)
Looks like somewhere upstream wasn't handling character encoding properly and ended up with replacement characters... You may need to keep an eye out in case it mangled more important parts of the text (eg. accented characters, non-English letters, emoji).
For the immediate problem, you can load the JSON data with the utf-8 encoding, then strip the character '\ufffd'.
value = value.strip().strip('\ufffd')
If the replacement characters also appear in the middle (and you want to delete them), you might want to use replace() instead.
value = value.replace('\ufffd', '').strip()

How do I get a regular expression to recognize non-ASCII characters as letters?

I'm extracting information from a webpage in Swedish. This page is using characters like: öäå.
My problem is that when I print the information the öäå are gone.
I'm extracting the information using Beautiful Soup. I think that the problem is that I do a bunch of regular expressions on the strings that I extract, e.g. location = re.sub(r'([^\w])+', '', location) to remove everything except for the letters. Before this I guess that Beautiful Soup encoded the strings so that the öäå became something like /x02/, a hex value.
So if I'm correct, then the regexes are removing the öäå, right, I mean the only thing that should be left of the hex char is x after the regex, but there are no x instead of öäå on my page, so this little theory is maybe not correct? Anyway, if it's right or wrong, how do you solve this? When I later print the extracted information to my webpage i use self.response.out.write() in google app engine (don't know if that help in solving the problem)
EDIT: The encoding on the Swedish site is utf-8 and the encoding on my site is also utf-8.
EDIT2: You can use ISO-8859-10 for Swedish, but according to google chrome the encoding is Unicode(utf-8) on this specific site
Always work in unicode and only convert to an encoded representation when necessary.
For this particular situation, you also need to use the re.U flag so \w matches unicode letters:
#coding: utf-8
import re
location = "öäå".decode('utf-8')
location = re.sub(r'([^\w])+', '', location, flags=re.U)
print location # prints öäå
It would help if you could dump the strings before and after each step.
Check your value of re.UNICODE first, see this

Working with strings in python produces strange quotation marks

currently I am working with scrapy, which is a web crawling framework based on python. The data is extracted from html using XPATH . (I am new to python) To wrap the data scrapy uses items, e.g.
item = MyItem()
item['id'] = obj.select('div[#class="id"]').extract()
When the id is printed like print item['id'] I get following output
[u'12346']
My problem is that this output is not always in the same form. Sometimes I get an output like
"[u""someText""]"
This happens only with text, but actually there is nothing speciall with the text compared to other text that is handled corretly just like the ID.
Does anyone know what the quotation marks mean? Like I said the someText was crawled like all other text data, e.g. from
<a>someText</a>
Any ideas?
Edit:
My spider crawls all pages of a blog. Here is the exact output
[u'41039'];[u'title]
[u'40942'];"[u""title""]"]
...
Extracted with
item['title'] = site.select('div[#class="header"]/h2/a/#title').extract()
I noticed that always the same blog posts have this quotation marks. So they dont appear randomly. But there is nothing special to the text. E.g. this title produces quotation marks
<a title="Xtra Pac Telekom web'n'walk Stick Basic für 9,95" href="someURL">
Xtra Pac Telekom web'n'walk Stick Basic für 9,95</a>
So my first thought was that this is because of some special chars but there arent any.
This happeny only when the items are written to csv, when I print them in cmd there are no quotation marks.
Any ideas?
python can use both single ' and double " quotes as quotation marks. when it prints something out it chooses single quotes normally, but will switch to double quotes if the text it is printing contains single quotes (to avoid having to escape the quote in the string):
so normally, it is printing [u'....'] but sometimes you have text that contains a ' character and then it prints [u"...."].
then there is an extra complication writing to csv. if a string is written to csv that contains just a ' then it is written as it is. so [u'....'] is written as [u'....'].
but if it contains double quotes then (1) everything is put inside double quotes and (2) any double quotes are repeated twice. so u["..."] is written as "[u""...""]". if you read the csv data back with a csv library then this will be detected and removed, so it will not cause any problems.
so it's a combination of the text containing a single quote (making python use double quotes) and the csv quoting rules (which apply to double quotes, but not single quotes).
if this is a problem the csv library has various options to change the behaviour - http://docs.python.org/library/csv.html
the wikipedia page explains the quoting rules in more detail - the behavuour here is shown by the example with "Super, ""luxurious"" truck"

Translating content in filesystem for a Plone product

I'm trying to get certain strings in a .py file translated, using the i18n machinery. Translating .pt files is not a problem, but whenever I try to translate using _('Something') in Python code on the filesystem, it always gives English text (which is the default) instead of the Norwegian text that should be there. So I can see output from python code in English, while other Page Templates bits are correctly translated.
Is there a how-to or something similar for this?
Is the domain name used for _('Something') the same as what you use in the Norwegian .po file that has the translation? They should be the same, so do not use 'plone' in one case and 'my.domain' in the other.
Also, the call to the underscore function does not in itself translate the string; it only creates a string that can be translated. If this string ends up on its own directly in a template, you should add i18n:translate="" to that tag, probably with a matching i18n:domain.
Otherwise you should manually call the translate method, as in http://readthedocs.org/docs/collective-docs/en/latest/i18n/localization.html#manually-translated-message-ids. Read the Plone 4 migration guide for some differences between Plone 3 and 4 that might bite you here.
if you are seeking for how-tos you should probably read these docs:
http://plone.org/documentation/kb/i18n-for-developers
http://readthedocs.org/docs/collective-docs/en/latest/i18n/localization.html
Bye,
Giacomo
be aware that _() does not translate the text at call, but returns a Message object which will be translated when rendered in a template.
That means:
do not concat Message objects. "text %s" % _('translation') will not work, as well as "text" + _('translation')
if you do not send the text to the browser through a template, it may not be translated. for example if you generate a email you need to translate the Message object manually

need to selectively escape html entities (&)

I'm scraping a html page, then using xml.dom.minidom.parseString() to create a dom object.
however, the html page has a '&'. I can use cgi.escape to convert this into & but it also converts all my html <> tags into <> which makes parseString() unhappy.
how do i go about this? i would rather not just hack it and straight replace the "&"s
thanks
For scraping, try to use a library that can handle such html "tag soup", like lxml, which has a html parser (as well as a dedicated html package in lxml.html), or BeautifulSoup (you will also find that these libraries also contain other stuff that makes scraping/working with html easier, aside from being able to handle ill-formed documents: getting information out of forms, making hyperlinks absolute, using css selectors...)
i would rather not just hack it and
straight replace the "&"s
Er, why? That's what cgi.escape is doing - effectively just a search and replace operation for certain characters that have to be escaped.
If you only want to replace a single character, just replace the single character:
yourstring.replace('&', '&')
Don't beat around the bush.
If you want to make sure that you don't accidentally re-escape an already escaped & (i. e. not transform & into &amp; or ß into &szlig;), you could
import re
newstring = re.sub(r"&(?![A-Za-z])", "&", oldstring)
This will leave &s alone when they are followed by a letter.
You shouldn't use an XML parser to parse data that isn't XML. Find an HTML parser instead, you'll be happier in the long run. The standard library has a few (HTMLParser and htmllib), and BeautifulSoup is a well-loved third-party package.

Categories