Decoding an encoded URL - python

I want to automatically download files from a pdf (in which there are links).
I already wrote a script which finds all these links and that works great, the problem I'm facing is with the files' names.
I want to save them by their default names so it will be easy to understand what each file is, without the need to manually change each name.
The problem is, each name is encoded using unicode (utf-8) according to this site https://www.webatic.com/url-convertor which converts the encoded strings great, but python doesn't let me use the function decode to decode this.
For example: this string %D7%97%D7%95%D7%9E%D7%A8%D7%99+%D7%9C%D7%99%D7%9E%D7%95%D7%93 should become חומרי לימוד after decoding.

Python has an URL parser:
>>> import urllib.parse
>>> urllib.parse.unquote_plus('%D7%97%D7%95%D7%9E%D7%A8%D7%99+%D7%9C%D7%99%D7%9E%D7%95%D7%93')
'חומרי לימוד'

Related

Python - pdfme - writing utf-8 characters to file

I would like to generate report to pdf using pdfme library. I need the Polish characters to be there as well. The example report end with:
with open('document.pdf', 'wb') as f:
build_pdf(document, f)
So I cannot add encoding = "utf-8". Is there any way I can still use Polish characters?
I tried:
Change to write mode and set encoding to utf-8. Getting: "TypeError: write() argument must be str, not bytes".
While having Polish characters add .encode("utf-8"). Example: "Paweł".encode("utf-8"). Getting: "TypeError: value of . attr must be of type str, list or tuple: b'Pawe\xc5\x82'"
In this case, the part of the code responsible for dealing with the unicode characters is the PDF library. The build_pdf call there, for whatever library it is, has to be able to handle any character in "document". And if it fails it is the context for the PDF library, owner of the "build_pdf" call that has to be changed so that it will handle all the characters you need.
"utf-8" is just one form os expressing characters as bytes - aPDF file is a binary file, and it does have internal headers, structures and settings to do its own character encoding handling: your text may endup inside the PDF either encoded as utf-8, or some other, legacy encoding- but that will be transparent for you and anyone using the PDF file.
It may be that the document, if it is text (we don't know if it is plain text, or if it is some object from your library that has already been pre-processed) - but if it is text, and your library says that build_pdf can accept bytes instead, you can encode the document prior to this call:
build_pdf(document.encode('utf-8', f) - but that would be some strange way of working - it is likely that either build_pdf will do the encoding, or whatever process generated the document had already done so.
To get more meaningful help, you have to say which library you are using to geneate the PDF, and include the import lines in your code,including the creation of your document so that we have a minimal reproducible example: i.e. I can copy your code, paste in a .py file here, install the lib, run it, and see a corrupted PDF file with the Polish characters magled: then I, and others, can be able to fix it. Otherwise, this answer is as far as I can get.

python - How to properly encode string in utf8 which is ISO8859-1 from xml

I'm using the following code in python 2.7 to retrieve an xmlFile containing some german umlauts like ä,ü,ö,ß:
.....
def getXML(self,url):
xmlFile=urllib2.urlopen(self.url)
xmlResponse=xmlFile.read()
xmlResponse=xmlResponse
xmlFile.close()
return xmlResponse
pass
def makeDict(self, xmlFile):
data = xmltodict.parse(xmlFile)
return data
def saveJSON(self, dictionary):
currentDayData=dictionary['speiseplan']['tag'][1]
file=open('data.json','w')
# Write the currentDay as JSON
file.write(json.dumps(currentDayData))
file.close()
return True
pass
# Execute
url="path/to/xml/"
App=GetMensaJSON(url)
xml=GetMensaJSON.getXML(App,url)
dictionary=GetMensaJSON.makeDict(App,xml)
GetMensaJSON.saveJSON(App,dictionary)
The problem is that the xml File claims in its <xml> tag that it is utf-8. It however isn't. By trying I found out that it is iso8859_1
So I wanted to reconvert from utf-8 to iso8859 and back to utf-8 to resolve the conflicts.
Is there an elegant way to resolve missing umlauts? In my code for example I have instead of ß\u00c3\u009f an instead of ü\u00c3\u00bc
I found this similar question but I can't get it to work How to parse an XML file with encoding declaration in Python?
Also I should add that I can't influence the way I get the xml.
The XML File Link can be found in the code.
The output from ´repr(xmlResponse)´ is
"<?xml version='1.0' encoding='utf-8'?>\n<speiseplan>\n<tag timestamp='1453676400'>\n<item language='de'>\n<category>Stammessen</category>\n<title>Gem\xc3\x83\xc2\xbcsebr\xc3\x83\xc2\xbche mit Backerbsen (25,28,31,33,34), paniertes H\xc3\x83\xc2\xa4hnchenbrustschnitzel (25) mit Paprikasauce (33,34), Pommes frites und Gem\xc3\x83\xc2\xbcse
You are trying to encode already encoded data. urllib2.urlopen() can only return you a bytestring, not unicode, so encoding makes little sense.
What happens instead is that Python is trying to be helpful here; if you insist on encoding bytes, then it'll decode those to unicode data first. And it'll use the default codec for that.
On top of that, XML documents are themselves responsible for documenting what codec should be used to decode. The default codec is UTF-8, don't manually re-code the data yourself, leave that to a XML parser.
If you have Mojibake data in your XML document, the best way to fix that is to do so after parsing. I recommend the ftfy package to do this for you.
You could manually 'fix' the encoding by first decoding as UTF-8, then encoding to Latin-1 again:
xmlResponse = xmlFile.read().decode('utf-8').encode('latin-1')
However, this makes the assumption that your data has been badly decoded as Latin-1 to begin with; this is not always a safe assumption. If it was decoded as Windows CP 1252, for example, then the best way to recover your data is to still use ftfy.
You could try using ftfy before parsing as XML, but this relies on the document not having used any non-ASCII elements outside of text and attribute content:
xmlResponse = ftfy.fix_text(
xmlFile.read().decode('utf-8'),
fix_entities=False, uncurl_quotes=False, fix_latin_ligatures=False).encode('utf-8')

Python's glob module and unix' find command don't recognize non-ascii

I am on Mac OS X 10.8.2
When I try to find files with filenames that contain non-ASCII-characters I get no results although I know for sure that they are existing. Take for example the console input
> find */Bärlauch*
I get no results. But if I try without the umlaut I get
> find */B*rlauch*
images/Bärlauch1.JPG
So the file is definitely existing. If I rename the file replacing 'ä' by 'ae' the file is being found.
Similarily the Python module glob is not able to find the file:
>>> glob.glob('*/B*rlauch*')
['images/Bärlauch1.JPG']
>>> glob.glob('*/Bärlauch*')
[]
I figured out it must have something to do with the encoding but my terminal is set to be utf-8 and I am using Python 3.3.0 which uses unicode strings.
Mac OS X uses denormalized characters always for filenames on HFS+. Use unicodedata.normalize('NFD', pattern) to denormalize the glob pattern.
import unicodedata
glob.glob(unicodedata.normalize('NFD', '*/Bärlauch*'))
Python programs are fundamentally text files. Conventionally, people write them using only characters from the ASCII character set, and thus do not have to think about the encoding they write them in: all character sets agree on how ASCII characters should be decoded.
You have written a Python program using a non-ASCII character. Your program thus comes with an implicit encoding (which you haven't mentioned): to save such a file, you have to decide how you are going to represent a-umlaut on disk. I would guess that perhaps your editor has chosen something non-Unicode for you.
Anyway, there are two ways around such a problem: either you can restrict yourself to using only ASCII characters in the source code of your program, or you can declare to Python that you want it to read the text file with a specific encoding.
To do the former, you should replace the a-umlaut with its Unicode escape sequence (which I think is \x0228 but can't test at the moment). To do the latter, you should add a coding declaration at the top of the file:
# -*- coding: <your encoding> -*-

Getting the correct encoding for strings and csv-files in Python

I'm using mechanize in Python to grab some data from a website and send it new data.
The thing is that the site is in French, so I get question marks in a diamond shape (�) instead of various characters such as éÉÀàùÙîû and others.
I tried looking around on Google and StackOverflow and found various answers that didn't fix my problem. I've seen answers recommending trying one of the following lines:
myString = éÀî
myString.encode('latin-1')
myString.encode('iso-8859-1')
unicode(myString, 'iso-8859-1')
but none of those seem to work.
The two cases where I need this are when I read a csv file with accents and with hardcoded strings containing accents. For instance, here's what a line in the csv file looks like (actually ';' is the separator):
Adam Guérin;myemail#mail.com;555-5555;2011-02-05
The 'é' looks fine, but when I try to fill a textField on the website with mechanize and submit it, the 'é' now looks like '�' on the actual website.
Edit:
This is my code for reading the data in the csv file:
subscriberReader = csv.reader(open(path, 'rb'), delimiter=';')
subscribers = []
for row in subscriberReader:
subscribers.append(Subscriber(row[0], row[1], row[2]))
Then I send it to the website using mechanize:
self.br.select_form('aspnetForm')
self.br.form['fldEmail'] = subscriber.email
self.br.form['fldName'] = subscriber.name
self.br.form['fldPhoneNum'] = subscriber.phoneNum
self.br.submit()
I tried various ways to encode the characters, but I guess I'm not doing it correctly. I'll be glad to try anything that gets suggested in the answers / comments.
As for the website, it doesn't specify which encoding it is using in the header.
First, you mentioned that you want to place literals into your code. To do so, you need to tell Python what encoding your script file has. You do this with a comment declaration at the beginning of the file (I'll assume that you're using latin-1).
# -*- coding: latin-1 -*-
myString = u'éÀî'
Second, you need to be able to work with the string. This isn't mechanize-specific, but covering a few basics should be useful: first, myString ends up being a unicode object (because of the way the literal was declared, with the u''). So, to use it as a Latin-1 encoding, you'll need to call .encode(), for example:
with open('test.txt', 'w') as f:
f.write(myString.encode('latin-1'))
And finally, when reading in a string that is encoded (say, from the remote web site), you can use .decode() to decode it into a unicode object, and work with it from there.
with open('test.txt', 'r') as f:
myString = f.read().decode('latin-1')

Translate special character ½

I am reading a source that contains the special character ½. How do I convert this to 1/2? The character is part of a sentence and I still need to be able to use this string "normally". I am reading webpage sources, so I'm not sure that I will always know the encoding??
Edit: I have tried looking at other answers, but they don't work for me. They always seem to start with something like:
s= u'£10"
but I get an error already there: "no encoding declared". But do I know what encoding I'm getting in, or does that not matter? Do I just pick one?
This is really two questions.
#1. To interpret ½: Use the unicodedata module. You can ask for the numeric value of the character or you can normalize using a canonical normalization form it and parse it yourself.
>>> import unicodedata
>>> unicodedata.numeric(u'½')
0.5
>>> unicodedata.normalize('NFKC', u'½')
'1⁄2'
#2. Encoding problems: If you're working with the terminal, make sure Python knows the terminal encoding. If you're writing source files, make sure Python knows the file encoding. You can't just "pick" an encoding to set for Python, you must inform Python about the encoding that your terminal / text editor already uses.
Python lets you set the encoding of files with Vim/Emacs style comments. Put a comment at the top of the file like this if you use Vim:
# coding=UTF-8
Or this, if you use Emacs:
# -*- coding: UTF-8 -*-
If you use neither Vim nor Emacs, then it doesn't matter which one. Obviously, if you don't use UTF-8 you should substitute the encoding you actually use. (UTF-8 is the only encoding I can recommend.)
Dietrich beat me to the punch, but here is some more detail about setting the encoding for your source file:
Because you want to search for a literal unicode ½, you need to be able to write it in your source file. Unfortunately, the Python interpreter chokes on any unicode input, unless you specify the encoding of that source file with a comment in the first couple of lines, like so:
# coding=utf8
# ... do stuff here ...
This assumes your editor is saving the file as UTF-8. If it's using a different encoding specify that instead. See PEP-0263 for more details.
Once you've specified the encoding you should be able to write something this in your code:
text = text.replace('½', '1/2')
Encoding of the webpage
Depending on how you are downloading the page, you probably don't need to worry about this at all, most HTTP libraries handle choosing the encoding for you automatically.
Did you try using codecs to read your file? [docs]
import codecs
fileObj = codecs.open( "someFile", "r", "utf-8" )
u = fileObj.read() # Returns a Unicode string from the UTF-8 bytes in the file
You can check the whole guide here.
Also a good ref: http://docs.python.org/howto/unicode

Categories