Save file with any language using unicode - python

I'm creating a simple script that takes a list of images as an input and outputs a pdf file, using the Reportlab pdf-generation module. The script takes the filename as shown above:
from reportlab.pdfgen import canvas
filename = raw_input("Enter pdf filename: ")
c = canvas.Canvas(filename + ".pdf")
c.save()
Everything is awesome, until the user input non-english filename (Hebrew, Arabic), which cause the code to throw the following exception:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf9 in position 0: invalid start byte
So, I decided to use unicode instead, but when I use unicode() it throws me another exception:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf9 in position 0: ordinal not in range(128)
However, when I decode the string encoding it works like a charm (Hebrew example):
from reportlab.pdfgen import canvas
filename = raw_input("Enter pdf filename: ")
filename = filename.decode("windows-1255")
c = canvas.Canvas(filename + ".pdf")
c.save()
I continued to try another methods, and found that if I write before the string u like in the example above, it works in any language:
from reportlab.pdfgen import canvas
filename = u"أ" #arabic
c = canvas.Canvas(filename + ".pdf")
c.save()
The problem is that I dont know what encoding I should use. The input string could be in any language. What can I do to fix it, or in other words: How can I add u before string without speicfy the encoding?
PS: If you have better title, please write me down below
Edit: The filename is actually provided from a website (I use urllib). I didnt thought it matters and I used raw_input() to make the problem more clear. Sorry for that

raw_input() strings are encoded by the terminal or console, so you'd ask the terminal or console for the right codec to use.
Python has already done this at startup time, and stored the codec in sys.stdin.encoding:
import sys
filename = raw_input("Enter pdf filename: ")
filename = filename.decode(sys.stdin.encoding)
From the comments you indicated that the filename is not actually sourced from raw_input(). For different sources, you'll need to use different techniques to detect the character set used.
For example, HTTP responses may include a charset parameter in the Content-Type header; a urllib or urllib2 response lets you extract that with:
encoding = response.info().getparam('charset')
This can still return None, at which point it depends on the exact mimetype returned. The default for text/ mimetypes (such as HTML) is Latin-1, but the HTML standard also allows for <meta> headers in the document itself to tell you the characterset used. For HTML, I'd use BeautifulSoup to parse the response, it'll detect the characterset for you.
Without more information on how you actually load the filename from a URL, however, I cannot say anything more specific.

OK, I got the solution! Once I got the text from the server I parsed it using BeutifulSoup (Thank you #Martijn Pieters!), that has charset detection library:
resp = urllib2.urlopen("http://example.com").read()
soup = BeautifulSoup(resp)
string = soup.find_all("span")[0].text
And then I just used string as the file name:
c = canvas.Canvas(path + "/" + string + ".pdf")
The full credit goes to #Martijn Pieters that recommended me to use BS.
This is not the first script HTML parsing script I wrote, and I always used regex. I highly recommend anyone to use BeautifulSoup instead, trust me it's much better then regex.

Related

Codec error while reading a file in python - 'charmap' codec can't decode byte 0x81 in position 3124: character maps to <undefined>

I am working on a Machine Learning Project which filters spam/phishing emails out of all emails. For this, I am using the SpamAssassin dataset. The dataset contains different mails in this format:
For identifying phishing emails, first thing I have to do is finding out how many web-links the email has. For doing that, I have written the following code:
wordsInLine = []
tempWord = []
urlList = []
base_dir = "C:/Users/keert/Downloads/Spam_Assassin/spam"
def count():
flag = 0
print("Reading all file names in sorted order")
for filename in sorted(os.listdir(base_dir)):
file=open(os.path.join(base_dir, filename))
count1 = 0
for line in file:
wordsInLine = line.split(' ')
for word in wordsInLine:
if re.search('href="http',word,re.I):
count1=count1+1
file.close()
urlList.append(count1)
if flag!=0:
print("File Name = " + filename)
print ("Number of links = ",count1)
flag = flag + 1
count()
final = urlList[1:]
print("List of number of links in each email")
print(final)
with open('count_links.csv', 'wb') as myfile:
wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
for val in final:
wr.writerow([val])
print("CSV file generated")
But this code is giving me an error saying that: 'charmap' codec can't decode byte 0x81 in position 3124: character maps to
I have even tried opening the file by adding encoding = 'utf8' option. But still, the clash remains and I got an error like: 'utf-8' codec can't decode byte 0x81 in position 3124: character maps to
I guess this is due to the special characters that are in the file. Is there any way to deal with this because I can not skip the special characters as they are also important. Please suggest me a way for doing this. Thank you in advance
You have to open and read the file using the same encoding that was used to write the file. In this case, that might be a bit difficult, since you are dealing with e-mails and they can be in any encoding, dependent on the sender. In the example file you showed, the message is encoded using 'iso-8859-1' encoding.
However, e-mails are a bit strange, since they consist of a header (which is in ASCII format as far as I know), followed by an empty line and the body. The body is encoded in the encoding that was specified in the header. So two different encodings could be used in the same file!
If you're sure that all the e-mails use iso-8859-1 encoding and you're looking for a quick-and-dirty solution, then you could also just open the file using 'iso-8859-1' encoding, since e-mail headers are compatible with iso-8859-1. However, be prepared that you will have to deal with other e-mail formatting/encoding/escaping issues as well, or your script might not work completely as expected.
I think the best solution would be to look for a Python module that can handle e-mails, so it will deal with all the decoding stuff and you don't have to worry about that. It will also solve other problems such as escape characters and line breaks.
I don't have experience with this myself, but it seems that Python has built-in support for parsing e-mails using the e-mail package. I recommend to take a look at that.

'utf-8' codec can't decode byte - Python

My Django application is working with both .txt and .doc filetypes. And this application opens a file, compares it with other files in db and prints out some report.
Now the problem is that, when file type is .txt, I get 'utf-8' codec can't decode byte error (here I'm using encoding='utf-8'). When I switch encoding='utf-8' to encoding='ISO-8859-1' error changes to 'latin-1' codec can't decode byte.
I want to find such encoding format that works with every type of a file. This is a small part of my function:
views.py:
#login_required(login_url='sign_in')
def result(request):
last_uploaded = OriginalDocument.objects.latest('id')
original = open(str(last_uploaded.document), 'r', encoding='utf-8')
original_words = original.read().lower().split()
words_count = len(original_words)
open_original = open(str(last_uploaded.document), "r")
read_original = open_original.read()
report_fives = open("static/report_documents/" + str(last_uploaded.student_name) +
"-" + str(last_uploaded.document_title) + "-5.txt", 'w')
# Path to the documents with which original doc is comparing
path = 'static/other_documents/doc*.txt'
files = glob.glob(path)
rows, found_count, fives_count, rounded_percentage_five, percentage_for_chart_five, fives_for_report, founded_docs_for_report = search_by_five(last_uploaded, 5, original_words, report_fives, files)
context = {
...
}
return render(request, 'result.html', context)
There is no general encoding which automatically knows how to decode an already encoded file in a specific encoding.
UTF-8 is a good option with many compatibilities with other encodings. You can e.g. simply ignore or replace characters which aren't decodable like this:
from codecs import open
original = open(str(last_uploaded.document), encoding="utf-8", errors="ignore")
original_words = original.read().lower().split()
...
original.close()
Or even using a context manager (with statement) who closes the file for you:
with open(str(last_uploaded.document), encoding="utf-8", errors="ignore") as fr:
original_words = fr.read().lower().split()
...
(Note: You do not need to use the codecs library if you're using Python 3, but you have tagged your question with python-2.7.)
You can see advantages and disadvantages of using different error handlers here and here. You have to know that not using an error handler will default to using errors="strict" which you probably do not want. Other options may be nearly self-explaining, e.g.:
using errors="replace" will replace an undecodable character with a suitable replacement marker
using errors="ignore" will simply ignore the character and continues reading the file data.
What you should use depends on your needs and usecase(s).
You're saying that you also have encoding problems not only with plain text files, but also with proprietary doc files:
The .doc format is not a plain text file which you can simply read with open() or codecs.open() since there are many information stored in binary format, see this site for more information. So you need a special reader for .doc files to get the text from it. Which library you are using depends on your Python version and maybe also on the operating system you are using. Maybe here is a good starting point for you.
Unfortunately, using a library does not prevent you completely from encoding errors. (Maybe yes, but I'm not sure if the encoding is saved in the file itself like in a .docx file.) You maybe also have the chance to figure out the encoding of the file. How you can handle encoding errors likely depends on the library itself.
So I just guess that you are trying opening .doc files as simple text files. Then you will get decoding errors, because it's not saved as human readable text. And even if you get rid of the error, you only will see the non human readable text: (I've created a simple text file with LibreOffice in doc-format (Microsoft Word 1997-2003)):
In [1]: open("./test.doc", "r").read()
UnicodeDecodeError: 'utf-8' codec can`t decode byte 0xd0 in position 0: invalid continuation byte
In [2]: open("./test.doc", "r", errors="replace").read() # or open("./test.doc", "rb").read()
'��\x11\u0871\x1a�\x00\x00\x00' ...

Change encoding for locally stored .html files downloaded with urllib.request.urlretrieve()

I used the following python code to save an html file to local storage:
url = "some_url.html
urllib.request.urlretrieve(url, 'save/to/path')
This successfully saves the file with a .html extension. When I attempt to open the file with:
html_doc = open('save/to/path/some_url.html', 'r')
I get the following error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 36255: ordinal not in range(128)
I think this means I am attempting to read a utf-8 file with a ascii codec. I attempted the solution found at:
Convert Unicode to ASCII without errors in Python
But this, as well as other solutions I have found, only seem to work for encoding the file for immediate viewing and not saved files. I cannot find one that works for altering the encoding of a locally stored file.
The open() function has an optional encoding parameter.
Its default is platform dependent, but in your case it apparently defaults to UTF-8.
I you know the correct codec (eg. from a HTTTP header), you can specify it:
html_doc = open('path/to/file.html', 'r', encoding='cp1252')
If you don't know it, chances are that it is written in the file.
You can open the file in binary mode:
html_doc = open('path/to/file.html', 'rb')
and then try to find an encoding declaration and decode the whole thing in memory.
However, don't do that.
There's not much use in opening and processing HTML like a text file.
You should use an HTML parser to walk through the document tree and extract whatever you need.
Python's standard library has one, but you might find Beautiful Soup easier to use.

How to save the output of airport -s -x to file with Python

i am learning python, and i am having troubles with saving the output of a small function to file. My python function is the following:
#!/usr/local/bin/python
import subprocess
import codecs
airport = '/System/Library/PrivateFrameworks/Apple80211.framework/Versions/Current/Resources/airport'
def getAirportInfo():
arguments = [airport, "--scan" , "--xml"]
execute = subprocess.Popen(arguments, stdout=subprocess.PIPE)
out, err = execute.communicate()
print out
return out
airportInfo = getAirportInfo()
outFile = codecs.open('wifi-data.txt', 'w')
outFile.write(airportInfo)
outFile.close()
I guess that this would only work on a Mac, as it references some PrivateFrameworks.
Anyways, the code almost works as it should. The print statement prints a huge xml file, that i'd like to store in a file, for future processing. And here start the problems.
In the version above, the script executes without any errors, however, when i try to open the file, i get an error message, along the lines of error with utf-8 encoding. Ignoring this, opens the file, and most of the things look fine, except for a couple of things:
some SSID have non-ascii characters, like ä, ö and ü. When printing those on the screen, they are correctly displayed as \xc3\xa4 and so on. When I open the file it is displayed incorrectly, the usual random garbage.
some of the xml values look like these when printed on screen: Data("\x00\x11WLAN-0024FE056185\x01\x08\x82\x84\x8b\x96\x0c\ … x10D\x00\x01\x02") but like this when read from file: //8AAAAAAAAAAAAAAAAAAA==
I thought it could be an encoding error (seen as the Umlauts have problems, the error message says something about the utf-8 encoding being messed up, and the text containing \x type of characters), and i tried looking here for possible solutions. However, no matter what i do, there are still errors:
adding an additional argument 'utf-8' to the codecs.open yields a
UnicodeDecodeError: 'ascii' codec can't decode byte 0x9a in position 24227: ordinal not in range(128) and the generated file is empty.
explicitly encoding to utf-8 with outFile.write(airportInfo.encode('utf-8')) before saving results in the same error
not being an expert, i tried decoding it, maybe i was just doing the exact opposite of what needed to be done, but i got an UnicodeDecodeError: 'utf8' codec can't decode byte 0x8a in position 8980: invalid start byte
The only the thing that worked (unsurprisingly), was to write the repr() of the string to file, but that is just not what i need, and also i can't make a nice .plist of a file full with escape symbols.
So please, please, can somebody help me? What am i missing?
If it helps, the type that gets saved in airportInfo is str (as in type(airportInfo) == str) and not u
You don't need re-encoding when your text is already unicode. Just write the text to a file. It should just work.
In [1]: t = 'äïöú'
In [2]: with open('test.txt', 'w') as f:
f.write(t)
...:
Additionally, you can make getAirportInfo simpler by using subprocess.check_output(). Also, mixed case names should only be used for classes, not functions. See PEP8.
import subprocess
def get_airport_info():
args = ['/System/Library/PrivateFrameworks/Apple80211.framework/Versions/Current/Resources/airport',
'--scan', '--xml']
return subprocess.check_output(args)
airportInfo = get_airport_info()
with open('wifi-data.txt', 'w') as outf:
outf.write(airportinfo)
I should have read this before my original answer:
What is the difference between encode/decode?
I always get confused between string and unicode conversion. On my mac, import sys; sys.getfilesystemencoding() suggests that subprocess returns a 'utf-8' string - so I don't know why airportInfo.encode('utf-8') fails. Is it possible to do airportInfo.encode('utf-8', 'ignore') and throw out the invalid characters?
Also - have you tried writing your file as wb: outFile = codecs.open('wifi-data.txt', 'wb') - doesn't 'w' open an ascii file?
Regarding your text editor - that may handle unicode characters differently. If it's reading a unicode text file as ascii, then the unicode characters may appear a garbled mess. You might try naming the file .xml, in which depending on your text editor may read it better as unicode.

UnicodeDammit: detwingle crashes on a website

I’m scraping websites and use BeautifulSoup4 to parse them. As the websits can have really random char sets, I use UnicodeDammit.detwingle to ensure that I feed proper data to BeautifulSoup. It worked fine... until it crashed. One website causes the code to break. The code to build "soup" looks like this:
u = bs.UnicodeDammit.detwingle( html_blob ) <--- here it crashes
u = bs.UnicodeDammit( u.decode('utf-8'),
smart_quotes_to='html',
is_html = True )
u = u.unicode_markup
soup = bs.BeautifulSoup( u )
And the error (standard Python-Unicode hell duo)
File ".../something.py", line 92, in load_bs_from_html_blob
u = bs.UnicodeDammit.detwingle( html_blob )
File ".../beautifulsoup4-4.1.3-py2.7.egg/bs4/dammit.py", line 802, in detwingle
return b''.join(byte_chunks)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0:
ordinal not in range(128)
The offending website is this one
Question: How to make a proper and bulletproof website source decoding?
This website is not a special case in terms of character encoding at all, it's perfectly valid utf-8 with even the http header set correctly. It then follows that your code would have crashed on any website encoded in utf-8 with code points beyond ASCII.
It is also evident from the documentation, that UnicodeDammit.detwingle takes an unicode string. You are passing it html_blob, and the variable naming suggests that it's not a decoded unicode string. (Misunderstanding)
To handle any website encoding is not trivial in the case the http header or markup lies about the encoding or is not included at all. You need to perform various heuristics and even then you won't get it right. But this website is sending the charset header correctly and has been encoded correctly in that charset.
Interesting trivia. The only beyond ASCII text in the website are these javascript comments (after being decoded as utf-8):
image = new Array(4); //¶¨ÒåimageΪͼƬÊýÁ¿µÄÊý×é
image[0] = 'sample_BG_image01.png' //±³¾°Í¼ÏóµÄ·¾¶
If you then encode those to ISO-8859-1, and decode the result as GB2312, you get:
image = new Array(4); //定义image为图片数量的数组
image[0] = 'sample_BG_image01.png' //背景图象的路径
Which google chinese -> english, translates to:
image = new Array(4); //Defined image of the array of the number of images
image[0] = 'sample_BG_image01.png' //The path of the background image

Categories