Python: Unable to write slanted apostrophe to file - python

I'm using Python 2.7 and unable to upgrade to 3.x just yet.
I need to read data from the database and write to a file.
Using the database query tool, I see the string I need to retrieve contains the following slanted apostrophe:
I’ve
When I read the database from python and simply print it to the console, I see the following. It has converted the slanted apostrophe to a normal apostrophe. This is actually my preferred behavior:
print(message)
I've
But when I try to write the string to a file, it instead writes the apostrophe as a question mark:
I?ve
My original code just does this:
file = open(path,"w")
file.write(message)
file.close()
To try to fix it, I did the following, but it did not help. The question mark is still showing up:
# -*- coding: utf-8 -*-
import codecs
file = codecs.open(path,"w", "utf-8")
file.write(message)
file.close()

Related

Insert large csv to MySQL, ignore lines with unknown characters

I have a large .csv that I'm trying to import into a MySQL database for a Django project. I'm using the django.db library to write raw sql statements such as:
LOAD DATA LOCAL INFILE 'file.csv'...
However, I keep getting the following error:
django.db.utils.OperationalError: (1300, "Hey! Are you out tonight?")
After grepping the .csv for the line, I realised that the error is being caused by this character: 😜; though I'm sure there will be other characters throwing that error after I fix this.
Running:
$ file --mime file.csv
from a terminal, returns:
$ file.csv: text/html; charset=us-ascii
Since the rest of my db is in UTF-8, I tried writing a python script to re-encode it, using .encode('utf-8', 'ignore') hoping that the 'ignore' would remove any symbols that gave it trouble, but it threw:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 825410: invalid continuation byte
The thing is, I don't actually care about inserting 100% of the file into my db. I would rather just insert only the 'safe' lines that don't contain strange characters.
So ideally, I'm looking for a way to modify my LOAD DATA LOCAL INFILE sql statement so it just skips inserting any lines that give it trouble. This is optimal, since I don't want to spend time preprocessing the data.
If that isn't feasible, the next best thing is to remove any troublesome character/lines with a Python script that I could later run from my django app whenever I update my db.
If all else fails, information on how to grep out any characters that aren't UTF-8 friendly that I could write a shell script around would be useful.
For 😜, MySQL must use CHARACTER SET utf8mb4 on the column where you will be storing it, the LOAD DATA, and on the connection.
More Python notes: http://mysql.rjweb.org/doc.php/charcoll#python
E9 does not make sense. The hex for the UTF-8 encoding for 😜 is F09F989C.
The link on converting between character is irrelevant; only UTF-8 can be used for Emoji.
Not 100% sure if this will help but this is what I'd try:
Since open() is used to open a CSV file for reading, the file will by default be decoded into unicode using the system default encoding (see locale.getpreferredencoding()). To decode a file using a different encoding, use the encoding argument of open:
import csv
with open('some.csv', newline='', encoding='utf-8') as f:
reader = csv.reader(f)
for row in reader:
print(row)
That's an example gathered from official docs. Have in mind that you might need to replace utf-8 with the actual file encoding, as docs say. Then you can either continue using python to push your data into DB or write a new file with a new encoding.
Alternatively, this could could be another approach.

python openpyxl unicode error

I have a problem with writing into xlsx file. I am getting the openpyxl.utils.exceptions.IllegalCharacterError error. On the top of my code there is written:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
Also the line that raises the error is:
ws[name_line] = z.text.encode('utf-8').strip()
So now I really don`t know what to do.
The IllegalCharacterError Exception raises when you want to assing special character to a Cell.
You don't have to encode Strings by yourself, openpyxl always encodes to utf-8.
Change your code, for instance to:
ws['A1'] = z.text.strip()
Come back and Flag your Question as answered if this is working for you or comment why not.

Getting the correct encoding for strings and csv-files in Python

I'm using mechanize in Python to grab some data from a website and send it new data.
The thing is that the site is in French, so I get question marks in a diamond shape (�) instead of various characters such as éÉÀàùÙîû and others.
I tried looking around on Google and StackOverflow and found various answers that didn't fix my problem. I've seen answers recommending trying one of the following lines:
myString = éÀî
myString.encode('latin-1')
myString.encode('iso-8859-1')
unicode(myString, 'iso-8859-1')
but none of those seem to work.
The two cases where I need this are when I read a csv file with accents and with hardcoded strings containing accents. For instance, here's what a line in the csv file looks like (actually ';' is the separator):
Adam Guérin;myemail#mail.com;555-5555;2011-02-05
The 'é' looks fine, but when I try to fill a textField on the website with mechanize and submit it, the 'é' now looks like '�' on the actual website.
Edit:
This is my code for reading the data in the csv file:
subscriberReader = csv.reader(open(path, 'rb'), delimiter=';')
subscribers = []
for row in subscriberReader:
subscribers.append(Subscriber(row[0], row[1], row[2]))
Then I send it to the website using mechanize:
self.br.select_form('aspnetForm')
self.br.form['fldEmail'] = subscriber.email
self.br.form['fldName'] = subscriber.name
self.br.form['fldPhoneNum'] = subscriber.phoneNum
self.br.submit()
I tried various ways to encode the characters, but I guess I'm not doing it correctly. I'll be glad to try anything that gets suggested in the answers / comments.
As for the website, it doesn't specify which encoding it is using in the header.
First, you mentioned that you want to place literals into your code. To do so, you need to tell Python what encoding your script file has. You do this with a comment declaration at the beginning of the file (I'll assume that you're using latin-1).
# -*- coding: latin-1 -*-
myString = u'éÀî'
Second, you need to be able to work with the string. This isn't mechanize-specific, but covering a few basics should be useful: first, myString ends up being a unicode object (because of the way the literal was declared, with the u''). So, to use it as a Latin-1 encoding, you'll need to call .encode(), for example:
with open('test.txt', 'w') as f:
f.write(myString.encode('latin-1'))
And finally, when reading in a string that is encoded (say, from the remote web site), you can use .decode() to decode it into a unicode object, and work with it from there.
with open('test.txt', 'r') as f:
myString = f.read().decode('latin-1')

Python 2.7: Setting I/O Encoding, ’?

Attempting to write a line to a text file in Python 2.7, and have the following code:
# -*- coding: utf-8 -*-
...
f = open(os.path.join(os.path.dirname(__file__), 'output.txt'), 'w')
f.write('Smith’s BaseBall Cap') // Note the strangely shaped apostrophe
However, in output.txt, I get Smith’s BaseBall Cap, instead. Not sure how to correct this encoding problem? Any protips with this sort of issue?
You have declared your file to be encoded with UTF-8, so your byte-string literal is in UTF-8. The curly apostrophe is U+2019. In UTF-8, this is encoded as three bytes, \xE2\x80\x99. Those three bytes are written to your output file. Then, when you examine the output file, it is interpreted as something other than UTF-8, and you see the three incorrect characters instead.
In Mac OS Roman, those three bytes display as ’.
Your file is a correct UTF-8 file, but you are viewing it incorrectly.
There are a couple possibilities, but the first one to check is that the output file actually contains what you think it does. Are you sure you're not viewing the file with the wrong encoding? Some editors have an option to choose what encoding you're viewing the file in. The editor needs to know the file's encoding, and if it interprets the file as being in some other encoding than UTF-8, it will display the wrong thing even though the contents of the file are correct.
When I run your code (on Python 2.6) I get the correct output in the file. Another thing to try: Use the codecs module to open the file for UTF-8 writing: f = codecs.open("file.txt", "w", "utf-8"). Then declare the string as a unicode string withu"'Smith’s BaseBall Cap'"`.

Translate special character ½

I am reading a source that contains the special character ½. How do I convert this to 1/2? The character is part of a sentence and I still need to be able to use this string "normally". I am reading webpage sources, so I'm not sure that I will always know the encoding??
Edit: I have tried looking at other answers, but they don't work for me. They always seem to start with something like:
s= u'£10"
but I get an error already there: "no encoding declared". But do I know what encoding I'm getting in, or does that not matter? Do I just pick one?
This is really two questions.
#1. To interpret ½: Use the unicodedata module. You can ask for the numeric value of the character or you can normalize using a canonical normalization form it and parse it yourself.
>>> import unicodedata
>>> unicodedata.numeric(u'½')
0.5
>>> unicodedata.normalize('NFKC', u'½')
'1⁄2'
#2. Encoding problems: If you're working with the terminal, make sure Python knows the terminal encoding. If you're writing source files, make sure Python knows the file encoding. You can't just "pick" an encoding to set for Python, you must inform Python about the encoding that your terminal / text editor already uses.
Python lets you set the encoding of files with Vim/Emacs style comments. Put a comment at the top of the file like this if you use Vim:
# coding=UTF-8
Or this, if you use Emacs:
# -*- coding: UTF-8 -*-
If you use neither Vim nor Emacs, then it doesn't matter which one. Obviously, if you don't use UTF-8 you should substitute the encoding you actually use. (UTF-8 is the only encoding I can recommend.)
Dietrich beat me to the punch, but here is some more detail about setting the encoding for your source file:
Because you want to search for a literal unicode ½, you need to be able to write it in your source file. Unfortunately, the Python interpreter chokes on any unicode input, unless you specify the encoding of that source file with a comment in the first couple of lines, like so:
# coding=utf8
# ... do stuff here ...
This assumes your editor is saving the file as UTF-8. If it's using a different encoding specify that instead. See PEP-0263 for more details.
Once you've specified the encoding you should be able to write something this in your code:
text = text.replace('½', '1/2')
Encoding of the webpage
Depending on how you are downloading the page, you probably don't need to worry about this at all, most HTTP libraries handle choosing the encoding for you automatically.
Did you try using codecs to read your file? [docs]
import codecs
fileObj = codecs.open( "someFile", "r", "utf-8" )
u = fileObj.read() # Returns a Unicode string from the UTF-8 bytes in the file
You can check the whole guide here.
Also a good ref: http://docs.python.org/howto/unicode

Categories