Python - writing unicode strings to a file & beautiful soup - python

I'm using BeautifulSoup to parse some XML files. One of the fields in this file frequently uses Unicode characters. I've tried unsuccessfully to write the unicode to a file using encode.
The process so far is basically:
Get the name
gamename = items.find('name').string.strip()
Then incorporate the name into a list which is later converted into a string:
stringtoprint = userid, gamename.encode('utf-8') #
newstring = "INSERT INTO collections VALUES " + str(stringtoprint) + ";" +"\n"
Then write that string to a file.
listofgamesowned.write(newstring.encode("UTF-8"))
It seems that I won't have to .encode quite so often. I had tried encoding directly upon parsing out the name e.g. gamename = items.find('name').string.strip().encode('utf-8') - however, that did not seem to work.
Currently - 'Uudet L\xc3\xb6yt\xc3\xb6retket'
is being printed and saved rather than Uudet Löytöretket.
It seems if this were a string I was generating then I'd use something.write(u'Uudet L\xc3\xb6yt\xc3\xb6retket'); however, it's one element embedded in a string.

Unicode is an in-memory representation of a string. When you write out or read in you need to encode and decode.
Uudet L\xc3\xb6yt\xc3\xb6retket is the utf-8 encoded version of Uudet Löytöretket, so it is what you want to write out. When you want to read a string back from a file you need to decode it.
>>> print 'Uudet L\xc3\xb6yt\xc3\xb6retket'
Uudet Löytöretket
>>> print 'Uudet L\xc3\xb6yt\xc3\xb6retket'.decode('utf-8')
Uudet Löytöretket
Just remember to encode immediately before you output and decode immediately after you read it back.

Related

How to use encoded text as a string

I want to write an encoded text to a file using Python 3.6, the issue is that I want to write it as a string and not as bytes.
text = open(file, 'r').read()
enc = text.encode(encoding) # for example: "utf-32"
f = open(new_file, 'w')
f.write(str(enc)[2:-1])
f.close()
The problem is, I still get the file content as bytes (e.g. the '\n' remains the same instead of become a new row).
I also tried to use:
enc.decode(encoding)
but it's just returning me back the old text I had in the first place.
any ideas how can I improve this piece of code?
Thanks.
The problem you have here is that you encode into utf-32 bytes object and then cast it back into a string object without specifying an encoding. The default is utf-8, so you've just converted using the wrong encoding. If you pass the same encoding to str then it should work.
Better yet, don't call str at all when writing out - if you already have a bytes object, it's not necessary.
This concept generally trips up a lot of people. I suggest reading the explanation here to help wrap your head around how and why we do the string/bytes conversions. A good rule of thumb - string types inside your python, and decode to string from bytes as data comes in, encode from string to bytes as it goes out.

Removing non-ascii characters on utf-16 (Python)

i have some code i'm using to decrypt a string, the string is originally encrypted and coming from .net source code but i'm able to make it all work fine. yet, the string coming into python has some extra characters in it and it has to decode as utf-16
here is some code for the decryption portion. my original string that i encrypted was "test2" , which is what is within the text variable in my code below.
import Crypto.Cipher.AES
import base64, sys
password = base64.b64decode('PSCIQGfoZidjEuWtJAdn1JGYzKDonk9YblI0uv96O8s=')
salt = base64.b64decode('ehjtnMiGhNhoxRuUzfBOXw==')
aes = Crypto.Cipher.AES.new(password, Crypto.Cipher.AES.MODE_CBC, salt)
text = base64.b64decode('TzQaUOYQYM/Nq9f/pY6yaw==')
print(aes.decrypt(text).decode('utf-16'))
text1 = aes.decrypt(text).decode('utf-16')
print(text1)
my issue is when i decrypt and print the result of text it is "test2ЄЄ" instead of the expected "test2"
if i save the same decrypt value into a variable it gets decoded incorrectly as "틊첃陋ភ滑毾穬ヸ"
my goal is i need to find a way to :
strip off the non ascii characters from the end of test2 value
be able to store that into a variable holding the correct string/text value
any help or suggestions appreciated? thanks
In python 2, you can use str.decode, like this:
string.decode('ascii', 'ignore')
The locale is ascii, and ignore specifies that anything that cannot be converted is to be dropped.
In python 3, you'll need to re-encode it first before decoding, since all str objects are decoded to your locale by default:
string.encode('ascii', 'ignore').decode()

Python: Correct Way to refer to index of unicode string

Not sure if this is exactly the problem, but I'm trying to insert a tag on the first letter of a unicode string and it seems that this is not working. Could these be because unicode indices work differently than those of regular strings?
Right now my code is this:
for index, paragraph in enumerate(intro[2:-2]):
intro[index] = bold_letters(paragraph, 1)
def bold_letters(string, index):
return "<b>"+string[0]+"</b>"+string[index:]
And I'm getting output like this:
<b>?</b>?רך האחד וישתבח הבורא בחכמתו ורצונו כל צבא השמים ארץ וימים אלה ואלונים.
It seems the unicode gets messed up when I try to insert the HTML tag. I tried messing with the insert position but didn't make any progress.
Example desired output (hebrew goes right to left):
>>>first_letter_bold("הקדמה")
"הקדמ<\b>ה<b>"
BTW, this is for Python 2
You are right, indices work over each byte when you are dealing with raw bytes i.e String in Python(2.x).
To work seamlessly with Unicode data, you need to first let Python(2.x) know that you are dealing with Unicode, then do the string manipulation. You can finally convert it back to raw bytes to keep the behavior abstracted i.e you get String and you return String.
Ideally you should convert all the data from UTF8 raw encoding to Unicode object (I am assuming your source encoding is Unicode UTF8 because that is the standard used by most applications these days) at the very beginning of your code and convert back to raw bytes at the fag end of code like saving to DB, responding to client etc. Some frameworks might handle that for you so that you don't have to worry.
def bold_letters(string, index):
string = string.decode('utf8')
string "<b>"+string[0]+"</b>"+string[index:]
return string.encode('utf8')
This will also work for ASCII because UTF8 is a super-set of ASCII. You can understand how Unicode works and in Python specifically better by reading http://nedbatchelder.com/text/unipain.html
Python 3.x String is a Unicode object so you don't have to explicitly do anything.
You should use Unicode strings. Byte strings in UTF-8 use a variable number of bytes per character. Unicode use one (at least those in the BMP on Python 2...the first 65536 characters):
#coding:utf8
s = u"הקדמה"
t = u'<b>'+s[0]+u'</b>'+s[1:]
print(t)
with open('out.htm','w',encoding='utf-8-sig') as f:
f.write(t)
Output:
<b>ה</b>קדמה
But my Chrome browser displays out.htm as:

Writing unicode to a file in ascii in python (for example as u'\xa0EC)

I've written a simple script in python that scrapes a website for some data and saves it into a list called data. Some of the data has unicode characters, I want to write this list to a .csv file and keep the unicode characters in ascii.
When I print the list in the python shell the unicode characters show up as, for example "u'\xa0EC" and I just want them saved exactly like that in the .csv so that they can be interpreted later back into unicode/utf-8.
I'm sure this can't be that difficult but I'm either getting the "ascii codec can't encode..." error or what I have at the moment replaces them with question marks -
f = codecs.open('data2.csv', mode='wb', encoding="ascii", errors='ignore')
writer = csv.writer(f)
writer.writerow([i.encode('ascii','replace') if type(i) is unicode else i for i in data])
f.close()
Apologies if this has be answered before, I have searched, but every other question seems to be people wanting them converted.
You want to use the "unicode_escape" encoding.
For example:
s = "雥"
s.encode("unicode_escape")
yields the following bytes:
b'\\u96e5'
To get the ascii representation, you would want to decode the bytes with the ascii encoding as such:
s.encode("unicode_escape").decode('ascii')

Character Encoding, XML, Excel, python

I am reading a list of strings that were imported into an excel xml file from another software program. I am not sure what the encoding of the excel file is, but I am pretty sure its not windows-1252, because when I try to use that encoding, I wind up with a lot of errors.
The specific word that is causing me trouble right now is: "Zmysłowska, Magdalena" (notice the "l" is not a standard "l", but rather, has a slash through it).
I have tried a few things, Ill mention three of them here:
(1)
page = unicode(page, "utf-8")
page = unicodedata.normalize("NFKD", page)
page = page.encode("utf-8", "ignore")
Output: Zmys\xc5\x82owska, Magdalena
Output after print statement: Zmysłowska, Magdalena
(2)
page = unicode(page, "utf-8")
page = unicodedata.normalize("NFKD", page)
Output: Zmys\u0142owska, Magdalena
Output after print statment: Zmysłowska, Magdalena
Note: this is great, but I need to encode it back to utf-8 before putting the string into my db. When I do that, by running page.encode("utf-8", "ignore"), I end up with Zmysłowska, Magdalena again.
(3)
Do nothing (no normalization, no decode, no encode). It seems like the string is already utf-8 when it comes in. However, when I do nothing, the string ends up with the following output again:
Output: Zmys\xc5\x82owska, Magdalena
Output after print statement: Zmysłowska, Magdalena
Is there a way for me to convert this string to utf-8?
Your problem isn't your encoding and decoding. Your code correctly takes a UTF-8 string, and converts it to an NFKD-normalized UTF-8 string. (You might want to use page.decode("utf-8") instead of unicode(page, "utf-8") just for future-proofing in case you ever go to Python 3, and to make the code a bit easier to read because the encode and decode are more obviously parallel, but you don't have to; the two are equivalent.)
Your actually problem is that you're printing UTF-8 strings to some context that isn't UTF-8. Most likely you're printing to the cmd window, which is defaulting to Windows-1252. So, cmd tries to interpret the UTF-8 characters as Windows-1252, and gets garbage.
There's a pretty easy way to test this. Make Python decode the UTF-8 string as if it were Windows-1252 and see if the resulting Unicode string looks like what're seeing.
>>> print page.decode('windows-1252')
Zmysłowska, Magdalena
>>> print repr(page.decode('windows-1252'))
u'Zmys\xc5\u201aowska, Magdalena'
There are two ways around this:
Print Unicode strings and let Python take care of it.
Print strings converted to the appropriate encoding.
For option 1:
print page.decode("utf-8") # of unicode(page, "utf-8")
For option 2, it's going to be one of the following:
print page.decode("utf-8").encode("windows-1252")
print page.decode("utf-8").encode(sys.getdefaultencoding())
Of course if you keep the intermediate Unicode string around, you don't need all those decode calls:
upage = page.decode("utf-8")
upage = unicodedata.normalize("NFKD", upage)
page = upage.encode("utf-8", "ignore")
print upage

Categories