python 2.7 encoding decoding - python

I have a problem involving encoding/decoding.
I read text from file and compare it with text from database (Postgres)
Compare is done within two lists
from file i get "jo\x9a" for "još" and from database I get "jo\xc5\xa1" for same value
common = [a for a in codes_from_file if a in kode_prfoksov]
# Items in one but not the other
only1 = [a for a in codes_from_file if not a in kode_prfoksov]
#Items only in another
only2 = [a for a in kode_prfoksov if not a in codes_from_file ]
How to solve this? Which encoding should be set when comparing this two strings to solve the issue?
thank you

The first one seems to be windows-1250, and the second is utf-8.
>>> print 'jo\x9a'.decode('windows-1250')
još
>>> print 'jo\xc5\xa1'.decode('utf-8')
još
>>> 'jo\x9a'.decode('windows-1250') == 'jo\xc5\xa1'.decode('utf-8')
True

Your file strings seems to be Windows-1250 encoded. Your database seems to contain UTF-8 strings.
So you can either convert first all strings to unicode:
codes_from_file = [a.decode("windows-1250") for a in codes_from_file]
kode_prfoksov] = [a.decode("utf-8") for a in codes_from_file]
or if you do not want unicode strings, just convert the file string to UTF-8:
codes_from_file = [a.decode("windows-1250").encode("utf-8") for a in codes_from_file]

Related

How to convert utf-8 characters to "normal" characters in string in python3.10?

I have raw data that looks like this:
25023,Zwerg+M%C3%BCtze,0,1,986,3780
25871,red+earth,0,1,38,8349
25931,K4m%21k4z3,90,1,1539,2530
It is saved as a .txt file: https://de205.die-staemme.de/map/player.txt
The "characters" starting with % are unicode, as far as I can tell.
I found the following table about it: https://www.i18nqa.com/debug/utf8-debug.html
Here is my code so far:
urllib.urlretrieve(url,pfad + "player.txt")
f = open(pfad + "player.txt","r",encoding="utf-8")
raw = raw.split("\n")
f.close()
Python does not convert the %-characters. They are read as if they were seperate characters.
Is there a way to convert these characters without calling .replace like 200 times?
Thank you very much in advance for help and/or useful hints!
The %s are URL-encoding; use urllib.parse.unquote to decode the string.
>>> raw = """25023,Zwerg+M%C3%BCtze,0,1,986,3780
... 25871,red+earth,0,1,38,8349
... 25931,K4m%21k4z3,90,1,1539,2530"""
>>> import urllib.parse
>>> print(urllib.parse.unquote(raw))
25023,Zwerg+Mütze,0,1,986,3780
25871,red+earth,0,1,38,8349
25931,K4m!k4z3,90,1,1539,2530

How to find null byte in a string in Python?

I'm having an issue parsing data after reading a file. What I'm doing is reading a binary file in and need to create a list of attributes from the read file all of the data in the file is terminated with a null byte. What I'm trying to do is find every instance of a null byte terminated attribute.
Essentially taking a string like
Health\x00experience\x00charactername\x00
and storing it in a list.
The real issue is I need to keep the null bytes in tact, I just need to be able to find each instance of a null byte and store the data that precedes it.
Python doesn't treat NUL bytes as anything special; they're no different from spaces or commas. So, split() works fine:
>>> my_string = "Health\x00experience\x00charactername\x00"
>>> my_string.split('\x00')
['Health', 'experience', 'charactername', '']
Note that split is treating \x00 as a separator, not a terminator, so we get an extra empty string at the end. If that's a problem, you can just slice it off:
>>> my_string.split('\x00')[:-1]
['Health', 'experience', 'charactername']
While it boils down to using split('\x00') a convenience wrapper might be nice.
def readlines(f, bufsize):
buf = ""
data = True
while data:
data = f.read(bufsize)
buf += data
lines = buf.split('\x00')
buf = lines.pop()
for line in lines:
yield line + '\x00'
yield buf + '\x00'
then you can do something like
with open('myfile', 'rb') as f:
mylist = [item for item in readlines(f, 524288)]
This has the added benefit of not needing to load the entire contents into memory before splitting the text.
To check if string has NULL byte, simply use in operator, for example:
if b'\x00' in data:
To find the position of it, use find() which would return the lowest index in the string where substring sub is found. Then use optional arguments start and end for slice notation.
Split on null bytes; .split() returns a list:
>> print("Health\x00experience\x00charactername\x00".split("\x00"))
['Health', 'experience', 'charactername', '']
If you know the data always ends with a null byte, you can slice the list to chop off the last empty string (like result_list[:-1]).

Convert a unicode string to its original format [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Converting a latin string to unicode in python
I have a list with the following format after storing in a file
list_example = [
u"\u00cdndia, Tail\u00e2ndia & Cingapura",
u"Lines through the days 1 (Arabic) \u0633\u0637\u0648\u0631 \u0639\u0628\u0631 \u0627\u0644\u0623\u064a\u0627\u0645 1",
]
But the actual format of the strings in the list is
actual_format = [
"Índia, Tailândia & Cingapura ",
"Lines through the days 1 (Arabic) سطور عبر الأيام 1 | شمس الدين خ "
]
How can I convert the strings in list_example to strings present in the actual_format list?
Your question is a bit unclear to me. In any case, the following guidelines should help you solving your problem.
If you define those strings in a Python source code, then you should
know in which character encoding your editor saves the source code file (e.g. utf-8)
declare that encoding in the first line of your source file, via e.g. # -*- coding: utf-8 -*-
define those strings as unicode objects:
strings = [u"Índia, Tailândia & Cingapura ", u"Lines through the days 1 (Arabic) سطور عبر الأيام 1 | شمس الدين خ "]
(Note: in Python 3, literal strings are unicode objects by default, i.e. you don't need the u. In Python 2, unicode strings are of type unicode, in Python 3, unicode strings are of type string.)
When you then want to save those strings to a file, you should explicitly define the character encoding:
with open('filename', 'w') as f:
s = '\n'.join(strings)
f.write(s.encode('utf-8'))
When you then want to read those strings again from that file, you again have to explicitly define the character encoding in order to decode the file contents properly:
with open('filename') as f:
strings = [l.decode('utf-8') for line in f]
actual_format = [x.decode('unicode-escape') for x in list_example]

Python to show special characters

I know there are tons of threads regarding this issue but I have not managed to find one which solves my problem.
I am trying to print a string but when printed it doesn't show special characters (e.g. æ, ø, å, ö and ü). When I print the string using repr() this is what I get:
u'Von D\xc3\xbc' and u'\xc3\x96berg'
Does anyone know how I can convert this to Von Dü and Öberg? It's important to me that these characters are not ignored, e.g. myStr.encode("ascii", "ignore").
EDIT
This is the code I use. I use BeautifulSoup to scrape a website. The contents of a cell (<td>) in a table (<table>), is put into the variable name. This is the variable which contains special characters that I cannot print.
web = urllib2.urlopen(url);
soup = BeautifulSoup(web)
tables = soup.find_all("table")
scene_tables = [2, 3, 6, 7, 10]
scene_index = 0
# Iterate over the <table>s we want to work with
for scene_table in scene_tables:
i = 0
# Iterate over < td> to find time and name
for td in tables[scene_table].find_all("td"):
if i % 2 == 0: # td contains the time
time = remove_whitespace(td.get_text())
else: # td contains the name
name = remove_whitespace(td.get_text()) # This is the variable containing "nonsense"
print "%s: %s" % (time, name,)
i += 1
scene_index += 1
Prevention is better than cure. What you need is to find out how that rubbish is being created. Please edit your question to show the code that creates it, and then we can help you fix it. It looks like somebody has done:
your_unicode_string = original_utf8_encoded_bytestring.decode('latin1')
The cure is to reverse the process, simply, and then decode.
correct_unicode_string = your_unicode_string.encode('latin1').decode('utf8')
Update Based on the code that you supplied, the probable cause is that the website declares that it is encoded in ISO-8859-1 (aka latin1) but in reality it is encoded in UTF-8. Please update your question to show us the url.
If you can't show it, read the BS docs; it looks like you'll need to use:
BeautifulSoup(web, from_encoding='utf8')
Unicode support in many languages is confusing, so your error here is understandable. Those strings are UTF-8 bytes, which would work properly if you drop the u at the front:
>>> err = u'\xc3\x96berg'
>>> print err
Ã?berg
>>> x = '\xc3\x96berg'
>>> print x
Öberg
>>> u = x.decode('utf-8')
>>> u
u'\xd6berg'
>>> print u
Öberg
For lots more information:
http://www.joelonsoftware.com/articles/Unicode.html
http://docs.python.org/howto/unicode.html
You should really really read those links and understand what is going on before proceeding. If, however, you absolutely need to have something that works today, you can use this horrible hack that I am embarrassed to post publicly:
def convert_fake_unicode_to_real_unicode(string):
return ''.join(map(chr, map(ord, string))).decode('utf-8')
The contents of the strings are not unicode, they are UTF-8 encoded.
>>> print u'Von D\xc3\xbc'
Von Dü
>>> print 'Von D\xc3\xbc'
Von Dü
>>> print unicode('Von D\xc3\xbc', 'utf-8')
Von Dü
>>>
Edit:
>>> print '\xc3\x96berg' # no unicode identifier, works as expected because it's an UTF-8 encoded string
Öberg
>>> print u'\xc3\x96berg' # has unicode identifier, means print uses the unicode charset now, outputs weird stuff
Ãberg
# Look at the differing object types:
>>> type('\xc3\x96berg')
<type 'str'>
>>> type(u'\xc3\x96berg')
<type 'unicode'>
>>> '\xc3\x96berg'.decode('utf-8') # this command converts from UTF-8 to unicode, look at the unicode identifier in the output
u'\xd6berg'
>>> unicode('\xc3\x96berg', 'utf-8') # this does the same thing
u'\xd6berg'
>>> unicode(u'foo bar', 'utf-8') # trying to convert a unicode string to unicode will fail as expected
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: decoding Unicode is not supported

iterate through unicode strings and compare with unicode in python dictionary

I have two python dictionaries containing information about japanese words and characters:
vocabDic : contains vocabulary, key: word, value: dictionary with information about it
kanjiDic : contains kanji ( single japanese character ), key: kanji, value: dictionary with information about it
Now I would like to iterate through each character of each word in the vocabDic and look up this character in the kanji dictionary. My goal is to create a csv file which I can then import into a database as join table for vocabulary and kanji.
My Python version is 2.6
My code is as following:
kanjiVocabJoinWriter = csv.writer(open('kanjiVocabJoin.csv', 'wb'), delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
kanjiVocabJoinCount = 1
#loop through dictionary
for key, val in vocabDic.iteritems():
if val['lang'] is 'jpn': # only check japanese words
vocab = val['text']
print vocab
# loop through vocab string
for v in vocab:
test = kanjiDic.get(v)
print v
print test
if test is not None:
print str(kanjiVocabJoinCount)+','+str(test['id'])+','+str(val['id'])
kanjiVocabJoinWriter([str(kanjiVocabJoinCount),str(test['id']),str(val['id'])])
kanjiVocabJoinCount = kanjiVocabJoinCount+1
If I print the variables to the command line, I get:
vocab : works, prints in japanese
v ( one character of the vocab in the for loop ) : �
test ( character looked up in the kanjiDic ) : None
To me it seems like the for loop messes the encoding up.
I tried various functions ( decode, encode.. ) but no luck so far.
Any ideas on how I could get this working?
Help would be very much appreciated.
From your description of the problem, it sounds like vocab is an encoded str object, not a unicode object.
For concreteness, suppose vocab equals u'債務の天井' encoded in utf-8:
In [42]: v=u'債務の天井'
In [43]: vocab=v.encode('utf-8') # val['text']
Out[43]: '\xe5\x82\xb5\xe5\x8b\x99\xe3\x81\xae\xe5\xa4\xa9\xe4\xba\x95'
If you loop over the encoded str object, you get one byte at a time: \xe5, then \x82, then \xb5, etc.
However if you loop over the unicode object, you'd get one unicode character at a time:
In [45]: for v in u'債務の天井':
....: print(v)
債
務
の
天
井
Note that the first unicode character, encoded in utf-8, is 3 bytes:
In [49]: u'債'.encode('utf-8')
Out[49]: '\xe5\x82\xb5'
That's why looping over the bytes, printing one byte at a time, (e.g. print \xe5) fails to print a recognizable character.
So it looks like you need to decode your str objects and work with unicode objects. You didn't mention what encoding you are using for your str objects. If it is utf-8, then you'd decode it like this:
vocab=val['text'].decode('utf-8')
If you are not sure what encoding val['text'] is in, post the output of
print(repr(vocab))
and maybe we can guess the encoding.

Categories