Python file input string: how to handle escaped unicode characters? - python

In a text file (test.txt), my string looks like this:
Gro\u00DFbritannien
Reading it, python escapes the backslash:
>>> file = open('test.txt', 'r')
>>> input = file.readline()
>>> input
'Gro\\u00DFbritannien'
How can I have this interpreted as unicode? decode() and unicode() won't do the job.
The following code writes Gro\u00DFbritannien back to the file, but I want it to be Großbritannien
>>> input.decode('latin-1')
u'Gro\\u00DFbritannien'
>>> out = codecs.open('out.txt', 'w', 'utf-8')
>>> out.write(input)

You want to use the unicode_escape codec:
>>> x = 'Gro\\u00DFbritannien'
>>> y = unicode(x, 'unicode_escape')
>>> print y
Großbritannien
See the docs for the vast number of standard encodings that come as part of the Python standard library.

Use the built-in 'unicode_escape' codec:
>>> file = open('test.txt', 'r')
>>> input = file.readline()
>>> input
'Gro\\u00DFbritannien\n'
>>> input.decode('unicode_escape')
u'Gro\xdfbritannien\n'
You may also use codecs.open():
>>> import codecs
>>> file = codecs.open('test.txt', 'r', 'unicode_escape')
>>> input = file.readline()
>>> input
u'Gro\xdfbritannien\n'
The list of standard encodings is available in the Python documentation: http://docs.python.org/library/codecs.html#standard-encodings

Related

Convert file to base64 string on Python 3

I need to convert image (or any file) to base64 string. I use different ways, but result is always byte, not string. Example:
import base64
file = open('test.png', 'rb')
file_content = file.read()
base64_one = base64.encodestring(file_content)
base64_two = base64.b64encode(file_content)
print(type(base64_one))
print(type(base64_two))
Returned
<class 'bytes'>
<class 'bytes'>
How do I get a string, not byte? Python 3.4.2.
Base64 is an ascii encoding so you can just decode with ascii
>>> import base64
>>> example = b'\x01'*10
>>> example
b'\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01'
>>> result = base64.b64encode(example).decode('ascii')
>>> print(repr(result))
'AQEBAQEBAQEBAQ=='
I need to write base64 text in file ...
So then stop worrying about strings and just do that instead.
with open('output.b64', 'wb'):
write(base64_one)
The following code worked for me:
import base64
file_text = open(file, 'rb')
file_read = file_text.read()
file_encode = base64.encodebytes(file_read)
I initially tried base64.encodestring() but that function has been deprecated as per this issue.

Unicode in python 3

I want to convert string, which contains Unicode numbers to usual text. For example, file "input.txt" contains string '\u0057\u0068\u0061\u0074,' and I want to know what does it mean. If string is input in the code like:
s = '\u0057\u0068\u0061\u0074'
b = s.encode('utf-8')
print(b)
it works perfectly, but if I want to do the same with file I get this result b'\\u0057\\u0068\\u0061\\u0074'.
How to fix this problem? Windows 8, encoding of files are 'windows-1251'.
If your file contains those unicode escape sequences, then you can use the unicode_escape “codec” to interpret them after you read the file contents as a string.
>>> s = r'\u0057\u0068\u0061\u0074'
>>> print(s)
\u0057\u0068\u0061\u0074
>>> s.encode('utf-8').decode('unicode_escape')
'What'
Or, you can just read a bytes string directly and decode that:
with open('file.txt', 'br') as f:
print(f.read().decode('unicode_escape'))

Storing a random byte string in Python

For my project, I need to be able to store random byte strings in a file and read the byte string again later. For example, I want to store randomByteString from the following code:
>>> from os import urandom
>>> randomByteString=urandom(8)
>>> randomByteString
b'zOZ\x84\xfb\xceM~'
What would be the proper way to do this?
Edit: Forgot to mention that I also want to store 'normal' string alongside the byte strings.
Code like:
>>> fh = open("e:\\test","wb")
>>> fh.write(randomByteString)
8
>>> fh.close()
Operate the file as binary mode. Also, you could do it in a better manner if the file operations are near one place (Thanks to #Blender):
>>> with open("e:\\test","wb") as fh:
fh.write(randomByteString)
Update: if you want to strong normal strings, you could encode it and then write it like:
>>> "test".encode()
b'test'
>>> fh.write("test".encode())
Here the fh means the same file handle opened previously.
Works just fine. You can't expect the output to make much sense though.
>>> import os
>>> with open("foo.txt", "wb") as fh:
... fh.write(os.urandom(8))
...
>>> fh.close()
>>> with open("foo.txt", "r") as fh:
... for line in fh.read():
... print line
...
^J^JM-/
^O
R
M-9
J
~G

Python JSON preserve encoding

I have a file like this:
aarónico
aaronita
ababol
abacá
abacería
abacero
ábaco
#more words, with no ascii chars
When i read and print that file to the console, it prints exactly the same, as expected, but when i do:
f.write(json.dumps({word: Lookup(line)}))
This is saved instead:
{"aar\u00f3nico": ["Stuff"]}
When i expected:
{"aarónico": ["Stuff"]}
I need to get the same when i jason.loads() it, but i don't know where or how to do the encoding or if it's needed to get it to work.
EDIT
This is the code that saves the data to a file:
with open(LEMARIO_FILE, "r") as flemario:
with open(DATA_FILE, "w") as f:
while True:
word = flemario.readline().strip()
if word == "":
break
print word #this is correct
f.write(json.dumps({word: RAELookup(word)}))
f.write("\n")
And this one loads the data and returns the dictionary object:
with open(DATA_FILE, "r") as f:
while True:
new = f.readline().strip()
if new == "":
break
print json.loads(new) #this is not
I cannot lookup the dictionaries if the keys are not the same as the saved ones.
EDIT 2
>>> import json
>>> f = open("test", "w")
>>> f.write(json.dumps({"héllö": ["stuff"]}))
>>> f.close()
>>> f = open("test", "r")
>>> print json.loads(f.read())
{u'h\xe9ll\xf6': [u'stuff']}
>>> "héllö" in {u'h\xe9ll\xf6': [u'stuff']}
False
This is normal and valid JSON behaviour. The \uxxxx escape is also used by Python, so make sure you don't confuse python literal representations with the contents of the string.
Demo in Python 3.3:
>>> import json
>>> print('aar\u00f3nico')
aarónico
>>> print(json.dumps('aar\u00f3nico'))
"aar\u00f3nico"
>>> print(json.loads(json.dumps('aar\u00f3nico')))
aarónico
In python 2.7:
>>> import json
>>> print u'aar\u00f3nico'
aarónico
>>> print(json.dumps(u'aar\u00f3nico'))
"aar\u00f3nico"
>>> print(json.loads(json.dumps(u'aar\u00f3nico')))
aarónico
When reading and writing from and to files, and when specifying just raw byte strings (and "héllö" is a raw byte string) then you are not dealing with Unicode data. You need to learn about the differences between encoded and Unicode data first. I strongly recommend you read at least 2 of the following 3 articles:
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
You were lucky with your "héllö" python raw byte string representation, Python managed to decode it automatically for you. The value read back from the file is perfectly normal and correct:
>>> print u'h\xe9ll\xf6'
héllö

Remove character between two characters in Python

My input string looks like this:
"1,724,741","24,527,465",14.00,14.35,14.00,14.25
I want the output to look like this:
1724741,24527465,14.00,14.35,14.00,14.25
I played with re.sub but still couldn't figure out.
Any help would be appreciated.
The csv module handles the quoting nicely:
>>> s = '"1,724,741","24,527,465",14.00,14.35,14.00,14.25'
>>> import csv
>>> r = csv.reader([s])
>>> for row in r:
... print ','.join(x.replace(",", "") for x in row)
...
1724741,24527465,14.00,14.35,14.00,14.25
A quite hacky solution is to use ast.literal_eval():
>>> from ast import literal_eval
>>> s = '"1,724,741","24,527,465",14.00,14.35,14.00,14.25'
>>> print ",".join(x.replace(",", "") if isinstance(x, str) else str(x)
... for x in literal_eval(s))
1724741,24527465,14.0,14.35,14.0,14.25
Note that this also reformats the floating point numbers.
Edit: Since you are apparently dealing with a CSV file and integers with thousands separators, a cleaner solution might be
import csv
import locale
locale.setlocale(locale.LC_ALL, 'en_GB.UTF8')
converters = [locale.atoi] * 2 + [locale.atof] * 4
with open("input.csv", "rb") as f, open("output.csv", "wb") as g:
out = csv.writer(g)
for row in csv.reader(f):
out.writerow([conv(x) for conv, x in zip(converters, row)])
You will need to substitute en_GB.UTF8 by a locale supported by your machine (and having comma as a thousands separator).

Categories