Encoding - Problems creating JSON using an Unicode Object in Python

Encoding - Problems creating JSON using an Unicode Object in Python - python

I've got some real problems to encode/decode strings to a specific charset (UTF-8).
My Unicode Object is:
>> u'Valor Econ\xf4mico - Opini\xe3o'
When I call print from python it returns:
>> Valor Econômico - Opinião
When I call .encode("utf-8") from my unicode object to write it to a JSON it returns:
>> 'Valor Econ\xc3\xb4mico - Opini\xc3\xa3o'
What am I doing wrong? What exactly is print() doing that I'm not?
Obs: I'm creating this unicode object from a line of a file.
import codecs
with codecs.open(path, 'r') as local_file:
for line in local_file:
obj = unicode((line.replace(codecs.BOM_UTF8, '')).replace('\n', ''), 'utf-8')

Valor Econ\xc3\xb4mico - Opini\xc3\xa3o is the UTF-8 representation prepared for a non-UTF-8 terminal, probably in the interactive shell. If you were to write this to a file (open("myfile", "wb").write("Valor Econ\xc3\xb4mico - Opini\xc3\xa3o") then you'd have a valid UTF-8 file.
To create Unicode strings from a file, you can use automatic decoding in the io module (Codecs.open() is being deprecated). BOMs will be removed automatically:
import io
with io.open(path, "r", encoding="utf-8") as local_file:
for line in local_file:
unicode_obj = line.strip()
When it comes to creating a JSON response, use the result from json.dumps(my_object). It will return an str with all non-ASCII chars encoded using Unicode codepoints.

Related

How to convert cp866 in utf-8 coding in python

I have file in cp866 encoding, i open it:
input_file = open(file_name, 'r', encoding = 'cp866')
How i can print() lines from this file like utf-8
I need decode this file to utf-8 and print

Well, the characters read from the file will be decoded and stored in memory as Python strings. You can print them on screen and they should be correct. You can then save the data as utf-8.

You can try:
result = text.encode('cp866').decode('cp866').encode('utf8') - simple convert
data = myfile.read()
b = bytes(data,"KOI8-R")
data_encoding = str(b,"cp1251")
to cp1251, usualy from cp866 you want to decode to cp1251"
If you need decode just once - try this web-site

convert the content of a file in hex to base64 and print out the result

So i'm trying to create a very simple program that opens a file, read the file and convert what is in it from hex to base64 using python3.
I tried this :
file = open("test.txt", "r")
contenu = file.read()
encoded = contenu.decode("hex").encode("base64")
print (encoded)
but I get the error:
AttributeError: 'str' object has no attribute 'decode'
I tried multiple other things but always get the same error.
inside the test.txt is :
4B
if you guys can explain me what I do wrong would be awesome.
Thank you
EDIT:
i should get Sw== as output

This should do the trick. Your code works for Python <= 2.7 but needs updating in later versions.
import base64
file = open("test.txt", "r")
contenu = file.read()
bytes = bytearray.fromhex(contenu)
encoded = base64.b64encode(bytes).decode('ascii')
print(encoded)

you need to encode hex string from file test.txt to bytes-like object using bytes.fromhex() before encoding it to base64.
import base64
with open("test.txt", "r") as file:
content = file.read()
encoded = base64.b64encode(bytes.fromhex(content))
print(encoded)
you should always use with statement for opening your file to automatically close the I/O when finished.
in IDLE:
>>>> import base64
>>>>
>>>> with open('test.txt', 'r') as file:
.... content = file.read()
.... encoded = base64.b64encode(bytes.fromhex(content))
....
>>>> encoded
b'Sw=='

How can I replace Unicode characters with Turkish characters in a text file with Python

I am working on Twitter. I got data from Twitter with Stream API and the result of app is JSON file. I wrote tweets data in a text file and now I see Unicode characters instead of Turkish characters. I don't want to do find/replace in Notepad++ by hand. Is there any automatic option to replace characters by opening txt file, reading all data in file and changing Unicode characters with Turkish characters by Python?
Here are Unicode characters and Turkish characters which I want to replace.
ğ - \u011f
Ğ - \u011e
ı - \u0131
İ - \u0130
ö - \u00f6
Ö - \u00d6
ü - \u00fc
Ü - \u00dc
ş - \u015f
Ş - \u015e
ç - \u00e7
Ç - \u00c7
I tried two different type
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
dosya = open('veri.txt', 'r')
for line in dosya:
match = re.search(line, "\u011f")
if (match):
replace("\u011f", "ğ")
dosya.close()
and:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
f1 = open('veri.txt', 'r')
f2 = open('veri2.txt', 'w')
for line in f1:
f2.write=(line.replace('\u011f', 'ğ'))
f2.write=(line.replace('\u011e', 'Ğ'))
f2.write=(line.replace('\u0131', 'ı'))
f2.write=(line.replace('\u0130', 'İ'))
f2.write=(line.replace('\u00f6', 'ö'))
f2.write=(line.replace('\u00d6', 'Ö'))
f2.write=(line.replace('\u00fc', 'ü'))
f2.write=(line.replace('\u00dc', 'Ü'))
f2.write=(line.replace('\u015f', 'ş'))
f2.write=(line.replace('\u015e', 'Ş'))
f2.write=(line.replace('\u00e7', 'ç'))
f2.write=(line.replace('\u00c7', 'Ç'))
f1.close()
f2.close()
Both of these didn't work.
How can I make it work?

JSON allows both "escaped" and "unescaped" characters. The reason why the Twitter API returns only escaped characters is that it can use the ASCII encoding, which increases interoperability. For Turkish characters you need another encoding. Opening a file with the open function opens a file assuming your current locale encoding, which is probably what your editor expects. If you want the output file to have e.g. the ISO-8859-9 encoding, you can pass encoding='ISO-8859-9' as an additional parameter to the open function.
You can read a file containing a JSON object with the json.load function. This returns a Python object with the escaped characters decoded. Writing it again with json.dump and passing ensure_ascii=False as an argument writes the object back to a file without encoding Turkish characters as escape sequences. An example:
import json
inp = open('input.txt', 'r')
out = open('output.txt', 'w')
in_as_obj = json.load(inp)
json.dump(in_as_obj, out, ensure_ascii=False)
Your file isn't really a JSON file, but instead a file containing multiple JSON objects. If each JSON object is on its own line, you can try the following:
import json
inp = open('input.txt', 'r')
out = open('output.txt', 'w')
for line in inp:
if not line.strip():
out.write(line)
continue
in_as_obj = json.loads(line)
json.dump(in_as_obj, out, ensure_ascii=False)
out.write('\n')
But in your case it's probably better to write unescaped JSON to the file in the first place. Try replacing your on_data method by (untested):
def on_data(self, raw_data):
data = json.loads(raw_data)
print(json.dumps(data, ensure_ascii=False))

You can use this method:
# For Turkish Character
translationTable = str.maketrans("ğĞıİöÖüÜşŞçÇ", "gGiIoOuUsScC")
yourText = "Pijamalı Hasta Yağız Şoföre Çabucak Güvendi"
yourText = yourText.translate(translationTable)
print(yourText)

Python 3: JSON File Load with Non-ASCII Characters

just trying to load this JSON file(with non-ascii characters) as a python dictionary with Unicode encoding but still getting this error:
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 90: ordinal not in range(128)
JSON file content = "tooltip":{
"dxPivotGrid-sortRowBySummary": "Sort\"{0}\"byThisRow",}
import sys
import json
data = []
with open('/Users/myvb/Desktop/Automation/pt-PT.json') as f:
for line in f:
data.append(json.loads(line.encode('utf-8','replace')))

You have several problems as near as I can tell. First, is the file encoding. When you open a file without specifying an encoding, the file is opened with whatever sys.getfilesystemencoding() is. Since that may vary (especially on Windows machines) its a good idea to explicitly use encoding="utf-8" for most json files. Because of your error message, I suspect that the file was opened with an ascii encoding.
Next, the file is decoded from utf-8 into python strings as it is read by the file system object. The utf-8 line has already been decoded to a string and is already ready for json to read. When you do line.encode('utf-8','replace'), you encode the line back into a bytes object which the json loads (that is, "load string") can't handle.
Finally, "tooltip":{ "navbar":"Operações de grupo"} isn't valid json, but it does look like one line of a pretty-printed json file containing a single json object. My guess is that you should read the entire file as 1 json object.
Putting it all together you get:
import json
with open('/Users/myvb/Desktop/Automation/pt-PT.json', encoding="utf-8") as f:
data = json.load(f)
From its name, its possible that this file is encoded as a Windows Portugese code page. If so, the "cp860" encoding may work better.

I had the same problem, what worked for me was creating a regular expression, and parsing every line from the json file:
REGEXP = '[^A-Za-z0-9\'\:\.\;\-\?\!]+'
new_file_line = re.sub(REGEXP, ' ', old_file_line).strip()

Having a file with content similar to yours I can read the file in one simple shot:
>>> import json
>>> fname = "data.json"
>>> with open(fname) as f:
... data = json.load(f)
...
>>> data
{'tooltip': {'navbar': 'Operações de grupo'}}

You don't need to read each line. You have two options:
import sys
import json
data = []
with open('/Users/myvb/Desktop/Automation/pt-PT.json') as f:
data.append(json.load(f))
Or, you can load all lines and pass them to the json module:
import sys
import json
data = []
with open('/Users/myvb/Desktop/Automation/pt-PT.json') as f:
data.append(json.loads(''.join(f.readlines())))
Obviously, the first suggestion is the best.

Why does printing to a utf-8 file fail?

So I ran into a problem this afternoon, I was able to solve it, but I don't quite understand why it worked.
this is related to a problem I had the other week: python check if utf-8 string is uppercase
basically, the following will not work:
#!/usr/bin/python
import codecs
from lxml import etree
outFile = codecs.open('test.xml', 'w', 'utf-8') #cannot use codecs.open()
root = etree.Element('root')
sect = etree.SubElement(root,'sect')
words = ( u'\u041c\u041e\u0421\u041a\u0412\u0410', # capital of Russia, all uppercase
u'R\xc9SUM\xc9', # RESUME with accents
u'R\xe9sum\xe9', # Resume with accents
u'R\xe9SUM\xe9', ) # ReSUMe with accents
for word in words:
print word
if word.encode('utf8').decode('utf8').isupper(): #.isupper won't function on utf8
title = etree.SubElement(sect,'title')
title.text = word
else:
item = etree.SubElement(sect,'item')
item.text = word
print>>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8')
it fails with the following:
Traceback (most recent call last):
File "./temp.py", line 25, in
print >>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8')
File "/usr/lib/python2.7/codecs.py",
line 691, in write
return self.writer.write(data) File "/usr/lib/python2.7/codecs.py",
line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec
can't decode byte 0xd0 in position 66:
ordinal not in range(128)
but if I open the new file without codecs.open('test.xml', 'w', 'utf-8') and instead use
outFile = open('test.xml', 'w') it works perfectly.
So whats happening??
since encoding='utf-8' is specified in etree.tostring() is it encoding the file again?
if I leave codecs.open() and remove encoding='utf-8' the file then becomes an ascii file. Why? becuase etree.tostring() has a default encoding of ascii I persume?
but etree.tostring() is simply being written to stdout, and is then redirect to a file that was created as a utf-8 file??
is print>> not workings as I expect? outFile.write(etree.tostring()) behaves the same way.
Basically, why wouldn't this work? what is going on here. It might be trivial, but I am obviously a bit confused and have a desire to figure out why my solution works,

You've opened the file with UTF-8 encoding, which means that it expects Unicode strings.
tostring is encoding to UTF-8 (in the form of bytestrings, str), which you're writing to the file.
Because the file is expecting Unicode, it's decoding the bytestrings to Unicode using the default ASCII encoding so that it can then encode the Unicode to UTF-8.
Unfortunately, the bytestrings aren't ASCII.
EDIT: The best advice to avoid this kind of problem is to use Unicode internally, decoding on input and encoding on output.

Using print>>outFile is a little strange. I don't have lxml installed, but the built-in xml.etree library is similar (but doesn't support pretty_print). Wrap the root Element in an ElementTree and use the write method.
Also, if you using a # coding line to declare the encoding of the source file, you can use readable Unicode strings instead of escape codes:
#!/usr/bin/python
# coding: utf8
import codecs
from xml.etree import ElementTree as etree
root = etree.Element(u'root')
sect = etree.SubElement(root,u'sect')
words = [u'МОСКВА',u'RÉSUMÉ',u'Résumé',u'RéSUMé']
for word in words:
print word
if word.isupper():
title = etree.SubElement(sect,u'title')
title.text = word
else:
item = etree.SubElement(sect,u'item')
item.text = word
tree = etree.ElementTree(root)
tree.write('text.xml',xml_declaration=True,encoding='utf-8')

In addition to MRABs answer some lines of code:
import codecs
from lxml import etree
root = etree.Element('root')
sect = etree.SubElement(root,'sect')
# do some other xml building here
with codecs.open('test.xml', 'w', encoding='utf-8') as f:
f.write(etree.tostring(root, encoding=unicode))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Encoding - Problems creating JSON using an Unicode Object in Python - python

Related

How to convert cp866 in utf-8 coding in python

convert the content of a file in hex to base64 and print out the result

How can I replace Unicode characters with Turkish characters in a text file with Python

Python 3: JSON File Load with Non-ASCII Characters

Why does printing to a utf-8 file fail?

Categories

Resources