How to encode data loaded with open() into UTF-8? - python

I have the following script to import a text-based file.
with open("test.tsv") as import_lines:
for line in import_lines:
line_parsed = line.strip().split("\t")
print(line_parsed[0])
The output of this file is something like this:
2006\u5E74\u5B66\u672F\u6587\u7AE0
2006\u5E74\u5B78\u8853\u6587\u7AE0
2006\u5E74\u5B78\u8853\u6587\u7AE0
I assumed that decoding was as simple as:
print(line_parsed[0].encode().decode("utf-8"))
But this results in the exact same result.
I did notice that:
print(line_parsed[0].encode())
results in:
b'2006\\u5E74\\u5B66\\u672F\\u6587\\u7AE0'
b'2006\\u5E74\\u5B78\\u8853\\u6587\\u7AE0'
b'2006\\u5E74\\u5B78\\u8853\\u6587\\u7AE0'

You don't need to encode().decode(), open your file in binary mode:
with open("test.tsv", "rb") as import_lines:
for line in import_lines:
line_parsed = line.strip().decode('unicode-escape').split("\t")
print(line_parsed[0])
Output:
2006年学术文章
2006年學術文章
2006年學術文章

print(line_obj["label"].encode().decode("unicode-escape"))

Related

get data from .dat file with python

I need to read a .dat file in python, i need to read two value from dat file
[Registration information]
Name=nikam
Key=**KDWOE**
need to get nilkam from name and KDWOE from key
datContent = [i.strip().split() for i in open("./license.dat").readlines()]
print (datContent)
i got this result
[['[Registration', 'information]'], ['Name=nilkam'], ['Key=KZOiX=BFcjLKqJr6HwYxYU+NHN8+MP7VO0YA5+O1PwX0C3twCmum=BLfBI95NQw']]
and from second
with open("./license.dat", 'r') as f :
f = (f.read())
print (f)
i got this
[Registration information]
Name=nikam
Key=KDWOE
i need result need to get nilkam from name and KDWOE from key
I'm not sure what a .dat file is, and you don't specify, but given your example it looks like the configparser library might work for you.
import configparser
config = configparser.ConfigParser()
config.read('./license.dat')
print(config['Registration information']['Name'])
print(config['Registration information']['Key'])

GZip and output file

I'm having difficulty with the following code (which is simplified from a larger application I'm working on in Python).
from io import StringIO
import gzip
jsonString = 'JSON encoded string here created by a previous process in the application'
out = StringIO()
with gzip.GzipFile(fileobj=out, mode="w") as f:
f.write(str.encode(jsonString))
# Write the file once finished rather than streaming it - uncomment the next line to see file locally.
with open("out_" + currenttimestamp + ".json.gz", "a", encoding="utf-8") as f:
f.write(out.getvalue())
When this runs I get the following error:
File "d:\Development\AWS\TwitterCompetitionsStreaming.py", line 61, in on_status
with gzip.GzipFile(fileobj=out, mode="w") as f:
File "C:\Python38\lib\gzip.py", line 204, in __init__
self._write_gzip_header(compresslevel)
File "C:\Python38\lib\gzip.py", line 232, in _write_gzip_header
self.fileobj.write(b'\037\213') # magic header
TypeError: string argument expected, got 'bytes'
PS ignore the rubbish indenting here...I know it doesn't look right.
What I'm wanting to do is to create a json file and gzip it in place in memory before saving the gzipped file to the filesystem (windows). I know I've gone about this the wrong way and could do with a pointer. Many thanks in advance.
You have to use bytes everywhere when working with gzip instead of strings and text. First, use BytesIO instead of StringIO. Second, mode should be 'wb' for bytes instead of 'w' (last is for text) (samely 'ab' instead of 'a' when appending), here 'b' character means "bytes". Full corrected code below:
Try it online!
from io import BytesIO
import gzip
jsonString = 'JSON encoded string here created by a previous process in the application'
out = BytesIO()
with gzip.GzipFile(fileobj = out, mode = 'wb') as f:
f.write(str.encode(jsonString))
currenttimestamp = '2021-01-29'
# Write the file once finished rather than streaming it - uncomment the next line to see file locally.
with open("out_" + currenttimestamp + ".json.gz", "wb") as f:
f.write(out.getvalue())

Python 3: JSON File Load with Non-ASCII Characters

just trying to load this JSON file(with non-ascii characters) as a python dictionary with Unicode encoding but still getting this error:
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 90: ordinal not in range(128)
JSON file content = "tooltip":{
"dxPivotGrid-sortRowBySummary": "Sort\"{0}\"byThisRow",}
import sys
import json
data = []
with open('/Users/myvb/Desktop/Automation/pt-PT.json') as f:
for line in f:
data.append(json.loads(line.encode('utf-8','replace')))
You have several problems as near as I can tell. First, is the file encoding. When you open a file without specifying an encoding, the file is opened with whatever sys.getfilesystemencoding() is. Since that may vary (especially on Windows machines) its a good idea to explicitly use encoding="utf-8" for most json files. Because of your error message, I suspect that the file was opened with an ascii encoding.
Next, the file is decoded from utf-8 into python strings as it is read by the file system object. The utf-8 line has already been decoded to a string and is already ready for json to read. When you do line.encode('utf-8','replace'), you encode the line back into a bytes object which the json loads (that is, "load string") can't handle.
Finally, "tooltip":{ "navbar":"Operações de grupo"} isn't valid json, but it does look like one line of a pretty-printed json file containing a single json object. My guess is that you should read the entire file as 1 json object.
Putting it all together you get:
import json
with open('/Users/myvb/Desktop/Automation/pt-PT.json', encoding="utf-8") as f:
data = json.load(f)
From its name, its possible that this file is encoded as a Windows Portugese code page. If so, the "cp860" encoding may work better.
I had the same problem, what worked for me was creating a regular expression, and parsing every line from the json file:
REGEXP = '[^A-Za-z0-9\'\:\.\;\-\?\!]+'
new_file_line = re.sub(REGEXP, ' ', old_file_line).strip()
Having a file with content similar to yours I can read the file in one simple shot:
>>> import json
>>> fname = "data.json"
>>> with open(fname) as f:
... data = json.load(f)
...
>>> data
{'tooltip': {'navbar': 'Operações de grupo'}}
You don't need to read each line. You have two options:
import sys
import json
data = []
with open('/Users/myvb/Desktop/Automation/pt-PT.json') as f:
data.append(json.load(f))
Or, you can load all lines and pass them to the json module:
import sys
import json
data = []
with open('/Users/myvb/Desktop/Automation/pt-PT.json') as f:
data.append(json.loads(''.join(f.readlines())))
Obviously, the first suggestion is the best.

Python : store data in file

I am trying to store the jsonas text file , I am able to print the file but am not able to store the file and also the o/p is coming wiht unicode charatcer.
PFB code.
import json
from pprint import pprint
with open('20150827_abc_json') as data_file:
f=open("file.txt","wb")
f.write(data=json.load(data_file))
print (data)>f
f.close()
When i execute it , the file gets created but its of zero byte and also how can i get rid of unicode character and also store the output.
o/p
u'Louisiana', u'city': u'New Olreans'
To serialize JSON to file you should use json.dump function. Try to use following code
import json
from pprint import pprint
with open('20150827_abc_json') as data_file, open('file.txt','w') as f:
data=json.load(data_file)
print data
json.dump(data,f)
the print syntax is wrong, you put only a single > while there should be two of them >>.
in python 3 (or python2 if you from __future__ import print_function) you can also write, in a more explicit way:
print("blah blah", file=yourfile)
I would also suggest to use a context manager for both files:
with open('20150827_abc_json') as data_file:
with open("file.txt","wb") as file:
...
otherwise you risk that an error will leave you destination file pending.

Python django writing ascii characters in coded format in file

I am using Django to generate the abc.tex file
I am displaying the the data in browser and same data i am writing to tex file like this
with open("sample.tex") as f:
t = Template(f.read())
head = ['name','class']
c = Context({"head":headers, "table": rowlist})
# Render template
output = t.render(c)
with open("mytable.tex", 'w') as out_f:
out_f.write(output)
Now in the broser i can see the text as speaker-hearer's but in the file it is coming as speaker-hearer's
How can i fix that
As far as I know, the browser decodes this data automatically, but the text within the file will be raw; so you are seeing the data "as it is".
Maybe you can use the HTMLParser library to decode the data generated by Django (output) before writing to the abc.tex file.
For your sample string:
import HTMLParser
h = HTMLParser.HTMLParser()
s = "speaker-hearer's"
s = h.unescape(s)
So then it would be just a matter of unescaping your output when you write it to a file, and probably handling the parsing exception.
Source (see step #3)

Categories