Loading json file and decoding in UTF-16 - python

This is a better way of wording my question:
I'm trying to read a utf-16 characters (English and Arabic) from a .json.gz file in python 2.7.
The code lines that I have written read utf-8 characters:
import glob
import json
import gzip
print("Reading input JSON files")
for filename in glob.glob("*api*.json.gz"):
with gzip.open(filename,'r') as f:
data = json.loads(f.read().decode('utf-8'))
I tried a simple replacement of utf-8 to utf-16, but I got this error:
ValueError: No JSON object could be decoded
Any help would be appreciated.

Specify the encoding as a part of open(). Here is a "round-trip demo":
>>> import json
>>> data = {
... "title": "قالت وزارة الداخلية المصرية إن كمية من المتفجرات في سيارة كانت معدة لتنفيذ عملية إرهابية أدت إلى الانفجار الذي وقع وسط القاهرة وأودى بحياة نحو 20 شخصا."
... }
>>> with open("/tmp/utf16demo.json", "w", encoding="utf-16") as f:
... json.dump(data, f)
>>> with open("/tmp/utf16demo.json", encoding="utf-16") as f:
... newdata = json.load(f)
>>> next(iter(newdata.values())) == next(iter(data.values()))
True
As mentioned in the comments, just because the data is originally UTF-16 encoded does not need you mean to write it back to CSV in the same encoding. You are perfectly free to load and decode using UTF-16, but then write out using UTF-8.

import json
{"intents": [
{"tag": "greeting",
"patterns": ["هاي","عامل إيه","ايه اخبارك","ازيك"],
"responses": ["هاي!","كويس","حمدالله","ماشي الحال وإنت ??"],
"context_set": ""
}
]
}
with open("intents.json", encoding="utf-8") as f:
intents = json.load(f)

Related

Handling a Txt file content as a String Variable

I have a json file, and I need to read all of that json file content as String data. How can I read all the data and set a variable as a String for all of that content? Json file has blanks, new lines, special characters etc if it's neccesarry.
Thanks for your help!
import json
from ast import literal_eval
with open('<path_to_json_data>/json_data.txt') as f:
json_data = json.load(f) # dict object
print(json_data, type(json_data))
json_data_as_str = str(json_data) # dict-->str object
print(json_data_as_str, type(json_data_as_str))
data = literal_eval(json_data_as_str) # str-->dict object again
print(data, type(data))
Hope it helps
Simple as this example
import json
with open("path/to/json/filename.json", "r") as json_file:
data = json.load(json_file)
print(data)
dataStr = json.dumps(data)
print(dataStr)
use json.loads
import json
with open(file_name, "r") as fp:
as_string = str(json.loads(fp.read()))

How to write human-readable data to to a JSON file

When I export a file from python to json file it contains charecters like,
{"-": "text", "menu": {"-": "node", "id": 2244676, "prev": "[2/40] \u0d2a\u0d4d\u0d30\u0d2f\u0d4b\u0d1c\u0d15 \u0d15\u0d4d\u0d30\u0d3f\u0d2f
I used
with open('messages.json', 'w') as outfile:
json.dump(all_messages, outfile, cls=DateTimeEncoder)
in python. How to convert it to normal unicode text?
If you want the output JSON to be human-readable, use UTF-8 encoding and the ensure_ascii=False parameter:
with open('messages.json', 'w', encoding='utf8') as outfile:
json.dump(all_messages, outfile, cls=DateTimeEncoder,ensure_ascii=False)
If you just want to read the data back in again, json.load will convert it back to Unicode:
with open('messages.json', encoding='utf8') as infile:
data = json.load(infile)
Examples with simple strings:
>>> s = '[2/40] പ്രയോജക ക്രിയ'
>>> print(json.dumps(s))
"[2/40] \u0d2a\u0d4d\u0d30\u0d2f\u0d4b\u0d1c\u0d15 \u0d15\u0d4d\u0d30\u0d3f\u0d2f"
>>> print(json.dumps(s,ensure_ascii=False))
"[2/40] പ്രയോജക ക്രിയ"
>>> out = json.dumps(s)
>>> out
'"[2/40] \\u0d2a\\u0d4d\\u0d30\\u0d2f\\u0d4b\\u0d1c\\u0d15 \\u0d15\\u0d4d\\u0d30\\u0d3f\\u0d2f"'
>>> json.loads(out)
'[2/40] പ്രയോജക ക്രിയ'

How to write Arabic to a CSV file

I am trying to extract tweets with Python and store them in a CSV file, but I can't seem to include all languages. Arabic appears as special characters.
def recup_all_tweets(screen_name,api):
all_tweets = []
new_tweets = api.user_timeline(screen_name,count=300)
all_tweets.extend(new_tweets)
#outtweets = [[tweet.id_str, tweet.created_at, tweet.text,tweet.retweet_count,get_hashtagslist(tweet.text)] for tweet in all_tweets]
outtweets = [[tweet.text,tweet.entities['hashtags']] for tweet in all_tweets]
# with open('recup_all_tweets.json', 'w', encoding='utf-8') as f:
# f.write(json.dumps(outtweets, indent=4, sort_keys=True))
with open('recup_all_tweets.csv', 'w',encoding='utf-8') as f:
writer = csv.writer(f,delimiter=',')
writer.writerow(["text","tag"])
writer.writerows(outtweets)
# pass
return(outtweets)
Example of writing both CSV and JSON:
#coding:utf8
import csv
import json
s = ['عربى','عربى','عربى']
with open('output.csv','w',encoding='utf-8-sig',newline='') as f:
r = csv.writer(f)
r.writerow(['header1','header2','header3'])
r.writerow(s)
with open('output.json','w',encoding='utf8') as f:
json.dump(s,f,ensure_ascii=False)
output.csv:
header1,header2,header3
عربى,عربى,عربى
output.csv viewed in Excel:
output.json:
["عربى", "عربى", "عربى"]
Note Microsoft Excel needs utf-8-sig to read a UTF-8 file properly. Other applications may or may not need it to view properly. Many Windows applications required a UTF-8 "BOM" signature at the start of a text file or will assume an ANSI encoding instead. The ANSI encoding varies depending on the localized version of Windows used.
Maybe try with
f.write(json.dumps(outtweets, indent=4, sort_keys=True, ensure_ascii=False))
I searched a lot and finally wrote the following piece of code:
import arabic_reshaper
from bidi.algorithm import get_display
import numpy as np
itemsX = webdriver.find_elements(By.CLASS_NAME,"x1i10hfl")
item_linksX = [itemX.get_attribute("href") for itemX in itemsX]
item_linksX = filter(lambda k: '/p/' in k, item_linksX)
counter = 0
for item_linkX in item_linksX:
AllComments2 = []
counter = counter + 1
webdriver.get(item_linkX)
print(item_linkX)
sleep(11)
comments = webdriver.find_elements(By.CLASS_NAME,"_aacl")
for comment in comments:
try:
reshaped_text = arabic_reshaper.reshape(comment.text)
bidi_text = get_display(reshaped_text)
AllComments2.append(reshaped_text)
except:
pass
df = pd.DataFrame({'col':AllComments2})
df.to_csv('C:\Crawler\Comments' + str(counter) + '.csv', sep='\t', encoding='utf-16')
This code worked perfectly for me. I hope it helps those who haven't used the code from the previous post

how to prevent python json.loads() to decoding characters unnecessarily

I have a file that has:
{
"name": "HOSTNAME_HTTP",
"description": "Custom hostname for http service route. Leave blank for default hostname, e.g.: \u003capplication-name\u003e-\u003cproject\u003e.\u003cdefault-domain-suffix\u003e"
}
when I open the file using:
with open('data.txt', 'r') as file:
data = file.read()
I pass this to json.loads and the content in data is replaced with:
<application>...</application>
How can I prevent python json.loads from messing with the encoding in the content?
You could use a workaround like this to escape the unicode sequences:
>>> obj = json.loads(data.replace('\\', '\\\\'))
>>> obj
{'name': 'HOSTNAME_HTTP',
'description': 'Custom hostname for http service route. Leave blank for default hostname, e.g.: \\u003capplication-name\\u003e-\\u003cproject\\u003e.\\u003cdefault-domain-suffix\\u003e'}
And then when you're done modifying:
>>> print(json.dumps(obj).replace('\\\\', '\\'))
{"name": "HOSTNAME_HTTP", "description": "Custom hostname for http service route. Leave blank for default hostname, e.g.: \u003capplication-name\u003e-\u003cproject\u003e.\u003cdefault-domain-suffix\u003e"}
If you expect other backslashes in the file, it would be safer to use regular expressions:
import re
from_pattern = re.compile(r'(\\u[0-9a-fA-F]{4})')
to_pattern = re.compile(r'\\(\\u[0-9a-fA-F]{4})')
def from_json_escaped(path):
with open(path, 'r') as f:
return json.loads(from_pattern.sub(r'\\\1', f.read()))
def to_json_escaped(path, obj):
with open(path, 'w') as f:
f.write(to_pattern.sub(r'\1', json.dumps(obj)))
I found a solution:
import json
from json.decoder import JSONDecoder
with open('data.txt', 'r') as file:
data = file.read()
data_without_dump = '{"data":\"' + data + '\"}'
datum_dump = json.dumps(data)
datum = '{"data": ' + datum_dump + '}'
datum_load = json.loads(datum)
datum_load_without_dump = json.loads(data_without_dump)
print(datum_dump)
print(datum)
print(datum_load["data"])
print(datum_load_without_dump["data"])
print(type(datum_dump), type(datum), type(datum_load))
Output:
"\\u003capplication\\u003e.....\\u003c/application\\u003e"
{"data": "\\u003capplication\\u003e.....\\u003c/application\\u003e"}
\u003capplication\u003e.....\u003c/application\u003e
<application>.....</application>
<class 'str'> <class 'str'> <class 'dict'>
My reasoning:
json.loads : Deserialize a str or unicode instance containing a JSON document to a Python object.
json.dumps : Serialize obj to a JSON formatted str.
So, using them in cascading gets the desired result.

Writing to a JSON file and updating said file

I have the following code that will write to a JSON file:
import json
def write_data_to_table(word, hash):
data = {word: hash}
with open("rainbow_table\\rainbow.json", "a+") as table:
table.write(json.dumps(data))
What I want to do is open the JSON file, add another line to it, and close it. How can I do this without messing with the file?
As of right now when I run the code I get the following:
write_data_to_table("test1", "0123456789")
write_data_to_table("test2", "00123456789")
write_data_to_table("test3", "000123456789")
#<= {"test1": "0123456789"}{"test2": "00123456789"}{"test3": "000123456789"}
How can I update the file without completely screwing with it?
My expected output would probably be something along the lines of:
{
"test1": "0123456789",
"test2": "00123456789",
"test3": "000123456789",
}
You may read the JSON data with :
parsed_json = json.loads(json_string)
You now manipulate a classic dictionary. You can add data with :
parsed_json.update({'test4': 0000123456789})
Then you can write data to a file using :
with open('data.txt', 'w') as outfile:
json.dump(parsed_json, outfile)
If you are sure the closing "}" is the last byte in the file you can do this:
>>> f = open('test.json', 'a+')
>>> json.dump({"foo": "bar"}, f) # create the file
>>> f.seek(0)
>>> f.read()
'{"foo": "bar"}'
>>> f.seek(-1, 2)
>>> f.write(',\n', f.write(',\n' + json.dumps({"spam": "bacon"})[1:]))
>>> f.seek(0)
>>> print(f.read())
{"foo": "bar",
"spam": "bacon"}
Since your data is not hierarchical, you should consider a flat format like "TSV".

Categories