gzip a list of nested dictionaries - python

I have a group of .jsonl.gz files.
I can read them using the script:
import json
import gzip
with gzip.open(filepath, "r") as read_file: # file path ends with .jsonl.gz
try:
# read gzip file which contains a list of json files (json lines)
# each json file is a dictionary of nested dictionaries
json_list = list(read_file)
except:
print("fail to read thezip ")
Then I do some processing and get some .json files and store them in a list.
for num, json_file in enumerate(json_list):
try:
j_file = json.loads(json_file)
(...some code...)
except:
print("fail")
My question is what is the right way to write them again into .jsonl.gz again?
This is my attempt
jsonfilename = 'valid_' +str(num)+'.jsonl.gz'
with gzip.open(jsonfilename, 'wb') as f:
for dict in list_of_nested_dictionaries:
content.append(json.dumps(dict).encode('utf-8'))
f.write(content)
But I got this error:
TypeError: memoryview: a bytes-like object is required, not 'list'
Then I tried just to gzip the list of dictionaries as is:
jsonfilename = 'valid_' +str(num)+'.jsonl.gz'
with gzip.open(jsonfilename, 'wb') as f:
f.write(json.dumps(list_of_nested_dictionaries).encode('utf-8'))
But the problem here that it gzips the whole list as one block, and when I read it back I got one element which is the whole stored list but not a list of json files as I got from the first step.
this is the code that i use for reading
with gzip.open('valid_3.jsonl.gz', "r" , ) as read_file:
try:
json_list = list(read_file) # read zip file
print(len(json_list))# I got 1 here
except:
print("fail")
json_list[0].decode('utf-8')

f.write(content) takes a byte-string, but you're passing it a list of byte-strings.
f.writelines(content) will iterate over and write each byte-string from the list.
Edit: by the way, gzip is meant for compressing a single file. If you need to compress multiple files into one, I suggest to pack them together in a tarball first and then gzip that.

the solution is simply like this
with gzip.open(jsonfilename, 'wb') as f:
for dict in list_of_nested_dictionaries:
content.append((json.dumps(dict)+'\n').encode('utf-8'))
f.writelines(content)

Related

Pandas to JSON file formatting issue, adding \ to strings

I am using the pandas.DataFrame.to_json to convert a data frame to JSON data.
data = df.to_json(orient="records")
print(data)
This works fine and the output when printing is as expected in the console.
[{"n":"f89be390-5706-4ef5-a110-23f1657f4aec:voltage","bt":1610040655,"u":"V","v":237.3},
{"n":"f89be390-5706-4ef5-a110-23f1657f4aec:power","bt":1610040836,"u":"W","v":512.3},
{"n":"f89be390-5706-4ef5-a110-23f1657f4aec:voltage","bt":1610040840,"u":"V","v":238.4}]
The problem comes when uploading it to an external API which converts it to a file format or writing it to a file locally. The output has added \ to the beginning and ends of strings.
def dataToFile(processedData):
with open('data.json', 'w') as outfile:
json.dump(processedData,outfile)
The result is shown in the clip below
[{\"n\":\"f1097ac5-0ee4-48a4-8af5-bf2b58f3268c:power\",\"bt\":1610024746,\"u\":\"W\",\"v\":40.3},
{\"n\":\"f1097ac5-0ee4-48a4-8af5-bf2b58f3268c:voltage\",\"bt\":1610024751,\"u\":\"V\",\"v\":238.5},
{\"n\":\"f1097ac5-0ee4-48a4-8af5-bf2b58f3268c:power\",\"bt\":1610024764,\"u\":\"W\",\"v\":39.7}]
Is there any formatting specifically I should be including/excluding when converting the data to a file format?
Your data variable is a string of json data and not an actual dictionary. You can do a few things:
Use DataFrame.to_json() to write the file, the first argument of to_json() is the file path:
df.to_json('./data.json', orient='records')
Write the json string directly as text:
def write_text(text: str, path: str):
with open(path, 'w') as file:
file.write(text)
data = df.to_json(orient="records")
write_text(data, './data.json')
If you want to play around with the dictionary data:
def write_json(data, path, indent=4):
with open(path, 'w') as file:
json.dump(data, file, indent=indent)
df_data = df.to_dict(orient='records')
# ...some operations here...
write_json(df_data, './data.json')

Read JSON file correctly

I am trying to read a JSON file (BioRelEx dataset: https://github.com/YerevaNN/BioRelEx/releases/tag/1.0alpha7) in Python. The JSON file is a list of objects, one per sentence.
This is how I try to do it:
def _read(self, file_path):
with open(cached_path(file_path), "r") as data_file:
for line in data_file.readlines():
if not line:
continue
items = json.loads(lines)
text = items["text"]
label = items.get("label")
My code is failing on items = json.loads(line). It looks like the data is not formatted as the code expects it to be, but how can I change it?
Thanks in advance for your time!
Best,
Julia
With json.load() you don't need to read each line, you can do either of these:
import json
def open_json(path):
with open(path, 'r') as file:
return json.load(file)
data = open_json('./1.0alpha7.dev.json')
Or, even cooler, you can GET request the json from GitHub
import json
import requests
url = 'https://github.com/YerevaNN/BioRelEx/releases/download/1.0alpha7/1.0alpha7.dev.json'
response = requests.get(url)
data = response.json()
These will both give the same output. data variable will be a list of dictionaries that you can iterate over in a for loop and do your further processing.
Your code is reading one line at a time and parsing each line individually as JSON. Unless the creator of the file created the file in this format (which given it has a .json extension is unlikely) then that won't work, as JSON does not use line breaks to indicate end of an object.
Load the whole file content as JSON instead, then process the resulting items in the array.
def _read(self, file_path):
with open(cached_path(file_path), "r") as data_file:
data = json.load(data_file)
for item in data:
text = item["text"]
label appears to be buried in item["interaction"]

No JSON object could be decoded ,even when valid json in present in the file

For converting dictionary element to Json and write to a file ,
with open(filename,'w') as f:
if(os.stat(f).st_size == 0):
json.dump(new_data, f)
else:
data = json.load(f)
data.update(new_data)#adding a new dictionary
json.dump(data, f)
i am able to write only one json to the file. When i want to read the exisiting file and then append another set of dictionary , i am unable to do .
Getting ValueError: No JSON object could be decoded tried
json.loads(f), json.load(f)
You should simply read from the file first, if it exists. If it is already empty, or contains invalid JSON, or doesn't exist, initialize data to the empty dict (which is the identity element for the update method).
try:
with open(filename) as f:
data = json.load(f)
except (IOError, ValueError):
data = {}
Then open the file in write mode to write the updated data.
data.update(new_data)
with open(filename, 'w') as f:
json.dump(data, f)

pickle.dump dumps nothing when appending to file

User may give a bunch of urls as command line args. All URLs given in the past are serialized with pickle. The script checks all given URLs, if they are unique then they are serialized and appended to a file. At least that's what should be happening. Nothing is being appended. However when I open the file in write mode,the new, unique URL is written. So what gives? Code is:
def get_new_urls():
if(len(urls.URLs) != 0): # check if empty
with open(urlFile, 'rb') as f:
try:
cereal = pickle.load(f)
print(cereal)
toDump = []
for arg in urls.URLs:
if (arg in cereal):
print("Duplicate URL {0} given, ignoring it.".format(arg))
else:
toDump.append(arg)
except Exception as e:
print("Holy bleep something went wrong: {0}".format(e))
return(toDump)
urlsToDump = get_new_urls()
print(urlsToDump)
# TODO: append new URLs
if(urlsToDump):
with open(urlFile, 'ab') as f:
pickle.dump(urlsToDump, f)
# TODO check HTML of each page against the serialized copy
with open(urlFile, 'rb') as f:
try:
cereal = pickle.load(f)
print(cereal)
except EOFError: # your URL file is empty, bruh
pass
Pickle writes out the data you give it in a special format, e.g. it will write some header/metadata/etc, to the file you give it.
It is not intended to work this way; concatenating two pickle files doesn't really make sense. To achieve a concatenation of your data, you'd need to first read whatever is in the file into your urlsToDump, then update your urlsToDump with any new data, and then finally dump it out again (overwriting the whole file, not appending).
After
with open(urlFile, 'rb') as f:
you need a while loop, to repeatedly unpickle (repeatedly read) from the file until hitting EOF.

Write Python dictionary obtained from JSON in a file

I have this script which abstract the json objects from the webpage. The json objects are converted into dictionary. Now I need to write those dictionaries in a file. Here's my code:
#!/usr/bin/python
import requests
r = requests.get('https://github.com/timeline.json')
for item in r.json or []:
print item['repository']['name']
There are ten lines in a file. I need to write the dictionary in that file which consist of ten lines..How do I do that? Thanks.
To address the original question, something like:
with open("pathtomyfile", "w") as f:
for item in r.json or []:
try:
f.write(item['repository']['name'] + "\n")
except KeyError: # you might have to adjust what you are writing accordingly
pass # or sth ..
note that not every item will be a repository, there are also gist events (etc?).
Better, would be to just save the json to file.
#!/usr/bin/python
import json
import requests
r = requests.get('https://github.com/timeline.json')
with open("yourfilepath.json", "w") as f:
f.write(json.dumps(r.json))
then, you can open it:
with open("yourfilepath.json", "r") as f:
obj = json.loads(f.read())

Categories