Read only first line of gzip JSON file - python

Although a lot of code has been posted here about how to read the first line of a file, I cannot figure out how to only read the first line of a gzipped JSON file in Python.
Here is my current working example. However, it contains a nasty break statement, and the loop seems completely unnecessary:
for line in gzip.open(file, 'rb'):
one_line = json.loads(line)
print(one_line)
break
Is there a solution that keeps the json.loads() command (or a similar one that reads in the JSON file correctly), while only reading the first line of the gzipped JSON file?

Call readline() instead of a for loop.
with gzip.open(file, 'rb') as f:
line = f.readline()
one_line = json.loads(line)
print(one_line)

Related

Reading large compressed files

This might be a simple question but I can't seem to find the answer to this or why it is not working on this specific case.
I want to read large files, they can be compressed or not. I used contextlib to write a contextmanager function to handle this. Then using the with statement I read the files in the main script.
My problem here is that the script uses a lot of memory then gets killed (testing using a compressed file). What am I doing wrong? Should I approach this differently?
def process_vcf(location):
logging.info('Processing vcf')
logging.debug(location)
with read_compressed_or_not(location) as vcf:
for line in vcf.readlines():
if line.startswith('#'):
logging.debug(line)
#contextmanager
def read_compressed_or_not(location):
if location.endswith('.gz'):
try:
file = gzip.open(location)
yield file
finally:
file.close()
else:
try:
file = open(location, 'r')
yield file
finally:
file.close()
The lowest impact solution is just to skip the use of the readlines function. readlines returns a list containing every line in the file, so it does have the entire file in memory. Using the filename by itself reads one line at a time using a generator, so it doesn't have to have the whole file in memory.
with read_compressed_or_not(location) as vcf:
for line in vcf:
if line.startswith('#'):
logging.debug(line)
Instead of using for line in vcf.readlines(), you can do:
line = vcf.readline()
while line:
# Do stuff
line = vcf.readline()
This will only load one single line into memory at once
The file opening function is the main difference between reading a gzip file and a non-gzip file. So one can dynamically assign the opener and then read the file. Then there is no need for a custom context manager.
import gzip
open_fn = gzip.open if location.endswith(".gz") else open
with open_fn(location, mode="rt") as vcf:
for line in vcf:
...

How to read the lines from a mmap into a dictionary?

My mmap opens a json file from disk. Currently I am able to read line by line from the file but I want to be able to store this information into a dictionary of key and value pairs something similar to the structure of the json file contents.
Currently, I use the following line of code to read line by line on Windows.
filename = 'C:\Workspace\tempfile.json'
resultsDictionary = {}
with open(filename, "r+b") as f:
map_file = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
for line in iter(map_file.readline, b""):
print(line)
# I want to be able to store it in a resultsDictionary so I could use that resultsDictionary in latter method in my python code. I am not sure on how to do this.
Any help would be appreciated.
You should use the json module from the python standard library.

I am reading line by line from a json file and appending the data to a list, but it is only appending the last line in the list

I'm reading from a JSON file which is very large, I am using json.loads method to read each line from the JSON file, yet it stores only the data of the last line in the list.
When I was loading the whole JSON file and then accessing the data, it was working fine but when working line by line it isn't working properly.
with open('tinyTwitter(3).json','r',encoding = 'utf-8',errors='ignored') as f:
next(f)
for line in f:
try:
data =(json.loads(line))
except:
continue
lst.append(data)
It should store all the values in the lst list.

Why does Python produce a MemoryError when I open this file

I am trying to remove empty lines from a file. My method is reading the file line by line and writing any lines which are not just newlines to a new file. it works great for small files, but for reasons I don't understand, I'm getting a MemoryError on larger ones. The problem file is over 1GB, but since I'm reading it line by line, I don't think I'm storing more than one line in memory. Or am I?
with open(output_path, "ab+") as out_file:
with open(input_path, "rb") as in_file:
line = in_file.readline()
while line:
if line != "\n":
out_file.write(line)
line = in_file.readline()
When I split the file into chunks, it works fine, but that's a step I'd rather not do. I want to understand what is happening here. Thanks!
It turns out that the problem was elsewhere in the code. I wasn't explicitly closing a file, which led to this issue. Thanks all for your help.

How to read the json file after skippinf few lines in python?

I have a json file whose contents is as follow:-
[
{"time":"56990","device_id":"1","kwh":"279.4"},
{"time":"60590","device_id":"1","kwh":"289.4"},
{"time":"64190","device_id":"1","kwh":"299.4"},
{"time":"67790","device_id":"1","kwh":"319.4"},
]
Now I want to read this file one line at a time using seek and tell methods in python. I tried this but it shows an error saying not able to decode. I actually want to read the json file after every 15 mins or so from that pointer where it was last read.
This is what I have tried.
last_pointer = 0
with open (FILENAME) as f:
f.seek(last_pointer)
raw_data = json.load(f) // this raw_data should load json starting from the last pointer.
.....process something.........
last_position = f.tell()
If your data is arranged in lines exactly as shown, you can construct an ad-hoc solution by reading lines from the file one by one, trimming the trailing comma, and feeding the result to json.loads. But perhaps the better variant would be to use a streaming parser like ijson.
import json
import time
with open ('dat') as f:
line = f.readline()
while line:
try:
raw_data = json.loads(line.strip().strip(','))
print (raw_data)
time.sleep(15*60)
except ValueError:
pass
line = f.readline()

Categories