I have a piece of code, it process thousands of files in a directory, for each file, it generate an object (dictionary) with part of its key-value as:
{
........
'result': [...a very long list...]
}
if I process all the files, save result in a list then use jsonlines library to write all, my laptop (mac) will run out of memory.
So my solution will be process one by one, and get result, then insert into the jsonline file and delete the object and release memory.
After check the official document:
https://jsonlines.readthedocs.io/en/latest/
I couldn't find a method which can write without overwriting the jsonline file.
So how I can handle such big output.
Besides, I'm using parallel threads to process result:
from multiprocessing.dummy import Pool
Pool(4).map(get_result, file_lst)
I do hope to open the json_file, write each result and then release the memory.
If I understands your question correctly, I think this will solve it:
with jsonlines.open('yourTextFile', mode='a') as writer:
writer.write(...)
As you mentioned you are overwriting the file, I think this is because you use mode='w' (w = writing) instead of using mode='a' (a = appending)
Related
Here is my code of accessing&editing the file:
def edit_default_settings(self, setting_type, value):
with open("cam_settings.json", "r") as f:
cam_settings = json.load(f)
cam_settings[setting_type] = value
with open("cam_settings.json", 'w') as f:
json.dump(cam_settings, f, indent=4)
I use It in a program that runs for several hours in a day, and once in a ~week I'm noticing, that cam_settings.json file becoming empty (literally empty, the file explorer shows 0 bytes), but can't imagine how that is possible
Would be glad to hear some comments on what could go wrong
I can't see any issues with the code itself, but there can be an issue with the execution environment. Are you running the code in a multi-threaded environment or running multiple instances of the same program at once?
This situation can arise if this code is executed parallelly and multiple threads/processes try to access the file at the same time. Try logging each time the function was executed and if the function was executed successfully. Try exception handlers and error logging.
If this is a problem, using buffers or singleton pattern can solve the issue.
As #Chels said, the file is truncated when it's opened with 'w'. That doesn't explain why it stays that way; I can only imagine that happening if your code crashed. Maybe you need to check logs for code crashes (or change how your code is run so that crash reasons get logged, if they aren't).
But there's a way to make this process safer in case of crashes. Write to a separate file and then replace the old file with the new file, only after the new file is fully written. You can use os.replace() for this. You could do this simply with a differently-named file:
with open(".cam_settings.json.tmp", 'w') as f:
json.dump(cam_settings, f, indent=4)
os.replace(".cam_settings.json.tmp", "cam_settings.json")
Or you could use a temporary file from the tempfile module.
When openning a file with the "w" parameter, everytime you will write to it, the content of the file will be erased. (You will actually replace what's written already).
Not sure if this is what you are looking for, but could be one of the reasons why "cam_settings.json" becomes empty after the call of open("cam_settings.json", 'w')!
In such a case, to append some text, use the "a" parameter, as:
open("cam_settings.json", 'a')
I am using python 3.9 and trying :
with open(file, 'r') as fl:
val = ijson.items(fl, '<my_key>.item', use_float=True)
for i in val:
print(i)
After some time print statement is not printing anything on jupyter console, but that jupyte cell still run for a very long time.
Is it like, even if I parse specific elements, ijson scan complete file from start to end?, if YES, how can i restrict this behaviour(if it is possible).
Note: Instead of content->print am writing content->into some file, I can see file contents are not changing after some time, but process keeps running.
I have tried all sorts of closing file operations etc. Nothing work so far.
Thanks in advance.
The only way to break out of the ijson iteration is to break it yourself (i.e., actually break from the for loop). This is because, as you suggest, ijson goes through reads input files fully. This in turn is because your path (<my_key>.item) could appear again in your file after the initial set of results you are seeing (keys are not required to be unique in JSON).
I use multiple python scripts that collect data and write it into one single json data file.
It is not possible to combine the scripts.
The writing process is fast and it happens often that errors occur (e.g. some chars at the end duplicate), which is fatal, especially since I am using json format.
Is there a way to prevent a python script to write into a file if there are other script currently trying to write into the file? (It would be absolutely ok, if the data that the python script tries to write into the file gets lost, but it is important that the file syntax does not get somehow 'injured'.)
Code Snipped:
This opens the file and retrieves the data:
data = json.loads(open("data.json").read())
This appends a new dictionary:
data.append(new_dict)
And the old file is overwritten:
open("data.json","w").write( json.dumps(data) )
Info: data is a list which contains dicts.
Operating System: The hole process takes place on linux server.
On Windows, you could try to create the file, and bail out if an exception occurs (because file is locked by another script). But on Linux, your approach is bound to fail.
Instead, I would
write one file per new dictionary, suffixing filename by process ID and a counter
consuming process(es) don't read a single file, but the sorted files (according to modification time) and build the data from it
So in each script:
filename = "data_{}_{}.json".format(os.getpid(),counter)
counter+=1
open(filename ,"w").write( json.dumps(new_dict) )
and in the consumers (reading each dict of sorted files in a protected loop):
files = sorted(glob.glob("*.json"),key=os.path.getmtime())
data = []
for f in files:
try:
with open(f) as fh:
data.append(json.load(fh))
except Exception:
# IO error, malformed json file: ignore
pass
I will post my own solution, since it works for me:
Every single python script checks (before opening and writing the data file) whether a file called data_check exists. If so, the pyhthon script does not try to read and write the file and dismisses the data, that was supposed to be written into the file. If not, the python script creates the file data_check and then starts to read and wirte the file. After the writing process is done the file data_check is removed.
I have some files (part-00000.gz, part-00001.gz, part-00002.gz, ...) and each part is rather large. I need to use the filename of each part because it contains time stamp information. As I know, it seems that in pyspark only wholeTextFiles can read input as (filename, content). However, i get the error of out of memory when using wholeTextFiles. So, my guess is that wholeTextFiles reads a whole part as content in mapper without partition operation. I also find this answer (How does the number of partitions affect `wholeTextFiles` and `textFiles`?). If so, how can i get the filename of a rather large part file. Thanks
You get the error because wholeTextFiles tries to read the entire file into a single RDD. You're better off reading the file line-by-line, which you can do simply by writing your own generator and using the flatMap function. Here's an example of doing that to read a gzip file:
import gzip
def read_fun_generator(filename):
with gzip.open(filename, 'rb') as f:
for line in f:
yield line.strip()
gz_filelist = glob.glob("/path/to/files/*.gz")
rdd_from_bz2 = sc.parallelize(gz_filelist).flatMap(read_fun_generator)
I'm writing a script that gets the most recently modified file from a unix directory.
I'm certain it works, but I have to create a unittest to prove it.
The problem is the setUp function. I want to be able to predict the order the files are created in.
self.filenames = ["test1.txt", "test2.txt", "test3.txt", "filename.txt", "test4"]
newest = ''
for fn in self.filenames:
if pattern.match(fn): newest = fn
with open(fn, "w") as f: f.write("some text")
The pattern is "test.*.txt" so it just matches the first three in the list. In multiple tests, newest sometimes returns 'test3.txt' and sometimes 'test1.txt'.
Use os.utime to explicitly set modified time on the files that you have created. That way your test will run faster.
I doubt that the filesystem you are using supports fractional seconds on file create time.
I suggest you insert a call to time.sleep(1) in your loop so that the filesystem actually has a different timestamp on each created file.
It could be due to syncing. Just because you call write() on files in a certain order, it doesn't mean the data will be updated by the OS in that order.
Try calling f.flush() followed by os.fsync() on your file object before going to the next file. Giving some time between calls (using sleep()) might help also