I use multiple python scripts that collect data and write it into one single json data file.
It is not possible to combine the scripts.
The writing process is fast and it happens often that errors occur (e.g. some chars at the end duplicate), which is fatal, especially since I am using json format.
Is there a way to prevent a python script to write into a file if there are other script currently trying to write into the file? (It would be absolutely ok, if the data that the python script tries to write into the file gets lost, but it is important that the file syntax does not get somehow 'injured'.)
Code Snipped:
This opens the file and retrieves the data:
data = json.loads(open("data.json").read())
This appends a new dictionary:
data.append(new_dict)
And the old file is overwritten:
open("data.json","w").write( json.dumps(data) )
Info: data is a list which contains dicts.
Operating System: The hole process takes place on linux server.
On Windows, you could try to create the file, and bail out if an exception occurs (because file is locked by another script). But on Linux, your approach is bound to fail.
Instead, I would
write one file per new dictionary, suffixing filename by process ID and a counter
consuming process(es) don't read a single file, but the sorted files (according to modification time) and build the data from it
So in each script:
filename = "data_{}_{}.json".format(os.getpid(),counter)
counter+=1
open(filename ,"w").write( json.dumps(new_dict) )
and in the consumers (reading each dict of sorted files in a protected loop):
files = sorted(glob.glob("*.json"),key=os.path.getmtime())
data = []
for f in files:
try:
with open(f) as fh:
data.append(json.load(fh))
except Exception:
# IO error, malformed json file: ignore
pass
I will post my own solution, since it works for me:
Every single python script checks (before opening and writing the data file) whether a file called data_check exists. If so, the pyhthon script does not try to read and write the file and dismisses the data, that was supposed to be written into the file. If not, the python script creates the file data_check and then starts to read and wirte the file. After the writing process is done the file data_check is removed.
Related
I've just managed to run my python code on ubuntu, all seems to be going well. My python script writes out a .csv file every hour and I can't seem to find the .csv file.
Having the .csv file is important as I need to do research on the data. I am also using Filezilla, I would have thought the .csv would have run into there.
import csv
import time
collectionTime= datetime.now().strftime('%Y-%m-%d %H:%M:%S')
mylist= [d['Spaces'] for d in data]
mylist.append(collectionTime)
print(mylist)
with open("CarparkData.csv","a",newline="") as f:
writer = csv.writer(f)
writer.writerow(mylist)
In short, your code is outputting to wherever the file you're opening is in this line:
with open("CarparkData.csv","a",newline="") as f:
You can change this filename to the location of wherever you'd like the file to be read/written from/to. For example, data/CarparkData.csv if you had a folder named data/ within your application dedicated to holding data files.
As written in your code, writer.writerow will write the lines to both python's in-memory object of the file (instantiated with open("filename.csv"...), and the file itself (in this case, CarparkData.csv).
The way your code is structured, it won't be creating a new .csv every hour because it is using a static filename. If a file with this name did not exist at time of opening, it will create one, and if it did, it will continue to append new lines to the existing file.
I have a piece of code, it process thousands of files in a directory, for each file, it generate an object (dictionary) with part of its key-value as:
{
........
'result': [...a very long list...]
}
if I process all the files, save result in a list then use jsonlines library to write all, my laptop (mac) will run out of memory.
So my solution will be process one by one, and get result, then insert into the jsonline file and delete the object and release memory.
After check the official document:
https://jsonlines.readthedocs.io/en/latest/
I couldn't find a method which can write without overwriting the jsonline file.
So how I can handle such big output.
Besides, I'm using parallel threads to process result:
from multiprocessing.dummy import Pool
Pool(4).map(get_result, file_lst)
I do hope to open the json_file, write each result and then release the memory.
If I understands your question correctly, I think this will solve it:
with jsonlines.open('yourTextFile', mode='a') as writer:
writer.write(...)
As you mentioned you are overwriting the file, I think this is because you use mode='w' (w = writing) instead of using mode='a' (a = appending)
Trying to loop through some license ids to get data from a website. Example: when I enter id "E09OS0018" in the search box, I get a list of one school/daycare. But when I type the following code in my python script (website link and arguments obtained from developer tools), I get no data in the file. What's wrong with this requests.get() command? If I should use requests.post() instead, what arguments would I use with the requests.post() command (not very familiar with this approach).
flLicenseData = requests.get('https://cares.myflfamilies.com/PublicSearch/SuggestionSearch?text=E09OS0018&filter%5Bfilters%5D%5B0%5D%5Bvalue%5D=e09os0018&filter%5Bfilters%5D%5B0%5D%5Boperator%5D=contains&filter%5Bfilters%5D%5B0%5D%5Bfield%5D=&filter%5Bfilters%5D%5B0%5D%5BignoreCase%5D=true&filter%5Blogic%5D=and')
openFile = open('fldata', 'wb')
for chunk in flLicenseData.iter_content(100000):
openFile.write(chunk)
do openFile.flush() before checking the file's content.
Most likely, you are reading the file immediately before the contents are (Actually) written to the file.
There could be a lag between the contents writen to the file handler and contents actually transfered to the physical file, Due to the levels of buffers between the programming language API, OS and the actual physical file.
use openFile.flush() to ensure that the data is written into the file.
An excellent explanation of flush can be found here.
Or alternatively close the open file with openFile.close() or use a context manager
with open('fldata', 'wb') as open_file:
for chunk in flLicenseData.iter_content(100000):
openFile.write(chunk)
I have a Python program which does the following:
It takes a list of files as input
It iterates through the list several times, each time opening the files and then closing them
What I would like is some way to open each file at the beginning, and then when iterating through the files make a copy of each file handle. Essentially this would take the form of a copy operation on file handles that allows a file to be traversed independently by multiple handles. The reason for wanting to do this is because on Unix systems, if a program obtains a file handle and the corresponding file is then deleted, the program is still able to read the file. If I try reopening the files by name on each iteration, the files might have been renamed or deleted so it wouldn't work. If I try using f.seek(0), then that might affect another thread/generator/iterator.
I hope my question makes sense, and I would like to know if there is a way to do this.
If you really want to get a copy of a file handle, you would need to use POSIX dup system call. In python, that would be accessed by using os.dup - see docs. If you have a file object (e.g. from calling open()), then you need to call fileno() method to get file descriptor.
So the entire code will look like this:
with open("myfile") as f:
fd = f.fileno() # get descriptor
fd2 = os.dup(fd) # duplicate descriptor
f2 = os.fdopen(fd2) # get corresponding file object
I am trying to use "requests" package and retrieve info from Github, like the Requests doc page explains:
import requests
r = requests.get('https://api.github.com/events')
And this:
with open(filename, 'wb') as fd:
for chunk in r.iter_content(chunk_size):
fd.write(chunk)
I have to say I don't understand the second code block.
filename - in what form do I provide the path to the file if created? where will it be saved if not?
'wb' - what is this variable? (shouldn't second parameter be 'mode'?)
following two lines probably iterate over data retrieved with request and write to the file
Python docs explanation also not helping much.
EDIT: What I am trying to do:
use Requests to connect to an API (Github and later Facebook GraphAPI)
retrieve data into a variable
write this into a file (later, as I get more familiar with Python, into my local MySQL database)
Filename
When using open the path is relative to your current directory. So if you said open('file.txt','w') it would create a new file named file.txt in whatever folder your python script is in. You can also specify an absolute path, for example /home/user/file.txt in linux. If a file by the name 'file.txt' already exists, the contents will be completely overwritten.
Mode
The 'wb' option is indeed the mode. The 'w' means write and the 'b' means bytes. You use 'w' when you want to write (rather than read) froma file, and you use 'b' for binary files (rather than text files). It is actually a little odd to use 'b' in this case, as the content you are writing is a text file. Specifying 'w' would work just as well here. Read more on the modes in the docs for open.
The Loop
This part is using the iter_content method from requests, which is intended for use with large files that you may not want in memory all at once. This is unnecessary in this case, since the page in question is only 89 KB. See the requests library docs for more info.
Conclusion
The example you are looking at is meant to handle the most general case, in which the remote file might be binary and too big to be in memory. However, we can make your code more readable and easy to understand if you are only accessing small webpages containing text:
import requests
r = requests.get('https://api.github.com/events')
with open('events.txt','w') as fd:
fd.write(r.text)
filename is a string of the path you want to save it at. It accepts either local or absolute path, so you can just have filename = 'example.html'
wb stands for WRITE & BYTES, learn more here
The for loop goes over the entire returned content (in chunks incase it is too large for proper memory handling), and then writes them until there are no more. Useful for large files, but for a single webpage you could just do:
# just W becase we are not writing as bytes anymore, just text.
with open(filename, 'w') as fd:
fd.write(r.content)