Write to csv with Python Multiprocessing apply_async causes missing of data

Write to csv with Python Multiprocessing apply_async causes missing of data - python

I have a csv file, where I read urls line by line to make a request for each enpoint. Each request is parsed and data is written to the output.csv. This process is paralleled.
The issue is connected with written data. Some portions of data are partially missed, or totally missed (blank lines). I suppose that it is happening because of collision or conflicts between async processes. Can you please advice how to fix that.
def parse_data(url, line_num):
print line_num, url
r = requests.get(url)
htmltext = r.text.encode("utf-8")
pois = re.findall(re.compile('<pois>(.+?)</pois>'), htmltext)
for poi in pois:
write_data(poi)
def write_data(poi):
with open('output.csv', 'ab') as resfile:
writer = csv.writer(resfile)
writer.writerow([poi])
resfile.close()
def main():
pool = Pool(processes=4)
with open("input.csv", "rb") as f:
reader = csv.reader(f)
for line_num, line in enumerate(reader):
url = line[0]
pool.apply_async(parse_data, args=(url, line_num))
pool.close()
pool.join()

Try to add file locking:
import fcntl
def write_data(poi):
with open('output.csv', 'ab') as resfile:
writer = csv.writer(resfile)
fcntl.flock(resfile, fcntl.LOCK_EX)
writer.writerow([poi])
fcntl.flock(resfile, fcntl.LOCK_UN)
# Note that you dont have to close the file. The 'with' will take care of it

Concurrent writes to a same file is indeed a known cause of data loss / file corruption. The safe solution here is the "map / reduce" pattern - each process writes in it's own result file (map), then you concatenate those files together (reduce).

Related

Data not written to CSV file on Ctrl-Z exit

So, I need to review some pages, and made a janky queue for efficiency. I have a CSV that needs to be opened to be read, and one to be written to. For each page I open from the read CSV, I call input(), and write some notes, so that they can be saved to the csv to be written to. Code below.
with open("readfile.csv") as r:
csv_file = csv.DictReader(r)
with open("writefile.csv", 'w') as w:
headers = {'URL': None, 'JUDGEMENT': None}
writer = csv.DictWriter(w, fieldnames=headers)
writer.writeheader()
for row in csv_file:
url = row.get("Profile URL")
browser.get(url) //Selenium opening URL
judgement = input("What say you?")
writer.writerow({"Profile URL": url, "JUDGEMENT": judgement})
This works just fine when I do the entire CSV, but sometimes, I only want to do half. When I do CTRL+Z to escape the script, none of the write file saves. I tried adding an exception for the input like
try:
judgement = input("What say you?")
except Exception e:
//but can't find what to put here
That doesn't work, since I can't seem to find what to put here.

Maybe try w.close() in the exception handler - this should flush the buffer to the file, write the data, and then exit.
with open("readfile.csv") as r:
csv_file = csv.DictReader(r)
with open("writefile.csv", 'w') as w:
try:
headers = {'URL': None, 'JUDGEMENT': None}
writer = csv.DictWriter(w, fieldnames=headers)
writer.writeheader()
for row in csv_file:
url = row.get("Profile URL")
browser.get(url) //Selenium opening URL
judgement = input("What say you?")
writer.writerow({"Profile URL": url, "JUDGEMENT": judgement})
except KeyboardInterupt:
if not w.closed:
w.close() # Flushes buffer, and closes file
Alternatively, you could open the file for writing without a default buffer - 0 for unbuffered, 1 for line buffering (I suggest using 1):
with open("writefile.csv", 'w', buffering=1) as w
This post may help you understand further.
EDIT:
It seems as though both of these approaches are needed to solve this, opening with a line buffer, and catching the keyboard interrupt, rather than one of the two.

Python Web Api to CSV

I am looking for some assistance with writing API results to a .CSV file using Python.
I have my source as CSV file. It contains the below urls in a column as separate rows.
https://webapi.nhtsa.gov/api/SafetyRatings/modelyear/2013/make/Acura/model/rdx?format=csv
https://webapi.nhtsa.gov/api/SafetyRatings/modelyear/2017/make/Chevrolet/model/Corvette?format=csv
I can call the Web API and get the printed results. Please find attached 'Web API results' snapshot.
When I try to export these results into a csv, I am getting them as per the attached 'API results csv'. It is not transferring all the records. Right now, It is only sending the last record to csv.
My final output should be as per the attached 'My final output should be' for all the given inputs.
Please find the below python code that I have used. I appreciate your help on this. Please find attached image for my code.My Code
import csv, requests
with open('C:/Desktop/iva.csv',newline ='') as f:
reader = csv.reader(f)
for row in reader:
urls = row[0]
print(urls)
r = requests.get(urls)
print (r.text)
with open('C:/Desktop/ivan.csv', 'w') as csvfile:
csvfile.write(r.text)

You'll have to create a writer object of the csvfile(to be created). and use the writerow() method you could write to the csvfile.
import csv,requests
with open('C:/Desktop/iva.csv',newline ='') as f:
reader = csv.reader(f)
for row in reader:
urls = row[0]
print(urls)
r = requests.get(urls)
print (r.text)
with open('C:/Desktop/ivan.csv', 'w') as csvfile:
writerobj=csv.writer(r.text)
for line in reader:
writerobj.writerow(line)

One problem in your code is that every time you open a file using open and mode w, any existing content in that file will be lost. You could prevent that by using append mode open(filename, 'a') instead.
But even better. Just open the output file once, outside the for loop.
import csv, requests
with open('iva.csv') as infile, open('ivan.csv', 'w') as outfile:
reader = csv.reader(infile)
for row in reader:
r = requests.get(urls[0])
outfile.write(r.text)

ThreadPoolExecutor behaving unexpectedly

I've got a directory of files such as:
input_0.data
input_1.data
and so forth. I want to parse these files with a function that has been shown to output 47 lines for input_0.data when run by itself. However, when I bring a ThreadPoolExecutor into the mix and actually run more than one thread, the output from input_0.data becomes huge, quickly exceeding the known good 47 lines.
The code I'm trying to use is as follows, with needless details cut fairly obviously:
def find_moves(param_list):
input_filename = param_list[0]
output_filename = param_list[1]
print(input_filename+" "+output_filename, flush=True)
input_file = open(input_filename, "r")
output_file = open(output_filename, "w")
for line in input_file:
if do_log_line(line):
log = format_log(line)
print(log, file=output_file, flush=True)
input_file.close()
output_file.close()
if len(argv) != 3:
print("Usage:\n\tmoves.py [input_dir] [output_dir]")
quit()
input_files = list()
for file in glob(path.join(argv[1], "input_*.data")):
input_files.append(file)
input_files = sorted(input_files)
with ThreadPoolExecutor(max_workers=8) as executor:
for file_number, input_filename in enumerate(input_files):
output_filename = "moves_"+str(file_number)+".csv"
output_filename = path.join(argv[2], output_filename)
executor.submit(find_moves, (input_filename, output_filename))
It's obvious I'm using this tool incorrectly, but it's not obvious to me where my mistake is. I'd appreciate some guidance in the matter.
It seems like the threads are writing to each other's files, even though they explicitly state they're working on the right file.

race-condition: reading/writing file (windows)

I have the following situation:
-different users (all on windows OS) that run a python script that can either read or write to pickle file located on a shared folder.
-the "system" is designed in way that only one user at a time will be writing to the file (therefore no race condition of more processes trying to WRITE at the same time on the file)
-the basic code to write would be this:
with open(path + r'\final_db.p', 'wb') as f:
pickle.dump((x, y), f)
-while code to read would be:
with open(path + r'\final_db.p', 'rb') as f:
x, y = pickle.load(f)
-x is list of 5K or plus elements, where each element is a class instance containing many attributes and functions; y is a date
QUESTION:
am i correct assuming that there is a race condition when a reading and a writing process overlap? and that the reading one can end up with a corrupt file?
PROPOSED SOLUTIONS:
1.a possible solution i thought of is using filelock:
code to write:
file_path = path + r'\final_db.p'
lock_path = file_path + '.lock'
lock = filelock.FileLock(lock_path, timeout=-1)
with lock:
with open(file_path, 'wb') as f:
pickle.dump((x, y), f)
code to read:
file_path = path + r'\final_db.p'
lock_path = file_path + '.lock'
lock = filelock.FileLock(lock_path, timeout=-1)
with lock:
with open(file_path, 'rb') as f:
x, y = pickle.load(f)
this solution should work (??), but if a process crash, the file remains blocked till the "file_path + '.lock'" is cancelled
2.another solution could be to use portalocker
code to write:
with open(path + r'\final_db.p', 'wb') as f:
portalocker.lock(f, portalocker.LOCK_EX)
pickle.dump((x, y), f)
code to read:
segnale = True
while segnale:
try:
with open(path + r'\final_db.p', 'rb') as f:
x, y = pickle.load(f)
segnale = False
except:
pass
the reading process, if another process started writing before it, will keep looping till the file is unlocked (except PermissionError).
if the writing process started after the reading process, the reading should loop if the file is corrupt.
what i am not sure about is if the reading process could end up reading a partially written file.
Any advice? better solutions?

ValueError: I/O operation on closed file while looping

Hi I am facing I/O error while looping file execution. The code prompt 'ValueError: I/O operation on closed file.' while running. Does anyone have any idea while says operation on closed as I am opening new while looping? Many thanks
code below:
with open('inputlist.csv', 'r') as f: #input list reading
reader = csv.reader(f)
queries2Google = reader
print(queries2Google)
def QGN(query2Google):
s = '"'+query2Google+'"' #Keywords for query, to solve the + for space
s = s.replace(" ","+")
date = str(datetime.datetime.now().date()) #timestamp
filename =query2Google+"_"+date+"_"+'SearchNews.csv' #csv filename
f = open(filename,"wb") #open output file
pass
df = np.reshape(df,(-1,3))
itemnum,col=df.shape
itemnum=str(itemnum)
df1 = pd.DataFrame(df,columns=['Title','URL','Brief'])
print("Done! "+itemnum+" pieces found.")
df1.to_csv(filename, index=False,encoding='utf-8')
f.close()
return
for query2Google in queries2Google:
QGN(query2Google) #output should be multiple files

with closes the file that you are trying to read once it it is done. So you are opening file, making a csv reader, and then closing the underlying file and then trying to read from it. See more about file i/o here
Solution is to do all of your work on your queries2Google reader INSIDE the with statement:
with open('inputlist.csv', 'r') as f: #input list reading
reader = csv.reader(f)
for q2g in reader:
QGN(q2g)
Some additional stuff:
That pass isn't doing anything and you should probably be using with again inside the QGN function since the file is opened and closed in there. Python doesn't need empty returns. You also don't seem to even be using f in the QGN function.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Write to csv with Python Multiprocessing apply_async causes missing of data - python

Concurrent writes to a same file is indeed a known cause of data loss / file corruption. The safe solution here is the "map / reduce" pattern - each process writes in it's own result file (map), then you concatenate those files together (reduce).

Related

Data not written to CSV file on Ctrl-Z exit

Python Web Api to CSV

ThreadPoolExecutor behaving unexpectedly

race-condition: reading/writing file (windows)

ValueError: I/O operation on closed file while looping

Categories

Resources