Python: Read a line from a file then remove it with threading

Python: Read a line from a file then remove it with threading - python

I have this program that uses Python threading to read different lines in a file, if it reads a duplicate line then reads another one, and once It has read it, removes the line from the file. The problem is that whenever it reads the file It doesn't update the file, or I'm not quite sure what's happening. It can sometimes read the same line as before therefore breaking it. I'm not sure if my code is the most effective way to do this?
def read_tokens_list():
tokens = []
with open('inputTokens.txt', 'r', encoding='UTF-8') as file:
lines = file.readlines()
for line in lines:
tokens.append(line.replace('\n', ''))
return tokens
def worker(token_list):
while True:
token = random.choice(token_list)
print(token)
ver = open("Fullyverified.txt", "a+")
ver.write(token + "\n")
with open("inputTokens.txt", "r") as f:
lines = f.readlines()
with open("inputTokens.txt", "w") as f:
for line in lines:
if line.strip("\n") != token:
f.write(line)
time.sleep(1)
def main():
threads = []
num_thread = input('Number of Threads: ')
num_thread = int(num_thread)
token_list = read_tokens_list() # read in the pokens.txt file
random.shuffle(token_list) # shuffle the list into random order
tokens_per_worker = len(token_list) // num_thread # how many tokens from the list each worker will get (roughly)
for i in range(num_thread):
if ((i+1)<num_thread):
num_tokens_for_this_worker = tokens_per_worker # give each worker an even share of the list
else:
num_tokens_for_this_worker = len(token_list) # except the last worker gets whatever is left
# we'll give the first (num_tokens_for_this_worker) tokens in the list to this worker
tokens_for_this_worker = token_list[0:num_tokens_for_this_worker]
# and remove those tokens from the list so that they won't get used by anyone else
token_list = token_list[num_tokens_for_this_worker:]
t = threading.Thread(target=worker, args= (tokens_for_this_worker, ))
threads.append(t)
t.start()
for t in threads:
t.join()
if __name__ == '__main__':
main()

Use a lock.
something like:
from threading import Lock
# ...
lock = Lock()
# ...
def worker(token_list, lock = lock):
# ...
with lock:
with open("inputTokens.txt", "r") as f:
lines = f.readlines()
with open("inputTokens.txt", "w") as f:
for line in lines:
if line.strip("\n") != token:
f.write(line)
# ...
The idea of the lock is to protect resources from being accessed by various threads simultaneously. So while one thread is working with the file, the others are waiting.
The next question is if this approach makes sense now, because depending of the size of your file, threads might be stuck waiting for the lock most of the time.
What about a database instead of a file? so you don't have to rewrite a full file, but just delete/update an entry

Related

Pandas Read Continuously Growing CSV File

I have a continuously growing CSV File, that I want to periodically read. I am also only interested in new values.
I was hoping to do something like:
file_chunks = pd.read_csv('file.csv', chunksize=1)
while True:
do_something(next(file_chunks))
time.sleep(0.1)
in a frequency, that is faster than the .csv file is growing.
However, as soon as the iterator does not return a value once, it "breaks" and does not return values, even if the .csv file has grown in the meantime.
Is there a way to read continuously growing .csv files line by line?

you could build a try: except: around it or make and if statement that checks if file_chunks is not none first.
Like this it shouldnt break anymore and he only sleeps when he there are no more chunks left.
while True:
file_chunks = pd.read_csv('file.csv', chunksize=1)
while True:
try:
do_something(next(file_chunks))
except:
time.sleep(0.1)

This is easier to do with the standard csv module where you can write your own line iterator that knows how to read an updating file. This generator would read in binary mode so that it can track file position, close the file at EOF and poll its size for appended data. This can fail if the reader gets a partial file update because the other side hasn't flushed yet, or if a CSV cell contains and embedded new line that invalidates the reader's assumption that a binary mode newline always terminates a row.
import csv
import time
import os
import threading
import random
def rolling_reader(filename, poll_period=.1, encoding="utf-8"):
pos = 0
while True:
while True:
try:
if os.stat(filename).st_size > pos:
break
except FileNotFoundError:
pass
time.sleep(poll_period)
fp = open(filename, "rb")
fp.seek(pos)
for line in fp:
if line.strip():
yield line.decode("utf-8")
pos = fp.tell()
# ---- TEST - thread updates test.csv periodically
class GenCSVThread(threading.Thread):
def __init__(self, csv_name):
super().__init__(daemon=True)
self.csv_name = csv_name
self.start()
def run(self):
val = 1
while True:
with open(self.csv_name, "a") as fp:
for _ in range(random.randrange(4)):
fp.write(",".join(str(val) for _ in range(4)) + "\n")
val += 1
time.sleep(random.random())
if os.path.exists("test.csv"):
os.remove("test.csv")
test_gen = GenCSVThread("test.csv")
reader = csv.reader(rolling_reader("test.csv"))
for row in reader:
print(row)
A platform dependent update would be to use a facility such as inotify to trigger reads off of a file close operation to reduce the risk of partial data.

spawn multiple processes to write different files Python

The idea is to write N files using N processes.
The data for the file to be written are coming from multiple files which are stored on a dictionary that has a list as a value and it looks like this:
dic = {'file1':['data11.txt', 'data12.txt', ..., 'data1M.txt'],
'file2':['data21.txt', 'data22.txt', ..., 'data2M.txt'],
...
'fileN':['dataN1.txt', 'dataN2.txt', ..., 'dataNM.txt']}
so file1 is data11 + data12 + ... + data1M etc...
So my code looks like this:
jobs = []
for d in dic:
outfile = str(d)+"_merged.txt"
with open(outfile, 'w') as out:
p = multiprocessing.Process(target = merger.merger, args=(dic[d], name, out))
jobs.append(p)
p.start()
out.close()
and the merger.py looks like this:
def merger(files, name, outfile):
time.sleep(2)
sys.stdout.write("Merging %n...\n" % name)
# the reason for this step is that all the different files have a header
# but I only need the header from the first file.
with open(files[0], 'r') as infile:
for line in infile:
print "writing to outfile: ", name, line
outfile.write(line)
for f in files[1:]:
with open(f, 'r') as infile:
next(infile) # skip first line
for line in infile:
outfile.write(line)
sys.stdout.write("Done with: %s\n" % name)
I do see the file written on the folder it should go to, but it's empty. no header, no nothing. I had put prints in there to see if everything is correct but nothing works.
Help!

Since the worker processes run in parallel to the main process creating them, the files named out get closed before the workers can write to them. This will happen even if you remove out.close() because of the with statement. Rather pass each process the filename and let the process open and close the file.

The problem is that you don't close the file in the child so internally buffered data is lost. You could move the file open to the child or wrap the whole thing in a try/finally block to make sure the file closes. A potential advantage of opening in the parent is that you can handle file errors there. I'm not saying its compelling, just an option.
def merger(files, name, outfile):
try:
time.sleep(2)
sys.stdout.write("Merging %n...\n" % name)
# the reason for this step is that all the different files have a header
# but I only need the header from the first file.
with open(files[0], 'r') as infile:
for line in infile:
print "writing to outfile: ", name, line
outfile.write(line)
for f in files[1:]:
with open(f, 'r') as infile:
next(infile) # skip first line
for line in infile:
outfile.write(line)
sys.stdout.write("Done with: %s\n" % name)
finally:
outfile.close()
UPDATE
There has been some confusion about parent/child file decriptors and what happens to files in the child. The underlying C library does not flush data to disk if a file is still open when the program exits. The theory is that a properly running program closes things before exit. Here is an example where the child loses data because it does not close the file.
import multiprocessing as mp
import os
import time
if os.path.exists('mytestfile.txt'):
os.remove('mytestfile.txt')
def worker(f, do_close=False):
time.sleep(2)
print('writing')
f.write("this is data")
if do_close:
print("closing")
f.close()
print('without close')
f = open('mytestfile.txt', 'w')
p = mp.Process(target=worker, args=(f, False))
p.start()
f.close()
p.join()
print('file data:', open('mytestfile.txt').read())
print('with close')
os.remove('mytestfile.txt')
f = open('mytestfile.txt', 'w')
p = mp.Process(target=worker, args=(f, True))
p.start()
f.close()
p.join()
print('file data:', open('mytestfile.txt').read())
I run it on linux and I get
without close
writing
file data:
with close
writing
closing
file data: this is data

Nested loops iterating on a single file

I want to delete some specific lines in a file.
The part I want to delete is enclosed between two lines (that will be deleted too), named STARTING_LINE and CLOSING_LINE. If there is no closing line before the end of the file, then the operation should stop.
Example:
...blabla...
[Start] <-- # STARTING_LINE
This is the body that I want to delete
[End] <-- # CLOSING_LINE
...blabla...
I came out with three different ways to achieve the same thing (plus one provided by tdelaney's answer below), but I am wondering which one is the best. Please note that I am not looking for a subjective opinion: I would like to know if there are some real reasons why I should choose one method over another.
1. A lot of if conditions (just one for loop):
def delete_lines(filename):
with open(filename, 'r+') as my_file:
text = ''
found_start = False
found_end = False
for line in my_file:
if not found_start and line.strip() == STARTING_LINE.strip():
found_start = True
elif found_start and not found_end:
if line.strip() == CLOSING_LINE.strip():
found_end = True
continue
else:
print(line)
text += line
# Go to the top and write the new text
my_file.seek(0)
my_file.truncate()
my_file.write(text)
2. Nested for loops on the open file:
def delete_lines(filename):
with open(filename, 'r+') as my_file:
text = ''
for line in my_file:
if line.strip() == STARTING_LINE.strip():
# Skip lines until we reach the end of the function
# Note: the next `for` loop iterates on the following lines, not
# on the entire my_file (i.e. it is not starting from the first
# line). This will allow us to avoid manually handling the
# StopIteration exception.
found_end = False
for function_line in my_file:
if function_line.strip() == CLOSING_LINE.strip():
print("stop")
found_end = True
break
if not found_end:
print("There is no closing line. Stopping")
return False
else:
text += line
# Go to the top and write the new text
my_file.seek(0)
my_file.truncate()
my_file.write(text)
3. while True and next() (with StopIteration exception)
def delete_lines(filename):
with open(filename, 'r+') as my_file:
text = ''
for line in my_file:
if line.strip() == STARTING_LINE.strip():
# Skip lines until we reach the end of the function
while True:
try:
line = next(my_file)
if line.strip() == CLOSING_LINE.strip():
print("stop")
break
except StopIteration as ex:
print("There is no closing line.")
else:
text += line
# Go to the top and write the new text
my_file.seek(0)
my_file.truncate()
my_file.write(text)
4. itertools (from tdelaney's answer):
def delete_lines_iter(filename):
with open(filename, 'r+') as wrfile:
with open(filename, 'r') as rdfile:
# write everything before startline
wrfile.writelines(itertools.takewhile(lambda l: l.strip() != STARTING_LINE.strip(), rdfile))
# drop everything before stopline.. and the stopline itself
try:
next(itertools.dropwhile(lambda l: l.strip() != CLOSING_LINE.strip(), rdfile))
except StopIteration:
pass
# include everything after
wrfile.writelines(rdfile)
wrfile.truncate()
It seems that these four implementations achieve the same result. So...
Question: which one should I use? Which one is the most Pythonic? Which one is the most efficient?
Is there a better solution instead?
Edit: I tried to evaluate the methods on a big file using timeit. In order to have the same file on each iteration, I removed the writing parts of each code; this means that the evaluation mostly regards the reading (and file opening) task.
t_if = timeit.Timer("delete_lines_if('test.txt')", "from __main__ import delete_lines_if")
t_for = timeit.Timer("delete_lines_for('test.txt')", "from __main__ import delete_lines_for")
t_while = timeit.Timer("delete_lines_while('test.txt')", "from __main__ import delete_lines_while")
t_iter = timeit.Timer("delete_lines_iter('test.txt')", "from __main__ import delete_lines_iter")
print(t_if.repeat(3, 4000))
print(t_for.repeat(3, 4000))
print(t_while.repeat(3, 4000))
print(t_iter.repeat(3, 4000))
Result:
# Using IF statements:
[13.85873354100022, 13.858520206999856, 13.851908310999988]
# Using nested FOR:
[13.22578497800032, 13.178281234999758, 13.155530822999935]
# Using while:
[13.254994718000034, 13.193942980999964, 13.20395484699975]
# Using itertools:
[10.547019549000197, 10.506679693000024, 10.512742852999963]

You can make it fancy with itertools. I'd be interested in how timing compares.
import itertools
def delete_lines(filename):
with open(filename, 'r+') as wrfile:
with open(filename, 'r') as rdfile:
# write everything before startline
wrfile.writelines(itertools.takewhile(lambda l: l.strip() != STARTING_LINE.strip(), rdfile))
# drop everything before stopline.. and the stopline itself
next(itertools.dropwhile(lambda l: l.strip() != CLOSING_LINE.strip(), rdfile))
# include everything after
wrfile.writelines(rdfile)
wrfile.truncate()

how access to a file concurrently to add/edit/delete the data?

I want to create a text file and add data to it, line by line. If a data line already exists in the file, it should be ignored. Otherwise, it should be appended to the file.

You are almost certainly better to read the file and write a new changed version. In most circumstances it will be quicker, easier, less error-prone and more extensible.

If your file isn't that big, you could just do something like this:
added = set()
def add_line(line):
if line not in added:
f = open('myfile.txt', 'a')
f.write(line + '\n')
added.add(line)
f.close()
But this isn't a great idea if you have to worry about concurrency, large amounts of data being stored in the file, or basically anything other than something quick and one-off.

I did it like this,
def retrieveFileData():
"""Retrieve Location/Upstream data from files"""
lines = set()
for line in open(LOCATION_FILE):
lines.add(line.strip())
return lines
def add_line(line):
"""Add new entry to file"""
f = open(LOCATION_FILE, 'a')
lines = retrieveFileData()
print lines
if line not in lines:
f.write(line + '\n')
lines.add(line)
f.close()
else:
print "entry already exists"
if __name__ == "__main__":
while True:
line = raw_input("Enter line manually: ")
add_line(line)
if line == 'quit':
break

Writing to a file with multiprocessing

I'm having the following problem in python.
I need to do some calculations in parallel whose results I need to be written sequentially in a file. So I created a function that receives a multiprocessing.Queue and a file handle, do the calculation and print the result in the file:
import multiprocessing
from multiprocessing import Process, Queue
from mySimulation import doCalculation
# doCalculation(pars) is a function I must run for many different sets of parameters and collect the results in a file
def work(queue, fh):
while True:
try:
parameter = queue.get(block = False)
result = doCalculation(parameter)
print >>fh, string
except:
break
if __name__ == "__main__":
nthreads = multiprocessing.cpu_count()
fh = open("foo", "w")
workQueue = Queue()
parList = # list of conditions for which I want to run doCalculation()
for x in parList:
workQueue.put(x)
processes = [Process(target = writefh, args = (workQueue, fh)) for i in range(nthreads)]
for p in processes:
p.start()
for p in processes:
p.join()
fh.close()
But the file ends up empty after the script runs. I tried to change the worker() function to:
def work(queue, filename):
while True:
try:
fh = open(filename, "a")
parameter = queue.get(block = False)
result = doCalculation(parameter)
print >>fh, string
fh.close()
except:
break
and pass the filename as parameter. Then it works as I intended. When I try to do the same thing sequentially, without multiprocessing, it also works normally.
Why it didn't worked in the first version? I can't see the problem.
Also: can I guarantee that two processes won't try to write the file simultaneously?
EDIT:
Thanks. I got it now. This is the working version:
import multiprocessing
from multiprocessing import Process, Queue
from time import sleep
from random import uniform
def doCalculation(par):
t = uniform(0,2)
sleep(t)
return par * par # just to simulate some calculation
def feed(queue, parlist):
for par in parlist:
queue.put(par)
def calc(queueIn, queueOut):
while True:
try:
par = queueIn.get(block = False)
print "dealing with ", par, ""
res = doCalculation(par)
queueOut.put((par,res))
except:
break
def write(queue, fname):
fhandle = open(fname, "w")
while True:
try:
par, res = queue.get(block = False)
print >>fhandle, par, res
except:
break
fhandle.close()
if __name__ == "__main__":
nthreads = multiprocessing.cpu_count()
fname = "foo"
workerQueue = Queue()
writerQueue = Queue()
parlist = [1,2,3,4,5,6,7,8,9,10]
feedProc = Process(target = feed , args = (workerQueue, parlist))
calcProc = [Process(target = calc , args = (workerQueue, writerQueue)) for i in range(nthreads)]
writProc = Process(target = write, args = (writerQueue, fname))
feedProc.start()
for p in calcProc:
p.start()
writProc.start()
feedProc.join ()
for p in calcProc:
p.join()
writProc.join ()

You really should use two queues and three separate kinds of processing.
Put stuff into Queue #1.
Get stuff out of Queue #1 and do calculations, putting stuff in Queue #2. You can have many of these, since they get from one queue and put into another queue safely.
Get stuff out of Queue #2 and write it to a file. You must have exactly 1 of these and no more. It "owns" the file, guarantees atomic access, and absolutely assures that the file is written cleanly and consistently.

If anyone is looking for a simple way to do the same, this can help you.
I don't think there are any disadvantages to doing it in this way. If there are, please let me know.
import multiprocessing
import re
def mp_worker(item):
# Do something
return item, count
def mp_handler():
cpus = multiprocessing.cpu_count()
p = multiprocessing.Pool(cpus)
# The below 2 lines populate the list. This listX will later be accessed parallely. This can be replaced as long as listX is passed on to the next step.
with open('ExampleFile.txt') as f:
listX = [line for line in (l.strip() for l in f) if line]
with open('results.txt', 'w') as f:
for result in p.imap(mp_worker, listX):
# (item, count) tuples from worker
f.write('%s: %d\n' % result)
if __name__=='__main__':
mp_handler()
Source: Python: Writing to a single file with queue while using multiprocessing Pool

There is a mistake in the write worker code, if the block is false, the worker will never get any data. Should be as follows:
par, res = queue.get(block = True)
You can check it by adding line
print "QSize",queueOut.qsize()
after the
queueOut.put((par,res))
With block=False you would be getting ever increasing length of the queue until it fills up, unlike with block=True where you get always "1".

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Read a line from a file then remove it with threading - python

Related

Pandas Read Continuously Growing CSV File

spawn multiple processes to write different files Python

Nested loops iterating on a single file

how access to a file concurrently to add/edit/delete the data?

Writing to a file with multiprocessing

Categories

Resources