I'm creating a python script of which parses a large (but simple) CSV.
It'll take some time to process. I would like the ability to interrupt the parsing of the CSV so I can continue at a later stage.
Currently I have this - of which lives in a larger class: (unfinished)
Edit:
I have some changed code. But the system will parse over 3 million rows.
def parseData(self)
reader = csv.reader(open(self.file))
for id, title, disc in reader:
print "%-5s %-50s %s" % (id, title, disc)
l = LegacyData()
l.old_id = int(id)
l.name = title
l.disc_number = disc
l.parsed = False
l.save()
This is the old code.
def parseData(self):
#first line start
fields = self.data.next()
for row in self.data:
items = zip(fields, row)
item = {}
for (name, value) in items:
item[name] = value.strip()
self.save(item)
Thanks guys.
If under linux, hit Ctrl-Z and stop the running process. Type "fg" to bring it back and start where you stopped it.
You can use signal to catch the event. This is a mockup of a parser than can catch CTRL-C on windows and stop parsing:
import signal, tme, sys
def onInterupt(signum, frame):
raise Interupted()
try:
#windows
signal.signal(signal.CTRL_C_EVENT, onInterupt)
except:
pass
class Interupted(Exception): pass
class InteruptableParser(object):
def __init__(self, previous_parsed_lines=0):
self.parsed_lines = previous_parsed_lines
def _parse(self, line):
# do stuff
time.sleep(1) #mock up
self.parsed_lines += 1
print 'parsed %d' % self.parsed_lines
def parse(self, filelike):
for line in filelike:
try:
self._parse(line)
except Interupted:
print 'caught interupt'
self.save()
print 'exiting ...'
sys.exit(0)
def save(self):
# do what you need to save state
# like write the parse_lines to a file maybe
pass
parser = InteruptableParser()
parser.parse([1,2,3])
Can't test it though as I'm on linux at the moment.
The way I'd do it:
Puty the actual processing code in a class, and on that class I'd implement the Pickle protocol (http://docs.python.org/library/pickle.html ) (basically, write proper __getstate__ and __setstate__ functions)
This class would accept the filename, keep the open file, and the CSV reader instance as instance members. The __getstate__ method would save the current file position, and setstate would reopen the file, forward it to the proper position, and create a new reader.
I'd perform the actuall work in an __iter__ method, that would yeld to an external function after each line was processed.
This external function would run a "main loop" monitoring input for interrupts (sockets, keyboard, state of an specific file on the filesystem, etc...) - everything being quiet, it would just call for the next iteration of the processor. If an interrupt happens, it would pickle the processor state to an specific file on disk.
When startingm the program just has to check if a there is a saved execution, if so, use pickle to retrieve the executor object, and resume the main loop.
Here goes some (untested) code - the iea is simple enough:
from cPickle import load, dump
import csv
import os, sys
SAVEFILE = "running.pkl"
STOPNOWFILE = "stop.now"
class Processor(object):
def __init__(self, filename):
self.file = open(filename, "rt")
self.reader = csv.reader(self.file)
def __iter__(self):
for line in self.reader():
# do stuff
yield None
def __getstate__(self):
return (self.file.name, self.file.tell())
def __setstate__(self, state):
self.file = open(state[0],"rt")
self.file.seek(state[1])
self.reader = csv.reader(self.File)
def check_for_interrupts():
# Use your imagination here!
# One simple thing would e to check for the existence of an specific file
# on disk.
# But you go all the way up to instantiate a tcp server and listen to
# interruptions on the network
if os.path.exists(STOPNOWFILE):
return True
return False
def main():
if os.path.exists(SAVEFILE):
with open(SAVEFILE) as savefile:
processor = load(savefile)
os.unlink(savefile)
else:
#Assumes the name of the .csv file to be passed on the command line
processor = Processor(sys.argv[1])
for line in processor:
if check_for_interrupts():
with open(SAVEFILE, "wb") as savefile:
dump(processor)
break
if __name__ == "__main__":
main()
My Complete Code
I followed the advice of #jsbueno with a flag - but instead of another file, I kept it within the class as a variable:
I create a class - when I call it asks for ANY input and then begins another process doing my work. As its looped - if I were to press a key, the flag is set and only checked when the loop is called for my next parse. Thus I don't kill the current action.
Adding a process flag in the database for each object from the data I'm calling means I can start this any any time and resume where I left off.
class MultithreadParsing(object):
process = None
process_flag = True
def f(self):
print "\nMultithreadParsing has started\n"
while self.process_flag:
''' get my object from database '''
legacy = LegacyData.objects.filter(parsed=False)[0:1]
if legacy:
print "Processing: %s %s" % (legacy[0].name, legacy[0].disc_number)
for l in legacy:
''' ... Do what I want it to do ...'''
sleep(1)
else:
self.process_flag = False
print "Nothing to parse"
def __init__(self):
self.process = Process(target=self.f)
self.process.start()
print self.process
a = raw_input("Press any key to stop \n")
print "\nKILL FLAG HAS BEEN SENT\n"
if a:
print "\nKILL\n"
self.process_flag = False
Thanks for all you help guys (especially yours #jsbueno) - if it wasn't for you I wouldn't have got this class idea.
Related
I want to write good tests to make sure my concurrent data structure works. But the tests are passing even on a class that is obviously not thread-safe.
class NotThreadSafe:
def __init__(self):
self.set1 = set()
self.set2 = set()
def add_to_sets(self, item):
self._add_to_set1(item)
self._add_to_set2(item)
def _add_to_set1(self, item):
self.set1.add(item)
def _add_to_set2(self, item):
self.set2.add(item)
def are_sets_equal_length(self):
return len(self.set1) == len(self.set2)
My tests have a reader thread and a writer thread running concurrently. The writer thread calls add_to_sets and the reader thread calls are_sets_equal_length.
But the reader thread always observes are_sets_equal_length to be True, even though the writer thread should theoretically cause inequalities.
How can I add some time delay on add_to_set2 so that it forces the race condition to surface?
The test:
import threading
import time
def writer_fn(nts: NotThreadSafe):
for i in range(1000):
nts.add_to_sets(i)
def reader_fn(nts: NotThreadSafe, stop: list, results: list):
while not len(stop):
if not nts.are_sets_equal_length():
results.append(False)
return
results.append(True)
def test_nts():
nts = NotThreadSafe()
stop = []
results = []
reader = threading.Thread(target=reader_fn, args=[nts, stop, results])
writer = threading.Thread(target=writer_fn, args=[nts])
reader.start()
writer.start()
writer.join()
stop.append(True)
reader.join()
assert not results[0]
Step 1: write a wrapper that creates a new function containing a time delay.
def slow_wrapper(method):
"""Adds a tiny delay to a method. Good for triggering race conditions that would otherwise be very rare."""
def wrapped_method(*args):
time.sleep(0.001)
return method(*args)
return wrapped_method
Step 2: In the test function, after creating the object, change add_to_set2 into a slow version:
nts = NotThreadSafe()
# change _add_to_set2 into a time-delayed version
nts._add_to_set2 = slow_wrapper(nts._add_to_set2)
Step 3: Run the tests. The failure should be triggered properly.
I want to answer a input() from another thread of the same process on python from within the code.
This is the code:
import sys
import threading
def threaded(fn):
def wrapper(*args, **kwargs):
thread = threading.Thread(target=fn, args=args, kwargs=kwargs, daemon=True)
thread.start()
return thread
return wrapper
#threaded
def answer():
time.sleep(2)
sys.stdin.write('to be inputed')
answer()
x = input('insert a value: ')
print(f'value inserted: {x}') # excpeted print: 'value inserted: to be inputed'
But I think its not possbile because I receive this error:
Exception in thread Thread-1:
Traceback (most recent call last):
File "teste.py", line 80, in answer
sys.stdin.write('to be inputed')
io.UnsupportedOperation: not writable
It's hard to explain why I want that, but sometimes the user will input the value and sometimes it will come from another input source (telegram). So this second thread should be able to input the value and release the code execution.
I also can't change the input() part of the code because its from inside a library, so it need to be this way: input('insert a value: ')
Is there a way to achive that?
The simple answer is that if you replace sys.stdin with your own variable, then input uses that instead.
However, then you've lost your original stdin, so you need start a new process to listen for user input, since you said:
but sometimes the user will input the value
This needs to be another process rather than a thread since it needs to be killed when you want to restore the original stdin, and killing the process interrupts it mid-readline.
Here is a working version of the code with the mock object implemented. The region inside the with block is where stdin has been replaced.
import sys
import time
import multiprocessing
import threading
class MockStdin:
def __init__(self):
self.queue = None
self.real_stdin = sys.stdin
self.relay_process = None
def readline(self):
# when input() is called, it calls this function
return self.queue.get()
def writeline(self, s):
# for input from elsewhere in the program
self.queue.put(s)
def relay_stdin(self):
# for input from the user
my_stdin = open(0) # this is a new process so it needs its own stdin
try:
while True:
inp = my_stdin.readline()
self.queue.put(inp)
except KeyboardInterrupt:
# when killed, exit silently
pass
def __enter__(self):
# when entering the `with` block, start replace stdin with self and relay real stdin
self.queue = multiprocessing.Queue()
self.relay_process = multiprocessing.Process(target=self.relay_stdin)
self.relay_process.start()
sys.stdin = self
def __exit__(self, exc_type=None, exc_val=None, exc_tb=None):
# when exiting the `with` block, put stdin back and stop relaying
sys.stdin = self.real_stdin
self.relay_process.terminate()
self.relay_process.join()
def __getstate__(self):
# this is needed for Windows - credit to Leonardo Rick for this fix
self_dict = self.__dict__.copy()
del self_dict['real_stdin']
return self_dict
def threaded(fn):
def wrapper(*args, **kwargs):
thread = threading.Thread(target=fn, args=args, kwargs=kwargs, daemon=True)
thread.start()
return thread
return wrapper
if __name__ == '__main__':
mock = MockStdin()
#threaded
def answer():
time.sleep(2)
# use mock to write to stdin
mock.writeline('to be inputed')
answer()
with mock:
# inside `with` block, stdin is replaced
x = input('insert a value: ')
print(f'\nvalue inserted: {x}')
answer()
# __enter__ and __exit__ can also be used
mock.__enter__()
x = input('insert a value: ')
print(f'\nvalue inserted: {x}')
mock.__exit__()
# now outside the `with` block, stdin is back to normal
x = input('insert another (stdin should be back to normal now): ')
print(f'value inserted: {x}')
I am trying to control a 3-axis printer using an x-box controller. To get inputs from the x-box I have borrowed code from martinohanlon https://github.com/martinohanlon/XboxController/blob/master/XboxController.py
I have also created code that reads a text file line by line (G-code) to move the printer.
I would like to be able to use the X-Box controller to select a G-code file and run it, then as the printer is running continue to listen for a cancel button just in case the print goes wrong. The controller is a threaded class, and my readGcode is a threaded class.
The problem I'm having is that when I use the controller to start the readGcode class I cant communicate with the controller until that thread finished.
My temporary solution is to use the controller to select a file then pass that files path to the readGcode class. In the readGcode class it keeps trying to open a file using a try block and fails until the filepath is acceptable. Then it changes a bool which makes it skip further reading until its done.
Code:
import V2_Controller as Controller
import V2_ReadFile as Read
import time
import sys
# file daialogue
import tkinter as tk
from tkinter import filedialog
# when X is selected on the x-box controller
def X(xValue):
if not bool(xValue):
try:
f=selectfile()
rf.setfilename(f)
except:
print("failed to select file")
def selectfile():
try:
root = tk.Tk() # opens tkinter
root.withdraw() # closes the tkinter window
return filedialog.askopenfilename()
except Exception:
print("no file")
# setup xbox controller
xboxCont = Controller.XboxController(controlCallBack, deadzone=30,
scale=100, invertYAxis=True)
# init the readfile class
rf = Read.Readfile()
# set the custom function for pressing X
xboxCont.setupControlCallback(xboxCont.XboxControls.X, X)
try:
# start the controller and readfile threads
xboxCont.start()
rf.start()
xboxCont.join()
rf.join()
while True:
time.sleep(1)
# Ctrl C
except KeyboardInterrupt:
print("User cancelled")
# error
except:
print("Unexpected error:", sys.exc_info()[0])
raise
finally:
# stop the controller
xboxCont.stop()
rf.stop()
V2_Readfile
# Main class for reading the script
class Readfile(threading.Thread):
# supports all variables needed to read a script
class readfile:
fileselect = True
linecount = 0
currentline = 0
commands = []
# setup readfile class
def __init__(self):
# setup threading
threading.Thread.__init__(self)
# persist values
self.running = False
self.reading = False
def setfilename(self,filename):
self.filename = filename
# called by the thread
def run(self):
self._start()
# start reading
def _start(self):
self.running = True
while self.running:
time.sleep(1)
if not self.reading:
try:
self.startread()
except:
pass
def startread(self):
try:
with open(self.filename, "r") as f: # read a local file
f1 = f.readlines()
# run through each line and extract the command from each line
linecount = 0
line = []
for x in f1:
# read each line into an array
line.append(x.split(";")[0])
linecount += 1
# Store the variables for later use
self.readfile.linecount = linecount
self.readfile.commands = line
self.reading = True
except Exception:
pass
i = 0
while i < self.readfile.linecount and self.reading:
self.readfile.currentline = i + 1
self.readline(i)
i += 1
# the following stops the code from reading again
self.reading = False
self.filename = ""
def readline(self,line):
Sort.sortline(self.readfile.commands[line])
# stops the controller
def stop(self):
self.running = False
You could use a syncronization primitive like threading.Event.
To do so, you need to modify your Readfile class like this:
from threading import Event
class Readfile(threading.Thread):
# setup readfile class
def __init__(self):
# setup threading
threading.Thread.__init__(self)
# persist values
self.running = False
self.reading = False
self.reading_file = Event() # Initialize your event here it here
def setfilename(self,filename):
self.filename = filename
self.reading_file.set() # This awakens the reader
def _start(self):
self.running = True
self.reading_file.wait() # Thread will be stopped until readfilename is called
self.startread()
Another syncronization primitive worth exploring is queue.Queue. It could be useful if you want to process more than one filename.
The pattern you describe in your question is called Busy Waiting, and should be avoided when possible.
I have implemented multithreaded code in two ways, but in both ways I got an error. Could someone explain what causes the problem?
In version 1, I got an exception saying two arguments passed to writekey function instead of one.
In version 2, one of the threads reads empty line, therefore exception is raised while processing the empty string.
I am using locks, shouldn't it prevent multiple threads accessing the function or file at the same time?
Version 1:
class SomeThread(threading.Thread):
def __init__(self, somequeue, lockfile):
threading.Thread.__init__(self)
self.myqueue = somequeue
self.myfilelock = lockfile
def writekey(key):
if os.path.exists(os.path.join('.', outfile)):
with open(outfile, 'r') as fc:
readkey = int(fc.readline().rstrip())
os.remove(os.path.join('.', outfile))
with open(outfile, 'w') as fw:
if readkey > key:
fw.write(str(readkey))
else:
fw.write(str(key))
def run(self):
while(True):
dict = self.myqueue.get()
self.myfilelock.acquire()
try:
self.writekey(dict.get("key"))
finally:
self.myfilelock.release()
self.myqueue.task_done()
populateQueue() # populate queue with objects
filelock = threading.Lock()
for i in range(threadnum):
thread = SomeThread(somequeue, filelock)
thread.setDaemon(True)
thread.start()
somequeue.join()
Version 2:
def writekey(key):
if os.path.exists(os.path.join('.', outfile)):
with open(outfile, 'r') as fc:
# do something...
os.remove(os.path.join('.', outfile))
with open(outfile, 'w') as fw:
# do something...
class SomeThread(threading.Thread):
def __init__(self, somequeue, lockfile):
threading.Thread.__init__(self)
self.myqueue = somequeue
self.myfilelock = lockfile
def run(self):
while(True):
dict = self.myqueue.get()
self.myfilelock.acquire()
try:
writekey(dict.get("key"))
finally:
myfilelock.release()
self.myqueue.task_done()
# Same as above ....
In version 1, def writekey(key) should be declared with "self" as the first parameter, i.e.
def writekey(self, key):
The problem in version 2 is less clear. I assume that an empty line is being read while reading outfile. This is normal and it indicates that the end-of-file has been reached. Normally you would just break out of your read loop. Usually it is preferable to read your file line-by-line in a for loop, e.g.
with open(outfile, 'r') as fc:
for line in fc:
# process the line
The for loop will terminate naturally upon reaching end-of-file.
I'm trying to share an existing object across multiple processing using the proxy methods described here. My multiprocessing idiom is the worker/queue setup, modeled after the 4th example here.
The code needs to do some calculations on data that are stored in rather large files on disk. I have a class that encapsulates all the I/O interactions, and once it has read a file from disk, it saves the data in memory for the next time a task needs to use the same data (which happens often).
I thought I had everything working from reading the examples linked to above. Here is a mock up of the code that just uses numpy random arrays to model the disk I/O:
import numpy
from multiprocessing import Process, Queue, current_process, Lock
from multiprocessing.managers import BaseManager
nfiles = 200
njobs = 1000
class BigFiles:
def __init__(self, nfiles):
# Start out with nothing read in.
self.data = [ None for i in range(nfiles) ]
# Use a lock to make sure only one process is reading from disk at a time.
self.lock = Lock()
def access(self, i):
# Get the data for a particular file
# In my real application, this function reads in files from disk.
# Here I mock it up with random numpy arrays.
if self.data[i] is None:
with self.lock:
self.data[i] = numpy.random.rand(1024,1024)
return self.data[i]
def summary(self):
return 'BigFiles: %d, %d Storing %d of %d files in memory'%(
id(self),id(self.data),
(len(self.data) - self.data.count(None)),
len(self.data) )
# I'm using a worker/queue setup for the multprocessing:
def worker(input, output):
proc = current_process().name
for job in iter(input.get, 'STOP'):
(big_files, i, ifile) = job
data = big_files.access(ifile)
# Do some calculations on the data
answer = numpy.var(data)
msg = '%s, job %d'%(proc, i)
msg += '\n Answer for file %d = %f'%(ifile, answer)
msg += '\n ' + big_files.summary()
output.put(msg)
# A class that returns an existing file when called.
# This is my attempted workaround for the fact that Manager.register needs a callable.
class ObjectGetter:
def __init__(self, obj):
self.obj = obj
def __call__(self):
return self.obj
def main():
# Prior to the place where I want to do the multprocessing,
# I already have a BigFiles object, which might have some data already read in.
# (Here I start it out empty.)
big_files = BigFiles(nfiles)
print 'Initial big_files.summary = ',big_files.summary()
# My attempt at making a proxy class to pass big_files to the workers
class BigFileManager(BaseManager):
pass
getter = ObjectGetter(big_files)
BigFileManager.register('big_files', callable = getter)
manager = BigFileManager()
manager.start()
# Set up the jobs:
task_queue = Queue()
for i in range(njobs):
ifile = numpy.random.randint(0, nfiles)
big_files_proxy = manager.big_files()
task_queue.put( (big_files_proxy, i, ifile) )
# Set up the workers
nproc = 12
done_queue = Queue()
process_list = []
for j in range(nproc):
p = Process(target=worker, args=(task_queue, done_queue))
p.start()
process_list.append(p)
task_queue.put('STOP')
# Log the results
for i in range(njobs):
msg = done_queue.get()
print msg
print 'Finished all jobs'
print 'big_files.summary = ',big_files.summary()
# Shut down the workers
for j in range(nproc):
process_list[j].join()
task_queue.close()
done_queue.close()
main()
This works in the sense that it calculates everything correctly, and it is caching the data that is read along the way. The only problem I'm having is that at the end, the big_files object doesn't have any of the files loaded. The final msg returned is:
Process-2, job 999. Answer for file 198 = 0.083406
BigFiles: 4303246400, 4314056248 Storing 198 of 200 files in memory
But then after it's all done, we have:
Finished all jobs
big_files.summary = BigFiles: 4303246400, 4314056248 Storing 0 of 200 files in memory
So my question is: What happened to all the stored data? It's claiming to be using the same self.data according to the id(self.data). But it's empty now.
I want the end state of big_files to have all the saved data that it accumulated along the way, since I actually have to repeat this entire process many times, so I don't want to have to redo all the (slow) I/O each time.
I'm assuming it must have something to do with my ObjectGetter class. The examples for using BaseManager only show how to make a new object that will be shared, not share an existing one. So am I doing something wrong with way I get the existing big_files object? Can anyone suggest a better way to do this step?
Thanks much!