Is there some way to "capture" all attempted writes to a particular file /my/special/file, and instead write that to a BytesIO or StringIO object instead, or some other way to get that output without actually writing to disk?
The use case is: there's a 'handler' function, whose contract is that it should write its output to /my/special/file. I don't have any control over this handler function -- I don't write it, I don't know its contents and I can't change its contents, and the contract cannot change. I'd like to be able to do something like this:
# 'output' has whatever 'handler' has written to `/my/special/file`
output = handler.run(data)
Even if this is an odd request, I'd like to be able to do this even with a 'hackier' answer.
EDIT: my code (and handler) will be invoked many times on a lot of chunks of data, so performance (both latency and throughput) are important.
Thanks.
If you're talking about code in your own Python program, you could monkey-patch the built in open function before that code gets called. Here's a really stupid example, but it shows that you can do this. This causes code that thinks it's writing to a file to instead write into an in-memory buffer. The calling code then prints what the foreign code wrote to the file:
import io
# The function you don't have access to that writes to a file
def foo():
f = open("/tmp/foo", "w")
f.write("blahblahblah\n")
f.close()
# The buffer to contain the captured text
capture_buffer = ""
# My silly file-like object that only handles write(str) and close()
class MyFileClass:
def write(self, str):
global capture_buffer
capture_buffer += str
def close(self):
pass
# patch open to return a MyFileClass instance
def my_open2(*args, **kwargs):
return MyFileClass()
open = my_open2
# Call the target function
foo()
# Print what the function wrote to "the file"
print(capture_buffer)
Result:
blahblahblah
Sorry for not spending more time with this. Just showing you it's possible. As others say, a mocking module might be the way to go to not have to grow your own thing here. I don't know if they allow access to what is written. I guess they must. Such a module is just going to do a better job of what I've shown here.
If your program does other file IO with open, or whichever method the mystery code uses to open the file, you'd check the incoming path and only return your special object if it was the one path you're interested in. Otherwise, you could just call the original open, which you could stash away under another name.
I am writing a little Python script that parses the input from a QR reader (which is seen as a keyboard by the system).
At the moment I am using raw_input() but this function waits for an EOF/end-of-line symbol in order to submit the received string to the program.
I am wondering if there is a way to continuously parse the input string and not just in chunks limited by a line end.
In practice:
- is there a way in python to asynchronously and continuously parse a console input ?
- is there a way to change raw_input() (or an equivalent function) to look for another character in order to submit the string read into the program?
It seems like you're generally trying to solve two problems:
Read input in chunks
Parse that input asynchronously
For the first part, it will vary greatly based on the specifics of the input function your calling, but for standard input you could use something like
sys.stdin.read(1)
As for parsing asynchronously, there are a number of approaches you could take. Python is synchronous, so you will necessarily have to involve some subprocess calls. Manually spawning a function using the subprocess library is one option. You could also use something like Redis or some lightweight job queue to pop input chunks on and have them read and processed by another background script. Finally, gevent is a very popular coroutine based library for spawning asynchronous processes. Using gevent, this whole set up would look something like this:
class QRLoader(object):
def __init__(self):
self.data = []
def add_data(data):
self.data.append(data)
# if self._data constitutes a full QR code
# do something with data
gevent.spawn(parse_async)
def parse_async():
# do something with qr_loader.data
qr_loader = QRLoader()
while True:
data = sys.stdin.read(1)
if data:
qr_loader.add_data(data)
I have a Python HTTP server, on a certain GET request a file is created which is returned as response afterwards. The file creation might take a second, respectively the modification (updating) of the file.
Hence, I cannot return immediately the file as response. How do I approach such a problem? Currently I have a solution like this:
while not os.path.isfile('myfile'):
time.sleep(0.1)
return myfile
This seems very inconvenient, but is there a possibly better way?
A simple notification would do, but I don't have control over the process which creates/updates the files.
You could use Watchdog for a nicer way to watch the file system?
Something like this will remove the os call:
while updating:
time.sleep(0.1)
return myfile
...
def updateFile():
# updating file
updating = false
Implementing blocking io operations in synchronous HTTP requests is a bad approach. If many people run the same procedure simultaneously you may soon run out of threads (if there is a limited thread pool). I'd do the following:
A client requests the file creation URI. A file generating procedure is initialized in a background process (some asynchronous task system), the user gets a file id / name in the HTTP response. Next the client makes AJAX calls every once a while (polling), to check if the file has been created/modified (seperate file serve/check-if-exists URI). When the file is finaly created, the user is redirected (js window.location) to the file serving URI.
This approach will require a bit more work, but eventually it will pay off.
You can try using os.path.getmtime, this would check the modification time of the file and return if it's less than 1 sec ago. Also I suggest you only make a limited amount of tries or you will be stuck in an infinite loop if the file doesn't get created/modified. And as #Krzysztof RosiĆski pointed out you should probably think about doing it in a non-blocking way.
import os
from datetime import datetime
import time
for i in range(10):
try:
dif = datetime.now()-datetime.fromtimestamp(os.path.getmtime(file_path))
if dif.total_seconds() < 1:
return file
except OSError:
time.sleep(0.1)
On my django app I have a report (a csv download) that can take some time to run. When a user runs the report they are redirected to a 'processing' page where a javascript function checks the server every second to see if the csv has been created (the file name is included in the HttpResponse object).
What I'm looking for is a way of identifying the thread that's creating the csv. That way I can add an estimated_time_to_completion attribute to the thread, and include this info in the holding page. In fact I could stop checking for the existance of the (unlocked) csv - I could just ask the thread if it's finished.
My csv building thread looks something like -
class CsvBuilder(threading.Thread):
def __init__(self, file_name, parameters)
self.file_name = file_name
self.parameters = parameters
threading.Thread.__init__(self)
def run():
# ...
file = open(self.file_name, 'wb')
writer = csv.writer(file)
for patient in patients:
writer.writerow('some data')
self.time_remaining = # a timedelta object
file.close()
And then my django requests will look something like -
def create_csv(request):
'''
Standard django view to create a csv
'''
# get filename and parameters from request
thread = CsvBuilder (file_name, parameters)
return render_to_response('processing.html', {"thread_id": thread.thread_id})
def check_progress(request):
'''
An ajax call to check the progress on a report
'''
thread_id = requst.GET['thread_id']
# find the thread
return HttpResponse(thread.time_remaining)
Is this possible? Or should I be going about this a different way?
It's probably easiest and safest to use dedicated background task library, they are designed for usecase like this. Most common for python is Celery. It has good Django support and it's very easy to use.
I'd suggest you have your writer function update a memcached key/value for time_remaining calculations.
If it were me, I'd have probably used Celery for the long running job, starting a thread from django seems like it could have pitfalls, but nothing specific is springing to mind.
I currently have a Python application where newline-terminated ASCII strings are being transmitted to me via a TCP/IP socket. I have a high data rate of these strings and I need to parse them as quickly as possible. Currently, the strings are being transmitted as CSV and if the data rate is high enough, my Python application starts to lag behind the input data rate (probably not all that surprising).
The strings look something like this:
chan,2007-07-13T23:24:40.143,0,0188878425-079,0,0,True,S-4001,UNSIGNED_INT,name1,module1,...
I have a corresponding object that will parse these strings and store all of the data into an object. Currently the object looks something like this:
class ChanVal(object):
def __init__(self, csvString=None,**kwargs):
if csvString is not None:
self.parseFromCsv(csvString)
for key in kwargs:
setattr(self,key,kwargs[key])
def parseFromCsv(self, csvString):
lst = csvString.split(',')
self.eventTime=lst[1]
self.eventTimeExact=long(lst[2])
self.other_clock=lst[3]
...
To read the data in from the socket, I'm using a basic "socket.socket(socket.AF_INET,socket.SOCK_STREAM)" (my app is the server socket) and then I'm using the "select.poll()" object from the "select" module to constantly poll the socket for new input using its "poll(...)" method.
I have some control over the process sending the data (meaning I can get the sender to change the format), but it would be really convenient if we could speed up the ASCII processing enough to not have to use fixed-width or binary formats for the data.
So up until now, here are the things I've tried and haven't really made much of a difference:
Using the string "split" method and then indexing the list of results directly (see above), but "split" seems to be really slow.
Using the "reader" object in the "csv" module to parse the strings
Changing the strings being sent to a string format that I can use to directly instantiate an object via "eval" (e.g. sending something like "ChanVal(eventTime='2007-07-13T23:24:40.143',eventTimeExact=0,...)")
I'm trying to avoid going to a fixed-width or binary format, though I realize those would probably ultimately be much faster.
Ultimately, I'm open to suggestions on better ways to poll the socket, better ways to format/parse the data (though hopefully we can stick with ASCII) or anything else you can think of.
Thanks!
You can't make Python faster. But you can make your Python application faster.
Principle 1: Do Less.
You can't do less input parsing over all but you can do less input parsing in the process that's also reading the socket and doing everything else with the data.
Generally, do this.
Break your application into a pipeline of discrete steps.
Read the socket, break into fields, create a named tuple, write the tuple to a pipe with something like pickle.
Read a pipe (with pickle) to construct the named tuple, do some processing, write to another pipe.
Read a pipe, do some processing, write to a file or something.
Each of these three processes, connected with OS pipes, runs concurrently. That means that the first process is reading the socket and make tuples while the second process is consuming tuples and doing calculations while the third process is doing calculations and writing a file.
This kind of pipeline maximizes what your CPU can do. Without too many painful tricks.
Reading and writing to pipes is trivial, since linux assures you that sys.stdin and sys.stdout will be pipes when the shell creates the pipeline.
Before doing anything else, break your program into pipeline stages.
proc1.py
import cPickle
from collections import namedtuple
ChanVal= namedtuple( 'ChanVal', ['eventTime','eventTimeExact', 'other_clock', ... ] )
for line socket:
c= ChanVal( **line.split(',') )
cPickle.dump( sys.stdout )
proc2.py
import cPickle
from collections import namedtuple
ChanVal= namedtuple( 'ChanVal', ['eventTime','eventTimeExact', 'other_clock', ... ] )
while True:
item = cPickle.load( sys.stdin )
# processing
cPickle.dump( sys.stdout )
This idea of processing namedtuples through a pipeline is very scalable.
python proc1.py | python proc2.py
You need to profile your code to find out where the time is being spent.
That doesn't necessarily mean using python's profiler
For example you can just try parsing the same csv string 1000000 times with different methods. Choose the fastest method - divide by 1000000 now you know how much CPU time it takes to parse a string
Try to break the program into parts and work out how what resources are really required by each part.
The parts that need the most CPU per input line are your bottle necks
On my computer, the program below outputs this
ChanVal0 took 0.210402965546 seconds
ChanVal1 took 0.350302934647 seconds
ChanVal2 took 0.558166980743 seconds
ChanVal3 took 0.691503047943 seconds
So you see that about half the time there is taken up by parseFromCsv. But also that quite a lot of time is taken extracting the values and storing them in the class.
If the class isn't used right away it might be faster to store the raw data and use properties to parse the csvString on demand.
from time import time
import re
class ChanVal0(object):
def __init__(self, csvString=None,**kwargs):
self.csvString=csvString
for key in kwargs:
setattr(self,key,kwargs[key])
class ChanVal1(object):
def __init__(self, csvString=None,**kwargs):
if csvString is not None:
self.parseFromCsv(csvString)
for key in kwargs:
setattr(self,key,kwargs[key])
def parseFromCsv(self, csvString):
self.lst = csvString.split(',')
class ChanVal2(object):
def __init__(self, csvString=None,**kwargs):
if csvString is not None:
self.parseFromCsv(csvString)
for key in kwargs:
setattr(self,key,kwargs[key])
def parseFromCsv(self, csvString):
lst = csvString.split(',')
self.eventTime=lst[1]
self.eventTimeExact=long(lst[2])
self.other_clock=lst[3]
class ChanVal3(object):
splitter=re.compile("[^,]*,(?P<eventTime>[^,]*),(?P<eventTimeExact>[^,]*),(?P<other_clock>[^,]*)")
def __init__(self, csvString=None,**kwargs):
if csvString is not None:
self.parseFromCsv(csvString)
self.__dict__.update(kwargs)
def parseFromCsv(self, csvString):
self.__dict__.update(self.splitter.match(csvString).groupdict())
s="chan,2007-07-13T23:24:40.143,0,0188878425-079,0,0,True,S-4001,UNSIGNED_INT,name1,module1"
RUNS=100000
for cls in ChanVal0, ChanVal1, ChanVal2, ChanVal3:
start_time = time()
for i in xrange(RUNS):
cls(s)
print "%s took %s seconds"%(cls.__name__, time()-start_time)