Why should asyncio.StreamWriter.drain be explicitly called?

Why should asyncio.StreamWriter.drain be explicitly called? - python

From doc:
write(data)
Write data to the stream.
This method is not subject to flow control. Calls to write() should be followed by drain().
coroutine drain()
Wait until it is appropriate to resume writing to the stream. Example:
writer.write(data)
await writer.drain()
From what I understand,
You need to call drain every time write is called.
If not I guess, write will block the loop thread
Then why is write not a coroutine that calls it automatically? Why would one call write without having to drain? I can think of two cases
You want to write and close immediately
You have to buffer some data before the message it is complete.
First one is a special case, I think we can have a different API. Buffering should be handled inside write function and application should not care.
Let me put the question differently. What is the drawback of doing this? Does the python3.8 version effectively do this?
async def awrite(writer, data):
writer.write(data)
await writer.drain()
Note: drain doc explicitly states the below:
When there is nothing to wait for, the drain() returns immediately.
Reading the answer and links again, I think the functions work like this. Note: Check accepted answer for more accurate version.
def write(data):
remaining = socket.try_write(data)
if remaining:
_pendingbuffer.append(remaining) # Buffer will keep growing if other side is slow and we have a lot of data
async def drain():
if len(_pendingbuffer) < BUF_LIMIT:
return
await wait_until_other_side_is_up_to_speed()
assert len(_pendingbuffer) < BUF_LIMIT
async def awrite(writer, data):
writer.write(data)
await writer.drain()
So when to use what:
When the data is not continuous, Like responding to an HTTP request. We just need to send some data and don't care about when it is reached and memory is not a concern - Just use write
Same as above but memory is a concern, use awrite
When streaming data to a large number of clients (e.g. some live stream or a huge file). If the data is duplicated in each of the connection's buffers, it will definitely overflow RAM. In this case, write a loop that takes a chunk of data each iteration and call awrite. In case of a huge file, loop.sendfile is better if available.

From what I understand, (1) You need to call drain every time write is called. (2) If not I guess, write will block the loop thread
Neither is correct, but the confusion is quite understandable. The way write() works is as follows:
A call to write() just stashes the data to a buffer, leaving it to the event loop to actually write it out at a later time, and without further intervention by the program. As far as the application is concerned, the data is written in the background as fast as the other side is capable of receiving it. In other words, each write() will schedule its data to be transferred using as many OS-level writes as it takes, with those writes issued when the corresponding file descriptor is actually writable. All this happens automatically, even without ever awaiting drain().
write() is not a coroutine, and it absolutely never blocks the event loop.
The second property sounds convenient - you can call write() wherever you need to, even from a function that's not async def - but it's actually a major flaw of write(). Writing as exposed by the streams API is completely decoupled from the OS accepting the data, so if you write data faster than your network peer can read it, the internal buffer will keep growing and you'll have a memory leak on your hands. drain() fixes that problem: awaiting it pauses the coroutine if the write buffer has grown too large, and resumes it again once the os.write()'s performed in the background are successful and the buffer shrinks.
You don't need to await drain() after every write, but you do need to await it occasionally, typically between iterations of a loop in which write() is invoked. For example:
while True:
response = await peer1.readline()
peer2.write(b'<response>')
peer2.write(response)
peer2.write(b'</response>')
await peer2.drain()
drain() returns immediately if the amount of pending unwritten data is small. If the data exceeds a high threshold, drain() will suspend the calling coroutine until the amount of pending unwritten data drops beneath a low threshold. The pause will cause the coroutine to stop reading from peer1, which will in turn cause the peer to slow down the rate at which it sends us data. This kind of feedback is referred to as back-pressure.
Buffering should be handled inside write function and application should not care.
That is pretty much how write() works now - it does handle buffering and it lets the application not care, for better or worse. Also see this answer for additional info.
Addressing the edited part of the question:
Reading the answer and links again, I think the the functions work like this.
write() is still a bit smarter than that. It won't try to write only once, it will actually arrange for data to continue to be written until there is no data left to write. This will happen even if you never await drain() - the only thing the application must do is let the event loop run its course for long enough to write everything out.
A more correct pseudo code of write and drain might look like this:
class ToyWriter:
def __init__(self):
self._buf = bytearray()
self._empty = asyncio.Event(True)
def write(self, data):
self._buf.extend(data)
loop.add_writer(self._fd, self._do_write)
self._empty.clear()
def _do_write(self):
# Automatically invoked by the event loop when the
# file descriptor is writable, regardless of whether
# anyone calls drain()
while self._buf:
try:
nwritten = os.write(self._fd, self._buf)
except OSError as e:
if e.errno == errno.EWOULDBLOCK:
return # continue once we're writable again
raise
self._buf = self._buf[nwritten:]
self._empty.set()
loop.remove_writer(self._fd, self._do_write)
async def drain(self):
if len(self._buf) > 64*1024:
await self._empty.wait()
The actual implementation is more complicated because:
it's written on top of a Twisted-style transport/protocol layer with its own sophisticated flow control, not on top of os.write;
drain() doesn't really wait until the buffer is empty, but until it reaches a low watermark;
exceptions other than EWOULDBLOCK raised in _do_write are stored and re-raised in drain().
The last point is another good reason to call drain() - to actually notice that the peer is gone by the fact that writing to it is failing.

Related

Segmentation fault when initializing array

I am getting a segmentation fault when initializing an array.
I have a callback function from when an RFID tag gets read
IDS = []
def readTag(e):
epc = str(e.epc, 'utf-8')
if not epc in IDS:
now = datetime.datetime.now().strftime('%m/%d/%Y %H:%M:%S')
IDS.append([epc, now, "name.instrument"])
and a main function from which it's called
def main():
for x in vals:
IDS.append([vals[0], vals[1], vals[2]])
for x in IDS:
print(x[0])
r = mercury.Reader("tmr:///dev/ttyUSB0", baudrate=9600)
r.set_region("NA")
r.start_reading(readTag, on_time=1500)
input("press any key to stop reading: ")
r.stop_reading()
The error occurs because of the line IDS.append([epc, now, "name.instrument"]). I know because when I replace it with a print call instead the program will run just fine. I've tried using different types for the array objects (integers), creating an array of the same objects outside of the append function, etc. For some reason just creating an array inside the "readTag" function causes the segmentation fault like row = [1,2,3]
Does anyone know what causes this error and how I can fix it? Also just to be a little more specific, the readTag function will work fine for the first two (only ever two) calls, but then it crashes and the Reader object that has the start_reading() function is from the mercury-api

This looks like a scoping issue to me; the mercury library doesn't have permission to access your list's memory address, so when it invokes your callback function readTag(e) a segfault occurs. I don't think that the behavior that you want is supported by that library

To extend Michael's answer, this appears to be an issue with scoping and the API you're using. In general pure-Python doesn't seg-fault. Or at least, it shouldn't seg-fault unless there's a bug in the interpreter, or some extension that you're using. That's not to say pure-Python won't break, it's just that a genuine seg-fault indicates the problem is probably the result of something messy outside of your code.
I'm assuming you're using this Python API.
In that case, the README.md mentions that the Reader.start_reader() method you're using is "asynchronous". Meaning it invokes a new thread or process and returns immediately and then the background thread continues to call your callback each time something is scanned.
I don't really know enough about the nitty gritty of CPython to say exactly what going on, but you've declared IDS = [] as a global variable and it seems like the background thread is running the callback with a different context to the main program. So when it attempts to access IDS it's reading memory it doesn't own, hence the seg-fault.
Because of how restrictive the callback is and the apparent lack of a buffer, this might be an oversight on the behalf of the developer. If you really need asynchronous reads it's worth sending them an issue report.
Otherwise, considering you're just waiting for input you probably don't need the asynchronous reads, and you could use the synchronous Reader.read() method inside your own busy loop instead with something like:
try:
while True:
readTags(r.read(timeout=10))
except KeyboardInterrupt: ## break loop on SIGINT (Ctrl-C)
pass
Note that r.read() returns a list of tags rather than just one, so you'd need to modify your callback slightly, and if you're writing more than just a quick script you probably want to use threads to interrupt the loop properly as SIGINT is pretty hacky.

Python Memory leak using Yocto

I'm running a python script on a raspberry pi that constantly checks on a Yocto button and when it gets pressed it puts data from a different sensor in a database.
a code snippet of what constantly runs is:
#when all set and done run the program
Active = True
while Active:
if ResponseType == "b":
while Active:
try:
if GetButtonPressed(ResponseValue):
DoAllSensors()
time.sleep(5)
else:
time.sleep(0.5)
except KeyboardInterrupt:
Active = False
except Exception, e:
print str(e)
print "exeption raised continueing after 10seconds"
time.sleep(10)
the GetButtonPressed(ResponseValue) looks like the following:
def GetButtonPressed(number):
global buttons
if ModuleCheck():
if buttons[number - 1].get_calibratedValue() < 300:
return True
else:
print "module not online"
return False
def ModuleCheck():
global moduleb
return moduleb.isOnline()
I'm not quite sure about what might be going wrong. But it takes about an hour before the RPI runs out of memory.
The memory increases in size constantly and the button is only pressed once every 15 minutes or so.
That already tells me that the problem must be in the code displayed above.

The problem is that the yocto_api.YAPI object will continue to accumulate _Event objects in its _DataEvents dict (a class-wide attribute) until you call YAPI.YHandleEvents. If you're not using the API's callbacks, it's easy to think (I did, for hours) that you don't need to ever call this. The API docs aren't at all clear on the point:
If your program includes significant loops, you may want to include a call to this function to make sure that the library takes care of the information pushed by the modules on the communication channels. This is not strictly necessary, but it may improve the reactivity of the library for the following commands.
I did some playing around with API-level callbacks before I decided to periodically poll the sensors in my own code, and it's possible that some setting got left enabled in them that is causing these events to accumulate. If that's not the case, I can't imagine why they would say calling YHandleEvents is "not strictly necessary," unless they make ARM devices with unlimited RAM in Switzerland.
Here's the magic static method that thou shalt call periodically, no matter what. I'm doing so once every five seconds and that is taking care of the problem without loading down the system at all. API code that would accumulate unwanted events still smells to me, but it's time to move on.
#noinspection PyUnresolvedReferences
#staticmethod
def HandleEvents(errmsgRef=None):
"""
Maintains the device-to-library communication channel.
If your program includes significant loops, you may want to include
a call to this function to make sure that the library takes care of
the information pushed by the modules on the communication channels.
This is not strictly necessary, but it may improve the reactivity
of the library for the following commands.
This function may signal an error in case there is a communication problem
while contacting a module.
#param errmsg : a string passed by reference to receive any error message.
#return YAPI.SUCCESS when the call succeeds.
On failure, throws an exception or returns a negative error code.
"""
errBuffer = ctypes.create_string_buffer(YAPI.YOCTO_ERRMSG_LEN)
#noinspection PyUnresolvedReferences
res = YAPI._yapiHandleEvents(errBuffer)
if YAPI.YISERR(res):
if errmsgRef is not None:
#noinspection PyAttributeOutsideInit
errmsgRef.value = YByte2String(errBuffer.value)
return res
while len(YAPI._DataEvents) > 0:
YAPI.yapiLockFunctionCallBack(errmsgRef)
if not (len(YAPI._DataEvents)):
YAPI.yapiUnlockFunctionCallBack(errmsgRef)
break
ev = YAPI._DataEvents.pop(0)
YAPI.yapiUnlockFunctionCallBack(errmsgRef)
ev.invokeData()
return YAPI.SUCCESS

Misuse of yield

I'm making a SocketServer that will need to be able to handle a lot of commands. So to keep my RequestHandler from becoming too long it will call different functions depening on the command. My dilemma is how to make it send info back to the client.
Currently I'm making the functions "yield" everything it wants to send back to the client. But I'm thinking it's probably not the pythonic way.
# RequestHandler
func = __commands__.get(command, unkown_command)
for message in func():
self.send(message)
# example_func
def example():
yield 'ip: {}'.format(ip)
yield 'count: {}'.format(count)
. . .
for ping in pinger(ip,count):
yield ping
Is this an ugly use of yield? The only alterative I can think of is if when the RequestHandler calls the function is passes itself as an argument
func(self)
and then in the function
def example(handler):
. . .
handler.send('ip: {}'.format(ip))
But this way doesn't feel much better.

def example():
yield 'ip: {}'.format(ip)
yield 'count: {}'.format(count)
What strikes me as strange in this solution is not the use of yield itself (which can be perfectly valid) but the fact that you're losing a lot of information by turning your data into strings prematurely.
In particular, for this kind of data, simply returning a dictionary and handling the sending in the caller seems more readable:
def example():
return {'ip': ip, 'count': count}
This also helps you separate content and presentation, which might be useful if you want, for example, to return data encoded in XML but later switch to JSON.
If you want to yield intermediate data, another possibility is using tuples: yield ('ip', ip). In this way you keep the original data and can start processing the values immediately outside the function

I do the same as you with yield. The reason for this is simple:
With yield the main loop can easily handle the case that sending data to one socket will block. Each socket gets a buffer for outgoing data that you fill with the yield. The main loop tries to send as much of that as possible to the socket and when it blocks it records how far it got in the buffer and waits for the socket to be ready for more. When the buffer is empty is runs next(func) to get the next chunk of data.
I don't see how you would do that with handler.send('ip: {}'.format(ip)). When that socket blocks you are stuck. You can't pause that send and handle other sockets easily.
Now for this to be useful there are some assumptions:
the data each yield sends is considerable and you don't want to generate all of it into one massive buffer ahead of time
generating the data for each yield takes time and you want to already send the finished parts
you want to use reply = yield data waiting for the peer to respond to the data in some way. Yes, you can make this a back and forth. next(func) becomes func.send(reply).
Any of these is a good reason to go the yield way or coroutines in general. The alternative seems to be to use one thread per socket.
Note: the func can also call other generators using yield from. Makes it easy to split a large problem into smaller handlers and to share common parts.

how to properly close a tweepy stream

I'm trying to figure out how to properly close an asynchronous tweepy stream.
The tweepy streaming module can be found here.
I start the stream like this:
stream = Stream(auth, listener)
stream.filter(track=['keyword'], async=True)
When closing the application, I try to close the stream as simple as:
stream.disconnect()
This method seems to work as intended but it seems to have one problem:
the stream thread is still in the middle of the loop (waiting/handling tweets) and is not killed until the next loop, so when the stream receives a tweet even after the app has closed, it still tries to call the listener object (this can be seen with a simple print syntax on the listener object). I'm not sure if this is a bad thing or if it can simply be ignored.
I have 2 questions:
Is this the best way to close the stream or should I take a different approach?
Shouldn't the async thread be created as a daemon thread?

I had the same problem. I fixed it with restarting the script. Tweepy Stream doesn't stop until the next incoming tweet.
Example:
import sys
import os
python=sys.executable
time.sleep(10)
print "restart"
os.execl(python,python,*sys.argv)
I didn't find another solution.

I am not positive that it applies to your situation, but in general you can have applicable entities clean up after themselves by putting them in a with block:
with stream = Stream(auth, listener):
stream.filter(track=['keyword'], async=True)
# ...
# Outside the with-block; stream is automatically disposed of.
What "disposed of" actually means, it that the entities __exit__ function is called.
Presumably tweepy will have overridden that to Do The Right Thing.
As #VooDooNOFX suggests, you can check the source to be sure.

This is by design. Looking at the source, you will notice that disconnect has no immediate termination option.
def disconnect(self):
if self.running is False:
return
self.running = False
When calling disconnect(), it simply sets self.running = False, which is then checked on the next loop of the _run method
You can ignore this side effect.

Instead of restarting the script, as #burkay suggests, I finally deleted the Stream object and started a new one. In my example, someone wants to add a new user to be followed, so I update the track list this way:
stream.disconnect() # that should wait until next tweet, so let's delete it
del stream
# now, create a new object
stream = tweepy.Stream( auth=api.auth, listener=listener )
stream.userstream( track=all_users(), async=True )

Overriding basic signals (SIGINT, SIGQUIT, SIGKILL??) in Python

I'm writing a program that adds normal UNIX accounts (i.e. modifying /etc/passwd, /etc/group, and /etc/shadow) according to our corp's policy. It also does some slightly fancy stuff like sending an email to the user.
I've got all the code working, but there are three pieces of code that are very critical, which update the three files above. The code is already fairly robust because it locks those files (ex. /etc/passwd.lock), writes to to a temporary files (ex. /etc/passwd.tmp), and then, overwrites the original file with the temporary. I'm fairly pleased that it won't interefere with other running versions of my program or the system useradd, usermod, passwd, etc. programs.
The thing that I'm most worried about is a stray ctrl+c, ctrl+d, or kill command in the middle of these sections. This has led me to the signal module, which seems to do precisely what I want: ignore certain signals during the "critical" region.
I'm using an older version of Python, which doesn't have signal.SIG_IGN, so I have an awesome "pass" function:
def passer(*a):
pass
The problem that I'm seeing is that signal handlers don't work the way that I expect.
Given the following test code:
def passer(a=None, b=None):
pass
def signalhander(enable):
signallist = (signal.SIGINT, signal.SIGQUIT, signal.SIGABRT, signal.SIGPIPE, signal.SIGALRM, signal.SIGTERM, signal.SIGKILL)
if enable:
for i in signallist:
signal.signal(i, passer)
else:
for i in signallist:
signal.signal(i, abort)
return
def abort(a=None, b=None):
sys.exit('\nAccount was not created.\n')
return
signalhander(True)
print('Enabled')
time.sleep(10) # ^C during this sleep
The problem with this code is that a ^C (SIGINT) during the time.sleep(10) call causes that function to stop, and then, my signal handler takes over as desired. However, that doesn't solve my "critical" region problem above because I can't tolerate whatever statement encounters the signal to fail.
I need some sort of signal handler that will just completely ignore SIGINT and SIGQUIT.
The Fedora/RH command "yum" is written is Python and does basically exactly what I want. If you do a ^C while it's installing anything, it will print a message like "Press ^C within two seconds to force kill." Otherwise, the ^C is ignored. I don't really care about the two second warning since my program completes in a fraction of a second.
Could someone help me implement a signal handler for CPython 2.3 that doesn't cause the current statement/function to cancel before the signal is ignored?
As always, thanks in advance.
Edit: After S.Lott's answer, I've decided to abandon the signal module.
I'm just going to go back to try: except: blocks. Looking at my code there are two things that happen for each critical region that cannot be aborted: overwriting file with file.tmp and removing the lock once finished (or other tools will be unable to modify the file, until it is manually removed). I've put each of those in their own function inside a try: block, and the except: simply calls the function again. That way the function will just re-call itself in the event of KeyBoardInterrupt or EOFError, until the critical code is completed.
I don't think that I can get into too much trouble since I'm only catching user provided exit commands, and even then, only for two to three lines of code. Theoretically, if those exceptions could be raised fast enough, I suppose I could get the "maximum reccurrsion depth exceded" error, but that would seem far out.
Any other concerns?
Pesudo-code:
def criticalRemoveLock(file):
try:
if os.path.isFile(file):
os.remove(file)
else:
return True
except (KeyboardInterrupt, EOFError):
return criticalRemoveLock(file)
def criticalOverwrite(tmp, file):
try:
if os.path.isFile(tmp):
shutil.copy2(tmp, file)
os.remove(tmp)
else:
return True
except (KeyboardInterrupt, EOFError):
return criticalOverwrite(tmp, file)

There is no real way to make your script really save. Of course you can ignore signals and catch a keyboard interrupt using try: except: but it is up to your application to be idempotent against such interrupts and it must be able to resume operations after dealing with an interrupt at some kind of savepoint.
The only thing that you can really to is to work on temporary files (and not original files) and move them after doing the work into the final destination. I think such file operations are supposed to be "atomic" from the filesystem prospective. Otherwise in case of an interrupt: restart your processing from start with clean data.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.