Python sendall() vs C write()

Python sendall() vs C write() - python

I was looking at some Python code the other day and came across this:
s.sendall(req % (len(body), body))
in which len(body) resolved to over 500K bytes. This was being sent to an Apache server, which should cap at 8190 bytes (I found it did when I tried to issue a similar request using C's write() function). So what is so special about sendall() in Python?

It doesn't matter if you're sending data to Apache or anything else. The software on the remote end of the socket we're talking about has essentially no direct impact on the difference in behavior between write(2) and socket.sendall.
The difference is that write(2) writes as many bytes as it can, then returns an integer indicating how many it wrote. It can't write more than you pass it, of course. But it might write fewer. There may not be room in the kernel send buffer for all of the bytes passed to it.
Contrast this with socket.sendall which writes all the bytes you pass to it. It does this by calling write(2) multiple times, if necessary.

Related

Python sockets really unreliable

I have been trying to do some coding with sockets recently and found out that i often get broken pipe errors, especially when working with bad connections.
In an attempt to solve this problem I made sure to sleep after every socket operation. It works but it is slow and ugly.
Is there any other way of making a socket connection more stable?

...server and client getting out of sync
Basically you say that your application is buggy. And the way to make the connection more stable is therefor to fix these bugs, not to work around it with some explicit sleep.
While you don't show any code, a common cause of "getting out of sync" is the assumption that a send on one side is matched exactly by a recv on the other side. Another common assumption is that send will actually send all data given and recv(n) will receive exactly n bytes of data.
All of these assumptions are wrong. TCP is not a message based protocol but a byte stream. Any message semantics need to be explicitly added on top of this byte stream, for example by prefixing messages with a length or by having a unique message separator or by having a fixed message size. And the result of send and recv need to be checked to be sure that all data have been send or all expected data have been received - and if not more send or recv would need to be done until all data are processed.
Adding some sleep often seems to "fix" some of these problems by basically adding "time" as a message separator. But it is not a real fix, i.e. it affects performance but it is also not 100% reliable either.

I've been using Python's Sockets for a long time and I can tell that as long as your code (which you unfortunately didn't provide) is clean and synchronized in itself you shouldn't get any problems. I use Sockets for small applications where I don't necessarily want/need to write/use an API, and it works like a dream.
As #Steffen already mentioned in his answer, TCP is not a message based protocol. It is a "stream oriented protocol" which means that is sends data byte-by-byte and not message-by-message..
Take a look at this thread and this paper to get a better understanding about the differences.
I would also suggest taking a look at this great answer to know how to sync your messages between your server and your client(s).

Yet another confustion about sending/recieving large amount of data over (unix-) socket

I have a C++ program which reads frames from a high speed camera and write each frame to a socket (unix socket). Each write is of 4096 bytes. Each frame is roughly 5MB. ( There is no guarantee that frame size would be constant but it is always a multiple of 4096 bytes. )
There is a python script which reads from the socket : 10 * 4096 bytes at each call of recv. Often I get unexpected behavior which I think boils down to understand the following about the sockets. I believe both of my programs are write/recving in blocking mode.
Can I write whole frame in one go (write call with 5MB of data)? Is it recommended? Speed is major concern here.
If python client fails to read or read slowly than write, does it mean that after some time write operation on socket would not add to buffer? Or, would they overwrite the buffer? If no-one is reading the socket, I'd not mind overwriting the buffer.
Ideally, I'd like my application to write to socket as fast as possibly. If no one is reading the data, then overwriting is fine. If someone is reading the data from socket but not reading fast enough, I'd like to store all data in buffer. Then how can I force my socket to increase the buffer size when reading is slow?

Can I write whole frame in one go (write call with 5MB of data)? Is it
recommended? Speed is major concern here.
Well, you can certainly try, but don't be too surprised if the call to socket.send() only sends a portion of the bytes you've asked it to send. In particular, you should always check the return value of socket.send() to see how many bytes it actually accepted from you, because that value may be greater than zero but less than the number of bytes you passed in to the call. (If it is less, then you'll probably want to call socket.send() again to send the remaining bytes from your buffer that weren't handled by the first call... and repeat as necessary; or alternatively you can call socket.sendall() instead of socket.send(), and that will do the necessary looping and re-calling of the socket.send() command for you, so you don't have to worry about it... the tradeoff is that socket.sendall() might not return for a long time, depending on the speed of your network connection and how much data you told socket.sendall() to send)
Note that when sending datagrams it is common for maximum packet sizes to be enforced; packets larger than that will either be fragmented into smaller packets for transmission (and hopefully re-assembled on the receiving side) or they might simply be discarded. For example, when sending UDP packets over Ethernet, it is common to have an MTU of 1500 bytes. When sending over a Unix socket the MTU will likely be larger than that, but likely there will still be a limit.
If python client fails to read or read slowly than write, does it mean
that after some time write operation on socket would not add to
buffer? Or, would they overwrite the buffer? If no-one is reading the
socket, I'd not mind overwriting the buffer.
If you are sending on a stream-style socket (SOCK_STREAM), then a slow client will cause your server's send() calls to block if/when the buffer fills up. If you are sending on a datagram-style socket (SOCK_DGRAM) and the buffer fills up, the "overflow" datagrams will simply be discarded.
Then how can I force my socket to increase the buffer size when
reading is slow?
You can set the socket's send-buffer size via socket.setsockopt(SOL_SOCKET, SO_SNDBUF, xxx). Note that this is typically done in advance (e.g. right after the socket is created) rather than trying to do it "on the fly" in response to a slow reader.

It sounds like a design flaw that you need to send this much data over the socket to begin-with and that there is this risk of the reader not keeping up with the writer. As an alternative, you may want to consider using a delta-encoding, where you alternate between "key frame"s (whole frames) and multiple frames encoded as deltas from the the prior frame. You may also want to consider writing the data to a local buffer and then, on your UNIX domain socket, implementing a custom protocol that allows reading a sequence of frames starting at a given timestamp or a single frame given a timestamp. If all reads go through such buffer rather than directly from the source, I imagine you could also add additional encoding / compression options in that protocol. Also, if the server application that exports the data to a UNIX socket is a separate application from the one that is reading in the data and writing it to a buffer, you won't need to worry about your data ingestion being blocked by slow readers.

why does python socket library not include recvall() method like sendall()?

When using recv() method, sometimes we can't receive as much data as we want, just like using send(). But we can use sendall() to solve the problem of sending data, how about receiving data? Why there is no such recvall() method?

There is no fundamental reason why such a function could not be provided as part of the standard library. In fact, there has been at least one attempt to add recvall().
Given that it can be easily implemented as a loop around recv(), I don't think that this is a major omission.

send has extra information that recv doesn't: how much data there is to send. If you have 100 bytes of data to send, sendall can objectively determine if fewer than 100 bytes were sent by the first call to send, and continually send data until all 100 bytes are sent.
When you try to read 1024 bytes, but only get 512 back, you have no way of knowing if that is because the other 512 bytes are delayed and you should try to read more, or if there were only 512 bytes to read in the first place. You can never say for a fact that there will be more data to read, rendering recvall meaningless. The most you can do is decide how long you are willing to wait (timeout) and how many times you are willing to retry before giving up.
You might wonder why there is an apparent difference between reading from a file and reading from a socket. With a file, you have extra information from the file system about how much data is in the file, so you reliably distinguish between EOF and some other that may have prevented you from reading the available data. There is no such source of metadata for sockets.

Because recvall is fundamentally confusing: your assumption was that it would read exactly-N bytes, but I would have thought it would completely exhaust the stream, based on the name.
An operation that completely exhausts the stream is a dangerous API for a bunch of reasons, and the ambiguity in naming makes this a pretty unproductive API.

python read() and write() in large blocks / memory management

I'm writing some python code that splices together large files at various points. I've done something similar in C where I allocated a 1MB char array and used that as the read/write buffer. And it was very simple: read 1MB into the char array then write it out.
But with python I'm assuming it is different, each time I call read() with size = 1M, it will allocate a 1M long character string. And hopefully when the buffer goes out of scope it will we freed in the next gc pass.
Would python handle the allocation this way? If so, is the constant allocation/deallocation cycle be computationally expensive?
Can I tell python to use the same block of memory just like in C? Or is the python vm smart enough to do it itself?
I guess what I'm essentially aiming for is kinda like an implementation of dd in python.

Search site docs.python.org for readinto to find docs appropriate for the version of Python you're using. readinto is a low-level feature. They'll look a lot like this:
readinto(b)
Read up to len(b) bytes into bytearray b and return the number of bytes read.
Like read(), multiple reads may be issued to the underlying raw stream, unless the latter is interactive.
A BlockingIOError is raised if the underlying raw stream is in non blocking-mode, and has no data available at the moment.
But don't worry about it prematurely. Python allocates and deallocates dynamic memory at a ferocious rate, and it's likely that the cost of repeatedly getting & free'ing a measly megabyte will be lost in the noise. And note that CPython is primarily reference-counted, so your buffer will get reclaimed "immediately" when it goes out of scope. As to whether Python will reuse the same memory space each time, the odds are decent but it's not assured. Python does nothing to try to force that, but depending on the entire allocation/deallocation pattern and the details of the system C's malloc()/free() implementation, it's not impossible it will get reused ;-)

Issues with Python socket module

So I'm working on a Python IRC framework, and I'm using Python's socket module. Do I feel like using Twisted? No, not really.
Anyway, I have an infinite loop reading and processing data from socket.recv(xxxx), where xxxx is really irrelevant in this situation. I split the received data into messages using str.split("\r\n") and process them one by one.
My problem is that I have to set a specific 'read size' in socket.recv() to define how much data to read from the socket. When I receive a burst of data (for example, when I connect to the IRC server and receive the MOTD.etc), there's always a message that spans two 'reads' of the socket (i.e. part of the line is read in one socket.recv() and the rest is read in the next iteration of the infinite loop).
I can't process half-received messages, and I'm not sure if there's even a way of detecting them. In an ideal situation I'd receive everything that's in the buffer, but it doesn't look like socket provides a method for doing that.
Any help?

You should really be using select or poll, e.g. via asyncore or select, or twisted (which you prefer not to).
Reading from a socket you never know how much you'll receive in each read. You could receive several messages in one go, or have one message split into many reads. You should always collect the data in a buffer until you can make use of it, then remove the data you've used from the buffer (but leave data you haven't used yet).
Since you know your input makes sense line by line, then your receive loop can look something like:
while true:
Append new data to buffer
Look for EOLs, process and remove all complete lines

Stream-mode sockets (e.g, TCP) never guarantee that you'll receive messages in any sort of neatly framed format. If you receive partial lines of input -- which will inevitably happen sometimes -- you need to hold onto the partial line until the rest of the line shows up.
Using Twisted will save you a lot of time. Better yet, you may want to look into using an existing IRC framework -- there are a number of them already available.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.