Build a buffer in file for stream of TCP packets

Build a buffer in file for stream of TCP packets - python

I am trying to build a proxy to buffer some packets according to some schedules.
I have two TCP connections, one from host A to the proxy and the other one from the proxy to host B.
The proxy forwards the packets between A and B. The proxy will buffer the packets according to the scheduled instructions. At certain time, it will buffer the packets. After the buffering period is over, it will forward the packets in the buffer and also do its normal forwarding work.
I am using python. Which module would be the best in this situation? I tried pickle but it is difficult to remove and append elements in the file. Any suggestions? Thanks!

I recommend you join the two scripts into one and just use memory.
If you can't join the scripts for some reason, create a unix-domain socket to pass the raw, binary data directly from one to the other. These fifos have limited size so you'll still have to do in-memory buffering on one side or another, probably the B side.
If the data is too big for memory, you can write it out to a temporary file and re-read it when it's time to pass it on. It'll be easiest if the same script both writes and reads the file as then you won't have to guess when to write, advanced to a new file, or deal with separate readers and writers.

Related

Yet another confustion about sending/recieving large amount of data over (unix-) socket

I have a C++ program which reads frames from a high speed camera and write each frame to a socket (unix socket). Each write is of 4096 bytes. Each frame is roughly 5MB. ( There is no guarantee that frame size would be constant but it is always a multiple of 4096 bytes. )
There is a python script which reads from the socket : 10 * 4096 bytes at each call of recv. Often I get unexpected behavior which I think boils down to understand the following about the sockets. I believe both of my programs are write/recving in blocking mode.
Can I write whole frame in one go (write call with 5MB of data)? Is it recommended? Speed is major concern here.
If python client fails to read or read slowly than write, does it mean that after some time write operation on socket would not add to buffer? Or, would they overwrite the buffer? If no-one is reading the socket, I'd not mind overwriting the buffer.
Ideally, I'd like my application to write to socket as fast as possibly. If no one is reading the data, then overwriting is fine. If someone is reading the data from socket but not reading fast enough, I'd like to store all data in buffer. Then how can I force my socket to increase the buffer size when reading is slow?

Can I write whole frame in one go (write call with 5MB of data)? Is it
recommended? Speed is major concern here.
Well, you can certainly try, but don't be too surprised if the call to socket.send() only sends a portion of the bytes you've asked it to send. In particular, you should always check the return value of socket.send() to see how many bytes it actually accepted from you, because that value may be greater than zero but less than the number of bytes you passed in to the call. (If it is less, then you'll probably want to call socket.send() again to send the remaining bytes from your buffer that weren't handled by the first call... and repeat as necessary; or alternatively you can call socket.sendall() instead of socket.send(), and that will do the necessary looping and re-calling of the socket.send() command for you, so you don't have to worry about it... the tradeoff is that socket.sendall() might not return for a long time, depending on the speed of your network connection and how much data you told socket.sendall() to send)
Note that when sending datagrams it is common for maximum packet sizes to be enforced; packets larger than that will either be fragmented into smaller packets for transmission (and hopefully re-assembled on the receiving side) or they might simply be discarded. For example, when sending UDP packets over Ethernet, it is common to have an MTU of 1500 bytes. When sending over a Unix socket the MTU will likely be larger than that, but likely there will still be a limit.
If python client fails to read or read slowly than write, does it mean
that after some time write operation on socket would not add to
buffer? Or, would they overwrite the buffer? If no-one is reading the
socket, I'd not mind overwriting the buffer.
If you are sending on a stream-style socket (SOCK_STREAM), then a slow client will cause your server's send() calls to block if/when the buffer fills up. If you are sending on a datagram-style socket (SOCK_DGRAM) and the buffer fills up, the "overflow" datagrams will simply be discarded.
Then how can I force my socket to increase the buffer size when
reading is slow?
You can set the socket's send-buffer size via socket.setsockopt(SOL_SOCKET, SO_SNDBUF, xxx). Note that this is typically done in advance (e.g. right after the socket is created) rather than trying to do it "on the fly" in response to a slow reader.

It sounds like a design flaw that you need to send this much data over the socket to begin-with and that there is this risk of the reader not keeping up with the writer. As an alternative, you may want to consider using a delta-encoding, where you alternate between "key frame"s (whole frames) and multiple frames encoded as deltas from the the prior frame. You may also want to consider writing the data to a local buffer and then, on your UNIX domain socket, implementing a custom protocol that allows reading a sequence of frames starting at a given timestamp or a single frame given a timestamp. If all reads go through such buffer rather than directly from the source, I imagine you could also add additional encoding / compression options in that protocol. Also, if the server application that exports the data to a UNIX socket is a separate application from the one that is reading in the data and writing it to a buffer, you won't need to worry about your data ingestion being blocked by slow readers.

send one big packet of many small over TCP/UDP?

I am sending coordinates of points to a visualizational client-side script via TCP over the internet. I wonder which option I should use:
concat the coordinates into a large string and send them together, or
send them one by one
I don't know which one the faster is. I have some other questions too:
Which one should I use?
Is there a maximum size of packet of TCP? (python: maximum size of string for client.send(string))
As it is a visualization project should I use UDP instead of TCP?
Could you please tell me a bit about lost packet? When do they occur? How to deal with them?
Sorry for the many questions, but I really struggle with this issue...

When you send a string, that might be sent in multiple TCP packets. If you send multiple strings, they might all be sent in one TCP packet. You are not exposed to the packets, TCP sockets are just a constant stream of data. Do not expect that every call to recv() is paired with a single call to send(), because that isn't true. You might send "abcd" and "efg", and might read in "a", "bcde", and "fg" from recv().
It is probably best to send data as soon as you get it, so that the networking stack has as much information about what you're sending, as soon as possible. It will decide exactly what to do. You can send as big a string as you like, and if necessary, it will be broken up to send over the wire. All automatically.
Since in TCP you don't deal with packets, things like lost packets also aren't your concern. That's all handled automatically -- either the data gets through eventually, or the connection closes.
As for UDP - you probably don't want UDP. :)

UDP has a maximum theoretical packet size of 65535, in reality this size is much lower and depends on operation systems and routing hardware. UDP does not have any mechanism to ensure data delivery, this make it much faster but unreliable without some hand crafted data protection mechanism.
TCP, you don't have to worry about packet size, if you try to send a bigger chunk then possible, wil be automatically and transparently split by the operating system for you. Sending small chunks of data can be inefficient due the overhead of the TCP communication, but even here, by default you have Nagle's algorithm enabled on most OS (See Nagle's algorithm). The algorithm basically tries to join small chunks of data, for most scenarios this is fast and efficient.

Memory bounds in twisted applications

Consider the following scenario: A process on the server is used to handle data from a network connection. Twisted makes this very easy with spawnProcess and you can easily connect the ProcessTransport with your protocol on the network side.
However, I was unable to determine how Twisted handles a situation where the data from the network is available faster than the process performs reads on its standard input. As far as I can see, Twisted code mostly uses an internal buffer (self._buffer or similar) to store unconsumed data. Doesn't this mean that concurrent requests from a fast connection (eg. over local gigabit LAN) could fill up main memory and induce heavy swapping, making the situation even worse? How can this be prevented?
Ideally, the internal buffer would have an upper bound. As I understand it, the OS's networking code would automatically stall the connection/start dropping packets if the OS's buffers are full, which would slow down the client. (Yes I know, DoS on the network level is still possible, but this is a different problem). This is also the approach I would take if implementing it myself: just don't read from the socket if the internal buffer is full.
Restricting the maximum request size is also not an option in my case, as the service should be able to process files of arbitrary size.

The solution has two parts.
One part is called producers. Producers are objects that data comes out of. A TCP transport is a producer. Producers have a couple useful methods: pauseProducing and resumeProducing. pauseProducing causes the transport to stop reading data from the network. resumeProducing causes it to start reading again. This gives you a way to avoid building up an unbounded amount of data in memory that you haven't processed yet. When you start to fall behind, just pause the transport. When you catch up, resume it.
The other part is called consumers. Consumers are objects that data goes in to. A TCP transport is also a consumer. More importantly for your case, though, a child process transport is also a consumer. Consumers have a few methods, one in particular is useful to you: registerProducer. This tells the consumer which producer data is coming to it from. The consumer can them call pauseProducing and resumeProducing according to its ability to process the data. When a transport (TCP or process) cannot send data as fast as a producer is asking it to send data, it will pause the producer. When it catches up, it will resume it again.
You can read more about producers and consumers in the Twisted documentation.

How does urllib.urlopen() work?

Let's consider a big file (~100MB). Let's consider that the file is line-based (a text file, with relatively short line ~80 chars).
If I use built-in open()/file() the file will be loaded in lazy manner.
I.E. if a I do aFile.readline() only a chunk of a file will reside in memory. Does the urllib.urlopen() do something similar (with usage of a cache on disk)?
How big is the difference in performance between urllib.urlopen().readline() and file().readline()? Let's consider that file is located on localhost. Once I open it with urllib.urlopen() and then with file(). How big will be difference in performance/memory consumption when i loop over the file with readline()?
What is best way to process a file opened via urllib.urlopen()? Is it faster to process it line by line? Or shall I load bunch of lines(~50) into a list and then process the list?

open (or file) and urllib.urlopen look like they're more or less doing the same thing there. urllib.urlopen is (basically) creating a socket._socketobject and then invoking the makefile method (contents of that method included below)
def makefile(self, mode='r', bufsize=-1):
"""makefile([mode[, bufsize]]) -> file object
Return a regular file object corresponding to the socket. The mode
and bufsize arguments are as for the built-in open() function."""
return _fileobject(self._sock, mode, bufsize)

Does the urllib.urlopen() do something similar (with usage of a cache on disk)?
The operating system does. When you use a networking API such as urllib, the operating system and the network card will do the low-level work of splitting data into small packets that are sent over the network, and to receive incoming packets. Those are stored in a cache, so that the application can abstract away the packet concept and pretend it would send and receive continuous streams of data.
How big is the difference in performance between urllib.urlopen().readline() and file().readline()?
It is hard to compare these two. For urllib, this depends on the speed of the network, as well as the speed of the server. Even for local servers, there is some abstraction overhead, so that, usually, it is slower to read from the networking API than from a file directly.
For actual performance comparisons, you will have to write a test script and do the measurement. However, why do you even bother? You cannot replace one with another since they serve different purposes.
What is best way to process a file opened via urllib.urlopen()? Is it faster to process it line by line? Or shall I load bunch of lines(~50) into a list and then process the list?
Since the bottle neck is the networking speed, it might be a good idea to process the data as soon as you get it. This way, the operating system can cache more incoming data "in the background".
It makes no sense to cache lines in a list before processing them. Your program will just sit there waiting for enough data to arrive while it could be doing something useful already.

FIFO (named pipe) messaging obstacles

I plan to use Unix named pipes (mkfifo) for simple multi-process messaging.
A message would be just a single line of text.
Would you discourage me from that? What obstacles should I expect?
I have noticed these limitations:
A sender cannot continue until the message is received.
A receiver is blocked until there are some data. Nonblocking IO would be needed
when we need to stop the reading. For example, another thread could ask for that.
The receiver could obtain many messages in a single read. These have to be processed
before quiting.
The max length of an atomic message is limited by 4096 bytes. That is the PIPE_BUF limit on Linux (see man 7 pipe).
I will implement the messaging in Python. But the obstacles hold in general.

Lack of portability - they are mainly a Unix thing. Sockets are more portable.
Harder to scale out to multiple systems (another sockets+)
On the other hand, I believe pipes are faster than sockets for processes on the same machine (less communication overhead).
As to your limitations,
You can "select" on pipes, to do a non-blocking read.
I normally (in perl) print out my messages on pipes seperated by "\n", and read a line from them to get one message at a time.
Do be careful with the atomic length.
I find perlipc to be a good discussion between the various options, though it has perl specific code.

The blocking, both on the sender side and the receiver side, can be worked around via non-blocking I/O.
Further limitations of FIFOs:
Only one client at a time.
After the client closes the FIFO, the server need to re-open its endpoint.
Unidirectional.
I would use UNIX domain sockets instead, which have none of the above limitations.
As an added benefit, if you want to scale it to communicate between multiple machines, it's barely any change at all. For example, just take the Python documentation page on socket and replace socket.AF_INET with socket.AF_UNIX, (HOST, PORT) with filename, and it just works.
SOCK_STREAM will give you stream-like behavior; that is, two sends may be merged into one receive or vice versa. AF_UNIX also supports SOCK_DGRAM: datagrams are guaranteed to be sent and read all as one unit or not at all. (Analogously, AF_INET+SOCK_STREAM=TCP, AF_INET+SOCK_DGRAM=UDP.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.