How to incrementally parse JSON from a websocket message in Python?

How to incrementally parse JSON from a websocket message in Python? - python

I have JSON messages (JSON hash table) represented as strings coming in through a websocket. Each read from the socket may return a string that does not end on a message boundary. What's the easiest way to parse the JSON messages in Python? How do I find where in the string a message terminates without writing a parser (or brace/paren matcher) myself?
Do other languages provide tools to make this easier?

Each read from the socket may return a string that does not end on a message boundary.
This is incorrect - Websockets is a message oriented protocol as opposed to the traditional stream oriented TCP socket protocol where you needed to worry about and handle message chunking.
It is built on top of TCP, however it automatically handles piecing the individual fragments into one complete message before delivering it at the application layer (your code).
So websockets are message-oriented like UDP without the maximum length constraints but with TCP’s delivery guarantees and congestion control. It turns out that TCP’s stream orientation isn’t all that useful (think about how many protocols build some sort of “message” concept on top of TCP). In fact SCTP (RFC 4960) provides many of the same benefits of messages-on-top-of-TCP but removes the TCP part to reduce the overhead. Unfortunately, SCTP is yet to gain widespread adoption.
Also from the official RFC:
layers a framing mechanism on top of TCP to get back to the IP packet mechanism that TCP is built on, but without length limits

Related

Weird behavior of send() and recv()

SORRY FOR BAD ENGLISH
Why if I have two send()-s on the server, and two recv()-s on the client, sometimes the first recv() will get the content of the 2nd send() from the server, without taking just the content of the first one and let the other recv() to take the "due and proper" content of the other send()?
How can I get this work in an other way?

This is by design.
A TCP stream is a channel on which you can send bytes between two endpoints but the transmission is stream-based, not message based.
If you want to send messages then you need to encode them... for example by prepending a "size" field that will inform the receiver how many bytes to expect for the body.
If you send 100 bytes and then other 100 bytes it's well possible that the receiver will instead see 200 at once, or even 50 + 150 in two different read commands. If you want message boundaries then you have to put them in the data yourself.
There is a lower layer (datagrams) that allows to send messages, however they are limited in size and delivery is not guaranteed (i.e. it's possible that a message will get lost, that will be duplicated or that two messages you send will arrive in different order).
TCP stream is built on top of this datagram service and implements all the logic needed to transfer data reliably between the two endpoints.
As an alternative there are libraries designed to provide reliable message-passing between endpoints, like ZeroMQ.

Most probably you use SOCK_STREAM type socket. This is a TCP socket and that means that you push data to one side and it gets from the other side in the same order and without missing chunks, but there are no delimiters. So send() just sends data and recv() receives all the data available to the current moment.
You can use SOCK_DGRAM and then UDP will be used. But in such case every send() will send a datagram and recv() will receive it. But you are not guaranteed that your datagrams will not be shuffled or lost, so you will have to deal with such problems yourself. There is also a limit on maximal datagram size.
Or you can stick to TCP connection but then you have to send delimiters yourself.

Socket in python and socket.recv()

I am sending some request from the server to my client but I have some problem.
When I'm sending messages to the client, if I send many messages, I'll receive all with socket.recv()
Is there a way to get the messages one by one ?
Thanks

You need to use some kind of protocol over otherwise bare sockets.
See python twisted or use something like nanomsg or ZeroMQ if you want a simple drop-in replacement which is message-oriented.
It is not transport-agnostic though, meaning they will only work if they are used on both ends.

No. TCP is a byte stream. There are no messages larger than one byte.

I assume that you are using TCP. TCP is a streaming protocol, not a datagram protocol. This means that the data are not a series of messages, but instead a single data stream without any message boundaries. If you need something like this either switch the protocol (UDP is datagram, but has other problems) or make your own protocol on top of TCP which knows about messages.
Typical message based protocols on top of TCP either use some message delimiter (often newline) or prefix each message with its size.

Send messages to ZeroMQ server using conventional TCP, possible?

I'm not sure if I'm doing this right, but I would like to be able to send messages to my server running ZMQ from normal TCP connections. The server is running Python ZMQ on port 5555 using a TCP transport. I would like to be able to send messages to it using different clients (Python, Java, PHP) that use conventional TCP. This is what I have so far:
SERVER
context = zmq.Context()
socket = context.socket(zmq.REP)
socket.bind("tcp://*:5555")
while True:
message = socket.recv()
print message
socket.send('{"name":"someone"}')
CLIENT
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(('127.0.0.1', 5555))
s.send('Hello, World!')
data = s.recv(1024)
print data
Printing data on the client does not give me the message I am expecting. I get this: �. I tried doing bytes(data).decode('utf8') thinking what I'm getting was an array of bytes, but I get the following error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
I am just wondering: Is this possible at all? Or am I doing something wrong? Also, is it recommended?
My reason for not using ZMQ on the clients is that I want to reduce the number of dependencies (ZeroMQ being one).
Thank you for your help.

ZeroMQ is a protocol that sits on top of TCP (well, technically, ZMTP is a protocol; 0MQ is the library that implements the ZMTP protocol, and ZeroMQ is the Python implementation of 0MQ…). All the signaling it does, and all the framing (dividing the stream of bytes into separate messages), is done by sending bytes over the socket. A client that doesn't know how to do ZeroMQ signaling and framing is just going to see a bunch of garbage.
What you're trying to do is exactly like trying to write a web client that just reads off a socket without knowing anything about HTTP. You're going to get a bunch of framing stuff that you don't know what to do with. The only difference is that in the case of HTTP, the framing is sometimes (but not always—you can have MIME envelopes, gzip transport-encoding, chunked transport, …) just a bunch of human-readable ASCII that comes before the data, while with ZMTP it's never human-readable…
If you want to send data over a plain TCP socket, you have to do that by creating a plain TCP socket on the server side and calling its send (or sendall, etc.) method. If you want to send data over a ZMTP channel, you have to do that by parsing ZMTP, or using a library that does it for you (like ZeroMQ), on the client side.
One more thing to keep in mind is that, unlike ZMTP, TCP is not a message-oriented protocol; all TCP sends is a stream of bytes. From the receiving side, there's no way to know when one send ends and the next one begins. So, for almost anything but a "send a request, get a response, hang up" protocol, you need to write some kind of framing of your own. This can be as simple as "messages are strings that have no newlines in them, and each message is separated by a newline" (in which case you can just use socket.makefile), but often the message format has to be more complicated, or you have to send "commands" rather than just data, etc.
Since tdelaney's answer has been deleted (which I think means it's invisible to anyone under 10K rep), and had a useful suggestion, I'll repeat it here, with credit as due: You can (using the ZeroMQ library) write a piece of middleware that talks ZMTP to the server but your own simple TCP-based protocol to the clients. ZeroMQ has been specifically designed to make this reasonably easy; as tdelaney put it, it's "kinda like Lego, you build a robust communication infrastructure by building different communicating parts. Not all of the parts need to be zeromq."

Google protocol buffer for parsing Text and Binary protocol messages in network trace (PCAP)

I want to parse application layer protocols from network trace using Google protocol buffer and replay the trace (I am using python). I need suggestions to automatically generate protocol message description (in .proto file) from a network trace.

So you want to reconstruct what .proto messages were being passed over the application-layer protocol?
This isn't as easy as it sounds. First, .proto messages can't be sent raw over the wire, as the receiver needs to know how long they are. They need to be encapsulated somehow, maybe in an HTTP POST or with a raw 4-byte size prepended. I don't know what it would be for your application, but you'll need to deal with that.
Second, you can't reconstruct the full .proto from the messages alone. You only get tag numbers and types, not names. In addition, you will lose information about submessages - submessages and plain strings are encoded identically (you could probably tell which is which by eyeballing them, but I don't think you could do it automatically). You also will never know about optional items that never got sent. But you could parse the buffer without the proto and get some reasonable data (ints, repeated strings, and such).
Third, you need to reconstruct the application byte stream from the pcap log. I'm not sure how to do that, but I suspect there are tools that would do that for you.

Python/Twisted - TCP packet fragmentation?

In Twisted when implementing the dataReceived method, there doesn't seem to be any examples which refer to packets being fragmented. In every other language this is something you manually implement, so I was just wondering if this is done for you in twisted already or what? If so, do I need to prefix my packets with a length header? Or do I have to do this manually? If so, what way would that be?

In the dataReceived method you get back the data as a string of indeterminate length meaning that it may be a whole message in your protocol or it may only be part of the message that some 'client' sent to you. You will have to inspect the data to see if it comprises a whole message in your protocol.
I'm currently using Twisted on one of my projects to implement a protocol and decided to use the struct module to pack/unpack my data. The protocol I am implementing has a fixed header size so I don't construct any messages until I've read at least HEADER_SIZE amount of bytes. The total message size is declared in this header data portion.
I guess you don't really need to define a message length as part of your protocol but it helps. If you didn't define one you would have to have a special delimiter that determines when a message begins/ends. Sort of how the FIX protocol uses the SOH byte to delimit fields. Though it does have a required field that tells you how long a message is (just not how many fields are in a message).

When dealing with TCP, you should really forget all notion of 'packets'. TCP is a stream protocol - you stream data in and data streams out the other side. Once the data is sent, it is allowed to arrive in as many or as few blocks as it wants, as long as the data all arrives in the right order. You'll have to manually do the delimitation as with other languages, with a length field, or a message type field, or a special delimiter character, etc.

You can also use a LineReceiver protocol

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.