Storing and replaying binary network data with python

Storing and replaying binary network data with python - python

I have a Python application which sends 556 bytes of data across the network at a rate of 50 Hz. The binary data is generated using struct.pack() which returns a string, which is subsequently written to a UDP socket.
As well as transmitting this data, I would like to save this data to file as space-efficiently as possible, including a timestamp for each message, so that I can replay the data at a later time. What would be the best way of doing this using Python?
I have mulled over using a logging object, but have not yet found out whether Python can read in log files so that I can replay the data. Also, I don't know whether the logging object can handle binary data.
Any tips would be much appreciated! Although Wireshark would be an option, I'd rather store the data using my application so that I can automatically start new data files each time I run the program.

Python's logging system is intended to process human-readable strings, and it's intended to be easy to enable or disable depending on whether it's you (the developer) or someone else running your program. Don't use it for something that your application always needs to output.
The simplest way to store the data is to just write the same 556-byte string that you send over the socket out to a file. If you want to have timestamps, you could precede each 556-byte message with the time of sending, converted to an integer, and packed into 4 or 8 bytes using struct.pack(). The exact method would depend on your specific requirements, e.g. how precise you need the time to be, and whether you need absolute time or just relative to some reference point.

One possibility for a compact timestamp for replay purposes...: set the time as a floating point number of seconds since the epoch with time.time(), multiply by 50 since you said you're repeating this 50 times a second (the resulting unit, one fiftieth of a second, is sometimes called "a jiffy"), truncate to int, subtract from the similar int count of jiffies since the epoch that you measured at the start of your program, and struct.pack the result into an unsigned int with the number of bytes you need to represent the intended duration -- for example, with 2 bytes for this timestamp, you could represent runs of about 1200 seconds (20 minutes), but if you plan longer runs you'd need 4 bytes (3 bytes is just too unwieldy IMHO;-).
Not all operating systems have time.time() returning decent precision, so you may need more devious means if you need to run on such unfortunately limited OSs. (That's VERY os-dependent, of course). What OSs do you need to support...?
Anyway...: for even more compactness, use a slightly higher multiplier than 50 (say 10000) for more accuracy, and store, each time, the difference wrt the previous timestamp -- since that difference should not be much different from a jiffy (if I understand your spec correctly) that should be about 200 or so of these "tenth-thousands of a second" and you can store a single unsigned byte (and have no limit wrt the duration of runs you're storing for future replay). This depends even more on accurate returns from time.time() of course.
If your 556-byte binary data is highly compressible, it will be worth your while to use gzip to store the stream of timestamp-then-data in compressed form; this is best assessed empirically on your actual data, though.

Related

What is the max value of pygame.time.get_ticks()?

I'm wondering, if i use pygame.time.get_ticks(), can i get a stack overflow?
I wrote a game, and the score depends on the time the player spend with it, before dies. This way, I can get a quite large number, especially measuring it in miliseconds. What's the upper limit, pygame can handle? I know, that python is only limited by my computer's memory, but it looks ineffective, even compared to itself.
What are the alternatives, if i dont want this type of memory-leak? Or is it a problem at all?

This is 100000% not an issue. When dealing with an int Python will never throw an OverflowError.
get_ticks returns a python int, which when initialized on a 32 bit system is a 4-byte number. So it can store 2^32 values. 2^32 milliseconds is equal to about 50 days, so you are not going to run into any issues whatsoever. Even if it did somehow run longer than that, Python automatically increases the size of integers and there is no limit. Even a simple 8 byte value (2^64) can store milliseconds totaling up to 584,554,531 years.
See also How does Python manage int and long? and the OverflowError doc.

How to publish multiple numbers via MQTT?

I'm working on remotecontrolling a watering system via smartphone using MQTT to control the valves connected to a raspberry pi.
Up till now, I had the raspberry pi subscribe to multiple topics ('watering/frontLawn', 'watering/backLawn') and interpret the payload as the watering duration.
Now I want to add a way to schedule waterings, which requires to additionally send the time, when the watering is supposed to take place. So startTime and duration needs to be send on a specific topic.
Something like wateringInfo = [1563532789, 300] # in the form [startTime, duration]
Is there a recommended way of transferring information like this?
So far my only idea was to combine the two numbers:
startTime*1000+duration # assuming duration is maxed at 999
send them
and retrieve using:
retrievedStartTime = int(msg.payload) / 1000
retrievedDuration = int(msg.payload) % 1000
This seems like an error prone way of doing things. Is there a different way, maybe even directly transfer the array?

I suggest using "watering/frontLawn/info" topic and send the message as a string which would be so easily parsed on the other side.
Publisher:
client.publish(topic="watering/frontLawn/info",
payload=str(start)+","+str(end), qos=1, retain=False)
Subscriber:
client.subscribe(topic="watering/frontLawn/info", qos=1)
Parsing Part:
info = message.payload.split(",")
Now you have a list as follows:
info = ["start", "end"]
To get info:
start = int(info[0])
end = int(info[1])

How you pack/serialize data is entirely up to you and will depend on multiple factors including (but not exclusively):
What type of system is producing the data (is it very limited capability sensor)?
What is consuming the data (see above)?
Does the data need to be human readable?
Are you concerned with how big the message is (are you paying by the byte)?
Some sample options include:
JSON/XML These are both mark up languages that allow you to include a huge amount of structure and context along with the values. But they are also human readable.
CSV (comma separated variables) this is human readable, but you need to know what the position of each variable in the list maps to.
Protobuf again this allows you to include structure, but at a binary level so not human readable.
raw fixed width fields, you just say that each value is going to fit into the range of what can be represented by a fixed number of bytes. e.g. -128 to + 128 (or 0 to 255) can be represented by a single byte and you just predefine the number of bytes for each field and their order. This is basically CSV, but without the separator chars and using the smallest amount of space to pass the information.
There are loads more other options, but they are mainly variations on the 3 described above. Which you pick will depend on the at least the factors listed and how low level you want to get when implementing it.

PyAudio - Synchronizing playback and record

I'm currently using PyAudio to work on a lightweight recording utility that fits a specific need of an application I'm planning. I am working with an ASIO audio interface. What I'm writing the program to do is play a wav file through the interface, while simultaneously recording the output from the interface. The interface is processing the signal onboard in realtime and altering the audio. As I'm intending to import this rendered output into a DAW, I need the output to be perfectly synced with the input audio. Using a DAW I can simultaneously play audio into my interface and record the output. It is perfectly synced in the DAW when I do this. The purpose of my utility is to be able to trigger this from a python script.
Through a brute-force approach I've come up with a solution that works, but I'm now stuck with a magic number and I'm unsure of whether this is some sort of constant or something I can calculate. If it is a number I can calculate that would be ideal, but I still would like to understand where it is coming from either way.
My callback is as follows:
def testCallback(in_data, frame_count, time_info, status):
#read data from wave file
data = wave_file.readframes(frame_count)
#calculate number of latency frames for playback and recording
#1060 is my magic number
latencyCalc = math.ceil((stream.get_output_latency() + stream.get_input_latency()) * wave_file.getframerate()) + 1060
#no more data in playback file
if data == "":
#this is the number of times we must keep the loop alive to capture all playback
recordEndBuffer = latencyCalc / frame_count
if lastCt < recordEndBuffer:
#return 0-byte data to keep callback alive
data = b"0"*wave_file.getsampwidth()*frame_count
lastCt += 1
#we start recording before playback, so this accounts for the initial "pre-playback" data in the output file
if firstCt > (latencyCalc/frame_count):
wave_out.writeframes(in_data)
else:
firstCt += 1
return (data, pyaudio.paContinue)
My concern is in the function:
latencyCalc = math.ceil((stream.get_output_latency() + stream.get_input_latency()) * wave_file.getframerate()) + 1060
I put this calculation together by observing the offset of my output file in comparison to the original playback file. Two things were occurring, my output file was starting later than the original file when played simultaneously, and it would also end early. Through trial and error I determined it was a specific number of frames extra at the beginning and missing at the end. This calculates those number of frames. I do understand the first piece, it is the input/output latencies (provided in second/subsecond accuracy) converted to frames using the sample rate. But I'm not quite sure how to fill in the 1060 value as I'm not sure where it comes from.
I've found that by playing with the latency settings on my ASIO driver, my application continues to properly sync the recorded file even as the output/input latencies above change due to the adjustment (input/output latencies are always the same value), so the 1060 appears to be consistent on my machine. However, I simply don't know whether this is a value that can be calculated. Or if it is a specific constant, I'm unsure what exactly it represents.
Any help in better understanding these values would be appreciated. I'm happy my utility is now working properly, but would like to fully understand what is happening here, as I suspect potentially using a different interface would likely no longer work correctly (I would like to support this down the road for a few reasons).
EDIT 4/8/2014 in response to Roberto:
The value I receive for
latencyCalc = math.ceil((stream.get_output_latency() + stream.get_input_latency()) * wave_file.getframerate()) + 1060
is 8576, with the extra 1060 bringing to total latency to 9636 frames. You are correct in your assumption of why I added the 1060 frames. I am playing the file through the external ASIO interface, and the processing I'm hoping to capture in my recorded file is the result of the processing that occurs on the interface (not something I have coded). To compare the outputs, I simply played the test file and recorded the interface's output without any of the processing effects engaged on the interface. I then examined the two tracks in Audacity, and by trial and error determined that 1060 was the closest I could get the two to align. I have since realized it is still not exactly perfect, but it is incredibly close and audibly undetectable when played simulataneously (which is not true when the 1060 offset is removed, there is a noticeable delay). Adding/removing an additional frame is too much compensation in comparison to 1060 as well.
I do believe you are correct that the additional latency is from the external interface. I was initially wondering if it was something I could calculate with the numerical info I had at hand, but I am concluding it's just a constant in the interface. I feel this is true as I have determined that if I remove the 1060, the offset of the file is exactly the same as performing the same test but manually in Reaper (this is exactly the process I'm automating). I am getting much better latency than I would in reaper with my new brute force offset, so I'm going to call this a win. In my application, the goal is to completely replace the original file with the newly processed file, so the absolute minimum latency between the two is desired.
In response to your question about ASIO in PyAudio, the answer is fortunately yes. You must compile PortAudio using the ASIO SDK for PortAudio to function with ASIO, and then update the PyAudio setup to compile this way. Fortunately I'm working on windows, http://www.lfd.uci.edu/~gohlke/pythonlibs/#pyaudio which has ASIO support built in, and the devices will then be accessible through ASIO.

Since I'm not allowed to comment, I'll ask you here: What is the value of stream.get_output_latency() + stream.get_input_latency()) * wave_file.getframerate()? And how did you get that number 1060 in the first place?
With the line of code you marked off:
latencyCalc = math.ceil((stream.get_output_latency() + stream.get_input_latency()) * wave_file.getframerate()) + 1060, you simply add extra 1060 frames to your total latency. It's not clear to me from your description, why you do this, but I assume that you have measured total latency in your resulting file, and there is always constant number of extra frames, beside the sum of input latency + output latency. So, did you consider that this extra delay might be due to processing? You said that you do some processing of the input audio signal; and processing certainly takes some time. Try to do the same with unaltered input signal, and see if the extra delay is reduced/removed. Even the other parts of your application, e.g. if application has GUI, all those things can slow the recording down. You didn't describe your app completely, but I'm guessing that the extra latency is caused by your code, and the operations that the code does. And why is the 'magic number' always the same? Because your code is always the same.
resume: What the 'magic number' represents? Obviously, it represents some extra latency, in addition to your total round -trip latency.
What is causing this extra latency? The cause is most likely somewhere in your code. Your application is doing something that takes some additional time, and thus makes some additional delay. The only other possible thing that comes to my mind, is that you have added some additional 'silence period', somewhere in your settings, so you can check this out, too.

How to make BufferedReader's peek more reliable?

There's convenient peek function in io.BufferedReader. But
peek([n])
Return 1 (or n if specified) bytes from a buffer without advancing
the position. Only a single read on the raw stream is done to satisfy
the call. The number of bytes returned may be less than requested since
at most all the buffer’s bytes from the current position to the end are
returned.
it is returning too few bytes.
Where shall I get reliable multi-byte peek (without using read and disrupting other code nibbling the stream byte by byte and interpreting data from it)?

It depends on what you mean by reliable. The buffered classes are specifically tailored to prevent I/O as much as possible (as that is the whole point of a buffer) so they only guarantee that it will do 1 read of the buffer at most. The amount of data returned depends exclusively on the amount of data that the buffer already has in it.
If you need an exact amount of data, you will need to alter the underlying structures. In particular, you will probably need to re-open the stream with a bigger buffer.
If that is not an option, you could provide a wrapper class so that you can intercept the reads that you need and provide the data transparently to other code that actually want to consume the data.

Python - Generate a 32 bit random int with arguments

I need to generate a 32 bit random int but depending of some arguments. The idea is generate a unique ID for each message to send through a own P2P network. To generate it, I would like as arguments: my IP and the time stamp. My question is, how can I generate this 32 bit random int from these arguments?
Thanks again!

here's a list of options with their associated problems:
use a random number. you will get a collision (non-unique value) in about half the bits (this is the "birthday collision"). so for 32 bits you get a collision after 2*16 messages. if you are sending less than 65,000 messages this is not a problem, but 65,000 is not such a big number.
use a sequential counter from some service. this is what twitter's snowflake does (see another answer here). the trouble is supplying these across the net. typically with distributed systems you give each agent a set of numbers (so A might get 0-9, B get's 10-19, etc) and they use those numbers then request a new block. that reduces network traffic and load on the service providing numbers. but this is complex.
generate a hash from some values that will be unique. this sounds useful but is really no better than (1), because your hashes are going to collide (i explain why below). so you can hash IP address and timestamp, but all you're doing is generating 32 bit random numbers, in effect (the difference is that you can reproduce these values, but it doesn't seem like you need that functionality anyway), and so again you'll have a collisions after 65,000 messages or so, which is not much.
be smarter about generating ids to guarantee uniqueness. the problem in (3) is that you are hashing more than 32 bits, so you are compressing information and getting overlaps. instead, you could explicitly manage the bits to avoid collisions. for example, number each client for 16 bits (allows up to 65,000 clients) and then have each client user a 16 bit counter (allows up to 65,000 messages per client which is a big improvement on (3)). those won't collide because each is guaranteed unique, but you have a lot of limits in your system and things are starting to get complex (need to number clients and store counter state per client).
use a bigger field. if you used 64 bit ids then you could just use random numbers because collisions would be once every 2**32 messages, which is practically never (1 in 4,000,000,000). or you could join ip address (32 bits) with a 32 bit timestamp (but be careful - that probably means no more than 1 message per second from a client). the only drawback is slightly larger bandwidth, but in most cases ids are much smaller than payloads.
personally, i would use a larger field and random numbers - it's simple and works (although good random numbers are an issue in, say, embedded systems).
finally, if you need the value to be "really" random (because, for example, ids are used to decide priority and you want things to be fair) then you can take one of the solutions above with deterministic values and re-arrange the bits to be pseudo-random. for example, reversing the bits in a counter may well be good enough (compare lsb first).

I would suggest using some sort of hash. There are many possible hashes, the FNV hash comes in a variety of sizes and is fast. If you want something cryptographically secure it will be a lot slower. You may need to add a counter: 1, 2, 3, 4... to ensure that you do not get duplicate hashes within the same time stamp.

Did you try looking into Twitter's Snowflake? There is a Python wrapper for it.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.