How to publish multiple numbers via MQTT? - python

I'm working on remotecontrolling a watering system via smartphone using MQTT to control the valves connected to a raspberry pi.
Up till now, I had the raspberry pi subscribe to multiple topics ('watering/frontLawn', 'watering/backLawn') and interpret the payload as the watering duration.
Now I want to add a way to schedule waterings, which requires to additionally send the time, when the watering is supposed to take place. So startTime and duration needs to be send on a specific topic.
Something like wateringInfo = [1563532789, 300] # in the form [startTime, duration]
Is there a recommended way of transferring information like this?
So far my only idea was to combine the two numbers:
startTime*1000+duration # assuming duration is maxed at 999
send them
and retrieve using:
retrievedStartTime = int(msg.payload) / 1000
retrievedDuration = int(msg.payload) % 1000
This seems like an error prone way of doing things. Is there a different way, maybe even directly transfer the array?

I suggest using "watering/frontLawn/info" topic and send the message as a string which would be so easily parsed on the other side.
Publisher:
client.publish(topic="watering/frontLawn/info",
payload=str(start)+","+str(end), qos=1, retain=False)
Subscriber:
client.subscribe(topic="watering/frontLawn/info", qos=1)
Parsing Part:
info = message.payload.split(",")
Now you have a list as follows:
info = ["start", "end"]
To get info:
start = int(info[0])
end = int(info[1])

How you pack/serialize data is entirely up to you and will depend on multiple factors including (but not exclusively):
What type of system is producing the data (is it very limited capability sensor)?
What is consuming the data (see above)?
Does the data need to be human readable?
Are you concerned with how big the message is (are you paying by the byte)?
Some sample options include:
JSON/XML These are both mark up languages that allow you to include a huge amount of structure and context along with the values. But they are also human readable.
CSV (comma separated variables) this is human readable, but you need to know what the position of each variable in the list maps to.
Protobuf again this allows you to include structure, but at a binary level so not human readable.
raw fixed width fields, you just say that each value is going to fit into the range of what can be represented by a fixed number of bytes. e.g. -128 to + 128 (or 0 to 255) can be represented by a single byte and you just predefine the number of bytes for each field and their order. This is basically CSV, but without the separator chars and using the smallest amount of space to pass the information.
There are loads more other options, but they are mainly variations on the 3 described above. Which you pick will depend on the at least the factors listed and how low level you want to get when implementing it.

Related

Slow loop python to search data in antoher data frame in python

I have two data frames : one with all my data (called 'data') and one with latitudes and longitudes of different stations where each observation starts and ends (called 'info'), I am trying to get a data frame where I'll have the latitude and longitude next to each station in each observation, my code in python :
for i in range(0,15557580):
for j in range(0,542):
if data.year[i] == '2018' and data.station[i]==info.station[j]:
data.latitude[i] = info.latitude[j]
data.longitude[i] = info.longitude[j]
break
but since I have about 15 million observation , doing it, takes a lot of time, is there a quicker way of doing it ?
Thank you very much (I am still new to this)
edit :
my file info looks like this (about 500 observation, one for each station)
my file data like this (theres other variables not shown here) (about 15 million observations , one for each travel)
and what i am looking to get is that when the stations numbers match that the resulting data would look like this :
This is one solution. You can also use pandas.merge to add 2 new columns to data and perform the equivalent mapping.
# create series mappings from info
s_lat = info.set_index('station')['latitude']
s_lon = info.set_index('station')['latitude']
# calculate Boolean mask on year
mask = data['year'] == '2018'
# apply mappings, if no map found use fillna to retrieve original data
data.loc[mask, 'latitude'] = data.loc[mask, 'station'].map(s_lat)\
.fillna(data.loc[mask, 'latitude'])
data.loc[mask, 'longitude'] = data.loc[mask, 'station'].map(s_lon)\
.fillna(data.loc[mask, 'longitude'])
This is a very recurrent and important issue when anyone starts to deal with large datasets. Big Data is a whole subject in itself, here is a quick introduction to the main concepts.
1. Prepare your dataset
In big data, 80% to 90% of the time is spent gathering, filtering and preparing your datasets. Create subsets of data, making them optimized for your further processing.
2. Optimize your script
Short code does not always mean optimized code in term of performance. In your case, without knowing about your dataset, it is hard to say exactly how you should process it, you will have to figure out on your own how to avoid the most computation possible while getting the exact same result. Try to avoid any unnecessary computation.
You can also consider splitting the work over multiple threads if appropriate.
As a general rule, you should not use for loops and break them inside. Whenever you don't know precisely how many loops you will have to go through in the first place, you should always use while or do...while loops.
3. Consider using distributed storage and computing
This is a subject in itself that is way too big to be all explained here.
Storing, accessing and processing data in a serialized way is faster of small amount of data but very inappropriate for large datasets. Instead, we use distributed storage and computing frameworks.
It aims at doing everything in parallel. It relies on a concept named MapReduce.
The first distributed data storage framework was Hadoop (eg. Hadoop Distributed File System or HDFS). This framework has its advantages and flaws, depending on your application.
In any case, if you are willing to use this framework, it will probably be more appropriate for you not to use MR directly on top HDFS, but using a upper level one, preferably in-memory, such as Spark or Apache Ignite on top of HDFS. Also, depending on your needs, try to have a look at frameworks such as Hive, Pig or Sqoop for example.
Again this subject is a whole different world but might very well be adapted to your situation. Feel free to document yourself about all these concepts and frameworks, and leave your questions if needed in the comments.

PyAudio - Synchronizing playback and record

I'm currently using PyAudio to work on a lightweight recording utility that fits a specific need of an application I'm planning. I am working with an ASIO audio interface. What I'm writing the program to do is play a wav file through the interface, while simultaneously recording the output from the interface. The interface is processing the signal onboard in realtime and altering the audio. As I'm intending to import this rendered output into a DAW, I need the output to be perfectly synced with the input audio. Using a DAW I can simultaneously play audio into my interface and record the output. It is perfectly synced in the DAW when I do this. The purpose of my utility is to be able to trigger this from a python script.
Through a brute-force approach I've come up with a solution that works, but I'm now stuck with a magic number and I'm unsure of whether this is some sort of constant or something I can calculate. If it is a number I can calculate that would be ideal, but I still would like to understand where it is coming from either way.
My callback is as follows:
def testCallback(in_data, frame_count, time_info, status):
#read data from wave file
data = wave_file.readframes(frame_count)
#calculate number of latency frames for playback and recording
#1060 is my magic number
latencyCalc = math.ceil((stream.get_output_latency() + stream.get_input_latency()) * wave_file.getframerate()) + 1060
#no more data in playback file
if data == "":
#this is the number of times we must keep the loop alive to capture all playback
recordEndBuffer = latencyCalc / frame_count
if lastCt < recordEndBuffer:
#return 0-byte data to keep callback alive
data = b"0"*wave_file.getsampwidth()*frame_count
lastCt += 1
#we start recording before playback, so this accounts for the initial "pre-playback" data in the output file
if firstCt > (latencyCalc/frame_count):
wave_out.writeframes(in_data)
else:
firstCt += 1
return (data, pyaudio.paContinue)
My concern is in the function:
latencyCalc = math.ceil((stream.get_output_latency() + stream.get_input_latency()) * wave_file.getframerate()) + 1060
I put this calculation together by observing the offset of my output file in comparison to the original playback file. Two things were occurring, my output file was starting later than the original file when played simultaneously, and it would also end early. Through trial and error I determined it was a specific number of frames extra at the beginning and missing at the end. This calculates those number of frames. I do understand the first piece, it is the input/output latencies (provided in second/subsecond accuracy) converted to frames using the sample rate. But I'm not quite sure how to fill in the 1060 value as I'm not sure where it comes from.
I've found that by playing with the latency settings on my ASIO driver, my application continues to properly sync the recorded file even as the output/input latencies above change due to the adjustment (input/output latencies are always the same value), so the 1060 appears to be consistent on my machine. However, I simply don't know whether this is a value that can be calculated. Or if it is a specific constant, I'm unsure what exactly it represents.
Any help in better understanding these values would be appreciated. I'm happy my utility is now working properly, but would like to fully understand what is happening here, as I suspect potentially using a different interface would likely no longer work correctly (I would like to support this down the road for a few reasons).
EDIT 4/8/2014 in response to Roberto:
The value I receive for
latencyCalc = math.ceil((stream.get_output_latency() + stream.get_input_latency()) * wave_file.getframerate()) + 1060
is 8576, with the extra 1060 bringing to total latency to 9636 frames. You are correct in your assumption of why I added the 1060 frames. I am playing the file through the external ASIO interface, and the processing I'm hoping to capture in my recorded file is the result of the processing that occurs on the interface (not something I have coded). To compare the outputs, I simply played the test file and recorded the interface's output without any of the processing effects engaged on the interface. I then examined the two tracks in Audacity, and by trial and error determined that 1060 was the closest I could get the two to align. I have since realized it is still not exactly perfect, but it is incredibly close and audibly undetectable when played simulataneously (which is not true when the 1060 offset is removed, there is a noticeable delay). Adding/removing an additional frame is too much compensation in comparison to 1060 as well.
I do believe you are correct that the additional latency is from the external interface. I was initially wondering if it was something I could calculate with the numerical info I had at hand, but I am concluding it's just a constant in the interface. I feel this is true as I have determined that if I remove the 1060, the offset of the file is exactly the same as performing the same test but manually in Reaper (this is exactly the process I'm automating). I am getting much better latency than I would in reaper with my new brute force offset, so I'm going to call this a win. In my application, the goal is to completely replace the original file with the newly processed file, so the absolute minimum latency between the two is desired.
In response to your question about ASIO in PyAudio, the answer is fortunately yes. You must compile PortAudio using the ASIO SDK for PortAudio to function with ASIO, and then update the PyAudio setup to compile this way. Fortunately I'm working on windows, http://www.lfd.uci.edu/~gohlke/pythonlibs/#pyaudio which has ASIO support built in, and the devices will then be accessible through ASIO.
Since I'm not allowed to comment, I'll ask you here: What is the value of stream.get_output_latency() + stream.get_input_latency()) * wave_file.getframerate()? And how did you get that number 1060 in the first place?
With the line of code you marked off:
latencyCalc = math.ceil((stream.get_output_latency() + stream.get_input_latency()) * wave_file.getframerate()) + 1060, you simply add extra 1060 frames to your total latency. It's not clear to me from your description, why you do this, but I assume that you have measured total latency in your resulting file, and there is always constant number of extra frames, beside the sum of input latency + output latency. So, did you consider that this extra delay might be due to processing? You said that you do some processing of the input audio signal; and processing certainly takes some time. Try to do the same with unaltered input signal, and see if the extra delay is reduced/removed. Even the other parts of your application, e.g. if application has GUI, all those things can slow the recording down. You didn't describe your app completely, but I'm guessing that the extra latency is caused by your code, and the operations that the code does. And why is the 'magic number' always the same? Because your code is always the same.
resume: What the 'magic number' represents? Obviously, it represents some extra latency, in addition to your total round -trip latency.
What is causing this extra latency? The cause is most likely somewhere in your code. Your application is doing something that takes some additional time, and thus makes some additional delay. The only other possible thing that comes to my mind, is that you have added some additional 'silence period', somewhere in your settings, so you can check this out, too.

Python - Generate a 32 bit random int with arguments

I need to generate a 32 bit random int but depending of some arguments. The idea is generate a unique ID for each message to send through a own P2P network. To generate it, I would like as arguments: my IP and the time stamp. My question is, how can I generate this 32 bit random int from these arguments?
Thanks again!
here's a list of options with their associated problems:
use a random number. you will get a collision (non-unique value) in about half the bits (this is the "birthday collision"). so for 32 bits you get a collision after 2*16 messages. if you are sending less than 65,000 messages this is not a problem, but 65,000 is not such a big number.
use a sequential counter from some service. this is what twitter's snowflake does (see another answer here). the trouble is supplying these across the net. typically with distributed systems you give each agent a set of numbers (so A might get 0-9, B get's 10-19, etc) and they use those numbers then request a new block. that reduces network traffic and load on the service providing numbers. but this is complex.
generate a hash from some values that will be unique. this sounds useful but is really no better than (1), because your hashes are going to collide (i explain why below). so you can hash IP address and timestamp, but all you're doing is generating 32 bit random numbers, in effect (the difference is that you can reproduce these values, but it doesn't seem like you need that functionality anyway), and so again you'll have a collisions after 65,000 messages or so, which is not much.
be smarter about generating ids to guarantee uniqueness. the problem in (3) is that you are hashing more than 32 bits, so you are compressing information and getting overlaps. instead, you could explicitly manage the bits to avoid collisions. for example, number each client for 16 bits (allows up to 65,000 clients) and then have each client user a 16 bit counter (allows up to 65,000 messages per client which is a big improvement on (3)). those won't collide because each is guaranteed unique, but you have a lot of limits in your system and things are starting to get complex (need to number clients and store counter state per client).
use a bigger field. if you used 64 bit ids then you could just use random numbers because collisions would be once every 2**32 messages, which is practically never (1 in 4,000,000,000). or you could join ip address (32 bits) with a 32 bit timestamp (but be careful - that probably means no more than 1 message per second from a client). the only drawback is slightly larger bandwidth, but in most cases ids are much smaller than payloads.
personally, i would use a larger field and random numbers - it's simple and works (although good random numbers are an issue in, say, embedded systems).
finally, if you need the value to be "really" random (because, for example, ids are used to decide priority and you want things to be fair) then you can take one of the solutions above with deterministic values and re-arrange the bits to be pseudo-random. for example, reversing the bits in a counter may well be good enough (compare lsb first).
I would suggest using some sort of hash. There are many possible hashes, the FNV hash comes in a variety of sizes and is fast. If you want something cryptographically secure it will be a lot slower. You may need to add a counter: 1, 2, 3, 4... to ensure that you do not get duplicate hashes within the same time stamp.
Did you try looking into Twitter's Snowflake? There is a Python wrapper for it.

How many times string appears in another string

I have a large static binary (10GB) that doesn't change.
I want to be able to take as input small strings (15 bytes or lower each) and then to determine which string is the least frequent.
I understand that without actually searching the whole binary I wont be able to determine this exactly, so I know it will be an approximation.
Building a tree/hash table isn't feasible since it will require about 256^15 bytes which is ALOT.
I have about 100GB of disk space and 8GB RAM which will be dedicated into this task, but I can't seem to find any way to accomplish this task without actually going over the file.
I have as much time as I want to prepare the big binary, and after that I'll need to decide which is the least frequent string many many times.
Any ideas?
Thanks!
Daniel.
(BTW: if it matters, I'm using Python)
Maybe build a hashtable with the counts for as many n-tuples as you can afford storage for? You can prune the trees that don't appear anymore. I wouldn't call it "approximation", but could be "upper bounds", with assurance to detect strings that don't appear.
So, say you can build all 4-tuples.
Then to count occurrences for "ABCDEF" you'd have the minimum of count(ABCD), count(BCDE), count(CDEF). If that is zero for any of those, the string is guaranteed to not appear. If it is one, it will appear at most once (but maybe not at all).
Because you have a large static string that does not change you could distinguish one-time work preprocessing this string which never has to be repeated from the work of answering queries. It might be convenient to do the one-time work on a more powerful machine.
If you can find a machine with an order of magnitude or so more internal storage you could build a suffix array - an array of offsets into the stream in sorted order of the suffixes starting at the offset. This could be stored in external storage for queries, and you could use this with binary search to find the first and last positions in sorted order where your query string appears. Obviously the distance between the two will give you the number of occurrences, and a binary search will need about 34 binary chops to do 16 Gbyte assuming 16Gbytes is 2^34 bytes so each query should cost about 68 disk seeks.
It may not be reasonable to expect you to find that amount of internal storage, but I just bought a 1TB USB hard drive for about 50 pounds, so I think you could increase external storage for one time work. There are algorithms for suffix array construction in external memory, but because your query strings are limited to 15 bytes you don't need anything that complicated. Just create 200GB of data by writing out the 15-byte string found at every offset followed by an 5-byte offset number, then sort these 20-byte records with an external sort. This will give you 50Gbytes of indexes into the string in sorted order for you to put into external storage to answer queries with.
If you know all of the queries in advance, or are prepared to batch them up, another approach would be to build an http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm tree from them. This takes time linear in the total size of the queries. Then you can stream the 10GB data past them in time proportional to the sum of the size of that data and the number of times any string finds a match.
Since you are looking for which is least frequent, and are willing to accept approximate solution. You could use a series of Bloom filters instead of a hash table. If you use sufficiently large ones, you shouldn't need to worry about the query size, as you can probably keep the false positive rate low.
The idea would be to go through all of the possible query sizes and make sub-strings out of them. For example, if the queries will be between 3 and 100, then it would cost (N * (sum of (i) from i = 3 to i = 100)). Then one by one add the subsets to one of the bloom filters, such that the query doesn't exist within the filter, creating a new one Bloom filter with the same hash functions if needed. You obtain the count by going through each filter and checking if the query exists within it. Each query then simply goes through each of the filter and checks if it's there, if it is, it adds 1 to a count.
You'll need to try to balance the false positive rate as well as the number of filters. If the false positive rate gets too high on one of the filters it isn't useful, likewise it's bad if you have trillions of bloom filters (quite possible if you one filter per sub-string). There are a couple of ways these issues can be dealt with.
To reduce the number of filters:
Randomly delete filters until there are only so many left. This will likely increase the false negative rate, which probably means it's better to simply delete the filters with the highest expected false positive rates.
Randomly merge filters until there are only so many left. Ideally avoiding merging a filter too often as it increases the false positive rate. Practically speaking, you probably have too many to do this without making use of the scalable version (see below), as it'll probably be hard enough to manage the false positive rate.
It also may not be a bad to avoid a greedy approach when adding to a bloom filter. Be rather selective in which filter something is added to.
You might end up having to implement scalable bloom filters to keep things manageable, which sounds similar to what I'm suggesting anyway, so should work well.

Storing and replaying binary network data with python

I have a Python application which sends 556 bytes of data across the network at a rate of 50 Hz. The binary data is generated using struct.pack() which returns a string, which is subsequently written to a UDP socket.
As well as transmitting this data, I would like to save this data to file as space-efficiently as possible, including a timestamp for each message, so that I can replay the data at a later time. What would be the best way of doing this using Python?
I have mulled over using a logging object, but have not yet found out whether Python can read in log files so that I can replay the data. Also, I don't know whether the logging object can handle binary data.
Any tips would be much appreciated! Although Wireshark would be an option, I'd rather store the data using my application so that I can automatically start new data files each time I run the program.
Python's logging system is intended to process human-readable strings, and it's intended to be easy to enable or disable depending on whether it's you (the developer) or someone else running your program. Don't use it for something that your application always needs to output.
The simplest way to store the data is to just write the same 556-byte string that you send over the socket out to a file. If you want to have timestamps, you could precede each 556-byte message with the time of sending, converted to an integer, and packed into 4 or 8 bytes using struct.pack(). The exact method would depend on your specific requirements, e.g. how precise you need the time to be, and whether you need absolute time or just relative to some reference point.
One possibility for a compact timestamp for replay purposes...: set the time as a floating point number of seconds since the epoch with time.time(), multiply by 50 since you said you're repeating this 50 times a second (the resulting unit, one fiftieth of a second, is sometimes called "a jiffy"), truncate to int, subtract from the similar int count of jiffies since the epoch that you measured at the start of your program, and struct.pack the result into an unsigned int with the number of bytes you need to represent the intended duration -- for example, with 2 bytes for this timestamp, you could represent runs of about 1200 seconds (20 minutes), but if you plan longer runs you'd need 4 bytes (3 bytes is just too unwieldy IMHO;-).
Not all operating systems have time.time() returning decent precision, so you may need more devious means if you need to run on such unfortunately limited OSs. (That's VERY os-dependent, of course). What OSs do you need to support...?
Anyway...: for even more compactness, use a slightly higher multiplier than 50 (say 10000) for more accuracy, and store, each time, the difference wrt the previous timestamp -- since that difference should not be much different from a jiffy (if I understand your spec correctly) that should be about 200 or so of these "tenth-thousands of a second" and you can store a single unsigned byte (and have no limit wrt the duration of runs you're storing for future replay). This depends even more on accurate returns from time.time() of course.
If your 556-byte binary data is highly compressible, it will be worth your while to use gzip to store the stream of timestamp-then-data in compressed form; this is best assessed empirically on your actual data, though.

Categories