Question Description
We are performing a lot of timeseries queries, these queries sometimes result in issues, they are usually performed through an API (Python) and sometimes result in complete failure due to data missing.
Due to this situation we are not sure where to educate ourselves and get the answer to this specific question on, how to deal with missing data in our timeseries (influxdb) database
Example
To describe a problem in an example..
We have some timeseries data, let's say we measure the temperature of the room, now we have many rooms and sometimes sensors die or stop working for a week or two, then we replace them and so on, in that timeframe the data is missing.
Now we try to perform certain calculations, they fail, let's say we want to calculate the temperature average per each day, now this will fail because some days we have no measurement input on the sensors.
One approach that we thought of is that we just interpolate the data for that day. Use the last and the first available and just place that value for the days that there is no data available.
This has many downsides, major one being due to fake data, you can't trust it and for our processes that are a bit more serious we would prefer to not store fake data (or interpolated).
We were wondering what the possible alternatives were to this question and where can we find the resource to educate ourselves on such topic.
Answer
The idea is that we fill the missing values, the gaps, with data that is null or None. This way we can use influxdb built-in fill.
https://docs.influxdata.com/influxdb/cloud/query-data/flux/fill/
Like in this example, we are able to fill null values and thereby perform any additional queries and actions on the data on analysis.
The link reference above contains all of the methodologies that we can use to resolve and fill in the missing data values.
I’m using a raspberry pi 4 to collect sensor data with a python script.
Like:
val=mcp.read_adc(0)
Which can read ten thousand data per second.
And now I want to save these data to influx for real-time analysis.
I have tried to save them to a log file while reading, and then use telegraf to collect as this blog did:
But it’s not working for my stream data as it is too slow.
Also I have tried to use python's influxdb module to write directly, like:
client.write(['interface,path=address,elementss=link value=3.14'],{'db':'db'},204,'line')
It's worse.
So how can I write these data into influxdb in time. Are there any solutions?
Thank you much appreciated!
Btw, I'm a beginner and can only use simple python, so sad.
InfluxDB OSS will process writes faster if you batch them. The python client has a batch parameter batch_size that you can use to do this. If you are reading ~10k points/s I would try a batch size of about 10k too. The batches should be compressed to speed transfer.
The write method also allows sending the tags path=address,elementss=link as a dictionary. Doing this should decrease parsing effort.
Are you also running InfluxDB on the raspberry pi or do you send the data off the Pi over a network connection?
I noticed that you said in the comments that nanosecond precision is very important but you did not include a timestamp in your line protocol point example. You should provide a timestamp yourself if the timing is this critical. Without an explicit timestamp in the data, InfluxDB will insert a timestamp at "when the data arrives" which is unpredictable.
As noted in the comments, you may want to consider preprocessing this data some before sending it to InfluxDB. We can't make a suggestion without knowing how you are processing the piezo data to detect footsteps. Usually ADC values are averaged in small batches (10 - 100 reads, depending) to reduce noise. Assuming your footstep detector runs continuously, you'll have over 750 million points per day from a single sensor. This is a lot of data to store and postprocess.
Please edit your question to include move information, if you are willing.
I noticed a lack of good soundfont-compatible synthesizers written in Python. So, a month or so ago, I started some work on my own (for reference, it's here). Making this was also a challenge that I set for myself.
I keep coming up against the same problem again and again and again, summarized by this:
To play sound, a stream of data with a more-or-less constant rate of flow must be sent to the audio device
To synthesize sound in real time based on user input, little-to-no buffering can be used
Thus, there is a cap on the amount of time one 'buffer generation loop' can take
Python, as a language, simply cannot run fast enough to do synthesize sound within this time limit
The problem is not my code, or at least, I've tried to optimize it to extreme levels - using local variables in time-sensitive parts of the code, avoiding using dots to access variables in loops, using itertools for iteration, using pre-compiled macros like max, changing thread switching parameters, doing as few calculations as possible, making approximations, this list goes on.
Using Pypy helps, but even that starts to struggle after not too long.
It's worth noting that (at best) my synth at the moment can play about 25 notes simultaneously. But this isn't enough. Fluidsynth, a synth written in C, has a cap on the number of notes per instrument at 128 notes. It also supports multiple instruments at a time.
Is my assertion that Python simply cannot be used to write a synthesizer correct? Or am I missing something very important?
I built a device based on a microcontroller with some sensors attached to it, one of them is an orientation sensor that currently delivers information about pitch, yaw, roll and acceleration for x,y,z. I would like to be able to detect movement "events" when the device is well... moved around.
For example I would like to detect a "repositioned" event which basically would consist of series of other events - "up" (picked up), "move" (moved in air to some other point), "down" (device put back down).
Since I am just starting to figure out how to make it possible I would like to ask if I am getting the right ideas or wasting my time.
My idea is currently that I use the data I probed to create a dataset and try to use machine learning to detect if each element belongs to one of the events I am trying to detect. So basically I took the device and first rotated it on table a few times, then picked it up several times, then moved it in the air and finally put it down several times. This generated a set of data that has a structure like that:
yaw,pitch,roll,accelx,accely,accelz,state
-140,178,178,17,-163,-495,stand
110,-176,-166,-212,-97,-389,down
118,-177,178,123,16,-146,up
166,-174,-171,-375,-145,-929,up
157,-178,178,4,-61,-259,down
108,177,-177,-55,76,-516,move
152,178,-179,35,98,-479,stand
175,177,-178,-30,-168,-668,move
100,177,178,-42,26,-447,stand
-14,177,179,42,-57,-491,stand
-155,177,179,28,-57,-469,stand
92,-173,-169,347,-373,-305,down
[...]
the last "state" column is added by me - I added this after each test movement type and then shuffled the rows.
I got about 450 records this way and the idea is to use the machine learning to predict the "state" column for each record coming from the running device, then I could queue up the outcomes and if in some short period the "up" events are majority I can take it the device is being picked up.
Maybe instead of using each reading as a data row I should rather take the last 10 readings (lets say) and try to predict what happens per column - i.e. if I know last 10 yaw readings were the changes during I was moving the device up I should rather use this data - so 10 readings from each of the 6 columns is processed as row and then I have 6 results - again the ratio of result types may make it possible to detect the "movement" event that happened during these 10 readings.
I am currently about 30% into an online ML course and enjoying it but I'd really like to hear some comments from more experienced people.
Are my ideas a reasonable solution or am I totally failing to understand how I can use ML? If so, what resources shall I use to get myself started?
Your idea to regroup the reading seems interesting. But it all depends on how often you get a record and how you plan on grouping them.
If you get a record every 10-100ms, it could be a good idea to group them since it will help to have more accurate data reducing noise. You could take the mean of each column to get rid of that noise and help your classifier to better classify your different states.
Otherwise if you have a record every second, I think it's a bad idea to regroup the records since you will most certainely mix several actions together.
The best way would be to try out both ways if you have the time ^^
I'm measuring multiple processes in a component to see where bottlenecks are. These processes take anything from 1-1000us to complete.
I'm logging this in an influxDB database, set to us resolution, using Python3.
My problem is visualising this. I tried grafana thinking it would suit me. However, when graphing this microsecond data it will show multiple datapoints on 1 ms, the max grafana supports, making it impossible to see increments or zoom in or anything similar.
Judging by some google results, 1, 2, 3, I'm not alone.
Is there any way I can make this data more readable/understandable by either having the graphing tool display it in microseconds or be able to change the X-axis to something different than a timestamp. (Ideally something in similar to grafana or chronograf.)
Thanks.
According to this Grafana feature request post (from 2016):
https://github.com/grafana/grafana/issues/6252
Quote by Torkel Ödegaard, Co-founder of Grafana:
No there is no way to do that. It would be quite tricky as all time
formats in javascript (and time libs) only go down to millisecond resolution.
As it seems this is currently not possible (even not in the mid-term future) as Javascript supports only milliseconds