The way to write ten thousand data points to InfluxDB per second

The way to write ten thousand data points to InfluxDB per second - python

I’m using a raspberry pi 4 to collect sensor data with a python script.
Like:
val=mcp.read_adc(0)
Which can read ten thousand data per second.
And now I want to save these data to influx for real-time analysis.
I have tried to save them to a log file while reading, and then use telegraf to collect as this blog did:
But it’s not working for my stream data as it is too slow.
Also I have tried to use python's influxdb module to write directly, like:
client.write(['interface,path=address,elementss=link value=3.14'],{'db':'db'},204,'line')
It's worse.
So how can I write these data into influxdb in time. Are there any solutions?
Thank you much appreciated!
Btw, I'm a beginner and can only use simple python, so sad.

InfluxDB OSS will process writes faster if you batch them. The python client has a batch parameter batch_size that you can use to do this. If you are reading ~10k points/s I would try a batch size of about 10k too. The batches should be compressed to speed transfer.
The write method also allows sending the tags path=address,elementss=link as a dictionary. Doing this should decrease parsing effort.
Are you also running InfluxDB on the raspberry pi or do you send the data off the Pi over a network connection?
I noticed that you said in the comments that nanosecond precision is very important but you did not include a timestamp in your line protocol point example. You should provide a timestamp yourself if the timing is this critical. Without an explicit timestamp in the data, InfluxDB will insert a timestamp at "when the data arrives" which is unpredictable.
As noted in the comments, you may want to consider preprocessing this data some before sending it to InfluxDB. We can't make a suggestion without knowing how you are processing the piezo data to detect footsteps. Usually ADC values are averaged in small batches (10 - 100 reads, depending) to reduce noise. Assuming your footstep detector runs continuously, you'll have over 750 million points per day from a single sensor. This is a lot of data to store and postprocess.
Please edit your question to include move information, if you are willing.

Related

Query execution time with small batches vs entire input set

I'm using ArangoDB 3.9.2 for search task. The number of items in dataset is 100.000. When I pass the entire dataset as an input list to the engine - the execution time is around ~10 sec, which is pretty quick. But if I pass the dataset in small batches one by one - 100 items per batch, the execution time is rapidly growing. In this case, to process the full dataset takes about ~2 min. Could you explain please, why is it happening? The dataset is the same.
I'm using python driver "ArangoClient" from python-arango lib ver 0.2.1
PS: I had the similar problem with Neo4j, but the problem was solved using transactions committing with HTTP API. Does the ArangoDB have something similar?

Every time you make a call to a remote system (Neo4J or ArangoDB or any database) there is overhead in making the connection, sending the data, and then after executing your command, tearing down the connection.
What you're doing is trying to find the 'sweet spot' for your implementation as to the most efficient batch size for the type of data you are sending, the complexity of your query, the performance of your hardware, etc.
What I recommend doing is writing a test script that sends the data in varying batch sizes to help you determine the optimal settings for your use case.
I have taken this approach with many systems that I've designed and the optimal batch sizes are unique to each implementation. It totally depends on what you are doing.
See what results you get for the overall load time if you use batch sizes of 100, 1000, 2000, 5000, and 10000.
This way you'll work out the best answer for you.

How to read highspeed Serial-data using Python?

I am working on a project to display sensor data on Python GUI. Sensor data is coming at a rate of 2kHz on serial port using Arduino. I was using pyserial (using readline()) to read the sensor data on my laptop. After hours of debugging I found that python was able to read around 400 samples/sec i.e. reading frequency is around 400Hz.
Is there any way to read the serial data at higher rate with the help of python?
Thanks in Advance.

Assuming the sensor is designed to transmit 2kHz data and is doing so properly, my guess is that the time your Python code is taking to read a data sample, process the data, update a plot, etc is the limiting factor. Are you reading and processing samples one at a time? Is there a smart way to read all of the available in a big chunk reducing the number of individual read/process steps?
Are you plotting the data in "real time"? If so, plot updates are slow.

Getting good timestamps for sensor readings

I'm trying to get timestamp data to match accelerometer and gyroscope readings.
I'm using a Raspberry Pi 3 B+ with python to pull accelerometer (Adxl345) and gyroscope (ITG3200) readings. I'm currently reading them through I2C as fast as I can, and I write a timestamp from the system time (time.time()) immediately before reading.
I thought this would be sufficiently accurate, but when I look at the resulting data the time is often not monotonic and/or just looks wrong. In fact, the result often seem to better match the motion I was tracking if I throw out all but the first timestamp and then synthetically create times based on the frequency of the device I'm collecting from!
That said, this approach clearly gives me wrong data, but I'm at a loss as to where to pull correct data. The accelerometer and gyro don't seem to include times in anything they do, and if I pull data as fast as I can I'm still bound to miss some from them at their highest rates meaning the times I use will always be somewhat wrong.
The accelerometer also has a FIFO cache it can store some values in, but again, if I pull from that case how do I know what timestamp goes with each value?? The documention mentions the cache storing values but nothing about the timestamp.
All of which leads me to believe I'm missing something. Is there a secret here I don't know? Or a standard practice I'm unaware of? Any thoughts or suggestions would be most welcome.

Reading huge CSV files using Pandas vs. MySQL

I have a 500+ MB CSV data file. My question is, which would be faster for data manipulation (e.g., reading, processing) is the Python MySQL client would be faster since all work is mapped into SQL queries and optimization is left to the optimizer. But, at the same time Pandas is dealing with a file which should be faster than communicating with a server?
I have already checked "Large data" work flows using pandas, Best practices for importing large CSV files, Fastest way to write large CSV with Python, and Most efficient way to parse a large .csv in python?. However, I haven't really found any comparison regarding Pandas and MySQL.
Use Case:
I am working on text dataset that consists of 1,737,123 rows and 8 columns. I am feeding this dataset into RNN/LSTM network. I do some preprocessing in prior to feeding which is encoding using a customized encoding algorithm.
More details
I have 250+ experiments to do and 12 architectures (different models design) to try.
I am confused, I feel I miss something.

There's no comparison online 'cuz these two scenarios give different results:
With Pandas, you end up with a Dataframe in memory (as a NumPy ndarray under the hood), accessible as native Python objects
With MySQL client, you end up with data in a MySQL database on disk (unless you're using an in-memory database), accessible via IPC/sockets
So, the performance will depend on
how much data needs to be transferred by lower-speed channels (IPC, disk, network)
how comparatively fast is transferring vs processing (which of them is the bottleneck)
which data format your processing facilities prefer (i.e. what additional conversions will be involved)
E.g.:
If your processing facility can reside in the same (Python) process that will be used to read it, reading it directly into Python types is preferrable since you won't need to transfer it all to the MySQL process, then back again (converting formats each time).
OTOH if your processing facility is implemented in some other process and/or language, or e.g. resides within a computing cluster, hooking it to MySQL directly may be faster by eliminating the comparatively slow Python from equation, and because you'll need to be transferring the data again and converting it into the processing app's native objects anyway.

How to find the expected file transfer duration in Python?

I am using rsync to transfer files using rsync in Python. I have the basic UI where user can selects the file and initiate the transfer. I want to show the Expected Time Duration to transfer all the files they selected. I know the total size of all the files in bytes. What's the smart way to show them the expected file transfer duration? It doesn't have to be exact precise.

To calculate an estimated time to completion for anything, you simply need to keep track of the amount of time taken to transfer the data currently completed and base your estimate for the rest of the data on the past speed. Once you get that basic method, there are all sorts of ways you can adjust your estimate to take account of acceleration, congestion and other effects - for example, taking the amount of data transferred in the last 100 seconds, breaking this down into 20s increments and calculating a weighted mean speed.
I'm not familiar with using rsync in Python. Are you just calling it using os.exec*() or are you using something like pysync (http://freecode.com/projects/pysync)? If you are spawning rsync processes, you'll struggle to get granular data (esp. if transferring large files). I suppose you could spawn rsync --progress and get/parse the progress lines in some sneaky way but that seems horridly awkward.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

The way to write ten thousand data points to InfluxDB per second - python

Related

Query execution time with small batches vs entire input set

How to read highspeed Serial-data using Python?

Getting good timestamps for sensor readings

Reading huge CSV files using Pandas vs. MySQL

How to find the expected file transfer duration in Python?

Categories

Resources