Running process in parallel for data collection

Running process in parallel for data collection - python

I am collecting data from two pieces of equipment using serial ports (scale and conductivity probe). I need to continuously collect data from the scale which I average between collection points of the conductivity probe (roughly a minuet).
Thus I need to run two processes at the same time. One that collects data from the scale, and other which waits for data from the conductivity probe, once it gets the data it would send a command to the other process in order to get the collected scale data, which is then time stamped and saved into .csv file.
I looked into subprocess but it I cant figure out how to reset a running script. Any suggestions on what to look into.

Instead of using threads you could also implementing your data sources as generators and just loop over them to consume the incoming data and do something with it. Perhaps using two different generators and zipping them together, actually would be a nice experiment I'm not entirely sure it can be done...

Related

Asynchronous Training with Ray

I want to be able to throw at some ray workers a lot of data collection tasks where a trainer is working concurrently and asynchronously on another cpu training on the collected data, the notion resembles this example from the docs: https://docs.ray.io/en/master/auto_examples/plot_parameter_server.html#asynchronous-parameter-server-training
Difference is I don't want to hang waiting for the next sample to arrive, blocking me from assigning a new task (with the ray.wait in the attached example), but throw at the pool a lot of samples and condition the trainer's training process to start only when at least N samples were collected using the data collection tasks.
How can I do that using ray?

Can you take a look at e.g. DQN's or SAC's execution plan in RLlib?
ray/rllib/agents/dqn/dqn.py::execution_plan().
E.g. DQN samples via the remote workers and puts the collected samples into the buffer, while - at the same time - sampling from that buffer and doing learning updates on this buffer-sampled data. You can also set the "training intensity", the ratio between time steps sampled and time steps trained.
SAC works the same. APEX-DQN on the other hand uses distributed replay buffers to allow for even faster sample storage and retrieval.

The way to write ten thousand data points to InfluxDB per second

I’m using a raspberry pi 4 to collect sensor data with a python script.
Like:
val=mcp.read_adc(0)
Which can read ten thousand data per second.
And now I want to save these data to influx for real-time analysis.
I have tried to save them to a log file while reading, and then use telegraf to collect as this blog did:
But it’s not working for my stream data as it is too slow.
Also I have tried to use python's influxdb module to write directly, like:
client.write(['interface,path=address,elementss=link value=3.14'],{'db':'db'},204,'line')
It's worse.
So how can I write these data into influxdb in time. Are there any solutions?
Thank you much appreciated!
Btw, I'm a beginner and can only use simple python, so sad.

InfluxDB OSS will process writes faster if you batch them. The python client has a batch parameter batch_size that you can use to do this. If you are reading ~10k points/s I would try a batch size of about 10k too. The batches should be compressed to speed transfer.
The write method also allows sending the tags path=address,elementss=link as a dictionary. Doing this should decrease parsing effort.
Are you also running InfluxDB on the raspberry pi or do you send the data off the Pi over a network connection?
I noticed that you said in the comments that nanosecond precision is very important but you did not include a timestamp in your line protocol point example. You should provide a timestamp yourself if the timing is this critical. Without an explicit timestamp in the data, InfluxDB will insert a timestamp at "when the data arrives" which is unpredictable.
As noted in the comments, you may want to consider preprocessing this data some before sending it to InfluxDB. We can't make a suggestion without knowing how you are processing the piezo data to detect footsteps. Usually ADC values are averaged in small batches (10 - 100 reads, depending) to reduce noise. Assuming your footstep detector runs continuously, you'll have over 750 million points per day from a single sensor. This is a lot of data to store and postprocess.
Please edit your question to include move information, if you are willing.

Open question - Is high level parallelising of many multi threaded serial jobs across a cluster using joblib backend possible?

I am totally new to Ray and have a question regarding it being a potential solution.
I am optimising an image modelling code and have successfully optimised it to run on a single machine, using multi-threaded numpy operations.
Each image generation is a serial operation, which scales across a single node.
What I’d like to do is scale each of these locally parallel jobs across multiple nodes.
Before refactoring, the code was parallelised serially at a high level, calculating single images in parallel. I would like to replicate this parallel behaviour again, across multiple nodes. Essentially this would be batch running a number of independent jobs which compute a single image in parallel across multiple nodes, where those computations themselves are independent of each other, the only communication requirement is sending parameters at the beginning (small) and image arrays at the end (large).
As mentioned the original parallel implementation used joblib to parallelise the serial image computation over cpus locally, with each image calculation on a separate cpu. Now, I want to replicate this, except with one image calculation process per node, which will them multi thread scale across that compute node.
So my idea is try the joblib backend for to control this process. This is the previous high level Joblib call for running multiple serial image computation in parallel.
enter image description here
I believe I can just encapsulate the above call with:
with joblib_backend(‘ray’):
The above loop is actually being called inside a method of a class, and the image computation uses the class self construct to pass around variables and arrays. Is there anything I have to do with actors to preserve this state?
Any thoughts or pointers would be greatly appreciated.

How to check if a slurm job is the last RUNNING in an array?

I need to send an array of jobs to a SLURM cluster, and I need them to aggregate a part of their results in one combined file. However, I can't have multiple independent slurm array drones writing to the same file. So currently I'm trying to make it so that only the last drone aggregates all the data.
At the moment, I have each array drone check if all the other results are written when it finishes, and if they are, then it does the file writing. However, currently multiple drones finish in almost the same time, and seem to still try to write to the same file.
I would like to make it so that only the last drone in the array does this. However the last drone numerically (i.e by checking the JOBID) may not be the last drone to finish, as the jobs take slightly variable lengths of time.
So is there a way for each drone to check that it's the last one running in the array or something? Or is there a better way to do this that I'm overlooking?
Also, I would prefer answer in python since that's what I'm using, if possible.

The easiest way would be to create an additional job for the aggregation and adding a dependency on the job array.
#SBATCH --dependendy=afterany=<jobid of the job array>
See https://slurm.schedmd.com/job_array.html and https://slurm.schedmd.com/sbatch.html

How to find the expected file transfer duration in Python?

I am using rsync to transfer files using rsync in Python. I have the basic UI where user can selects the file and initiate the transfer. I want to show the Expected Time Duration to transfer all the files they selected. I know the total size of all the files in bytes. What's the smart way to show them the expected file transfer duration? It doesn't have to be exact precise.

To calculate an estimated time to completion for anything, you simply need to keep track of the amount of time taken to transfer the data currently completed and base your estimate for the rest of the data on the past speed. Once you get that basic method, there are all sorts of ways you can adjust your estimate to take account of acceleration, congestion and other effects - for example, taking the amount of data transferred in the last 100 seconds, breaking this down into 20s increments and calculating a weighted mean speed.
I'm not familiar with using rsync in Python. Are you just calling it using os.exec*() or are you using something like pysync (http://freecode.com/projects/pysync)? If you are spawning rsync processes, you'll struggle to get granular data (esp. if transferring large files). I suppose you could spawn rsync --progress and get/parse the progress lines in some sneaky way but that seems horridly awkward.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Running process in parallel for data collection - python

Related

Asynchronous Training with Ray

The way to write ten thousand data points to InfluxDB per second

Open question - Is high level parallelising of many multi threaded serial jobs across a cluster using joblib backend possible?

How to check if a slurm job is the last RUNNING in an array?

How to find the expected file transfer duration in Python?

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Running process in parallel for data collection - python

Related

Asynchronous Training with Ray

The way to write ten thousand data points to InfluxDB per second

Open question - Is high level parallelising of many multi threaded serial jobs across a cluster using joblib backend possible?

How to check if a slurm job is the last **RUNNING** in an array?

How to find the expected file transfer duration in Python?

Categories

Resources

How to check if a slurm job is the last RUNNING in an array?