How to find the expected file transfer duration in Python?

How to find the expected file transfer duration in Python? - python

I am using rsync to transfer files using rsync in Python. I have the basic UI where user can selects the file and initiate the transfer. I want to show the Expected Time Duration to transfer all the files they selected. I know the total size of all the files in bytes. What's the smart way to show them the expected file transfer duration? It doesn't have to be exact precise.

To calculate an estimated time to completion for anything, you simply need to keep track of the amount of time taken to transfer the data currently completed and base your estimate for the rest of the data on the past speed. Once you get that basic method, there are all sorts of ways you can adjust your estimate to take account of acceleration, congestion and other effects - for example, taking the amount of data transferred in the last 100 seconds, breaking this down into 20s increments and calculating a weighted mean speed.
I'm not familiar with using rsync in Python. Are you just calling it using os.exec*() or are you using something like pysync (http://freecode.com/projects/pysync)? If you are spawning rsync processes, you'll struggle to get granular data (esp. if transferring large files). I suppose you could spawn rsync --progress and get/parse the progress lines in some sneaky way but that seems horridly awkward.

Related

Query execution time with small batches vs entire input set

I'm using ArangoDB 3.9.2 for search task. The number of items in dataset is 100.000. When I pass the entire dataset as an input list to the engine - the execution time is around ~10 sec, which is pretty quick. But if I pass the dataset in small batches one by one - 100 items per batch, the execution time is rapidly growing. In this case, to process the full dataset takes about ~2 min. Could you explain please, why is it happening? The dataset is the same.
I'm using python driver "ArangoClient" from python-arango lib ver 0.2.1
PS: I had the similar problem with Neo4j, but the problem was solved using transactions committing with HTTP API. Does the ArangoDB have something similar?

Every time you make a call to a remote system (Neo4J or ArangoDB or any database) there is overhead in making the connection, sending the data, and then after executing your command, tearing down the connection.
What you're doing is trying to find the 'sweet spot' for your implementation as to the most efficient batch size for the type of data you are sending, the complexity of your query, the performance of your hardware, etc.
What I recommend doing is writing a test script that sends the data in varying batch sizes to help you determine the optimal settings for your use case.
I have taken this approach with many systems that I've designed and the optimal batch sizes are unique to each implementation. It totally depends on what you are doing.
See what results you get for the overall load time if you use batch sizes of 100, 1000, 2000, 5000, and 10000.
This way you'll work out the best answer for you.

The way to write ten thousand data points to InfluxDB per second

I’m using a raspberry pi 4 to collect sensor data with a python script.
Like:
val=mcp.read_adc(0)
Which can read ten thousand data per second.
And now I want to save these data to influx for real-time analysis.
I have tried to save them to a log file while reading, and then use telegraf to collect as this blog did:
But it’s not working for my stream data as it is too slow.
Also I have tried to use python's influxdb module to write directly, like:
client.write(['interface,path=address,elementss=link value=3.14'],{'db':'db'},204,'line')
It's worse.
So how can I write these data into influxdb in time. Are there any solutions?
Thank you much appreciated!
Btw, I'm a beginner and can only use simple python, so sad.

InfluxDB OSS will process writes faster if you batch them. The python client has a batch parameter batch_size that you can use to do this. If you are reading ~10k points/s I would try a batch size of about 10k too. The batches should be compressed to speed transfer.
The write method also allows sending the tags path=address,elementss=link as a dictionary. Doing this should decrease parsing effort.
Are you also running InfluxDB on the raspberry pi or do you send the data off the Pi over a network connection?
I noticed that you said in the comments that nanosecond precision is very important but you did not include a timestamp in your line protocol point example. You should provide a timestamp yourself if the timing is this critical. Without an explicit timestamp in the data, InfluxDB will insert a timestamp at "when the data arrives" which is unpredictable.
As noted in the comments, you may want to consider preprocessing this data some before sending it to InfluxDB. We can't make a suggestion without knowing how you are processing the piezo data to detect footsteps. Usually ADC values are averaged in small batches (10 - 100 reads, depending) to reduce noise. Assuming your footstep detector runs continuously, you'll have over 750 million points per day from a single sensor. This is a lot of data to store and postprocess.
Please edit your question to include move information, if you are willing.

Save data to disk to avoid re-computation

I'm a beginner with Python, but I'm at the final stages of a project I've been working on for the past year and I need help at the final step.
If needed I'll post my code though it's not really relevant.
Here is my problem:
I have a database of images, say for example a 100 images. On each one of those images, I run an algorithm called ICA. This algorithm is very heavy to compute and each picture individually usually takes 7-10 seconds, so 100 pictures can take 700-1000 seconds, and that's way too long to wait.
Thing is, my database of images never changes. I never add pictures or delete pictures, and so the output of the ICA algorithm is always the same. So in reality, every time I run my code, I wait forever and gain the same output every time.
Is there a way to save the data to the hard disk, and extract it at a later time?
Say, I compute the ICA of the 100 images, it takes forever, and I save it and close my computer. Now when I run the program, I don't want it to recompute ICA, I want it to use the values I stored previously.
Would such a thing be possible in Python? if so - how?

Since you're running computation-heavy algorithms, I'm going to assume you're using Numpy. If not, you should be.
Numpy has a numpy.save() function that lets you save arrays in binary format. You can then load them using numpy.load().
EDIT: Docs for aforementioned functions can be found here under the "NPZ Files" section.

PyAudio - Synchronizing playback and record

I'm currently using PyAudio to work on a lightweight recording utility that fits a specific need of an application I'm planning. I am working with an ASIO audio interface. What I'm writing the program to do is play a wav file through the interface, while simultaneously recording the output from the interface. The interface is processing the signal onboard in realtime and altering the audio. As I'm intending to import this rendered output into a DAW, I need the output to be perfectly synced with the input audio. Using a DAW I can simultaneously play audio into my interface and record the output. It is perfectly synced in the DAW when I do this. The purpose of my utility is to be able to trigger this from a python script.
Through a brute-force approach I've come up with a solution that works, but I'm now stuck with a magic number and I'm unsure of whether this is some sort of constant or something I can calculate. If it is a number I can calculate that would be ideal, but I still would like to understand where it is coming from either way.
My callback is as follows:
def testCallback(in_data, frame_count, time_info, status):
#read data from wave file
data = wave_file.readframes(frame_count)
#calculate number of latency frames for playback and recording
#1060 is my magic number
latencyCalc = math.ceil((stream.get_output_latency() + stream.get_input_latency()) * wave_file.getframerate()) + 1060
#no more data in playback file
if data == "":
#this is the number of times we must keep the loop alive to capture all playback
recordEndBuffer = latencyCalc / frame_count
if lastCt < recordEndBuffer:
#return 0-byte data to keep callback alive
data = b"0"*wave_file.getsampwidth()*frame_count
lastCt += 1
#we start recording before playback, so this accounts for the initial "pre-playback" data in the output file
if firstCt > (latencyCalc/frame_count):
wave_out.writeframes(in_data)
else:
firstCt += 1
return (data, pyaudio.paContinue)
My concern is in the function:
latencyCalc = math.ceil((stream.get_output_latency() + stream.get_input_latency()) * wave_file.getframerate()) + 1060
I put this calculation together by observing the offset of my output file in comparison to the original playback file. Two things were occurring, my output file was starting later than the original file when played simultaneously, and it would also end early. Through trial and error I determined it was a specific number of frames extra at the beginning and missing at the end. This calculates those number of frames. I do understand the first piece, it is the input/output latencies (provided in second/subsecond accuracy) converted to frames using the sample rate. But I'm not quite sure how to fill in the 1060 value as I'm not sure where it comes from.
I've found that by playing with the latency settings on my ASIO driver, my application continues to properly sync the recorded file even as the output/input latencies above change due to the adjustment (input/output latencies are always the same value), so the 1060 appears to be consistent on my machine. However, I simply don't know whether this is a value that can be calculated. Or if it is a specific constant, I'm unsure what exactly it represents.
Any help in better understanding these values would be appreciated. I'm happy my utility is now working properly, but would like to fully understand what is happening here, as I suspect potentially using a different interface would likely no longer work correctly (I would like to support this down the road for a few reasons).
EDIT 4/8/2014 in response to Roberto:
The value I receive for
latencyCalc = math.ceil((stream.get_output_latency() + stream.get_input_latency()) * wave_file.getframerate()) + 1060
is 8576, with the extra 1060 bringing to total latency to 9636 frames. You are correct in your assumption of why I added the 1060 frames. I am playing the file through the external ASIO interface, and the processing I'm hoping to capture in my recorded file is the result of the processing that occurs on the interface (not something I have coded). To compare the outputs, I simply played the test file and recorded the interface's output without any of the processing effects engaged on the interface. I then examined the two tracks in Audacity, and by trial and error determined that 1060 was the closest I could get the two to align. I have since realized it is still not exactly perfect, but it is incredibly close and audibly undetectable when played simulataneously (which is not true when the 1060 offset is removed, there is a noticeable delay). Adding/removing an additional frame is too much compensation in comparison to 1060 as well.
I do believe you are correct that the additional latency is from the external interface. I was initially wondering if it was something I could calculate with the numerical info I had at hand, but I am concluding it's just a constant in the interface. I feel this is true as I have determined that if I remove the 1060, the offset of the file is exactly the same as performing the same test but manually in Reaper (this is exactly the process I'm automating). I am getting much better latency than I would in reaper with my new brute force offset, so I'm going to call this a win. In my application, the goal is to completely replace the original file with the newly processed file, so the absolute minimum latency between the two is desired.
In response to your question about ASIO in PyAudio, the answer is fortunately yes. You must compile PortAudio using the ASIO SDK for PortAudio to function with ASIO, and then update the PyAudio setup to compile this way. Fortunately I'm working on windows, http://www.lfd.uci.edu/~gohlke/pythonlibs/#pyaudio which has ASIO support built in, and the devices will then be accessible through ASIO.

Since I'm not allowed to comment, I'll ask you here: What is the value of stream.get_output_latency() + stream.get_input_latency()) * wave_file.getframerate()? And how did you get that number 1060 in the first place?
With the line of code you marked off:
latencyCalc = math.ceil((stream.get_output_latency() + stream.get_input_latency()) * wave_file.getframerate()) + 1060, you simply add extra 1060 frames to your total latency. It's not clear to me from your description, why you do this, but I assume that you have measured total latency in your resulting file, and there is always constant number of extra frames, beside the sum of input latency + output latency. So, did you consider that this extra delay might be due to processing? You said that you do some processing of the input audio signal; and processing certainly takes some time. Try to do the same with unaltered input signal, and see if the extra delay is reduced/removed. Even the other parts of your application, e.g. if application has GUI, all those things can slow the recording down. You didn't describe your app completely, but I'm guessing that the extra latency is caused by your code, and the operations that the code does. And why is the 'magic number' always the same? Because your code is always the same.
resume: What the 'magic number' represents? Obviously, it represents some extra latency, in addition to your total round -trip latency.
What is causing this extra latency? The cause is most likely somewhere in your code. Your application is doing something that takes some additional time, and thus makes some additional delay. The only other possible thing that comes to my mind, is that you have added some additional 'silence period', somewhere in your settings, so you can check this out, too.

Running process in parallel for data collection

I am collecting data from two pieces of equipment using serial ports (scale and conductivity probe). I need to continuously collect data from the scale which I average between collection points of the conductivity probe (roughly a minuet).
Thus I need to run two processes at the same time. One that collects data from the scale, and other which waits for data from the conductivity probe, once it gets the data it would send a command to the other process in order to get the collected scale data, which is then time stamped and saved into .csv file.
I looked into subprocess but it I cant figure out how to reset a running script. Any suggestions on what to look into.

Instead of using threads you could also implementing your data sources as generators and just loop over them to consume the incoming data and do something with it. Perhaps using two different generators and zipping them together, actually would be a nice experiment I'm not entirely sure it can be done...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.