Efficiency when printing progress updates, print x vs if x%y==0: print x - python

I am running an algorithm which reads an excel document by rows, and pushes the rows to a SQL Server, using Python. I would like to print some sort of progression through the loop. I can think of two very simple options and I would like to know which is more lightweight and why.
Option A:
for x in xrange(1, sheet.nrows):
print x
cur.execute() # pushes to sql
Option B:
for x in xrange(1, sheet.nrows):
if x % some_check_progress_value == 0:
print x
cur.execute() # pushes to sql
I have a feeling that the second one would be more efficient but only for larger scale programs. Is there any way to calculate/determine this?

I'm a newbie, so I can't comment. An "answer" might be overkill, but it's all I can do for now.
My favorite thing for this is tqdm. It's minimally invasive, both code-wise and output-wise, and it gets the job done.

I am one of the developers of tqdm, a Python progress bar that tries to be as efficient as possible while providing as many automated features as possible.
The biggest performance sink we had was indeed I/O: printing to the console/file/whatever.
But if your loop is tight (more than 100 iterations/second), then it's useless to print every update, you'd just as well print just 1/10 of the updates and the user would see no difference, while your bar would be 10 times less overhead (faster).
To fix that, at first we added a mininterval parameter which updated the display only every x seconds (which is by default 0.1 seconds, the human eye cannot really see anything faster than that). Something like that:
import time
def my_bar(iterator, mininterval=0.1)
counter = 0
last_print_t = 0
for item in iterator:
if (time.time() - last_print_t) >= mininterval:
last_print_t = time.time()
print_your_bar_update(counter)
counter += 1
This will mostly fix your issue as your bar will always have a constant display overhead which will be more and more negligible as you have bigger iterators.
If you want to go further in the optimization, time.time() is also an I/O operation and thus has a cost greater than simple Python statements. To avoid that, you want to minimize the calls you do to time.time() by introducing another variable: miniters, which is the minimum number of iterations you want to skip before even checking the time:
import time
def my_bar(iterator, mininterval=0.1, miniters=10)
counter = 0
last_print_t = 0
last_print_counter = 0
for item in iterator:
if (counter - last_print_counter) >= miniters:
if (time.time() - last_print_t) >= mininterval:
last_print_t = time.time()
last_print_counter = counter
print_your_bar_update(counter)
counter += 1
You can see that miniters is similar to your Option B modulus solution, but it's better fitted as an added layer over time because time is more easily configured.
With these two parameters, you can manually finetune your progress bar to make it the most efficient possible for your loop.
However, miniters (or modulus) is tricky to get to work generally for everyone without manual finetuning, you need to make good assumptions and clever tricks to automate this finetuning. This is one of the major ongoing work we are doing on tqdm. Basically, what we do is that we try to calculate miniters to equal mininterval, so that time checking isn't even needed anymore. This automagic setting kicks in after mininterval gets triggered, something like that:
from __future__ import division
import time
def my_bar(iterator, mininterval=0.1, miniters=10, dynamic_miniters=True)
counter = 0
last_print_t = 0
last_print_counter = 0
for item in iterator:
if (counter - last_print_counter) >= miniters:
cur_time = time.time()
if (cur_time - last_print_t) >= mininterval:
if dynamic_miniters:
# Simple rule of three
delta_it = counter - last_print_counter
delta_t = cur_time - last_print_t
miniters = delta_it * mininterval / delta_t
last_print_t = cur_time
last_print_counter = counter
print_your_bar_update(counter)
counter += 1
There are various ways to compute miniters automatically, but usually you want to update it to match mininterval.
If you are interested in digging more, you can check the dynamic_miniters internal parameters, maxinterval and an experimental monitoring thread of the tqdm project.

Using the modulus check (counter % N == 0) is almost free compared print and a great solution if you run a high frequency iteration (log a lot).
Specially if you does not need to print for each iteration but want some feedback along the way.

Related

Why does my for loop with if else clause run so slow?

TL,DR:
I'm trying to understand why the below for loop is incredibly slow, taking hours to run on a dataset of 160K entries.
I have a working solution using a function and .apply(), but I want to understand why my homegrown solution is so bad. I'm obviously a huge beginner with Python:
popular_or_not = []
counter = 0
for id in df['id']:
if df['popularity'][df['id'] == id].values == 0:
popular_or_not.append(0)
else:
popular_or_not.append(1)
counter += 1
df['popular_or_not'] = popular_or_not
df
In more detail:
I'm currently learning Python for data science, and I'm looking at this dataset on Kaggle: https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks
I'm interesting in predicting/modelling the popularity score. It is not normally distributed:
plt.bar(df['popularity'].value_counts().index, df['popularity'].value_counts().values)
I would like to add a column, to say whether a track is popular or not, with popular tracks being those that get a score of 5 and above and unpopular being the others.
I have tried the following solution, but it runs incredibly slowly, and I'm not sure why. It runs fine on a very small subset, but would take a few hours to run on the full dataset:
popular_or_not = []
counter = 0
for id in df['id']:
if df['popularity'][df['id'] == id].values == 0:
popular_or_not.append(0)
else:
popular_or_not.append(1)
counter += 1
df['popular_or_not'] = popular_or_not
df
This alternative solution works fine:
def check_popularity(score):
if score > 5:
return 1
else:
#pdb.set_trace()
return 0
df['popularity'].apply(check_popularity).value_counts()
df['popular_or_not'] = df['popularity'].apply(check_popularity)
I think understanding why my first solution doesn't work might be an important part of my Python learning.
Thanks everyone for your comments. I'm going to summarize them below as an answer to my question, but please feel free to jump in if anything is incorrect:
The reason my initial for loop was so slow is that I was checking df['id'] == id 160k times. This is typically a very slow operation.
For this type of operation, instead of iterating over a pandas dataframe thousands of times, it's always a good idea to think of applying vectorization - a bunch of tools and methods to process a whole column in a single instruction at C speed. This is what I did with the following code:
def check_popularity(score):
if score > 5:
return 1
else:
#pdb.set_trace()
return 0
df['popularity'].apply(check_popularity).value_counts()
df['popular_or_not'] = df['popularity'].apply(check_popularity)
By using .apply and a pre-defined function. I get the same result, but in seconds instead of in hours.

Zero return in measuring time of function

import time
def find(a):
count = 0
for item in a:
count = count + 1
if item == 2:
return count
a = [7,4,5,10,3,5,88,5,5,5,5,5,5,5,5,5,5,55,
5,5,5,5,5,5,5,5,5,5,5,5,55,5,5,5,5,5,
5,5,5,5,5,2,5,5,5,55,5,55,5,5,5,6]
print (len(a))
sTime = time.time()
print (find(a))
eTime = time.time()
ave = eTime - sTime
print (ave)
I want measure the execution time of this function
My print (ave) returns 0; why?
To accurately time code execution you should use the timeit, rather than time. timeit easily allows the repetition of code blocks for timing to avoid very near zero results (the cause of your question)
import timeit
s = """
def find(a):
count = 0
for item in a:
count = count + 1
if item == 2:
return count
a = [7,4,5,10,3,5,88,5,5,5,5,5,5,5,5,5,5,55,5,5,5,5,5,5,5,5,5,5,5,5,55,5,5,5,5,5,5,5,5,5,5,2,5,5,5,55,5,55,5,5,5,6]
find(a)
"""
print(timeit.timeit(stmt=s, number=100000))
This will measure the amount of time it takes to run the code in multiline string s 100,000 times. Note that I replaced print(find(a)) with just find(a) to avoid having the result printed 100,000 times.
Running many times is advantageous for several reasons:
In general, code runs very quickly. Summing many quick runs results in a number which is actually meaningful and useful
Run time is dependent on many variable, uncontrollable factors (such as other processes using computing power). Running many times helps normalize this
If you are using timeit to compare two methodologies to see which is faster, multiple runs will make it easier to see the conclusive result
I'm not sure, either; I get a time about 1.4E-5.
Try putting the call into a loop to measure more iterations:
for i in range(10000):
result = find(a)
print(result)

Using time.time() to time a function often return 0 seconds

I have to time the implementation I did of an algorithm in one of my classes, and I am using the time.time() function to do so. After implementing it, I have to run that algorithm on a number of data files which contains small and bigger data sets in order to formally analyse its complexity.
Unfortunately, on the small data sets, I get a runtime of 0 seconds even if I get a precision of 0.000000000000000001 with that function when looking at the runtimes of the bigger data sets and I cannot believe that it really takes less than that on the smaller data sets.
My question is: Is there a problem using this function (and if so, is there another function I can use that has a better precision)? Or am I doing something wrong?
Here is my code if ever you need it:
import sys, time
import random
from utility import parseSystemArguments, printResults
...
def main(ville):
start = time.time()
solution = dynamique(ville) # Algorithm implementation
end = time.time()
return (end - start, solution)
if __name__ == "__main__":
sys.argv.insert(1, "-a")
sys.argv.insert(2, "3")
(algoNumber, ville, printList) = parseSystemArguments()
(algoTime, solution) = main(ville)
printResults(algoTime, solution, printList)
The printResults function:
def printResults(time, solution, printList=True):
print ("Temps d'execution = " + str(time) + "s")
if printList:
print (solution)
The solution to my problem was to use the timeit module instead of the time module.
import timeit
...
def main(ville):
start = timeit.default_timer()
solution = dynamique(ville)
end = timeit.default_timer()
return (end - start, solution)
Don't confuse the resolution of the system time with the resolution of a floating point number. The time resolution on a computer is only as frequent as the system clock is updated. How often the system clock is updated varies from machine to machine, so to ensure that you will see a difference with time, you will need to make sure it executes for a millisecond or more. Try putting it into a loop like this:
start = time.time()
k = 100000
for i in range(k)
solution = dynamique(ville)
end = time.time()
return ((end - start)/k, solution)
In the final tally, you then need to divide by the number of loop iterations to know how long your code actually runs once through. You may need to increase k to get a good measure of the execution time, or you may need to decrease it if your computer is running in the loop for a very long time.

How to improve a simple caching mechanism in Python?

just registered so I could ask this question.
Right now I have this code that prevents a class from updating more than once every five minutes:
now = datetime.now()
delta = now - myClass.last_updated_date
seconds = delta.seconds
if seconds > 300
update(myClass)
else
retrieveFromCache(myClass)
I'd like to modify it by allowing myClass to update twice per 5 minutes, instead of just once.
I was thinking of creating a list to store the last two times myClass was updated, and comparing against those in the if statement, but I fear my code will get convoluted and harder to read if I go that route.
Is there a simpler way to do this?
You could do it with a simple counter. Concept is get_update_count tracks how often the class is updated.
if seconds > 300 or get_update_count(myClass) < 2:
#and update updatecount
update(myClass)
else:
#reset update count
retrieveFromCache(myClass)
Im not sure how you uniquely identify myClass.
update_map = {}
def update(instance):
#do the update
update_map[instance] = update_map.get(instance,0)+1
def get_update_count(instance):
return update_map[instance] or 0

Python Beginner: Selective Printing in loops

I'm a very new python user (had only a little prior experience with html/javascript as far as programming goes), and was trying to find some ways to output only intermittent numbers in my loop for a basic bicycle racing simulation (10,000 lines of biker positions would be pretty excessive :P).
I tried in this loop several 'reasonable' ways to communicate a condition where a floating point number equals its integer floor (int, floor division) to print out every 100 iterations or so:
for i in range (0,10000):
i = i + 1
t = t + t_step #t is initialized at 0 while t_step is set at .01
acceleration_rider1 = (power_rider1 / (70 * velocity_rider1)) - (force_drag1 / 70)
velocity_rider1 = velocity_rider1 + (acceleration_rider1 * t_step)
position_rider1 = position_rider1 + (velocity_rider1 * t_step)
force_drag1 = area_rider1 * (velocity_rider1 ** 2)
acceleration_rider2 = (power_rider2 / (70 * velocity_rider1)) - (force_drag2 / 70)
velocity_rider2 = velocity_rider2 + (acceleration_rider2 * t_step)
position_rider2 = position_rider2 + (velocity_rider2 * t_step)
force_drag2 = area_rider1 * (velocity_rider2 ** 2)
if t == int(t): #TRIED t == t // 1 AND OTHER VARIANTS THAT DON'T WORK HERE:(
print t, "biker 1", position_rider1, "m", "\t", "biker 2", position_rider2, "m"
The for loop auto increments for you, so you don't need to use i = i + 1.
You don't need t, just use % (modulo) operator to find multiples of a number.
# Log every 1000 lines.
LOG_EVERY_N = 1000
for i in range(1000):
... # calculations with i
if (i % LOG_EVERY_N) == 0:
print "logging: ..."
To print out every 100 iterations, I'd suggest
if i % 100 == 0: ...
If you'd rather not print the very first time, then maybe
if i and i % 100 == 0: ...
(as another answer noted, the i = i + 1 is supererogatory given that i is the control variable of the for loop anyway -- it's not particularly damaging though, just somewhat superfluous, and is not really relevant to the issue of why your if doesn't trigger).
While basing the condition on t may seem appealing, t == int(t) is unlikely to work unless the t_step is a multiple of 1.0 / 2**N for some integer N -- fractions cannot be represented exactly in a float unless this condition holds, because floats use a binary base. (You could use decimal.Decimal, but that would seriously impact the speed of your computation, since float computation are directly supported by your machine's hardware, while decimal computations are not).
The other answers suggest that you use the integer variable i instead. That also works, and is the solution I would recommend. This answer is mostly for educational value.
I think it's a roundoff error that is biting you. Floating point numbers can often not be represented exactly, so adding .01 to t for 100 times is not guaranteed to result in t == 1:
>>> sum([.01]*100)
1.0000000000000007
So when you compare to an actual integer number, you need to build in a small tolerance margin. Something like this should work:
if abs(t - int(t)) < 1e-6:
print t, "biker 1", position_rider1, "m", "\t", "biker 2", position_rider2, "m"
You can use python library called tqdm (tqdm derives from the Arabic word taqaddum (تقدّم) which can mean "progress) for showing progress and use write() method from tqdm to print intermittent log statements as answered by #Stephen
Why using tqdm is useful in your case?
Shows compact & fancy progress bar with very minimal code change.
Does not fill your console with thousands of log statement and yet shows accurate iteration progress of your for loop.
Caveats:
Can not use logging library as it writes output stdout only. Though you can redirect it to logfile very easily.
Adds little performance overhead.
Code
from tqdm import tqdm
from time import sleep
# Log every 100 lines.
LOG_EVERY_N = 100
for i in tqdm(range(1,1000)):
if i%LOG_EVERY_N == 0:
tqdm.write(f"loggig : {i}")
sleep(0.5)
How to install ?
pip install tqdm
Sample GIF that shows console output

Categories