How to improve a simple caching mechanism in Python?

How to improve a simple caching mechanism in Python? - python

just registered so I could ask this question.
Right now I have this code that prevents a class from updating more than once every five minutes:
now = datetime.now()
delta = now - myClass.last_updated_date
seconds = delta.seconds
if seconds > 300
update(myClass)
else
retrieveFromCache(myClass)
I'd like to modify it by allowing myClass to update twice per 5 minutes, instead of just once.
I was thinking of creating a list to store the last two times myClass was updated, and comparing against those in the if statement, but I fear my code will get convoluted and harder to read if I go that route.
Is there a simpler way to do this?

You could do it with a simple counter. Concept is get_update_count tracks how often the class is updated.
if seconds > 300 or get_update_count(myClass) < 2:
#and update updatecount
update(myClass)
else:
#reset update count
retrieveFromCache(myClass)
Im not sure how you uniquely identify myClass.
update_map = {}
def update(instance):
#do the update
update_map[instance] = update_map.get(instance,0)+1
def get_update_count(instance):
return update_map[instance] or 0

Related

Functions that stop after x time Python

I am working at a Python project, and I reached a point where I need a function to stop and return after a x time, that is passed as a parameter. A simple example:
def timedfunc(time_to_stop):
result = None
while (time_has_not_passed):
do()
return result
I explain:
When time has passed, timedfunc stops and interrupts everything in it, and jumps right to return result. So, what I need is a way to make this function work as long as possible (time_to_stop), and then to return the result variable, which is as accurate as possible (More time, more calculations, more accuracy). Of course, when time is out, also do() stops. To better understand, I say that the function is continuosly changing the value of result, and once the time has passed it returns the current value. (do() stands for all the calculations that change result)
I just made a simple example to better explain what I want:
def multiply(time):
result = 10
while time_has_not_passed:
temporary = result*10 #Actually much more time-consuming, also like 3 minutes.
temporary /= 11
result = temporary
return result
This explains what kind of calculations do() makes, and I need as many *10/11 as python can do in, for example, 0.5 sec.
I know that this pretty complicated, but any help would be great.

import time
start_time = time.time()
program()
if (time.time()-start_time)>0.5: #you can change 0.5 to any other value you want
exit()
It is something like this. you can put this if statement right inside your program function too.

maybe you can use:
time.sleep(x)
do()
# OR
now = time()
cooldown = x
if now + cooldown < time():
do()
if you want it to do something for a while
now = time()
needed_time = x
while now + needed_time > time():
do()

Efficiency when printing progress updates, print x vs if x%y==0: print x

I am running an algorithm which reads an excel document by rows, and pushes the rows to a SQL Server, using Python. I would like to print some sort of progression through the loop. I can think of two very simple options and I would like to know which is more lightweight and why.
Option A:
for x in xrange(1, sheet.nrows):
print x
cur.execute() # pushes to sql
Option B:
for x in xrange(1, sheet.nrows):
if x % some_check_progress_value == 0:
print x
cur.execute() # pushes to sql
I have a feeling that the second one would be more efficient but only for larger scale programs. Is there any way to calculate/determine this?

I'm a newbie, so I can't comment. An "answer" might be overkill, but it's all I can do for now.
My favorite thing for this is tqdm. It's minimally invasive, both code-wise and output-wise, and it gets the job done.

I am one of the developers of tqdm, a Python progress bar that tries to be as efficient as possible while providing as many automated features as possible.
The biggest performance sink we had was indeed I/O: printing to the console/file/whatever.
But if your loop is tight (more than 100 iterations/second), then it's useless to print every update, you'd just as well print just 1/10 of the updates and the user would see no difference, while your bar would be 10 times less overhead (faster).
To fix that, at first we added a mininterval parameter which updated the display only every x seconds (which is by default 0.1 seconds, the human eye cannot really see anything faster than that). Something like that:
import time
def my_bar(iterator, mininterval=0.1)
counter = 0
last_print_t = 0
for item in iterator:
if (time.time() - last_print_t) >= mininterval:
last_print_t = time.time()
print_your_bar_update(counter)
counter += 1
This will mostly fix your issue as your bar will always have a constant display overhead which will be more and more negligible as you have bigger iterators.
If you want to go further in the optimization, time.time() is also an I/O operation and thus has a cost greater than simple Python statements. To avoid that, you want to minimize the calls you do to time.time() by introducing another variable: miniters, which is the minimum number of iterations you want to skip before even checking the time:
import time
def my_bar(iterator, mininterval=0.1, miniters=10)
counter = 0
last_print_t = 0
last_print_counter = 0
for item in iterator:
if (counter - last_print_counter) >= miniters:
if (time.time() - last_print_t) >= mininterval:
last_print_t = time.time()
last_print_counter = counter
print_your_bar_update(counter)
counter += 1
You can see that miniters is similar to your Option B modulus solution, but it's better fitted as an added layer over time because time is more easily configured.
With these two parameters, you can manually finetune your progress bar to make it the most efficient possible for your loop.
However, miniters (or modulus) is tricky to get to work generally for everyone without manual finetuning, you need to make good assumptions and clever tricks to automate this finetuning. This is one of the major ongoing work we are doing on tqdm. Basically, what we do is that we try to calculate miniters to equal mininterval, so that time checking isn't even needed anymore. This automagic setting kicks in after mininterval gets triggered, something like that:
from __future__ import division
import time
def my_bar(iterator, mininterval=0.1, miniters=10, dynamic_miniters=True)
counter = 0
last_print_t = 0
last_print_counter = 0
for item in iterator:
if (counter - last_print_counter) >= miniters:
cur_time = time.time()
if (cur_time - last_print_t) >= mininterval:
if dynamic_miniters:
# Simple rule of three
delta_it = counter - last_print_counter
delta_t = cur_time - last_print_t
miniters = delta_it * mininterval / delta_t
last_print_t = cur_time
last_print_counter = counter
print_your_bar_update(counter)
counter += 1
There are various ways to compute miniters automatically, but usually you want to update it to match mininterval.
If you are interested in digging more, you can check the dynamic_miniters internal parameters, maxinterval and an experimental monitoring thread of the tqdm project.

Using the modulus check (counter % N == 0) is almost free compared print and a great solution if you run a high frequency iteration (log a lot).
Specially if you does not need to print for each iteration but want some feedback along the way.

Best way to limit incoming messages to avoid duplicates

I have a system that accepts messages that contain urls, if certain keywords are in the messages, an api call is made with the url as a parameter.
In order to conserve processing and keep my end presentation efficient.
I don't want duplicate urls being submitted within a certain time range.
so if this url ---> http://instagram.com/p/gHVMxltq_8/ comes in and it's submitted to the api
url = incoming.msg['urls']
url = urlparse(url)
if url.netloc == "instagram.com":
r = requests.get("http://api.some.url/show?url=%s"% url)
and then 3 secs later the same url comes in, I don't want it submitted to the api.
What programming method might I deploy to eliminate/limit duplicate messages from being submitted to the api based on time?
UPDATE USING TIM PETERS METHOD:
limit = DecayingSet(86400)
l = limit.add(longUrl)
if l == False:
pass
else:
r = requests.get("http://api.some.url/show?url=%s"% url)
this snippet is inside a long running process, that is accepting streaming messages via tcp.
every time I pass the same url in, l returns True every time.
But when I try it in the interpreter everything is good, it returns False when the set time hasn't expired.
Does it have to do with the fact that the script is running, while the set is being added to?
Instance issues?

Maybe overkill, but I like creating a new class for this kind of thing. You never know when requirements will get fancier ;-) For example,
from time import time
class DecayingSet:
def __init__(self, timeout): # timeout in seconds
from collections import deque
self.timeout = timeout
self.d = deque()
self.present = set()
def add(self, thing):
# Return True if `thing` not already in set,
# else return False.
result = thing not in self.present
if result:
self.present.add(thing)
self.d.append((time(), thing))
self.clean()
return result
def clean(self):
# forget stuff added >= `timeout` seconds ago
now = time()
d = self.d
while d and now - d[0][0] >= self.timeout:
_, thing = d.popleft()
self.present.remove(thing)
As written, it checks for expirations whenever an attempt is made to add a new thing. Maybe that's not what you want, but it should be a cheap check since the deque holds items in order of addition, so gets out at once if no items are expiring. Lots of possibilities.
Why a deque? Because deque.popleft() is a lot faster than list.pop(0) when the number of items becomes non-trivial.

suppose your desired interval is 1 hour, keep 2 counters that increment every hour but they are offset 30 minutes from each other. i. e. counter A goes 1, 2, 3, 4 at 11:17, 12:17, 13:17, 14:17 and counter B goes 1, 2, 3, 4 at 11:47, 12:47, 13:47, 14:47.
now if a link comes in and has either of the two counters same as an earlier link, then consider it to be duplicate.
the benefit of this scheme over explicit timestamps is that you can hash the url+counterA and url+counterB to quickly check whether the url exists
Update: You need two data stores: one, a regular database table (slow) with columns: (url, counterA, counterB) and two, a chunk of n bits of memory (fast). given a url so.com, counterA 17 and counterB 18, first hash "17,so.com" into a range 0 to n - 1 and see if the bit at that address is turned on. similarly, hash "18,so.com" and see if the bit is turned on.
If the bit is not turned on in either case you are sure it is a fresh URL within an hour, so we are done (quickly).
If the bit is turned on in either case then look up the url in the database table to check if it was that url indeed or some other URL that hashed to the same bit.
Further update: Bloom filters are an extension of this scheme.

I'd recommend keeping an in-memory cache of the most-recently-used URLs. Something like a dictionary:
urls = {}
and then for each URL:
if url in urls and (time.time() - urls[url]) < SOME_TIMEOUT:
# Don't submit the data
else:
urls[url] = time.time()
# Submit the data

Implementing an Anti Spamming thing?

I have an IRC bot that I made for automating stuff.
Here's a snippet of it:
def analyseIRCText(connection, event):
global adminList, userList, commandPat, flood
userName = extractUserName(event.source())
userCommand = event.arguments()[0]
escapedChannel = cleanUserCommand(config.channel).replace('\\.', '\\\\.')
escapedUserCommand = cleanUserCommand(event.arguments()[0])
#print userName, userCommand, escapedChannel, escapedUserCommand
if flood.has_key(userName):
flood[userName] += 1
else:
flood[userName] = 1
... (if flood[userName] > certain number do...)
So the idea is that flood thing is a dictionary where a list of users who have entered in a command to the bot in the recent... some time is kept, and how many times they've said so and so within that time period.
Here's where I run into trouble. There has to be SOMETHING that resets this dictionary so that the users can say stuff every once in awhile, no? I think that a little thing like this would do the trick.
def floodClear():
global flood
while 1:
flood = {} # Clear the list
time.sleep(4)
But what would be the best way to do this?
At the end of the program, I have a little line called:
thread.start_new_thread(floodClear,())
so that this thing doesn't get called at gets stuck in an infinite loop that halts everything else. Would this be a good solution or is there something better that I could do?

Your logic should be enough. If you have say:
if flood.has_key(userName):
flood[userName] += 1
else:
flood[userName] = 1
if flood[userName] > say 8:
return 0
That should make your bot ignore the user if he has spammed too many times within your given time period. What you have there should also work to clear up your flood dictionary.

Separating Progress Tracking and Loop Logic

Suppose i want to track the progress of a loop using the progress bar printer ProgressMeter (as described in this recipe).
def bigIteration(collection):
for element in collection:
doWork(element)
I would like to be able to switch the progress bar on and off. I also want to update it only every x steps for performance reasons. My naive way to do this is
def bigIteration(collection, progressbar=True):
if progressBar:
pm = progress.ProgressMeter(total=len(collection))
pc = 0
for element in collection:
if progressBar:
pc += 1
if pc % 100 = 0:
pm.update(pc)
doWork(element)
However, I am not satisfied. From an "aesthetic" point of view, the functional code of the loop is now "contaminated" with generic progress-tracking code.
Can you think of a way to cleanly separate progress-tracking code and functional code? (Can there be a progress-tracking decorator or something?)

It seems like this code would benefit from the null object pattern.
# a progress bar that uses ProgressMeter
class RealProgressBar:
pm = Nothing
def setMaximum(self, max):
pm = progress.ProgressMeter(total=max)
pc = 0
def progress(self):
pc += 1
if pc % 100 = 0:
pm.update(pc)
# a fake progress bar that does nothing
class NoProgressBar:
def setMaximum(self, max):
pass
def progress(self):
pass
# Iterate with a given progress bar
def bigIteration(collection, progressBar=NoProgressBar()):
progressBar.setMaximum(len(collection))
for element in collection:
progressBar.progress()
doWork(element)
bigIteration(collection, RealProgressBar())
(Pardon my French, er, Python, it's not my native language ;) Hope you get the idea, though.)
This lets you move the progress update logic from the loop, but you still have some progress related calls in there.
You can remove this part if you create a generator from the collection that automatically tracks progress as you iterate it.
# turn a collection into one that shows progress when iterated
def withProgress(collection, progressBar=NoProgressBar()):
progressBar.setMaximum(len(collection))
for element in collection:
progressBar.progress();
yield element
# simple iteration function
def bigIteration(collection):
for element in collection:
doWork(element)
# let's iterate with progress reports
bigIteration(withProgress(collection, RealProgressBar()))
This approach leaves your bigIteration function as is and is highly composable. For example, let's say you also want to add cancellation this big iteration of yours. Just create another generator that happens to be cancellable.
# highly simplified cancellation token
# probably needs synchronization
class CancellationToken:
cancelled = False
def isCancelled(self):
return cancelled
def cancel(self):
cancelled = True
# iterates a collection with cancellation support
def withCancellation(collection, cancelToken):
for element in collection:
if cancelToken.isCancelled():
break
yield element
progressCollection = withProgress(collection, RealProgressBar())
cancellableCollection = withCancellation(progressCollection, cancelToken)
bigIteration(cancellableCollection)
# meanwhile, on another thread...
cancelToken.cancel()

You could rewrite bigIteration as a generator function as follows:
def bigIteration(collection):
for element in collection:
doWork(element)
yield element
Then, you could do a great deal outside of this:
def mycollection = [1,2,3]
if progressBar:
pm = progress.ProgressMeter(total=len(collection))
pc = 0
for item in bigIteration(mycollection):
pc += 1
if pc % 100 = 0:
pm.update(pc)
else:
for item in bigIteration(mycollection):
pass

My approach would be like that:
The looping code yields the progress percentage whenever it changes (or whenever it wants to report it). The progress-tracking code then reads from the generator until it's empty; updating the progress bar after every read.
However, this also has some disadvantages:
You need a function to call it without a progress bar as you still need to read from the generator until it's empty.
You cannot easily return a value at the end. A solution would be wrapping the return value though so the progress method can determine if the function yielded a progress update or a return value. Actually, it might be nicer to wrap the progress update so the regular return value can be yielded unwrapped - but that'd require much more wrapping since it would need to be done for every progress update instead just once.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to improve a simple caching mechanism in Python? - python

Related

Functions that stop after x time Python

Efficiency when printing progress updates, print x vs if x%y==0: print x

Best way to limit incoming messages to avoid duplicates

Implementing an Anti Spamming thing?

Separating Progress Tracking and Loop Logic

Categories

Resources