I am looking to reset a counter every day using Redis. I am new to Redis so I want to make sure I well understood how transactions and pipes work.
Does the following code ensure that I will always get a unique couple of (date, number) while working in a multi processes environment or do I need to use a Redis lock?
import datetime
import redis
r = redis.Redis(...)
def get_utc_date_now():
return datetime.datetime.utcnow().date()
def get_daily_counter(r, dt_key='dt', counter_key='counter'):
def incr_daily_number(pipe):
dt_now = get_utc_date_now().isoformat() # e.g.: "2014-10-18"
dt = pipe.get(dt_key)
pipe.multi()
if dt != dt_now:
pipe.set(dt_key, dt_now)
pipe.set(counter_key, 0)
pipe.get(dt_key)
pipe.incr(counter_key)
result = r.transaction(incr_daily_number, dt_key)
return result[-2:]
# Get the (dt, number) couple
# 2014-10-18, 1
# 2014-10-18, 2
# etc.
dt, number = get_daily_counter(r)
UPDATE
Try with LUA Script:
r = redis.Redis(...)
incr_with_reset_on_change_lua_script = """
local dt = redis.call('GET', KEYS[2])
if dt ~= ARGV[2] then
redis.call('MSET', KEYS[1], ARGV[1], KEYS[2], ARGV[2])
end
return redis.call('INCR', KEYS[1])
"""
# Incr KEYS1 but reset first if KEYS2 has changed.
incr_with_reset_on_change = r.register_script(incr_with_reset_on_change_lua_script)
counter_key = 'dcounterA'
watch_key = 'dcounterA_dt'
watch_value = get_utc_date_now().isoformat()
number = incr_with_reset_on_change(keys=[counter_key, watch_key], args=[reset_value, watch_value])
Consider two concurrent transactions occuring at midnight. Both can execute get(dt_key), but one will execute the MULTI/EXEC block first. It will reset the counter, set the new date, increment the counter. The second one will enter also in its MULTI/EXEC block, but because the value of 'dt' has changed, the execution will fail, and incr_daily_number will be called again. This time get(dt_key) will return the new date, so when the MULTI/EXEC block will be executed, the counter will be incremented without any reset. The two transactions will return the new date with different counter values.
So, I believe there is no race condition here, and that the (date,number) couples will be unique.
You could also have implemented this using a server-side Lua script (whose execution is always atomic). It is usually more convenient.
Note that actually, there is no such thing as a Redis lock. The locking mechanism available in the API is provided by the Python client - not by the Redis server. If you look at its implementation, you will realize it is also based on SETNX + WATCH/MULTI/EXEC blocks or Lua scripting.
Related
I want to loop over tasks, again and again, until reaching a certain condition before continuing the rest of the workflow.
What I have so far is this:
# Loop task
class MyLoop(Task):
def run(self):
loop_res = prefect.context.get("task_loop_result", 1)
print (loop_res)
if loop_res >= 10:
return loop_res
raise LOOP(result=loop_res+1)
But as far as I understand this does not work for multiple tasks.
Is there a way to come back further and loop on several tasks at a time ?
The solution is simply to create a single task that itself creates a new flow with one or more parameters and calls flow.run(). For example:
class MultipleTaskLoop(Task):
def run(self):
# Get previous value
loop_res = prefect.context.get("task_loop_result", 1)
# Create subflow
with Flow('Subflow', executor=LocalDaskExecutor()) as flow:
x = Parameter('x', default = 1)
loop1 = print_loop()
add = add_value(x)
loop2 = print_loop()
loop1.set_downstream(add)
add.set_downstream(loop2)
# Run subflow and extract result
subflow_res = flow.run(parameters={'x': loop_res})
new_res = subflow_res.result[add]._result.value
# Loop
if new_res >= 10:
return new_res
raise LOOP(result=new_res)
where print_loop simply prints "loop" in the output and add_value adds one to the value it receives.
Unless I'm missing something, the answer is no.
Prefect flows are DAGs, and what you are describing (looping over multiple tasks in order again and again until some condition is met) would make a cycle, so you can't do it.
This may or may not be helpful, but you could try and make all of the tasks you want to loop into one task, and loop within that task until your exit condition has been met.
This code simulates loading a CSV, parsing it and loading it int a pandas dataframe. I would like to parallelize this problem so that it runs faster, but my pool.map implementation is actually slower than the serial implementation.
The csv is read as one big string and first split into lines and then split into values. It is an irregularly formatted csv with recurring headers so I cannot use the pandas read_csv. At least not that I know how.
My idea was to simply read in the file as string, split the long string into four parts (one for each core) and then process each chunk separately in parallel. This it turns out is slower than the serial version.
from multiprocessing import Pool
import datetime
import pandas as pd
def data_proc(raw):
pre_df_list = list()
for item in (i for i in raw.split('\n') if i and not i.startswith(',')):
if ' ' in item and ',' in item:
key, freq, date_observation = item.split(' ')
date, observation = date_observation.split(',')
pre_df_list.append([key, freq, date, observation])
return pre_df_list
if __name__ == '__main__':
raw = '\n'.join([f'KEY FREQ DATE,{i}' for i in range(15059071)]) # instead of loading csv
start = datetime.datetime.now()
pre_df_list = data_proc(raw)
df = pd.DataFrame(pre_df_list, columns=['KEY','FREQ','DATE','VAL'])
end = datetime.datetime.now()
print(end - start)
pool = Pool(processes=4)
start = datetime.datetime.now()
len(raw.split('\n'))
number_of_tasks = 4
chunk_size = int((len(raw) / number_of_tasks))
beginning = 0
multi_list = list()
for i in range(1,number_of_tasks+1):
multi_list.append(raw[beginning:chunk_size*i])
beginning = chunk_size*i
results = pool.imap(data_proc, multi_list)
# d = results[0]
pool.close()
pool.join()
# I haven'f finished conversion to dataframe since previous part is not working yet
# df = pd.DataFrame(d, columns=['SERIES_KEY','Frequency','OBS_DATE','val'])
end = datetime.datetime.now()
print(end - start)
EDIT: the serial version finishes in 34 seconds and the parallel after 53 seconds on my laptop. When I started working on this, my initial assumption was that I would be able to get it down to 10-ish seconds on a 4 core machine.
It looks like the parallel version I posted never finishes. I changed the pool.map call to pool.imap and now it works again. Note it has to be ran from the command line, not Spyder.
In General:
Multiprocessing isn't always the best way to do things. It takes overhead to create and manage the new processes, and to synchronize their output. In a relatively simple case, like parsing 150M lines of small text, you may or may not get a big time savings from using the multiprocessor.
There are a lot of other confounding variables - process load on the machine, number of processors, any access to I/O being spread across the processor (that one is not a problem in your specific case), the potential to fill up memory and then deal with page swaps... The list can keep growing. There are times multiprocessing is the ideal, and there are times where it makes matters worse. (In my production code, I have one place where I left a comment: "Using multiprocesing here took 3x longer than not. Just using regular map...")
Your specific case
However, without knowing your exact system specifications, I think that you should have a performance improvment from properly done multiprocessing. You may not; this task may be small enough that it's not worth the overhead. However, there are several issues with your code that will result in your multiprocessing path taking longer. I'll call out the ones that catch my attention
len(raw.split('\n'))
This line is very expensive, and accomplishes nothing. It goes through every line of your raw data, splits it, takes the length of the result, then throws out the split data and the len. You likely want to do something like:
splitted = raw.split('\n')
splitted_len = len(splitted) # but I'm not sure where you need this.
This would save the split data, so you could make use of it later, in your for loop. As it is right now, your for loop operates on raw, which has not been split. So instead of, e.g., running on [first_part, second_part, third_part, fourth_part], you're running on [all_of_it, all_of_it, all_of_it, all_of_it]. This, of course, is a HUGE part of your performance degradation - you're doing the same work x4!
I expect, if you take care of splitting on \n outside of your processing, that's all you'll need to get an improvement from multiprocessing. (Note, you actually don't need any special processing for 'serial' vs. 'parallel' - you can test it decently by using map instead of pool.map.)
Here's my take at re-doing your code. It moves the line splitting out of the data_proc function, so you can focus on whether splitting the array into 4 chunks gives you any improvement. (Other than that, it makes each task into a well-defined function - that's just style, to help clarify what's testing where.)
from multiprocessing import Pool
import datetime
import pandas as pd
def serial(raw):
pre_df_list = data_proc(raw)
return pre_df_list
def parallel(raw):
pool = Pool(processes=4)
number_of_tasks = 4
chunk_size = int((len(raw) / number_of_tasks))
beginning = 0
multi_list = list()
for i in range(1,number_of_tasks+1):
multi_list.append(raw[beginning:chunk_size*i])
beginning = chunk_size*i
results = pool.map(data_proc, multi_list)
pool.close()
pool.join()
pre_df_list = []
for r in results:
pre_df_list.append(r)
return pre_df_list
def data_proc(raw):
# assume raw is pre-split by the time you're here
pre_df_list = list()
for item in (i for i in if i and not i.startswith(',')):
if ' ' in item and ',' in item:
key, freq, date_observation = item.split(' ')
date, observation = date_observation.split(',')
pre_df_list.append([key, freq, date, observation])
return pre_df_list
if __name__ == '__main__':
# don't bother with the join, since we would need it in either case
raw = [f'KEY FREQ DATE,{i}' for i in range(15059071)] # instead of loading csv
start = datetime.datetime.now()
pre_df_list = serial(raw)
end = datetime.datetime.now()
print("serial time: {}".format(end - start))
start = datetime.datetime.now()
pre_df_list = parallel(raw)
end = datetime.datetime.now()
print("parallel time: {}".format(end - start))
# make the dataframe. This would happen in either case
df = pd.DataFrame(pre_df_list, columns=['KEY','FREQ','DATE','VAL'])
I have a problem in which I process documents from files using python generators. The number of files I need to process are not known in advance. Each file contain records which consumes considerable amount of memory. Due to that, generators are used to process records. Here is the summary of the code I am working on:
def process_all_records(files):
for f in files:
fd = open(f,'r')
recs = read_records(fd)
recs_p = (process_records(r) for r in recs)
write_records(recs_p)
My process_records function checks for the content of each record and only returns the records which has a specific sender. My problem is the following: I want to have a count on number of elements being returned by read_records. I have been keeping track of number of records in process_records function using a list:
def process_records(r):
if r.sender('sender_of_interest'):
records_list.append(1)
else:
records_list.append(0)
...
The problem with this approach is that records_list could grow without bounds depending upon the input. I want to be able to consume the content of records_list once it grows to certain point and then restart the process. For example, after 20 records has been processed, I want to find out how many records are from 'sender_of_interest' and how many are from other sources and empty the list. Can I do this without using a lock?
You could make your generator a class with an attribute that contains a count of the number of records it has processed. Something like this:
class RecordProcessor(object):
def __init__(self, recs):
self.recs = recs
self.processed_rec_count = 0
def __call__(self):
for r in self.recs:
if r.sender('sender_of_interest'):
self.processed_rec_count += 1
# process record r...
yield r # processed record
def process_all_records(files):
for f in files:
fd = open(f,'r')
recs_p = RecordProcessor(read_records(fd))
write_records(recs_p)
print 'records processed:', recs_p.processed_rec_count
Here's the straightforward approach. Is there some reason why something this simple won't work for you?
seen=0
matched=0
def process_records(r):
seen = seen + 1
if r.sender('sender_of_interest'):
matched = match + 1
records_list.append(1)
else:
records_list.append(0)
if seen > 1000 or someOtherTimeBasedCriteria:
print "%d of %d total records had the sender of interest" % (matched, seen)
seen = 0
matched = 0
If you have the ability to close your stream of messages and re-open them, you might want one more total seen variable, so that if you had to close that stream and re-open it later, you could go to the last record you processed and pick up there.
In this code "someOtherTimeBasedCriteria" might be a timestamp. You can get the current time in milliseconds when you begin processing, and then if the current time now is more than 20,000ms more (20 sec) then reset the seen/matched counters.
I have a system that accepts messages that contain urls, if certain keywords are in the messages, an api call is made with the url as a parameter.
In order to conserve processing and keep my end presentation efficient.
I don't want duplicate urls being submitted within a certain time range.
so if this url ---> http://instagram.com/p/gHVMxltq_8/ comes in and it's submitted to the api
url = incoming.msg['urls']
url = urlparse(url)
if url.netloc == "instagram.com":
r = requests.get("http://api.some.url/show?url=%s"% url)
and then 3 secs later the same url comes in, I don't want it submitted to the api.
What programming method might I deploy to eliminate/limit duplicate messages from being submitted to the api based on time?
UPDATE USING TIM PETERS METHOD:
limit = DecayingSet(86400)
l = limit.add(longUrl)
if l == False:
pass
else:
r = requests.get("http://api.some.url/show?url=%s"% url)
this snippet is inside a long running process, that is accepting streaming messages via tcp.
every time I pass the same url in, l returns True every time.
But when I try it in the interpreter everything is good, it returns False when the set time hasn't expired.
Does it have to do with the fact that the script is running, while the set is being added to?
Instance issues?
Maybe overkill, but I like creating a new class for this kind of thing. You never know when requirements will get fancier ;-) For example,
from time import time
class DecayingSet:
def __init__(self, timeout): # timeout in seconds
from collections import deque
self.timeout = timeout
self.d = deque()
self.present = set()
def add(self, thing):
# Return True if `thing` not already in set,
# else return False.
result = thing not in self.present
if result:
self.present.add(thing)
self.d.append((time(), thing))
self.clean()
return result
def clean(self):
# forget stuff added >= `timeout` seconds ago
now = time()
d = self.d
while d and now - d[0][0] >= self.timeout:
_, thing = d.popleft()
self.present.remove(thing)
As written, it checks for expirations whenever an attempt is made to add a new thing. Maybe that's not what you want, but it should be a cheap check since the deque holds items in order of addition, so gets out at once if no items are expiring. Lots of possibilities.
Why a deque? Because deque.popleft() is a lot faster than list.pop(0) when the number of items becomes non-trivial.
suppose your desired interval is 1 hour, keep 2 counters that increment every hour but they are offset 30 minutes from each other. i. e. counter A goes 1, 2, 3, 4 at 11:17, 12:17, 13:17, 14:17 and counter B goes 1, 2, 3, 4 at 11:47, 12:47, 13:47, 14:47.
now if a link comes in and has either of the two counters same as an earlier link, then consider it to be duplicate.
the benefit of this scheme over explicit timestamps is that you can hash the url+counterA and url+counterB to quickly check whether the url exists
Update: You need two data stores: one, a regular database table (slow) with columns: (url, counterA, counterB) and two, a chunk of n bits of memory (fast). given a url so.com, counterA 17 and counterB 18, first hash "17,so.com" into a range 0 to n - 1 and see if the bit at that address is turned on. similarly, hash "18,so.com" and see if the bit is turned on.
If the bit is not turned on in either case you are sure it is a fresh URL within an hour, so we are done (quickly).
If the bit is turned on in either case then look up the url in the database table to check if it was that url indeed or some other URL that hashed to the same bit.
Further update: Bloom filters are an extension of this scheme.
I'd recommend keeping an in-memory cache of the most-recently-used URLs. Something like a dictionary:
urls = {}
and then for each URL:
if url in urls and (time.time() - urls[url]) < SOME_TIMEOUT:
# Don't submit the data
else:
urls[url] = time.time()
# Submit the data
just registered so I could ask this question.
Right now I have this code that prevents a class from updating more than once every five minutes:
now = datetime.now()
delta = now - myClass.last_updated_date
seconds = delta.seconds
if seconds > 300
update(myClass)
else
retrieveFromCache(myClass)
I'd like to modify it by allowing myClass to update twice per 5 minutes, instead of just once.
I was thinking of creating a list to store the last two times myClass was updated, and comparing against those in the if statement, but I fear my code will get convoluted and harder to read if I go that route.
Is there a simpler way to do this?
You could do it with a simple counter. Concept is get_update_count tracks how often the class is updated.
if seconds > 300 or get_update_count(myClass) < 2:
#and update updatecount
update(myClass)
else:
#reset update count
retrieveFromCache(myClass)
Im not sure how you uniquely identify myClass.
update_map = {}
def update(instance):
#do the update
update_map[instance] = update_map.get(instance,0)+1
def get_update_count(instance):
return update_map[instance] or 0