Strategies for storing frequency of dynamic data

Strategies for storing frequency of dynamic data - python

Sorry if the title is misleading.
I am trying to write a program that calculates frequency of emails being sent out of different email ids. We need to trigger alerts based on number and frequency of mails sent. For example for a particular email if in past 60 minutes more than 25 mails were sent, a trigger needs to be sent.
A different trigger for another directory based on another rule. Fundamental rules are about how many mails sent over past 60 minutes, 180 minutes, 12 hours and 24 hours. How do we come up with a strategy to calculate frequency and store it without too much of system/cpu/database overheads.
The actual application is a legacy CRM system. We have no access to the Mail Server to hack something inside the Postfix or MTA. Moreover there are multiple domains involved, so any suggestion to do something on the mail server may not help.
We have ofcourse access to every attempt to send a mail, and can look at recording them. My challenge is on a large campaign database writes would be frequent, and doing some real time number crunching resource intensive. I would like to avoid that and come up with an optimal solution
Language would be Python, because CRM is also written using the same.

Try to do hack on client side in recording email attempt to a log file. Then you can read that file to count frequency of emails sent.
I think that you can put data in memory in dict for some time say for ex 5 or 10 min. Then you can send data to DB thus not putting load on DB of frequent writes. If you put a check in your code for sudden surge in email from a particular domain then it might provide you a solution for your problem.

Let m be the largest threshold (maximum number of emails sent) of any of your tests. E.g., if you have a limit of 150 emails per 24 hours, then m = 150 (presumably the thresholds for shorter-period checks are all lower than this -- it wouldn't make sense to also have a limit of 200 emails per 12 hours). Your longest period is 24 hours, which means m can't be more than 25*24 = 600 -- since if it were, then some checks would be redundant with the 25-emails-per-hour limit.
Since this is quite a small number of emails per user, I'd suggest just keeping a per-user ring buffer that simply stores the dates and times of the last m <= 600 messages sent by that user. Every time the user sends a new email, check the send time of the 25th-most-recent message sent; if it's less than 1 hour ago, sound the alarm. Likewise check the other limits (e.g. if there is also a limit of 50 emails per 180 minutes, check that the 50th-most-recent message was sent more than 180 minutes ago). If all tests pass, then (over)write the next entry in the ring buffer with the current date/time. A date/time can be stored in 4 bytes, so this is at most about 2.4Kb bytes per user, with O(1) queries and updates. (Of course you could use a DB instead, but a ring buffer is trivial to implement in-memory, and the total size needed is probably small enough if you don't have millions of users.)

Related

Python Outlook subfolder sniffer/alert

I'm very new to python and have very little experience.
I'm wondering if you can help me with the below:
I would like to create a script that would run in the background and by consuming W32client and outlook events i would like to fire an alert when specific amount of emails will land in a specific amount of time in a subfolder.
Let's say 5 emails in a 15 minutes. Once that condition is met i would like to stop script for next 15 minutes (even with time.sleep).
I've been thinking and came up with a rough idea like below:
Maybe some dict that would have a key (email subject or anything) and value of recived_time? Then email count would be a len of that dict, and could fire the condition when after looping through items oldest received date would be less than 15 mins?
Can you just suggest anything else?
i hooked up some events to check how does that work, but haven't came with any working code yet

Is it possible to send data at a certain speed (x items per second)

(Just wanted to mention that this is my first question and I apologize if I did something wrong). I am making a Python program that parses a CSV file and saves it as a list. However, the program takes user input as to how fast they want to send that data to a server. How would I regulate how fast the data is sent (ie. 100 items / second etc.) I am using PyQt5 for the GUI front end and the CSV module to parse the file. For testing purposes, I am sending the data to another CSV that the Python script writes.
I have tried sleep and date and time but since reading/writing data is not instant, it won't be x items / 1 second. I wasn't really able to find any documentation but I feel like date and time could still be viable although I really don't know how since I am a beginner.
I would like the program to read the CSV file and write/send it to another file at a certain rate/second. I have only had the program write it at normal speed.
Thank you in advance.

As #KlausD says, you can do something in a thread and use a queue if you want to do processing in between sends. But you might want to just do your sending in a loop in the main thread. How you loop over the items and delay so that you're sending them at the right rate should be pretty independent of how your code will actually be structured.
Rather than worry about what the delays are in advance that will contribute to your send rate, what you want to do is adaptively delay. So you figure out how long it actually took to do the send, and then you delay for whatever the remaining time is that you want to wait before doing another send. If your primary goal is your average send rate rather than the actual delay between two sends, which I would think would be the case, then you just want to be looking at how long it's taken you to send items so far in relation to how many things you've sent. From this you can adaptively delay to pretty much exactly adjust the overall send time to what you want. Over hundreds or thousands of sends, you can guarantee a rate that is pretty much exactly what your user has asked for. Here's an example of how to do that, abstracting away any sending of data to just a print() statement and a random delay:
import time
import random
# Send this many items per second
sends_per_second = 10
# Simulate send time by introducing a random delay of at most this many seconds
max_item_delay_seconds = .06
# How many items to send
item_count = 100
# Do something representing a send, introducing a random delay
def do_one_item(i):
time.sleep(random.random() * max_item_delay_seconds)
print("Sent item {}".format(i))
# Record the starting time
start_time = time.time()
# For each item to send...
for i in range(item_count):
# Send the item
do_one_item(i)
# Compute how much time we've spent so far
time_spent = time.time() - start_time
# Compute how much time we want to have spent so far based on the desired send rate
should_time = (i + 1) / sends_per_second
# If we're going too fast, wait just long enough to get us back on track
if should_time > time_spent:
print("Delaying {} seconds".format(should_time - time_spent))
time.sleep(should_time - time_spent)
time_spent = time.time() - start_time
print("Sent {} items in {} seconds ({} items per second)".format(item_count, time_spent, item_count / time_spent))
Abbreviated output:
Sent item 0
Delaying 0.06184182167053223 seconds
Sent item 1
Delaying 0.0555738925933838 seconds
...
Sent item 98
Delaying 0.036808872222900746 seconds
Sent item 99
Delaying 0.03746294975280762 seconds
Sent 100 items in 10.000335931777954 seconds (9.999664079506683 items per second)
As you can see, despite the code introducing a random delay for each send, and the delay logic therefore having to compute delays that are all over the place, the actual send rate achieved is exactly what was asked for to 5 or so decimal places.
You can play with the numbers in this example. You should be able to convince yourself that unless each send takes too long to keep up with your requested rate, you can dial in any send rate you want with this sort of logic. You can also see that you add too much simulated delay to represent the send time so that you can't keep up with the desired rate, you will get no delay calls and the code will just send items as fast as it can.

Delaying 1 second per request, not enough for 3600 per hour

The Amazon API limit is apparently 1 req per second or 3600 per hour. So I implemented it like so:
while True:
#sql stuff
time.sleep(1)
result = api.item_lookup(row[0], ResponseGroup='Images,ItemAttributes,Offers,OfferSummary', IdType='EAN', SearchIndex='All')
#sql stuff
Error:
amazonproduct.errors.TooManyRequests: RequestThrottled: AWS Access Key ID: ACCESS_KEY_REDACTED. You are submitting requests too quickly. Please retry your requests at a slower rate.
Any ideas why?

This code looks correct, and it looks like 1 request/second limit is still actual:
http://docs.aws.amazon.com/AWSECommerceService/latest/DG/TroubleshootingApplications.html#efficiency-guidelines
You want to make sure that no other process is using the same associate account. Depending on where and how you run the code, there may be an old version of the VM, or another instance of your application running, or maybe there is a version on the cloud and other one on your laptop, or if you are using a threaded web server, there may be multiple threads all running the same code.
If you still hit the query limit, you just want to retry, possibly with the TCP-like "additive increase/multiplicative decrease" back-off. You start by setting extra_delay = 0. When request fails, you set extra_delay += 1 and sleep(1 + extra_delay), then retry. When it finally succeeds, set extra_delay = extra_delay * 0.9.

Computer time is funny
This post is correct in saying "it varies in a non-deterministic manner" (https://stackoverflow.com/a/1133888/5044893). Depending on a whole host of factors, the time measured by a processor can be quite unreliable.
This is compounded by the fact that Amazon's API has a different clock than your program does. They are certainly not in-sync, and there's likely some overlap between their "1 second" time measurement and your program's. It's likely that Amazon tries to average out this inconsistency, and they probably also allow a small bit of error, maybe +/- 5%. Even so, the discrepancy between your clock and theirs is probably triggering the ACCESS_KEY_REDACTED signal.
Give yourself some buffer
Here are some thoughts to consider.
Do you really need to hit the Amazon API every single second? Would your program work with a 5 second interval? Even a 2-second interval is 200% less likely to trigger a lockout. Also, Amazon may be charging you for every service call, so spacing them out could save you money.
This is really a question of "optimization" now. If you use a constant variable to control your API call rate (say, SLEEP = 2), then you can adjust that rate easily. Fiddle with it, increase and decrease it, and see how your program performs.
Push, not pull
Sometimes, hitting an API every second means that you're polling for new data. Polling is notoriously wasteful, which is why Amazon API has a rate-limit.
Instead, could you switch to a queue-based approach? Amazon SQS can fire off events to your programs. This is especially easy if you host them with Amazon Lambda.

Why is appengine memcache not storing my data for the requested period of time?

I have my employees stored in appengine ndb and I'm running a cron job via the taskque to generate a list of dictionaries containing the email address of each employee. The resulting list looks something like this:
[{"text":"john#mycompany.com"},{"text":"mary#mycompany.com"},{"text":"paul#mycompany.com"}]
The list is used as source data for varing angular components such as ngTags ngAutocomplete etc. I want to store the list in memcache so the Angular http calls will run faster.
The problem I'm having is that the values stored in memcache never last for more than a few minutes even though I've set it to last 26 hours. I'm aware that the actual value stored can not be over 1mb so as an experiment I hardcoded the list of employees to contain only three values and the problem still persists.
The appengine console is telling me the job ran successfully and if I run the job manually it will load the values into memcache but they'll only stay there for a few minutes. I've done this many times before with far greater amount of data so I can't understand what's going wrong. I have billing enabled and I'm not over quota.
Here is an example of the function used to load the data into memcache:
def update_employee_list():
try:
# Get all 3000+ employees and generate a list of dictionaries
fresh_emp_list = [{"text":"john#mycompany.com"},{"text":"mary#mycompany.com"},{"text":"paul#mycompany.com"}]
the_cache_key = 'my_emp_list'
emp_data = memcache.get(the_cache_key)
# Kill the memcache packet so we can rebuild it.
if emp_data is not None:
memcache.delete(the_cache_key)
# Rebuild the memcache packet
memcache.add(the_cache_key, fresh_emp_list, 93600) # this should last for 26 hours
except Exception as e:
logging.info('ERROR!!!...A failure occured while trying to setup the memcache packet: %s'%e.message)
raise deferred.PermanentTaskFailure()
Here is an example of the function the angular components use to get the data from memcache:
#route
def get_emails(self):
self.meta.change_view('json')
emp_emails = memcache.get('my_emp_list')
if emp_emails is not None:
self.context['data'] = emp_emails
else:
self.context['data'] = []
Here is an example of the cron setting in cron.yaml:
- url: /cron/lookups/update_employee_list
description: Daily rebuild of Employee Data
schedule: every day 06:00
timezone: America/New_York
Why can't appengine memcache hold on to a list of three dictionaries for more than a few minutes?
Any ideas are appreciated. Thanks

Unless you are using dedicated memcache (paid service) the cached values can and will be evicted at any time.
What you tell memcache by specifying a lifetime is when your value becomes invalid and can therefor be removed from memcache. That however does not guarantee that your value will stay that long in the memcache, it's just a capping to a cache value's maximum lifetime.
Note: The more you put in memcache the more it is likely that other values will get dropped. Therefor you should carefully consider what data you put in your cache. You should definitely not put every value you come across in the memcache.
On a sidenote: In the projects i recently worked in, we had a - sort of - maximum cache lifetime of about a day. No cache value ever lasted longer that that, even if the desired lifetime was much higher. Interestingly enough though the cache got cleared out at about the same time every day, even including very new values.
Thus: Never rely on memcache. Always use a persistent storage and memcache for performance boosts with high volume traffic.

Python: email.message_from_string performance with large data in email body

I've been playing around with Python's imaplib and email module recently. I tried sending and receiving large emails (with most of the data in the body of the email rather than attachments) using the imaplib/email modules.
However, I've noticed a problem when I download large emails (of size greater than 8MB or so) from the email server and format them using the "email.message_from_string()" method. The time taken by that method seems to take a really long time (average of around 300-310 seconds for a 16 MB email). Note: Sending such a large email doesn't take too much time, about 40 seconds approximately. Again, all the data is in the body of the email -- not in the attachments. If I download the same email with all the data as attachments, the entire operation finishes in 30-40 seconds.
This is what I'm doing:
buf = []
t, d = mailacct.search(None, 'SUBJECT', subj)
for num in d:
t, msg = mailacct.fetch(num, '(RFC822)')
for resp in msg:
if isinstance(resp, tuple):
buf.append(email.message_from_string(resp[1])
I've timed each part of the code separately. mailacct.search and mailacct.fetch both finish in about 30-40 seconds for a 16 MB email. The line with email.message_from_string(resp[1]) takes around 280-300 seconds.
I'm a python noob. So am I doing something really inefficient in the above code? Or does the problem lie with the email.message_from_string() method, perhaps an inefficient implementation? Or could it be that email bodies were never meant to contain large amounts of data, and hence the poor performance?
* EDIT *:
Additional info: I used imaplib.IMAP4_SSL for creating IMAP connections. I used imaplib.append() to upload messages to the email account first. I used randomly generated binary data for the payload.

Okay, I did some digging on my own by examining the source code for the email module. The parsing function (parse()) in email/parser.py is the function which actually processes the email message when email.message_from_string() is called. It seems to parse strings in blocks of 8192 bytes which is why it takes so long for large data. I changed the code so that it read and processed the whole string at once and there was a tremendous improvement in the time taken to process the large email message.
I'm assuming it was initially set to process strings in blocks of 8192 to handle really really large strings? Is there a better way to do this rather than changing the email module source code?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.