MySQL SQLALCHEMY Python Getting Max Count for Timestamp - python

I have data recorded for several timestamps ... I want to get the max amount of all timestamps.
This is my code:
for timestamp in timestamps:
count = db.query(models.Appointment.id).filter(models.Appointment.place == place) \
.filter(models.Appointment.date == date) \
.filter(models.Appointment.timestamp == timestamp).count()
data.append(count)
return max(data)
Sadly, it takes timestamps * 1.5 seconds to calculate that requested value.
Is there any possibility (a query) which can handle this in around 3-10 seconds?
Regards,
Martin

If using MySQL 8 and later, you could give the following a go:
return db.query(func.max(func.count()).over()).\
filter(models.Appointment.place == place).\
filter(models.Appointment.date == date).\
filter(models.Appointment.timestamp.in_(timestamps)).\
group_by(models.Appointment.timestamp).\
limit(1).\
scalar()
This uses the (slightly non obvious) fact that window functions are evaluated after forming group rows, and without a partition and order the window is over all the group rows.
If using a version of MySQL that does not yet support window functions, use a subquery instead:
counts = db.query(func.count().label('count')).\
filter(models.Appointment.place == place).\
filter(models.Appointment.date == date).\
filter(models.Appointment.timestamp.in_(timestamps)).\
group_by(models.Appointment.timestamp).\
subquery()
return db.query(func.max(counts.c.count)).scalar()
The difference in these to the original approach is that both make only a single trip to the database. That is usually desirable, but may require thinking a bit differently about the problem, due to SQL being a (more or less) declarative language – you mostly describe the answer you want, not how you want it✝.
✝ "I want coffee" vs. "Start by pouring some water in the..."

Related

Speed up python w/ sqlalchemy function

I have a function that populates a database table using python and sqlalchemy. The function is running fairly slowly right now, taking around 17 minutes. I think the main problem is I am looping through two large sets of data to build the new table. I have included the record count in the code below.
How can I speed this up? Should I try to convert the nested for loop into one big sqlalchemy query? I profiled this function with pycharm but am not sure I fully understand the results.
def populate(self):
"""Core function to populate positions."""
# get raw annotations with tag Org
# returns 11,659 records
organizations = model.session.query(model.Annotation) \
.filter(model.Annotation.tag == 'Org')\
.filter(model.Annotation.organization_id.isnot(None)).all()
# get raw annotations with tags Support or Oppose
# returns 2,947 records
annotations = model.session.query(model.Annotation) \
.filter((model.Annotation.tag == 'Support') | (model.Annotation.tag == 'Oppose')).all()
for org in organizations:
for anno in annotations:
# Org overlaps with Support or Oppose tag
# start and end columns are integers
if org.start >= anno.start and org.end <= anno.end:
position = model.Position()
# set to de-duplicated organization
position.organization_id = org.organization_id
position.disposition = anno.tag
# look up bill_id from document_bill table
document = model.session.query(model.document_bill)\
.filter_by(document_id=anno.document_id).first()
position.bill_id = document.bill_id
position.document_id = anno.document_id
model.session.add(position)
logging.info('org: {}, disposition: {}, bill: {}'.format(
position.organization_id, position.disposition, position.bill_id)
)
continue
logging.info('committing to database')
model.session.commit()
My bets, in order of descending probability:
Autocommit is ON, so you're waiting for disk.
The query inside the loop "document = model.session.query(model.document_bill)...." is slow (use EXPLAIN ANALYZE).
most of the time is actually spent printing logs to the terminal in the inner loop (you should profile)
model.session.add(position) is slow (no idea what that does)
(and this one should really be first) Could a SQL query like INSERT INTO SELECT do this in a couple tens of milliseconds? If so, why make a loop in the application?...

Efficiently writing string comparison functions in Pandas

Let's say I work for a company that hands out different types of loans. We are getting our loan information from from a big data mart from which I need to calculate some additional things to calculate if someone is in arrears or not, etc. Right now, for clarity's sake I have done this a rather dumb function that iterates over all rows (where all information over a loan is stored) by using the pd.DataFrame.apply(myFunc, axis=1) function, which is horribly slow off course.
Now that we are growing and that I get more and more data to process, I am starting to get concerned over performance. Below is an example of a function that I call a lot, and would like to optimize (some ideas that I have below). These functions are applied to a DataFrame which has (a.o.) the following fields:
Loan_Type : a field containing a string that determines the type of loan, we have many different names but it comes down to either 4 types (for this example); Type 1 and Type 2, and whether staff or not has this loan.
Activity_Date : The date the activity on the loan was logged (it's a daily loan activity table, if that tells you anything)
Product_Account_Status : The status given by the table to these loans (are they active, or some other status?) on the Activity_Date, this needs to be recalculated because it is not always calculated in the table (don't ask why it is like this, complete headache).
Activation_Date : The date the loan was activated
Sum_Paid_To_Date : The amount of money paid into the loan at the Activity_Date
Deposit_Amount : The deposit amount for the loan
Last_Paid_Date : The last date a payment was made into the loan.
So two example functions:
def productType(x):
# Determines the type of the product, for later aggregation purposes, and to determine the amount to be payable per day
if ('Loan Type 1' in x['Loan_Type']) & (not ('Staff' in x['Loan_Type'])):
return 'Loan1'
elif ('Loan Type 2' in x['Loan_Type']) & (not ('Staff' in x['Loan_Type'])):
return 'Loan2'
elif ('Loan Type 1' in x['Loan_Type']) & ('Staff' in x['Loan_Type']):
return 'Loan1Staff'
elif ('Loan Type 2' in x['Loan_Type']) & ('Staff' in x['Loan_Type']):
return 'Loan2Staff'
elif ('Mobile' in x['Loan_Type']) | ('MM' in x['Loan_Type']):
return 'Other'
else:
raise ValueError(
'A payment plan is not captured in the code, please check it!')
This function is then applied to the DataFrame AllLoans which contains all loans I want to analyze at that moment, by using:
AllLoans['productType'] = AllLoans.apply(lambda x: productType(x), axis = 1)
Then I want to apply some other functions, one example of such a function is given below. This function determines whether the loan is blocked or not, depending on how long someone hasn't paid, and some other statuses that are important, but are currently stored in strings in the loan table. Examples of this are whether people are cancelled (for being blocked for too long), or some other statuses, we treat customers differently based on these tags.
def customerStatus(x):
# Sets the customer status based on the column Product_Account_Status or
# the days of inactivity
if x['productType'] == 'Loan1':
dailyAmount = 2
elif x['productType'] == 'Loan2':
dailyAmount = 2.5
elif x['productType'] == 'Loan1Staff':
dailyAmount = 1
elif x['productType'] == 'Loan2Staff':
dailyAmount = 1.5
else:
raise ValueError(
'Daily amount to be paid could not be calculated, check if productType is defined.')
if x['Product_Account_Status'] == 'Cancelled':
return 'Cancelled'
elif x['Product_Account_Status'] == 'Suspended':
return 'Suspended'
elif x['Product_Account_Status'] == 'Pending Deposit':
return 'Pending Deposit'
elif x['Product_Account_Status'] == 'Pending Allocation':
return 'Pending Allocation'
elif x['Outstanding_Balance'] == 0:
return 'Finished Payment'
# If this check returns True it means that Last_Paid_Date is zero/null, as
# far as I can see this means that the customer has only paid the deposit
# and is thus an FPD
elif type(x['Date_Of_Activity'] - x['Last_Paid_Date']) != (pd.tslib.NaTType):
if (((x['Date_Of_Activity'] - x['Last_Paid_Date']).days + 1) > 30) | ((((x['Date_Of_Activity'] - x['Last_Paid_Date']).days + 1) > 14) & ((x['Sum_Paid_To_Date'] - x['Deposit_Amount']) <= dailyAmount)):
return 'Blocked'
elif ((x['Date_Of_Activity'] - x['Last_Paid_Date']).days + 1) <= 30:
return 'Active'
# If this is True, the customer has not paid more than the deposit, so it
# will fall on the age of the customer whether they are blocked or not
elif type(x['Date_Of_Activity'] - x['Last_Paid_Date']) == (pd.tslib.NaTType):
# The date is changed here to 14 because of FPD definition
if ((x['Date_Of_Activity'] - x['Activation_Date']).days + 1) <= 14:
return 'Active'
elif ((x['Date_Of_Activity'] - x['Activation_Date']).days + 1) > 14:
return 'Blocked'
# If we have reached the end and still haven't found the status, it will
# get the following status
return 'Other Status'
This is again applied by using AllLoans['customerStatus'] = AllLoans.apply(lambda x: customerStatus(x), axis = 1). As you can see there are many string comparisons and date comparisons, which are a bit confusing for me on how I can 'properly' vectorize these functions.
Apologies if this is Optimization 101, but have tried to search for answers and strategies on how to do this, but couldn't find really comprehensive answers. I was hoping to get some tips here, thanks in advance for your time.
Some thoughts on making this faster/getting towards a more vectorized approach:
Make the customerStatus function slightly more modular by making a function that determines the daily amounts, and stores this in the dataframe for quicker access (I need to access them later anyway, and determine this variable in multiple functions).
Make the input column for the productType function into integers by using some sort of dict, so that fewer string functions need to called to this (but feel like this won't be my biggest speed up)
Some things that I would like to do but don't really know where to start on;
How to properly vectorize these functions that contain many if statements based on string/date comparisons (business rules can be a bit complex here) based on different columns in the dataframe. The code might become a bit more complex, but I need to apply these functions multiple times to slightly different (but importantly different) dataframes, and these are growing larger and larger so these functions need to be in some sort of library for ease of access, and the code needs to be speed up because it simply takes up to much time.
Have tried to search for some solutions like Numba or Cython but I don't understand enough of the inner workings of C to properly use this (or just yet, would like to learn). Any suggestions on how to improve performance would be greatly appreciated.
Kind regards,
Tim.

how to compare date (in query criteria) in pyral

I have a requirement to find the tasks which are not updated.
The criteria looks like :
'iteration.Name = \"iterationName\" and State!=Completed and LastUpdateDate<'+str(datetime.datetime.now())+"'"
This would result in:
iteration.Name = "iterationName" and State!=Completed and LastUpdateDate<'2015-12-27 20:17:08.769000'
I didn't get any results.
The rally task object has the LastUpdateDate as 2015-12-16T09:54:30.600Z 8.0
How do I compare the LastUpdateDate in the query criterion?
I had the same problem with multiple arguments in the end I had to add brackets to get it to work.
(('iteration.Name = \"iterationName\") AND (State!=Completed)) AND (LastUpdateDate<'+str(datetime.datetime.now())+"'")
There's probably too many brackets in there, the key thing I think was to get a bracket around the first two conditions; if adding another condition it would look something like this
((((condition1) AND (condition2)) AND (condition3)) AND (condition4))
Question was asked ages ago, but might be useful for people looking for using queries with dates in pyral.
There are several issues with the code in the question:
datetime.datetime.now() will probably not return a date in the format needed by Rally, so it is better to use strftime to get the proper format
Multiple conditions need to have correct set of parentheses; for example: ((((condition1) AND (condition2)) AND (condition3)) AND (condition4))
LastUpdateDate will necessarily be previous to current date (unless the users have been able to jump to the future).
It is better to put dates with double quotes (") instead of single quotes (')
Here is the code I came for identifying tasks that have not been updated in the last 5 days and are not completed.
iter_name = "2018-Iteration-4"
five_days_ago = datetime.datetime.now() - datetime.timedelta(days=5)
str_date = five_days_ago.strftime("%Y-%m-%dT%H:%M:%S.%fZ")
tasks_not_updated = rally.get(
'Task',
query = '(((iteration.Name = "%s")'
' and (State != Completed))'
' and (LastUpdateDate < "%s"))' % (iter_name, str_date)
)
for task in tasks_not_updated:
print("%s (%s)" % (task.Name, task.State))

invalid filter: Only one property per query may have inequality filters (>=, <=, >, <)

I have a number of items which are bookable in certain timeslots. Eg. a tennis court. So each item has got a number of associated availability slots each defined by begintime and endtime. Begintime and endtime are defined as datetime-objects so an availability slot from 09.00 - 11.30, is stored as eg. 2013-12-13 09.00 (begintime) to 2013-12-13 11.30 (endtime).
When a booking request comes in, I need to find out whether the tennis court is available for the desired timeslot.
So I am trying to filter availability slots based on start-time and end-time, and my query looks like this:
desired_availability_start = datetime(2013, 12, 13, 9,0,0)
desired_availability_end = datetime(2013, 12, 13, 10,0,0)
availability_slots = self.availability_slots.filter("begin <= ", desired_availability_start).filter("end >= ", desired_availability_end).fetch(limit=10)
but I get the following error
invalid filter: Only one property per query may have inequality filters (>=, <=, >, <)
Because I am trying to filter on both the begin- and end-property.
Based on the input and from some of the other posts on the topic Inequality Filter in AppEngine Datastore and BadFilterError: invalid filter: Only one property per query may have inequality filters (<=, >=, <, >) my current solution is to first filter on begin:
filtered_availability_slots = self.availability_slots.filter("begin <= ", desired_availability_start).fetch(limit=10)
and then filter on end and append the filtered items to a list:
final_availability_slots = []
for availability in filtered_availability_slots:
if availability.end >= desired_availability_end:
final_avaialability_slotes.append(availability)
But is this the best way of achieving what I want to achieve?
I am using Google App Engine and Python
Any help is appreciated
thanks
Thomas
As I guess you already know, you can't use more than one variable with inequality filters using the datastore. Unless you really need, you can filter using the 'begin' time only, and still get pretty accurate results.
calitem = self.appointments.filter("begin >= ", start).filter("begin <= ", end).fetch(limit=10)
If you really need, using your application logic, you can only show the items that doesn't go beyond the 'end' value. I don't see any other way around.
It's a bit unclear what you're asking. It's not clear whether you understand the problem: You're trying to use two inequality filters, and it's simply not allowed. Can't do it.
You must work around this datastore limitation.
The most basic option is to brute force it yourself. Use one filter, and manually filter out the results yourself. It may help to filter on begin, and sort on end, but you'll have to go through the results and pick the actual entities you want.
calitem = [x for x in self.appointments.filter("begin >= ", start).filter("begin <= " end) if x.end <= end]
In most cases, you'd want to restructure your data so that you don't need two inequality filters. This may or may not be possible.
I'm trying to guess at what you're doing, but if you're trying to see if someone is busy at 11am based on their calendar, a way to do this is:
Break the day down into time periods instead of using arbitrary time, ie 15min blocks.
Store an event as a list of the time blocks that it uses.
Query for events that contain the time block for 11am.
I have a similar requirement: pick entities out of Datastore that should be rendered/deliverd now. Since Datastore cannot handle this, application logic is required. I make two separate queries for keys that satisfy both ends of the constraint, and then take the intersection of them:
satisfies "begin" criteria: k1, k3, |k4, k5|, k6
--------+------+----
satisfies "end" criteria: k2, |k4, k5|, k7, k8
The intersection of "begin" and "end" are the keys k4, k5.
now = datetime.now()
start_dt_q = FooBar.all()
start_dt_q.filter('start_datetime <', now)
start_dt_q.filter('omit_from_feed =', False)
start_dt_keys = start_dt_q.fetch(None, keys_only=True)
end_dt_q = FooBar.all()
end_dt_q.filter('end_datetime >', now)
end_dt_q.filter('omit_from_feed =', False)
end_dt_keys = end_dt_q.fetch(None, keys_only=True)
# Get "intersection" of two queries; equivalent to
# single query with multiple criteria
valid_dt_keys = list(set(start_dt_keys) & set(end_dt_keys))
I then iterate over those keys getting the entities I need:
for key in valid_dt_keys:
foobar = FooBar.all().filter('__key__ =', key).get()
...
OR:
foobars = FooBar.all().filter('__key__ IN', valid_dt_keys)
for foobar in foobars:
...

How can I make my code more efficient?

I have a list of tuples that contains a tool_id, a time, and a message. I want to select from this list all the elements where the message matches some string, and all the other elements where the time is within some diff of any matching message for that tool.
Here is how I am currently doing this:
# record time for each message matching the specified message for each tool
messageTimes = {}
for row in cdata: # tool, time, message
if self.message in row[2]:
messageTimes[row[0], row[1]] = 1
# now pull out each message that is within the time diff for each matched message
# as well as the matched messages themselves
def determine(tup):
if self.message in tup[2]: return True # matched message
for (tool, date_time) in messageTimes:
if tool == tup[0]:
if abs(date_time-tup[1]) <= tdiff:
return True
return False
cdata[:] = [tup for tup in cdata if determine(tup)]
This code works, but it takes way too long to run - e.g. when cdata has 600,000 elements (which is typical for my app) it takes 2 hours for this to run.
This data came from a database. Originally I was getting just the data I wanted using SQL, but that was taking too long also. I was selecting just the messages I wanted, then for each one of those doing another query to get the data within the time diff of each. That was resulting in tens of thousands of queries. So I changed it to pull all the potential matches at once and then process it in python, thinking that would be faster. Maybe I was wrong.
Can anyone give me some suggestions on speeding this up?
Updating my post to show what I did in SQL as was suggested.
What I did in SQL was pretty straightforward. The first query was something like:
SELECT tool, date_time, message
FROM event_log
WHERE message LIKE '%foo%'
AND other selection criteria
That was fast enough, but it may return 20 or 30 thousand rows. So then I looped through the result set, and for each row ran a query like this (where dt and t are the date_time and tool from a row from the above select):
SELECT date_time, message
FROM event_log
WHERE tool = t
AND ABS(TIMESTAMPDIFF(SECOND, date_time, dt)) <= timediff
That was taking about an hour.
I also tried doing in one nested query where the inner query selected the rows from my first query, and the outer query selected the time diff rows. That took even longer.
So now I am selecting without the message LIKE '%foo%' clause and I am getting back 600,000 rows and trying to pull out the rows I want from python.
The way to optimize the SQL is to do it all in one query, instead of iterating over 20K rows and doing another query for each one.
Usually this means you need to add a JOIN, or occasionally a sub-query. And yes, you can JOIN a table to itself, as long as you rename one or both copies. So, something like this:
SELECT el2.date_time, el2.message
FROM event_log as el1 JOIN event_log as el2
WHERE el1.message LIKE '%foo%'
AND other selection criteria
AND el2.tool = el1.tool
AND ABS(TIMESTAMPDIFF(SECOND, el2.datetime, el1.datetime)) <= el1.timediff
Now, this probably won't be fast enough out of the box, so there are two steps to improve it.
First, look for any columns that obviously need to be indexed. Clearly tool and datetime need simple indices. message may benefit from either a simple index or, if your database has something fancier, maybe something fancier, but given that the initial query was fast enough, you probably don't need to worry about it.
Occasionally, that's sufficient. But usually, you can't guess everything correctly. And there may also be a need to rearrange the order of the queries, etc. So you're going to want to EXPLAIN the query, and look through the steps the DB engine is taking, and see where it's doing a slow iterative lookup when it could be doing a fast index lookup, or where it's iterating over a large collection before a small collection.
For tabular data, you can't go past the Python pandas library, which contains highly optimised code for queries like this.
I fixed this by changing my code as follows:
-first I made messageTimes a dict of lists keyed by the tool:
messageTimes = defaultdict(list) # a dict with sorted lists
for row in cdata: # tool, time, module, message
if self.message in row[3]:
messageTimes[row[0]].append(row[1])
-then in the determine function I used bisect:
def determine(tup):
if self.message in tup[3]: return True # matched message
times = messageTimes[tup[0]]
le = bisect.bisect_right(times, tup[1])
ge = bisect.bisect_left(times, tup[1])
return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and times[ge]-tup[1] <= tdiff)
With these changes the code that was taking over 2 hours took under 20 minutes, and even better, a query that was taking 40 minutes took 8 seconds!
I made 2 more changes and now that 20 minute query is taking 3 minutes:
found = defaultdict(int)
def determine(tup):
if self.message in tup[3]: return True # matched message
times = messageTimes[tup[0]]
idx = found[tup[0]]
le = bisect.bisect_right(times, tup[1], idx)
idx = le
return (le and tup[1]-times[le-1] <= tdiff) or (le != len(times) and times[le]-tup[1] <= tdiff)

Categories