SQLalchemy performance when iterating queries millions of time - python

I'm writing a disease simulation in Python, using SQLalchemy, but I'm hitting some performance issues when running queries on a SQLite file I create earlier in the simulation.
The code is below. There are more queries in the outer for loop, but what I've posted is what slowed it down to a crawl. There are 365 days, about 76,200 mosquitos, and each mosquito makes 5 contacts per day, bringing it to about 381,000 queries per simulated day, and 27,813,000 through the entire simulation (and that's just for the mosquitos). It goes along at about 2 days / hour which, if I'm calculating correctly, is about 212 queries per second.
Do you see any issues that could be fixed that could speed things up? I've experimented with indexing the fields which are used in selection but that didn't seem to change anything. If you need to see the full code, it's available here on GitHub. The function begins on line 399.
Thanks so much, in advance.
Run mosquito-human interactions
for d in range(days_to_run):
... much more code before this, but it ran reasonably fast
vectors = session.query(Vectors).yield_per(1000) #grab each vector..
for m in vectors:
i = 0
while i < biting_rate:
pid = random.randint(1, number_humans) # Pick a human to bite
contact = session.query(Humans).filter(Humans.id == pid).first() #Select the randomly-chosen human from SQLite table
if contact: # If the random id equals an ID in the table
if contact.susceptible == 'True' and m.infected == 'True' and random.uniform(0, 1) < beta: # if the human is susceptible and mosquito is infected, infect the human
contact.susceptible = 'False'
contact.exposed = 'True'
elif contact.infected == 'True' and m.susceptible == 'True': # otherwise, if the mosquito is susceptible and the human is infected, infect the mosquito
m.susceptible = 'False'
m.infected = 'True'
nInfectedVectors += 1
nSuscVectors += 1
i += 1
session.commit()

Related

Event Rate Timing - Events/time period and real-time

I need to create various rate timers of events across a timed period. For example, 5 events/16 seconds, or real-time. Akin to bandwidth calculations.
From what I can tell, this is something that would need to be written from scratch, as the timing functions I've seen are for performing an event every X seconds.
Questions:
Are there libraries for this type of thing, or would they need to be written from scratch?
I have a fair stab at some manual functions, for example for events over time_len:
def time( self,quantity ):
self.this_ts = c.time.time_ns()
self.this_diff_ts = self.this_ts - self.last_ts
if( self.this_diff_ts < self.time_len ):
self.this_count += quantity
else:
self.this_rate = self.this_count
self.this_count = quantity
self.ticks = 0
self.last_ts = self.this_ts
Is that a reasonable approach?
How is a real-time rate actually calculated? Would it be a count of events within a second averaged over seconds?
Thank you,
John

How to optimize my code for the Kattis Accounting Question?

I am doing this Kattis accounting question but at test case 10, it has the error Time limit exceeded.
How can I optimize my code to make it run faster?
Here's the question!
Erika the economist studies economic inequality. Her model starts in a
situation where everybody has the same amount of money. After that,
people’s wealth changes in various complicated ways.
Erika needs to run a simulation a large number of times to check if
her model works. The simulation consists of people, each of whom
begins with kroners. Then events happen, of three different types:
An event of type “SET ” means that the th person’s wealth is set to .
An event of type “RESTART ” means that the simulation is restarted,
and everybody’s wealth is set to .
An event of type “PRINT ” reports the current wealth of the th person.
Unfortunately, Erika’s current implementation is very slow; it takes
far too much time to keep track of how much money everybody has. She
decides to use her algorithmic insights to speed up the simulation.
Input The first line includes two integers and , where and . The
following lines each start with a string that is either “SET”,
“RESTART”, or “PRINT”. There is guaranteed to be at least one event of
type “PRINT”.
If the string is “SET” then it is followed by two integers and with
and . If the string is “RESTART” then it is followed by an integer
with . If the string is “PRINT” then it is followed by an integer
with .
Output For each event of type “PRINT”, write the th person’s capital.
Sample Input 1: 3 5 SET 1 7 PRINT 1 PRINT 2
RESTART 33 PRINT 1
Sample Output 1: 7 0 33
Sample Input 2: 5 7 RESTART 5 SET 3 7 PRINT 1
PRINT 2 PRINT 3 PRINT 4 PRINT 5
Sample Output 2: 5 5 7 5 5
# print("Enter 2 numbers")
n, q = map(int, input().split())
# print(n , q)
people = {}
def createPeople(n):
for i in range(n):
number = i+1
people[number] = 0
return people
def restart(n,new):
for i in range(n):
number = i+1
people[number] = new
return people
def setPeople(d ,id , number):
d[id] = number
return d
# return d.update({id: number})
def logic(n,dict,q):
for i in range(q):
# print("enter Command")
r = input()
r = r.split()
# print("r" ,r)
if r[0] == "SET":
# print(people , "People list")
abc = setPeople(dict, int(r[1]), int(r[2]))
# print(list)
elif r[0] == "RESTART":
abc = restart(n, int(r[1]))
elif r[0] == "PRINT":
print(dict[int(r[1])])
# return abc
people = createPeople(n)
# print(people)
test = logic(n,people,q)
The input is too big to be doing anything linear, like looping over all of the people and setting their values by hand. If we have 105 queries and 106 people, the worst case scenario is resetting over and over again, 1011 operations.
Easier is to keep a variable to track the baseline value after resets. Whenever a reset occurs, dump all entries in the dictionary and set the baseline to the specified value. Assume any further lookups for people that aren't in the dictionary to have the most recent baseline value. Now, all operations are O(1) and we can handle 105 queries linearly.
people = {}
baseline = 0
n, q = map(int, input().split())
for _ in range(q):
command, *args = input().split()
if command == "SET":
people[int(args[0])] = int(args[1])
elif command == "RESTART":
people.clear()
baseline = int(args[0])
elif command == "PRINT":
print(people.get(int(args[0]), baseline))
As an aside, writing abstractions is great in a real program, but for these tiny code challenges I'd just focus on directly solving the problem. This reduces the potential for confusion with return values like abc that seem to have no clear purpose.
Per PEP-8, use snake_case rather than camelCase in Python.

Why does my for loop with if else clause run so slow?

TL,DR:
I'm trying to understand why the below for loop is incredibly slow, taking hours to run on a dataset of 160K entries.
I have a working solution using a function and .apply(), but I want to understand why my homegrown solution is so bad. I'm obviously a huge beginner with Python:
popular_or_not = []
counter = 0
for id in df['id']:
if df['popularity'][df['id'] == id].values == 0:
popular_or_not.append(0)
else:
popular_or_not.append(1)
counter += 1
df['popular_or_not'] = popular_or_not
df
In more detail:
I'm currently learning Python for data science, and I'm looking at this dataset on Kaggle: https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks
I'm interesting in predicting/modelling the popularity score. It is not normally distributed:
plt.bar(df['popularity'].value_counts().index, df['popularity'].value_counts().values)
I would like to add a column, to say whether a track is popular or not, with popular tracks being those that get a score of 5 and above and unpopular being the others.
I have tried the following solution, but it runs incredibly slowly, and I'm not sure why. It runs fine on a very small subset, but would take a few hours to run on the full dataset:
popular_or_not = []
counter = 0
for id in df['id']:
if df['popularity'][df['id'] == id].values == 0:
popular_or_not.append(0)
else:
popular_or_not.append(1)
counter += 1
df['popular_or_not'] = popular_or_not
df
This alternative solution works fine:
def check_popularity(score):
if score > 5:
return 1
else:
#pdb.set_trace()
return 0
df['popularity'].apply(check_popularity).value_counts()
df['popular_or_not'] = df['popularity'].apply(check_popularity)
I think understanding why my first solution doesn't work might be an important part of my Python learning.
Thanks everyone for your comments. I'm going to summarize them below as an answer to my question, but please feel free to jump in if anything is incorrect:
The reason my initial for loop was so slow is that I was checking df['id'] == id 160k times. This is typically a very slow operation.
For this type of operation, instead of iterating over a pandas dataframe thousands of times, it's always a good idea to think of applying vectorization - a bunch of tools and methods to process a whole column in a single instruction at C speed. This is what I did with the following code:
def check_popularity(score):
if score > 5:
return 1
else:
#pdb.set_trace()
return 0
df['popularity'].apply(check_popularity).value_counts()
df['popular_or_not'] = df['popularity'].apply(check_popularity)
By using .apply and a pre-defined function. I get the same result, but in seconds instead of in hours.

Efficiently writing string comparison functions in Pandas

Let's say I work for a company that hands out different types of loans. We are getting our loan information from from a big data mart from which I need to calculate some additional things to calculate if someone is in arrears or not, etc. Right now, for clarity's sake I have done this a rather dumb function that iterates over all rows (where all information over a loan is stored) by using the pd.DataFrame.apply(myFunc, axis=1) function, which is horribly slow off course.
Now that we are growing and that I get more and more data to process, I am starting to get concerned over performance. Below is an example of a function that I call a lot, and would like to optimize (some ideas that I have below). These functions are applied to a DataFrame which has (a.o.) the following fields:
Loan_Type : a field containing a string that determines the type of loan, we have many different names but it comes down to either 4 types (for this example); Type 1 and Type 2, and whether staff or not has this loan.
Activity_Date : The date the activity on the loan was logged (it's a daily loan activity table, if that tells you anything)
Product_Account_Status : The status given by the table to these loans (are they active, or some other status?) on the Activity_Date, this needs to be recalculated because it is not always calculated in the table (don't ask why it is like this, complete headache).
Activation_Date : The date the loan was activated
Sum_Paid_To_Date : The amount of money paid into the loan at the Activity_Date
Deposit_Amount : The deposit amount for the loan
Last_Paid_Date : The last date a payment was made into the loan.
So two example functions:
def productType(x):
# Determines the type of the product, for later aggregation purposes, and to determine the amount to be payable per day
if ('Loan Type 1' in x['Loan_Type']) & (not ('Staff' in x['Loan_Type'])):
return 'Loan1'
elif ('Loan Type 2' in x['Loan_Type']) & (not ('Staff' in x['Loan_Type'])):
return 'Loan2'
elif ('Loan Type 1' in x['Loan_Type']) & ('Staff' in x['Loan_Type']):
return 'Loan1Staff'
elif ('Loan Type 2' in x['Loan_Type']) & ('Staff' in x['Loan_Type']):
return 'Loan2Staff'
elif ('Mobile' in x['Loan_Type']) | ('MM' in x['Loan_Type']):
return 'Other'
else:
raise ValueError(
'A payment plan is not captured in the code, please check it!')
This function is then applied to the DataFrame AllLoans which contains all loans I want to analyze at that moment, by using:
AllLoans['productType'] = AllLoans.apply(lambda x: productType(x), axis = 1)
Then I want to apply some other functions, one example of such a function is given below. This function determines whether the loan is blocked or not, depending on how long someone hasn't paid, and some other statuses that are important, but are currently stored in strings in the loan table. Examples of this are whether people are cancelled (for being blocked for too long), or some other statuses, we treat customers differently based on these tags.
def customerStatus(x):
# Sets the customer status based on the column Product_Account_Status or
# the days of inactivity
if x['productType'] == 'Loan1':
dailyAmount = 2
elif x['productType'] == 'Loan2':
dailyAmount = 2.5
elif x['productType'] == 'Loan1Staff':
dailyAmount = 1
elif x['productType'] == 'Loan2Staff':
dailyAmount = 1.5
else:
raise ValueError(
'Daily amount to be paid could not be calculated, check if productType is defined.')
if x['Product_Account_Status'] == 'Cancelled':
return 'Cancelled'
elif x['Product_Account_Status'] == 'Suspended':
return 'Suspended'
elif x['Product_Account_Status'] == 'Pending Deposit':
return 'Pending Deposit'
elif x['Product_Account_Status'] == 'Pending Allocation':
return 'Pending Allocation'
elif x['Outstanding_Balance'] == 0:
return 'Finished Payment'
# If this check returns True it means that Last_Paid_Date is zero/null, as
# far as I can see this means that the customer has only paid the deposit
# and is thus an FPD
elif type(x['Date_Of_Activity'] - x['Last_Paid_Date']) != (pd.tslib.NaTType):
if (((x['Date_Of_Activity'] - x['Last_Paid_Date']).days + 1) > 30) | ((((x['Date_Of_Activity'] - x['Last_Paid_Date']).days + 1) > 14) & ((x['Sum_Paid_To_Date'] - x['Deposit_Amount']) <= dailyAmount)):
return 'Blocked'
elif ((x['Date_Of_Activity'] - x['Last_Paid_Date']).days + 1) <= 30:
return 'Active'
# If this is True, the customer has not paid more than the deposit, so it
# will fall on the age of the customer whether they are blocked or not
elif type(x['Date_Of_Activity'] - x['Last_Paid_Date']) == (pd.tslib.NaTType):
# The date is changed here to 14 because of FPD definition
if ((x['Date_Of_Activity'] - x['Activation_Date']).days + 1) <= 14:
return 'Active'
elif ((x['Date_Of_Activity'] - x['Activation_Date']).days + 1) > 14:
return 'Blocked'
# If we have reached the end and still haven't found the status, it will
# get the following status
return 'Other Status'
This is again applied by using AllLoans['customerStatus'] = AllLoans.apply(lambda x: customerStatus(x), axis = 1). As you can see there are many string comparisons and date comparisons, which are a bit confusing for me on how I can 'properly' vectorize these functions.
Apologies if this is Optimization 101, but have tried to search for answers and strategies on how to do this, but couldn't find really comprehensive answers. I was hoping to get some tips here, thanks in advance for your time.
Some thoughts on making this faster/getting towards a more vectorized approach:
Make the customerStatus function slightly more modular by making a function that determines the daily amounts, and stores this in the dataframe for quicker access (I need to access them later anyway, and determine this variable in multiple functions).
Make the input column for the productType function into integers by using some sort of dict, so that fewer string functions need to called to this (but feel like this won't be my biggest speed up)
Some things that I would like to do but don't really know where to start on;
How to properly vectorize these functions that contain many if statements based on string/date comparisons (business rules can be a bit complex here) based on different columns in the dataframe. The code might become a bit more complex, but I need to apply these functions multiple times to slightly different (but importantly different) dataframes, and these are growing larger and larger so these functions need to be in some sort of library for ease of access, and the code needs to be speed up because it simply takes up to much time.
Have tried to search for some solutions like Numba or Cython but I don't understand enough of the inner workings of C to properly use this (or just yet, would like to learn). Any suggestions on how to improve performance would be greatly appreciated.
Kind regards,
Tim.

Django query / Iteration issue

I have a fairly noob question regarding iteration that I can't seem to get correct.
I have a table that houses a record for every monthly test a user completes, if they miss a month then there is no record in the table.
I want to pull the users history from the table then for each of the 12 months set a Y or N as to their completed status.
Here is my code:
def getSafetyHistory(self, id):
results = []
safety_courses = UserMonthlySafetyCurriculums.objects.filter(users_id=id).order_by('month_assigned')
for i in range(1, 13):
for s in safety_courses:
if s.month_assigned == i:
results.append('Y')
else:
results.append('N')
return results
So my ideal result would be a list with 12 entries, either Y or N
i.e results = [N,N,Y,N,N,Y,Y,Y,N,N,N,Y]
The query above returns 2 records for the user which is correct, but in my iteration I keep getting 24 entries, obviously due to the outter and inner loops, but I am not sure of the "pythonic" way I should be doing this without a ton of nested loops.
There are probably lots of ways to do this. Here is one idea.
It looks like you are only going to get records for courses that have been completed. So you could pre-build a list of 12 results, all set to no. Then after you query the database, you flip the ones to yes that correspond to the results you got.
results = ['N'] * 12 # prebuild results to all no
safety_courses = UserMonthlySafetyCurriculums.objects.filter(
users_id=id).order_by('month_assigned')
for course in safety_courses:
results[course.month_assigned - 1] = 'Y'
This assumes month_assigned is an integer between 1 and 12, as your code hints at.

Categories