Efficiently writing string comparison functions in Pandas - python

Let's say I work for a company that hands out different types of loans. We are getting our loan information from from a big data mart from which I need to calculate some additional things to calculate if someone is in arrears or not, etc. Right now, for clarity's sake I have done this a rather dumb function that iterates over all rows (where all information over a loan is stored) by using the pd.DataFrame.apply(myFunc, axis=1) function, which is horribly slow off course.
Now that we are growing and that I get more and more data to process, I am starting to get concerned over performance. Below is an example of a function that I call a lot, and would like to optimize (some ideas that I have below). These functions are applied to a DataFrame which has (a.o.) the following fields:
Loan_Type : a field containing a string that determines the type of loan, we have many different names but it comes down to either 4 types (for this example); Type 1 and Type 2, and whether staff or not has this loan.
Activity_Date : The date the activity on the loan was logged (it's a daily loan activity table, if that tells you anything)
Product_Account_Status : The status given by the table to these loans (are they active, or some other status?) on the Activity_Date, this needs to be recalculated because it is not always calculated in the table (don't ask why it is like this, complete headache).
Activation_Date : The date the loan was activated
Sum_Paid_To_Date : The amount of money paid into the loan at the Activity_Date
Deposit_Amount : The deposit amount for the loan
Last_Paid_Date : The last date a payment was made into the loan.
So two example functions:
def productType(x):
# Determines the type of the product, for later aggregation purposes, and to determine the amount to be payable per day
if ('Loan Type 1' in x['Loan_Type']) & (not ('Staff' in x['Loan_Type'])):
return 'Loan1'
elif ('Loan Type 2' in x['Loan_Type']) & (not ('Staff' in x['Loan_Type'])):
return 'Loan2'
elif ('Loan Type 1' in x['Loan_Type']) & ('Staff' in x['Loan_Type']):
return 'Loan1Staff'
elif ('Loan Type 2' in x['Loan_Type']) & ('Staff' in x['Loan_Type']):
return 'Loan2Staff'
elif ('Mobile' in x['Loan_Type']) | ('MM' in x['Loan_Type']):
return 'Other'
else:
raise ValueError(
'A payment plan is not captured in the code, please check it!')
This function is then applied to the DataFrame AllLoans which contains all loans I want to analyze at that moment, by using:
AllLoans['productType'] = AllLoans.apply(lambda x: productType(x), axis = 1)
Then I want to apply some other functions, one example of such a function is given below. This function determines whether the loan is blocked or not, depending on how long someone hasn't paid, and some other statuses that are important, but are currently stored in strings in the loan table. Examples of this are whether people are cancelled (for being blocked for too long), or some other statuses, we treat customers differently based on these tags.
def customerStatus(x):
# Sets the customer status based on the column Product_Account_Status or
# the days of inactivity
if x['productType'] == 'Loan1':
dailyAmount = 2
elif x['productType'] == 'Loan2':
dailyAmount = 2.5
elif x['productType'] == 'Loan1Staff':
dailyAmount = 1
elif x['productType'] == 'Loan2Staff':
dailyAmount = 1.5
else:
raise ValueError(
'Daily amount to be paid could not be calculated, check if productType is defined.')
if x['Product_Account_Status'] == 'Cancelled':
return 'Cancelled'
elif x['Product_Account_Status'] == 'Suspended':
return 'Suspended'
elif x['Product_Account_Status'] == 'Pending Deposit':
return 'Pending Deposit'
elif x['Product_Account_Status'] == 'Pending Allocation':
return 'Pending Allocation'
elif x['Outstanding_Balance'] == 0:
return 'Finished Payment'
# If this check returns True it means that Last_Paid_Date is zero/null, as
# far as I can see this means that the customer has only paid the deposit
# and is thus an FPD
elif type(x['Date_Of_Activity'] - x['Last_Paid_Date']) != (pd.tslib.NaTType):
if (((x['Date_Of_Activity'] - x['Last_Paid_Date']).days + 1) > 30) | ((((x['Date_Of_Activity'] - x['Last_Paid_Date']).days + 1) > 14) & ((x['Sum_Paid_To_Date'] - x['Deposit_Amount']) <= dailyAmount)):
return 'Blocked'
elif ((x['Date_Of_Activity'] - x['Last_Paid_Date']).days + 1) <= 30:
return 'Active'
# If this is True, the customer has not paid more than the deposit, so it
# will fall on the age of the customer whether they are blocked or not
elif type(x['Date_Of_Activity'] - x['Last_Paid_Date']) == (pd.tslib.NaTType):
# The date is changed here to 14 because of FPD definition
if ((x['Date_Of_Activity'] - x['Activation_Date']).days + 1) <= 14:
return 'Active'
elif ((x['Date_Of_Activity'] - x['Activation_Date']).days + 1) > 14:
return 'Blocked'
# If we have reached the end and still haven't found the status, it will
# get the following status
return 'Other Status'
This is again applied by using AllLoans['customerStatus'] = AllLoans.apply(lambda x: customerStatus(x), axis = 1). As you can see there are many string comparisons and date comparisons, which are a bit confusing for me on how I can 'properly' vectorize these functions.
Apologies if this is Optimization 101, but have tried to search for answers and strategies on how to do this, but couldn't find really comprehensive answers. I was hoping to get some tips here, thanks in advance for your time.
Some thoughts on making this faster/getting towards a more vectorized approach:
Make the customerStatus function slightly more modular by making a function that determines the daily amounts, and stores this in the dataframe for quicker access (I need to access them later anyway, and determine this variable in multiple functions).
Make the input column for the productType function into integers by using some sort of dict, so that fewer string functions need to called to this (but feel like this won't be my biggest speed up)
Some things that I would like to do but don't really know where to start on;
How to properly vectorize these functions that contain many if statements based on string/date comparisons (business rules can be a bit complex here) based on different columns in the dataframe. The code might become a bit more complex, but I need to apply these functions multiple times to slightly different (but importantly different) dataframes, and these are growing larger and larger so these functions need to be in some sort of library for ease of access, and the code needs to be speed up because it simply takes up to much time.
Have tried to search for some solutions like Numba or Cython but I don't understand enough of the inner workings of C to properly use this (or just yet, would like to learn). Any suggestions on how to improve performance would be greatly appreciated.
Kind regards,
Tim.

Related

Function returning different result despite the same inputs in Python

Here is my function that uses the Poloniex Exchange API. It gets a dict of asks (tuples of price and amount) and then calculates the total amount of BTC that would be obtained using a given spend.
But running the function several times returns different amounts despite the dict of asks and spend remaining the same. This problem should be replicable by printing "asks" (defined below) and the function result several times.
def findBuyAmount(spend):
#getOrderBook
URL = "https://poloniex.com/public?command=returnOrderBook&currencyPair=USDT_BTC&depth=20"
#request the bids and asks (returns nested dict)
r_ab = requests.get(url = URL)
# extracting data in json format -> returns a dict in this case!
ab_data = r_ab.json()
asks = ab_data.get('asks',[])
#convert strings into decimals
asks=[[float(elem[0]), elem[1]] for elem in asks]
amount=0
for elem in asks: #each elem is a tuple of price and amount
if spend > 0:
if elem[1]*elem[0] > spend: #check if the ask exceeds volume of our spend
amount = amount+((elem[1]/elem[0])*spend) #BTC that would be obtained using our spend at this price
spend = 0 #spend has been used entirely, leading to a loop break
if elem[1]*elem[0] < spend: #check if the spend exceeds the current ask
amount = amount + elem[1] #BTC that would be obtained using some of our spend at this price
spend = spend - elem[1]*elem[0] #remainder
else:
break
return amount
If the first ask in the asks dict was [51508.93591717, 0.62723766] and spend was 1000, I would expect amount to equal (0.62723766/51508.93591717) * 1000 but I get all kinds of varied outputs instead. How can I fix this?
You get all kinds of varied outputs because you're fetching new data every time you run the function. Split the fetch and the calculation into separate functions so you can test them independently. You can also make the logic much clearer by naming your variables properly:
import requests
def get_asks(url="https://poloniex.com/public?command=returnOrderBook&currencyPair=USDT_BTC&depth=20"):
response = requests.get(url=url)
ab_data = response.json()
asks = ab_data.get('asks', [])
#convert strings into decimals
return [(float(price), qty) for price, qty in asks]
def find_buy_amount(spend, asks):
amount = 0
for price, qty in asks:
if spend > 0:
ask_value = price * qty
if ask_value >= spend:
amount += spend / price
spend = 0
else:
amount += qty
spend -= ask_value
else:
break
return amount
asks = get_asks()
print("Asks:", asks)
print("Buy: ", find_buy_amount(1000, asks))
Your math was wrong for when the ask value exceeds remaining spend; the quantity on the order book doesn't matter at that point, so the amount you can buy is just spend / price.
With the functions split up, you can also run find_buy_amount any number of times with the same order book and see that the result is, in fact, always the same.
The problem is in your "we don't have enough money" path. In that case, the amount you can buy does not depend on the amount that was offered.
if elem[1]*elem[0] > spend:
amount += spend/elem[0]

MySQL SQLALCHEMY Python Getting Max Count for Timestamp

I have data recorded for several timestamps ... I want to get the max amount of all timestamps.
This is my code:
for timestamp in timestamps:
count = db.query(models.Appointment.id).filter(models.Appointment.place == place) \
.filter(models.Appointment.date == date) \
.filter(models.Appointment.timestamp == timestamp).count()
data.append(count)
return max(data)
Sadly, it takes timestamps * 1.5 seconds to calculate that requested value.
Is there any possibility (a query) which can handle this in around 3-10 seconds?
Regards,
Martin
If using MySQL 8 and later, you could give the following a go:
return db.query(func.max(func.count()).over()).\
filter(models.Appointment.place == place).\
filter(models.Appointment.date == date).\
filter(models.Appointment.timestamp.in_(timestamps)).\
group_by(models.Appointment.timestamp).\
limit(1).\
scalar()
This uses the (slightly non obvious) fact that window functions are evaluated after forming group rows, and without a partition and order the window is over all the group rows.
If using a version of MySQL that does not yet support window functions, use a subquery instead:
counts = db.query(func.count().label('count')).\
filter(models.Appointment.place == place).\
filter(models.Appointment.date == date).\
filter(models.Appointment.timestamp.in_(timestamps)).\
group_by(models.Appointment.timestamp).\
subquery()
return db.query(func.max(counts.c.count)).scalar()
The difference in these to the original approach is that both make only a single trip to the database. That is usually desirable, but may require thinking a bit differently about the problem, due to SQL being a (more or less) declarative language – you mostly describe the answer you want, not how you want it✝.
✝ "I want coffee" vs. "Start by pouring some water in the..."

Fixing a meeting room function schedule with double and triple bookings to determine space usage

I need to calculate the total amount of time each group uses a meeting space. But the data set has double and triple booking, so I think I need to fix the data first. Disclosure: My coding experience consists solely of working through a few Dataquest courses, and this is my first stackoverflow posting, so I apologize for errors and transgressions.
Each line of the data set contains the group ID and a start and end time. It also includes the booking type, ie. reserved, meeting, etc. Generally, the staff reserve a space for the entire period, which would create a single line, and then add multiple lines for each individual function when the details are known. They should segment the original reserved line so it's only holding space in between functions, but instead they double book the space, so I need to add multiple lines for these interim RES holds, based on the actual holds.
Here's what the data basically looks like:
Existing data:
functions = [['Function', 'Group', 'FunctionType', 'StartTime', 'EndTime'],
[01,01,'RES',2019/10/04 07:00,2019/10/06 17:00],
[02,01,'MTG',2019/10/05 09:00,2019/10/05 12:00],
[03,01,'LUN',2019/10/05 12:30,2019/10/05 13:30],
[04,01,'MTG',2019/10/05 14:00,2019/10/05 17:00],
[05,01,'MTG',2019/10/06 09:00,2019/10/06 12:00]]
I've tried to iterate using a for loop:
for index, row in enumerate(functions):
last_row_index = len(functions) - 1
if index == last_row_index:
pass
else:
current_index = index
next_index = index + 1
if row[3] <= functions[next_index][2]:
next
elif row[4] == 'RES' or row[6] < functions[next_index][6]:
copied_current_row = row.copy()
row[3] = functions[next_index][2]
copied_current_row[2] = functions[next_index][3]
functions.append(copied_current_row)
There seems to be a logical problem in here, because that last append line seems to put the program into some kind of loop and I have to manually interrupt it. So I'm sure it's obvious to someone experienced, but I'm pretty new.
The reason I've done the comparison to see if a function is RES is that reserved should be subordinate to actual functions. But sometimes there are overlaps between actual functions, so I'll need to create another comparison to decide which one takes precedence, but this is where I'm starting.
How I (think) I want it to end up:
[['Function', 'Group', 'FunctionType', 'StartTime', 'EndTime'],
[01,01,'RES',2019/10/04 07:00,2019/10/05 09:00],
[02,01,'MTG',2019/10/05 09:00,2019/10/05 12:00],
[01,01,'RES',2019/10/05 12:00,2019/10/05 12:30],
[03,01,'LUN',2019/10/05 12:30,2019/10/05 13:30],
[01,01,'RES',2019/10/05 13:30,2019/10/05 14:00],
[04,01,'MTG',2019/10/05 14:00,2019/10/05 17:00],
[01,01,'RES',2019/10/05 14:00,2019/10/06 09:00],
[05,01,'MTG',2019/10/06 09:00,2019/10/06 12:00],
[01,01,'RES',2019/10/06 12:00,2019/10/06 17:00]]
This way, I could do a simple calculation of elapsed time for each function line and add it up to see how much time they had the space booked for.
What I'm looking for here is just some direction I should pursue, and I'm definitely not expecting anyone to do the work for me. For example, am I on the right path here, or would it be better to use pandas and vectorized functions? If I can get the basic direction right, I think I can muddle through the specifics.
Thank-you very much,
AF

SQLalchemy performance when iterating queries millions of time

I'm writing a disease simulation in Python, using SQLalchemy, but I'm hitting some performance issues when running queries on a SQLite file I create earlier in the simulation.
The code is below. There are more queries in the outer for loop, but what I've posted is what slowed it down to a crawl. There are 365 days, about 76,200 mosquitos, and each mosquito makes 5 contacts per day, bringing it to about 381,000 queries per simulated day, and 27,813,000 through the entire simulation (and that's just for the mosquitos). It goes along at about 2 days / hour which, if I'm calculating correctly, is about 212 queries per second.
Do you see any issues that could be fixed that could speed things up? I've experimented with indexing the fields which are used in selection but that didn't seem to change anything. If you need to see the full code, it's available here on GitHub. The function begins on line 399.
Thanks so much, in advance.
Run mosquito-human interactions
for d in range(days_to_run):
... much more code before this, but it ran reasonably fast
vectors = session.query(Vectors).yield_per(1000) #grab each vector..
for m in vectors:
i = 0
while i < biting_rate:
pid = random.randint(1, number_humans) # Pick a human to bite
contact = session.query(Humans).filter(Humans.id == pid).first() #Select the randomly-chosen human from SQLite table
if contact: # If the random id equals an ID in the table
if contact.susceptible == 'True' and m.infected == 'True' and random.uniform(0, 1) < beta: # if the human is susceptible and mosquito is infected, infect the human
contact.susceptible = 'False'
contact.exposed = 'True'
elif contact.infected == 'True' and m.susceptible == 'True': # otherwise, if the mosquito is susceptible and the human is infected, infect the mosquito
m.susceptible = 'False'
m.infected = 'True'
nInfectedVectors += 1
nSuscVectors += 1
i += 1
session.commit()

Testing Equality of boto Price object

I am using the python package boto to connect python to MTurk. I am needing to award bonus payments, which are of the Price type. I want to test if one Price object equals a certain value. Specifically, when I want to award bonus payments, I need to check that their bonus payment is not 0 (because when you try to award a bonus payment in MTurk, it needs to be positive). But when I go to check values, I can't do this. For example,
from boto.mturk.connection import MTurkConnection
from boto.mturk.price import Price
a = Price(0)
a == 0
a == Price(0)
a == Price(0.0)
a > Price(0)
a < Price(0)
c = Price(.05)
c < Price(0)
c < Price(0.0)
These yield unexpected answers.
I am not sure of how to test if a has a Price equal to 0. Any suggestions?
Think you'll want the Price.amount function to compare these values. Otherwise, I think it compares objects or some other goofiness. It'd be smart for the library to override the standard quality test to make this more developer-friendly.

Categories