I am using the python package boto to connect python to MTurk. I am needing to award bonus payments, which are of the Price type. I want to test if one Price object equals a certain value. Specifically, when I want to award bonus payments, I need to check that their bonus payment is not 0 (because when you try to award a bonus payment in MTurk, it needs to be positive). But when I go to check values, I can't do this. For example,
from boto.mturk.connection import MTurkConnection
from boto.mturk.price import Price
a = Price(0)
a == 0
a == Price(0)
a == Price(0.0)
a > Price(0)
a < Price(0)
c = Price(.05)
c < Price(0)
c < Price(0.0)
These yield unexpected answers.
I am not sure of how to test if a has a Price equal to 0. Any suggestions?
Think you'll want the Price.amount function to compare these values. Otherwise, I think it compares objects or some other goofiness. It'd be smart for the library to override the standard quality test to make this more developer-friendly.
Related
I want to create a new variable with include all drugs DS59 - DS71 (values currently coded: 1 = never used, 2 = rarely use, 3 = occasionally use, and 4 = regularly use). I want one of three classes to be assigned to each subject as laid out below:
no user: no use on any of the drugs (all 1's)
experimenter/light user: low overall score on drug use across all classes (total summed score less than 20) and no "regularly use (4)" answers to any drug classes
regular user - high overall score on drug use across all classes (score above 20) and at least one "occasionally use (3)" or "regularly use (4)" answer to any drug class
This is my current code - I am unsure how to most appropriately write the conditionals.
druglist=[(df['DS59']),(df['DS60']),(df['DS61']),(df['DS62']),(df['DS63']),
(df['DS64']),(df['DS65']),(df['DS66']),(df['DS67']),(df['DS68']),
(df['DS69']),(df['DS70']),(df['DS71'])]
conditions=[
(druglist== ),
(druglist==),
(druglist== ),
]
values=['no user','experimenter/light user','regular user']
df['drugs']=np.select(conditions,values)
Thank you so much for any help/advice.
If I understood correctly, this should be what you're looking for. Let me know if not:
drug_sum = sum(druglist)
conditions = [
(drug_sum == len(druglist)), # If it equals the length, that means every item is 1
(drug_sum <= 20 and not 4 in druglist),
(drug_sum > 20 and (3 in druglist or 4 in druglist)),
]
Though I'm not sure, do these conditions not leave some cases not fitting into any of the options? For example if a person is 1 on everything but one drug, on which they are 4.
I've looked through much of the documentation but can't figure this one out. I'm using or-tools python for constraint programming. I'm trying to find a way I can set the value of a BoolVar to 1 if the sum of some IntVars is greater than 0, else the BoolVar should be 0.
I've tried AddImplication() and OnlyEnforceIf() which looked promising, but none worked. Tried a few other ideas but mostly out of desperation.
The piece of code looks like this:
for warehouse in WAREHOUSES:
model.Add(y[warehouse] == 1).OnlyEnforceIf(sum(x[(warehouse, customer)] for customer in CUSTOMERS) > 0)
This currently returns:
AttributeError: 'BoundedLinearExpression' object has no attribute 'Index'
I guess the error is because I need to pass a boolean to OnlyEnforceIf instead of an expression. I've tried doing the calculation in a separate funtion and only pass the return value (True/False). The program runs but sets all BoolVars to True which is not very helpfull.
I've already solve this known exercise with Linear Programming so I know some BoolVars should be false.
# x is the IntVar: dict with keys that are tuples and the IntVar as value
# y is the BoolVar
Do it the other way around:
for warehouse in WAREHOUSES:
model.Add(sum(x[(warehouse, customer)] for customer in CUSTOMERS) > 0).OnlyEnforceIf(y[warehouse])
model.Add(sum(x[(warehouse, customer)] for customer in CUSTOMERS) == 0).OnlyEnforceIf(y[warehouse].Not())
I have data recorded for several timestamps ... I want to get the max amount of all timestamps.
This is my code:
for timestamp in timestamps:
count = db.query(models.Appointment.id).filter(models.Appointment.place == place) \
.filter(models.Appointment.date == date) \
.filter(models.Appointment.timestamp == timestamp).count()
data.append(count)
return max(data)
Sadly, it takes timestamps * 1.5 seconds to calculate that requested value.
Is there any possibility (a query) which can handle this in around 3-10 seconds?
Regards,
Martin
If using MySQL 8 and later, you could give the following a go:
return db.query(func.max(func.count()).over()).\
filter(models.Appointment.place == place).\
filter(models.Appointment.date == date).\
filter(models.Appointment.timestamp.in_(timestamps)).\
group_by(models.Appointment.timestamp).\
limit(1).\
scalar()
This uses the (slightly non obvious) fact that window functions are evaluated after forming group rows, and without a partition and order the window is over all the group rows.
If using a version of MySQL that does not yet support window functions, use a subquery instead:
counts = db.query(func.count().label('count')).\
filter(models.Appointment.place == place).\
filter(models.Appointment.date == date).\
filter(models.Appointment.timestamp.in_(timestamps)).\
group_by(models.Appointment.timestamp).\
subquery()
return db.query(func.max(counts.c.count)).scalar()
The difference in these to the original approach is that both make only a single trip to the database. That is usually desirable, but may require thinking a bit differently about the problem, due to SQL being a (more or less) declarative language β you mostly describe the answer you want, not how you want itβ.
β "I want coffee" vs. "Start by pouring some water in the..."
Let's say I work for a company that hands out different types of loans. We are getting our loan information from from a big data mart from which I need to calculate some additional things to calculate if someone is in arrears or not, etc. Right now, for clarity's sake I have done this a rather dumb function that iterates over all rows (where all information over a loan is stored) by using the pd.DataFrame.apply(myFunc, axis=1) function, which is horribly slow off course.
Now that we are growing and that I get more and more data to process, I am starting to get concerned over performance. Below is an example of a function that I call a lot, and would like to optimize (some ideas that I have below). These functions are applied to a DataFrame which has (a.o.) the following fields:
Loan_Type : a field containing a string that determines the type of loan, we have many different names but it comes down to either 4 types (for this example); Type 1 and Type 2, and whether staff or not has this loan.
Activity_Date : The date the activity on the loan was logged (it's a daily loan activity table, if that tells you anything)
Product_Account_Status : The status given by the table to these loans (are they active, or some other status?) on the Activity_Date, this needs to be recalculated because it is not always calculated in the table (don't ask why it is like this, complete headache).
Activation_Date : The date the loan was activated
Sum_Paid_To_Date : The amount of money paid into the loan at the Activity_Date
Deposit_Amount : The deposit amount for the loan
Last_Paid_Date : The last date a payment was made into the loan.
So two example functions:
def productType(x):
# Determines the type of the product, for later aggregation purposes, and to determine the amount to be payable per day
if ('Loan Type 1' in x['Loan_Type']) & (not ('Staff' in x['Loan_Type'])):
return 'Loan1'
elif ('Loan Type 2' in x['Loan_Type']) & (not ('Staff' in x['Loan_Type'])):
return 'Loan2'
elif ('Loan Type 1' in x['Loan_Type']) & ('Staff' in x['Loan_Type']):
return 'Loan1Staff'
elif ('Loan Type 2' in x['Loan_Type']) & ('Staff' in x['Loan_Type']):
return 'Loan2Staff'
elif ('Mobile' in x['Loan_Type']) | ('MM' in x['Loan_Type']):
return 'Other'
else:
raise ValueError(
'A payment plan is not captured in the code, please check it!')
This function is then applied to the DataFrame AllLoans which contains all loans I want to analyze at that moment, by using:
AllLoans['productType'] = AllLoans.apply(lambda x: productType(x), axis = 1)
Then I want to apply some other functions, one example of such a function is given below. This function determines whether the loan is blocked or not, depending on how long someone hasn't paid, and some other statuses that are important, but are currently stored in strings in the loan table. Examples of this are whether people are cancelled (for being blocked for too long), or some other statuses, we treat customers differently based on these tags.
def customerStatus(x):
# Sets the customer status based on the column Product_Account_Status or
# the days of inactivity
if x['productType'] == 'Loan1':
dailyAmount = 2
elif x['productType'] == 'Loan2':
dailyAmount = 2.5
elif x['productType'] == 'Loan1Staff':
dailyAmount = 1
elif x['productType'] == 'Loan2Staff':
dailyAmount = 1.5
else:
raise ValueError(
'Daily amount to be paid could not be calculated, check if productType is defined.')
if x['Product_Account_Status'] == 'Cancelled':
return 'Cancelled'
elif x['Product_Account_Status'] == 'Suspended':
return 'Suspended'
elif x['Product_Account_Status'] == 'Pending Deposit':
return 'Pending Deposit'
elif x['Product_Account_Status'] == 'Pending Allocation':
return 'Pending Allocation'
elif x['Outstanding_Balance'] == 0:
return 'Finished Payment'
# If this check returns True it means that Last_Paid_Date is zero/null, as
# far as I can see this means that the customer has only paid the deposit
# and is thus an FPD
elif type(x['Date_Of_Activity'] - x['Last_Paid_Date']) != (pd.tslib.NaTType):
if (((x['Date_Of_Activity'] - x['Last_Paid_Date']).days + 1) > 30) | ((((x['Date_Of_Activity'] - x['Last_Paid_Date']).days + 1) > 14) & ((x['Sum_Paid_To_Date'] - x['Deposit_Amount']) <= dailyAmount)):
return 'Blocked'
elif ((x['Date_Of_Activity'] - x['Last_Paid_Date']).days + 1) <= 30:
return 'Active'
# If this is True, the customer has not paid more than the deposit, so it
# will fall on the age of the customer whether they are blocked or not
elif type(x['Date_Of_Activity'] - x['Last_Paid_Date']) == (pd.tslib.NaTType):
# The date is changed here to 14 because of FPD definition
if ((x['Date_Of_Activity'] - x['Activation_Date']).days + 1) <= 14:
return 'Active'
elif ((x['Date_Of_Activity'] - x['Activation_Date']).days + 1) > 14:
return 'Blocked'
# If we have reached the end and still haven't found the status, it will
# get the following status
return 'Other Status'
This is again applied by using AllLoans['customerStatus'] = AllLoans.apply(lambda x: customerStatus(x), axis = 1). As you can see there are many string comparisons and date comparisons, which are a bit confusing for me on how I can 'properly' vectorize these functions.
Apologies if this is Optimization 101, but have tried to search for answers and strategies on how to do this, but couldn't find really comprehensive answers. I was hoping to get some tips here, thanks in advance for your time.
Some thoughts on making this faster/getting towards a more vectorized approach:
Make the customerStatus function slightly more modular by making a function that determines the daily amounts, and stores this in the dataframe for quicker access (I need to access them later anyway, and determine this variable in multiple functions).
Make the input column for the productType function into integers by using some sort of dict, so that fewer string functions need to called to this (but feel like this won't be my biggest speed up)
Some things that I would like to do but don't really know where to start on;
How to properly vectorize these functions that contain many if statements based on string/date comparisons (business rules can be a bit complex here) based on different columns in the dataframe. The code might become a bit more complex, but I need to apply these functions multiple times to slightly different (but importantly different) dataframes, and these are growing larger and larger so these functions need to be in some sort of library for ease of access, and the code needs to be speed up because it simply takes up to much time.
Have tried to search for some solutions like Numba or Cython but I don't understand enough of the inner workings of C to properly use this (or just yet, would like to learn). Any suggestions on how to improve performance would be greatly appreciated.
Kind regards,
Tim.
I have data consisting of about 10,000 entries. Each row is a price for a product in a specific currency. For example:
- Purchase 1 = 10.25 USD
- Purchase 2 = 11.76 SEK
I have ten different database columns to total sales for each currency (this is a requirement). The columns are earnings_in_usd, earnings_in_sek, earnings_in_eur, etc. In my function to do an insert statement to the database, I need to define the necessary variable. By default all other entries will be 0.00. This is basically the code that would accomplish what I need to do:
if currency == 'USD':
earnings_in_usd = value
elif currency == 'SEK':
earnings_in_sek = value
elif ...
Is there a more straightforward way to do this (a way do to something like earnings_in_$ = value)?
Use a defaultdict indexed by the currency.
from collections import defaultdict
earnings = defaultdict(float) # float has a default value of 0.
Instead of your long if-then-else, use this single line:
earnings[currency] = value
and retrieve the earnings in, say, US$, with
earnings["USD"]
Perhaps use a dictionary?
earnings = {}
earnings[currency] = value
One way to do it, which may very well have someone confounded when it breaks, is to use a list comprehension:
earnings_in_usd, earnings_in_sek, ... = [(value if currency == c else 0) for c in CURRENCIES]
The drawback is that the left hand side would have to include all your variables, and CURRENCIES would have to be a list of string constants with exactly the same order as the variables on the left hand side. Like I said, this may very well break if you tamper with other parts of the program...
If earnings is an object/array then
earnings[currency] = value
or
earnings.currency = value;