Creating new column by looping through another column...not working - python

I'm trying to create a new "category" based on the value in another column (looping through that column). Here is my code.
def minority_category(minority_col):
for i in minority_col:
if i <= 25:
return '<= 25%'
if i > 25 and i <= 50:
return '<= 50%'
if i > 50 and i <= 75:
return '<= 75%'
if i > 75 and i <= 100:
return '<= 100%'
return 'Unknown'
However, the result is '<=75%' in the entire new column. Based on examples I've seen, my code looks right. Can anyone point out something wrong with the code? Thank you.

Related

df.apply returning errors

I am trying to do a Vlookup / true test to populate a value based upon a range. I am using df.apply as a means to this end. The Width column is float64.
Definition:
def f(x):
if ['Width'] < 66: return 'DblStk'
elif ['Width'] >= 66 & ['Width'] <= 77: return 'DblStkHC'
elif ['Width'] >= 77 & ['Width'] <= 92: return 'RbBase'
elif ['Width'] >= 92 & ['Width'] <= 94: return 'RBBildge'
elif ['Width'] >= 94: return 'StdOnly'
else: return 0
df_filtered['RollCat'] = df_filtered.apply(f,axis=1)
I am receiving a type error:
TypeError Traceback (most recent call
last) Input In [53], in <cell line: 1>()
----> 1 df_filtered['RollCat'] = df_filtered.apply(f,axis=1)
File ~\Anaconda3\lib\site-packages\pandas\core\frame.py:8839, in
DataFrame.apply(self, func, axis, raw, result_type, args, **kwargs)
8828 from pandas.core.apply import frame_apply 8830 op =
frame_apply( 8831 self, 8832 func=func, (...) 8837
kwargs=kwargs, 8838 )
-> 8839 return op.apply().finalize(self, method="apply")
Appreciate any guidance or help.
When you use df.apply() with the argument axis=1, the function you pass as an argument is applied to each row of the dataframe. That means that within the function it is a Pandas Series object.
If your function f(x), you are comparing a list (['Width']) to a number, which doesn't make sense. If there is a column of your dataframe called "Width" containing float values, then you should be able to do:
def f(x):
# Now we are inside the function so 'x' is a Pandas Series object
width = x['Width'] # Now, the variable called 'width' is a float
if width < 66:
...
Note that we have to actually retrieve the float value of the "Width" column for the row of interest. If you simply do ['Width'] then this is just a plain Python list and not a float, and therefore can't be compared to a numeric value.
Also in general, for joining Python conditions use and instead of &. And when you provide a traceback, make sure you provide the full error message. My guess is your error message says something like TypeError: '<' not supported between instances of 'list' and 'int', but it is always helpful to provide this information rather than just a traceback that points to where the error occurs but does not say what it is.
the argument that you pass to the function is x so the function should be :
def f(x):
if x < 66: return 'DblStk'
elif x >= 66 & x <= 77: return 'DblStkHC'
elif x >= 77 & x <= 92: return 'RbBase'
elif x >= 92 & x <= 94: return 'RBBildge'
elif x >= 94: return 'StdOnly'
else: return 0
and you can apply it to the column like this :
df_filtered['RollCat'] = df_filtered['width'] .apply(f)

Added second value to return and i don't know what's missing in the function's paramater to pass?

Need help with this assignment, python newbie here, 30 minutes till deadline and i couldn't figure out what's missing. my teacher said it won't work like this.
updated,
if 0 < amount <= 20:
return 0.005, 0
if 20 < amount <= 50:
return 0.893, 0
elif 50 <= amount <= 100:
return 1.000, 0
elif 100 <= amount <= 200:
return 1.900, 0
elif 200 <= amount <= 300:
return 2.800, 0
elif 300 <= amount <= 500:
return 3.500, 0
elif 500 <= amount <= 700:
return 4.000, 0
elif 700 <= amount <= 1000:
return 4.750, 0
else:
return 5.500, 0
I think your problem is here
def has_enough_balance(balance, amount, transaction_type):
return balance >= calculate_fee(amount, transaction_type) + amount
calculate_fee() returns a tuple , in your has_enough_balance() function, you add the amount to it, you cant add a numeric and a tuple
index calculate_fee(amount, transaction_type) to choose which variable you want to use there and your code should be fine
"depends on how and where the function should be used later, it's an abstract function "
you could edit your has_enough_balance() function to
def has_enough_balance(balance, amount, transaction_type, fee_type):
index_dict = {"sender":0, "receiver":1}
return balance >= (calculate_fee(amount, transaction_type)[index_dict[fee_type]]) + amount
basically since you have calculate_fee() returning a tuple, you only want to use one of those values at a time.
you can follow the tuple with [*index of the value you want to use*] to grab the correct value. so I added fee_type as a argument for has_enough_balance() and a dictionary inside to let you say whether you want to use your function for the sender fee or the receiver fee, and the dictionary will match it to its corresponding index, and [index_dict[fee_type]] grabs the correct slice out of calculate_fee()
Your calculate_fee method returns two variables.
In python, when you return multiple variables that's a tuple (example: (0, 0))
But in this block of code
# Checking if client has enough balance before request is sent
def has_enough_balance(balance, amount, transaction_type):
return balance >= calculate_fee(amount, transaction_type) + amount
You try to compare it to a single variable balance and you also try to add it to a single variable amount. These operations are not supported between a tuple, and an int.
From your comment i saw that calculate_fee returns the fee for the sender and the fee for the receiver.
You could edit your has_enough_balance function with a boolean parameter forSender to know if you're calculating the amount for the sender or for the receiver.
You could then use the value of this parameter to know which of the two fees you need to add in your calculation.
# Checking if client has enough balance before request is sent
def has_enough_balance(balance, amount, transaction_type, forSender):
senderFee, receiverFee = calculate_fee(amount, transaction_type)
fee = senderfEE if forSender else receiverFee
return balance >= fee + amount

Iterate through Python for loop more quickly

I have a Pandas data frame (called "ud_flex" below) that looks like the one below:
The data frame has over 27 million observations in it that I'm trying to iterate through to do a calculation for each row. Below is the calculation that I'm using:
def set_fpts(pos, rank, curr_fpts):
if pos == "RB" and rank >= 3.0:
return 0
elif pos == "WR" and rank >= 4.0:
return 0
elif (pos == "TE" or pos == "QB") and rank >= 2.0:
return 0
else:
return curr_fpts
Here is the for loop that I've created:
players = ud_flex.shape[0]
for i in range(0,players):
new_fpts = set_fpts(ud_flex.iloc[i]['position_name'], ud_flex.iloc[i]['wk_rank_orig'], ud_flex.iloc[i]['fpts'])
ud_flex.at[i, 'fpts_orig'] = new_fpts
Does anyone have any suggestions for how to speed up this loop? It's currently taking nearly an hour! Thanks!
You could start making an algorithm that exits faster:
def set_fpts(pos, rank, curr_fpts):
if rank > 4:
return 0
if rank < 2:
return curr_fpts
if pos in ["TE", "QB"]:
return 0
if rank >= 3:
if pos == "RB":
return 0
return curr_fpts
In general, iterating through pandas data frames is slow, so it's not surprising that your for loop based approach is taking a while.
I suspect that the following alternative should work more quickly for a data frame of your size.
mask = (((ud_flex['position_name']=="RB") & (ud_flex['wk_rank_orig']>=3))
|((ud_flex['position_name']=="WR") & (ud_flex['wk_rang_orig']>=4))
|((ud_flex['position_name'].isin["TE","QB"]) & (ud_flex['wk_rang_orig']>=2)))
ud_flex['fpts_orig'][mask] = 0
ud_flex['fpts_orig'][~mask] = ud_flex['fpts']

How to bin a column and keep null values in a separate group

I have a column with continuous variable and wanted to bin it for plotting. However, this column also contains null values.
I used the following code to bin it:
def a(b):
if b<= 20: return "<= 20"
elif b<= 40: return "20 < <= 40"
elif b<= 45: return "40 < <= 45"
else: return "> 45"
audf = udf(a, StringType())
data= data.withColumn("a_bucket", audf("b"))
I am running on Python 3 and throw me the following error:
TypeError: '<=' not supported between instances of 'NoneType' and 'int'
I look up some documentations saying Python 3 won't allow comparison between numbers with null value. But is there a way for me to throw those null values into a separate group so I won't throw away data. They are not bad data.
Thanks.
You can do this without a udf. Here is one way to rewrite your code, and have a special bucket for null values:
from pyspark.sql.functions import col, when
def a(b):
return when(b.isNull(), "Other")\
.when(b <= 20, "<= 20")\
.when(b <= 40, "20 < <= 40")\
.when(b <= 45, "40 < <= 45")\
.otherwise("> 45")
data = data.withColumn("a_bucket", a(col("b")))
However, a more general solution would allow you to pass in a list of buckets and dynamically return the bin output (untested):
from functools import reduce
def aNew(b, buckets):
"""assumes buckets are sorted"""
if not buckets:
raise ValueError("buckets can not be empty")
return reduce(
lambda w, i: w.when(
b.between(buckets[i-1], buckets[i]),
"{low} < <= {high}".format(low=buckets[i-1], high=buckets[i]))
),
range(1, len(buckets)),
when(
b.isNull(),
"Other"
).when(
b <= buckets[0],
"<= {first}".format(first=buckets[0])
)
).otherwise("> {last}".format(last=buckets[-1]))
data = data.withColumn("a_bucket", aNew(col("b"), buckets=[20, 40, 45]))

How to group a list of numbers into certain categories

I am trying to figure out how to take in a list of numbers and sort them into certain categories such as 0-10, 10-20, 20-30 and up to 90-100 but I have the code started, but the code isn't reading in all the inputs, but only the last one and repeating it. I am stumped, anyone help please?
def eScores(Scores):
count0 = 0
count10 = 0
count20 = 0
count30 = 0
count40 = 0
count50 = 0
count60 = 0
count70 = 0
count80 = 0
count90 = 0
if Scores > 90:
count90 = count90 + 1
if Scores > 80:
count80 = count80 + 1
if Scores > 70:
count70 = count70 + 1
if Scores > 60:
count60 = count60 + 1
if Scores > 50:
count50 = count50 + 1
if Scores > 40:
count40 = count40 + 1
if Scores > 30:
count30 = count30 + 1
if Scores > 20:
count20 = count20 + 1
if Scores > 10:
count10 = count10 + 1
if Scores <= 10:
count0 = count0 + 1
print count90,'had a score of (90 - 100]'
print count80,'had a score of (80 - 90]'
print count70,'had a score of (70 - 80]'
print count60,'had a score of (60 - 70]'
print count50,'had a score of (50 - 60]'
print count40,'had a score of (40 - 50]'
print count30,'had a score of (30 - 40]'
print count20,'had a score of (20 - 30]'
print count10,'had a score of (10 - 20]'
print count0,'had a score of (0 - 10]'
return eScores(Scores)
Each time eScores is called is sets all the counters (count10, count20) back to zero. So only the final call has any effect.
You should either declare the counters as global variables, or put the function into a class and make the counters member variables of the class.
Another problem is that the function calls itself in the return statement:
return eScores(Scores)
Since this function is (as I understand it) supposed to update the counter variables only, it does not need to return anything, let alone call itself recursively. You'd better remove the return statement.
One thing you're making a mistake on is that you're not breaking out of the whole set of if's when you go through. For example, if you're number is 93 it is going to set count90 to 1, then go on to count80 and set that to one as well, and so on until it gets to count10.
Your code is repeating because the function is infintely recursive (it has no stop condition). Here are the relevant bits:
def eScores(Scores):
# ...
return eScores(Scores)
I think what you'd want is more like:
def eScores(Scores):
# same as before, but change the last line:
return
Since you're printing the results, I assume you don't want to return the values of score10, score20, etc.
Also, the function won't accumulate results since you're creating new local counts each time the function is called.
Why don't you just use each number as a key (after processing) and return a dictionary of values?
def eScores(Scores):
return_dict = {}
for score in Scores:
keyval = int(score/10)*10 # py3k automatically does float division
if keyval not in return_dict:
return_dict[keyval] = 1
else:
return_dict[keyval] += 1
return return_dict

Categories