df.apply returning errors - python

I am trying to do a Vlookup / true test to populate a value based upon a range. I am using df.apply as a means to this end. The Width column is float64.
Definition:
def f(x):
if ['Width'] < 66: return 'DblStk'
elif ['Width'] >= 66 & ['Width'] <= 77: return 'DblStkHC'
elif ['Width'] >= 77 & ['Width'] <= 92: return 'RbBase'
elif ['Width'] >= 92 & ['Width'] <= 94: return 'RBBildge'
elif ['Width'] >= 94: return 'StdOnly'
else: return 0
df_filtered['RollCat'] = df_filtered.apply(f,axis=1)
I am receiving a type error:
TypeError Traceback (most recent call
last) Input In [53], in <cell line: 1>()
----> 1 df_filtered['RollCat'] = df_filtered.apply(f,axis=1)
File ~\Anaconda3\lib\site-packages\pandas\core\frame.py:8839, in
DataFrame.apply(self, func, axis, raw, result_type, args, **kwargs)
8828 from pandas.core.apply import frame_apply 8830 op =
frame_apply( 8831 self, 8832 func=func, (...) 8837
kwargs=kwargs, 8838 )
-> 8839 return op.apply().finalize(self, method="apply")
Appreciate any guidance or help.

When you use df.apply() with the argument axis=1, the function you pass as an argument is applied to each row of the dataframe. That means that within the function it is a Pandas Series object.
If your function f(x), you are comparing a list (['Width']) to a number, which doesn't make sense. If there is a column of your dataframe called "Width" containing float values, then you should be able to do:
def f(x):
# Now we are inside the function so 'x' is a Pandas Series object
width = x['Width'] # Now, the variable called 'width' is a float
if width < 66:
...
Note that we have to actually retrieve the float value of the "Width" column for the row of interest. If you simply do ['Width'] then this is just a plain Python list and not a float, and therefore can't be compared to a numeric value.
Also in general, for joining Python conditions use and instead of &. And when you provide a traceback, make sure you provide the full error message. My guess is your error message says something like TypeError: '<' not supported between instances of 'list' and 'int', but it is always helpful to provide this information rather than just a traceback that points to where the error occurs but does not say what it is.

the argument that you pass to the function is x so the function should be :
def f(x):
if x < 66: return 'DblStk'
elif x >= 66 & x <= 77: return 'DblStkHC'
elif x >= 77 & x <= 92: return 'RbBase'
elif x >= 92 & x <= 94: return 'RBBildge'
elif x >= 94: return 'StdOnly'
else: return 0
and you can apply it to the column like this :
df_filtered['RollCat'] = df_filtered['width'] .apply(f)

Related

Question about IF ELSE and mathematic in python

Data
Time,PM2.5,
1/1/2014,9
2/1/2014,10
import pandas as pd
df = pd.read_csv('xx.csv')
data = pd.DataFrame(df)
def calculation(y):
if 0 < y and y < 12:
bello=data.assign(API=(50/12)*y)
elif 12.1 <= y and y <= 50.4:
bello=data.assign(API=(((100-51)/(50.4-12.1))*(y-12.1))+51)
elif 50.5 <= y and y <= 55.4:
bello=data.assign(API=(((150-101)/(55.4-50.5))*(y-50.5))+101)
elif 55.5 <= y and y <= 150.4:
bello=data.assign(API=(((200-151)/(150.4-55.5))*(y-55.5))+151)
elif 150.5 <= y and y <= 250.4:
bello=data.assign(API=(((300-201)/(250.4-150.5))*(y-150.5))+201)
elif 250.5 <= y and y <= 350.4:
bello=data.assign(API=(((400-301)/(350.4-250.5))*(y-250.5))+301)
else:
bello=data.assign(API=(((500-401)/(500.4-350.5))*(y-350.5))+401)
return bello
y=data['PM2.5']
print(calculation(y))
Hi everyone,
I want to convert air quality data to PM2.5 with above condition and equation using coding above.
I received an error "ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().".
I hope someone can tell me what is the problem.
Thanks in advance.
I wrote the coding above but show error. Hope someone can tell what is the problem of my coding.
For example, you can rewrite the function for single value and use df.apply(...)
import pandas as pd
data = pd.DataFrame({'PM2.5': [15, 50, 1000]})
def calculation(y):
if 0<y<12:
return (50/12)*y
elif 12.1 <= y <= 50.4:
return (((100-51)/(50.4-12.1))*(y-12.1))+51
## ....
else:
return (((500-401)/(500.4-350.5))*(y-350.5))+401
y=data['PM2.5']
print(y.apply(calculation))
This is close to your code, however faster solutions might exists by vectorizing.

How to apply math functions while filtering (using loc) in pandas

I have retrived a value of x as x = 10.64589195722904 which I need to match in my existing dataframe using loc. As subtraction result can be a negative value which I must ignore and so I am using math.fabs to achieve it.
fdf = df.loc[(math.fabs(df['x'] - x) <= 0.01 & math.fabs(df['x'] - x) >= 0.001)]
But this is throwing error:
TypeError Traceback (most recent call last)
<ipython-input-256-b8a71a5bd17c> in <module>
10 # fdf = df.loc[math.fabs((df['x'] - k) <= 0.001) & (math.fabs(df['x'] - k) >= 0.0001) ]
11
---> 12 df.loc[(math.fabs(df['x'] - x) <= 0.01 & math.fabs(df['x'] - x) >= 0.001)]
13 fdf.head()
~\.conda\envs\pyenv\lib\site-packages\pandas\core\series.py in wrapper(self)
110 if len(self) == 1:
111 return converter(self.iloc[0])
--> 112 raise TypeError(f"cannot convert the series to {converter}")
113
114 wrapper.__name__ = f"__{converter.__name__}__"
TypeError: cannot convert the series to <class 'float'>
Use numpy.fabs for processing values vectorized way and also add () around masks because priority operators:
s = np.fabs(df['x'] - x)
fdf = df[(s <= 0.01) & (s >= 0.001)]
Alternative is use Series.between:
fdf = df[np.fabs(df['x'] - x).between(0.01, 0.001)]
math.fabs only takes a single value, so .apply can be used to create a new column, and then perform the Boolean selection.
As shown by jezrael, np.fabs can be used for a vectorized approach
The benefit is, numpy is faster
# apply math.fabs and create a column
df['fabs'] = df['x'].apply(lambda row: math.fabs(row) - x)
# filter on the new column
fdf = df[(df['fabs'] <= 0.01) & (df['fabs'] >= 0.001)]

How to bin a column and keep null values in a separate group

I have a column with continuous variable and wanted to bin it for plotting. However, this column also contains null values.
I used the following code to bin it:
def a(b):
if b<= 20: return "<= 20"
elif b<= 40: return "20 < <= 40"
elif b<= 45: return "40 < <= 45"
else: return "> 45"
audf = udf(a, StringType())
data= data.withColumn("a_bucket", audf("b"))
I am running on Python 3 and throw me the following error:
TypeError: '<=' not supported between instances of 'NoneType' and 'int'
I look up some documentations saying Python 3 won't allow comparison between numbers with null value. But is there a way for me to throw those null values into a separate group so I won't throw away data. They are not bad data.
Thanks.
You can do this without a udf. Here is one way to rewrite your code, and have a special bucket for null values:
from pyspark.sql.functions import col, when
def a(b):
return when(b.isNull(), "Other")\
.when(b <= 20, "<= 20")\
.when(b <= 40, "20 < <= 40")\
.when(b <= 45, "40 < <= 45")\
.otherwise("> 45")
data = data.withColumn("a_bucket", a(col("b")))
However, a more general solution would allow you to pass in a list of buckets and dynamically return the bin output (untested):
from functools import reduce
def aNew(b, buckets):
"""assumes buckets are sorted"""
if not buckets:
raise ValueError("buckets can not be empty")
return reduce(
lambda w, i: w.when(
b.between(buckets[i-1], buckets[i]),
"{low} < <= {high}".format(low=buckets[i-1], high=buckets[i]))
),
range(1, len(buckets)),
when(
b.isNull(),
"Other"
).when(
b <= buckets[0],
"<= {first}".format(first=buckets[0])
)
).otherwise("> {last}".format(last=buckets[-1]))
data = data.withColumn("a_bucket", aNew(col("b"), buckets=[20, 40, 45]))

Creating new column by looping through another column...not working

I'm trying to create a new "category" based on the value in another column (looping through that column). Here is my code.
def minority_category(minority_col):
for i in minority_col:
if i <= 25:
return '<= 25%'
if i > 25 and i <= 50:
return '<= 50%'
if i > 50 and i <= 75:
return '<= 75%'
if i > 75 and i <= 100:
return '<= 100%'
return 'Unknown'
However, the result is '<=75%' in the entire new column. Based on examples I've seen, my code looks right. Can anyone point out something wrong with the code? Thank you.

Write a loop using python

I need to print value with some size using condition.
size, url
1 https://api-glb-ams.smoot.apple.com/user_guid?
3257 https://init.itunes.apple.com/WebObjects/MZInit.woa/wa/signSapSetupCert
0 http://engine.rbc.medialand.ru/code?
35 http://www.google-analytics.com/collect?
0 http://engine.rbc.medialand.ru/test?
0 http://engine.rbc.medialand.ru/code?
I get it in loop and I try to get all url, where size more than 43.
if not size:
continue
elif size[0] < 43:
continue
else:
print size[0], url
If condition works, but elif doesn't. It print all size and url
In Python 2, which you are using, strings can be compared to integers. Strings always compare as being larger than integers.
>>> '35' < 43
False
To solve this, wrap the string in an int() call:
>>> int('35') < 43
True
For your program:
elif int(size[0]) < 43:

Categories