I have a dataframe with product name and volumes. I also have two variables with per unit cost.
LVP_Cost=xxxx
HVP_Cost=xxxx
However, I would like to apply the per unit cost only to selected product types. To achive this I am using isin() within a user defined function.
I am getting and error message:
AttributeError: 'str' object has no attribute 'isin'
Here is my code;
LVP_list=['BACS','FP','SEPA']
HVP_list=['HVP','CLS']
def calclate_cost (row):
if row['prod_final'].isin(LVP_list):
return row['volume']*LVP_per_unit_cost
elif row['prod_final']==(HVP_list):
return row['volume']*HVP_per_unit_cost
else:
return 0
mguk['cost_usd']=mguk.apply(calclate_cost,axis=1)
Please could you help
row['prod_final'] is a string containing the value of that column in the current row, not a Pandas series. So use the regular in operator.
if row['prod_final'] in LVP_list:
Related
I have a hypothetical dataframe 'country_sales_df' :
country_sales_list_of_lists =
[
['Australia',21421324,342343,'Pacific','Y'],
['England',124233431,43543464,'Europe','Y'],
['Japan',12431241341,34267545,'Asia','N'],
['India',214343421,342343,'Asia','Y']
]
country_sales_df = pd.DataFrame(country_sales_list_of_lists,columns = ['Country','Sales',Profit,Region,Otac_Group])
I then define a series of dataframes from the original country sales dataframe:
otac_df= country_sales_df.query('Otac_Group == "Y"')
asia_df= country_sales_df.query('Region == "Asia"')
europe_df= country_sales_df.query('Region == "Europe"')
pacific_df= country_sales_df.query('Region == "Pacific"')
For each of the dataframes I want to aggregate all the numeric fields and create an additional dataframe with the aggregated information. I don't want to have to repeat the agg code for each dataframe as the actual project I'm working on will have significantly more lines of code, this is just a smaller example.
How would I create a function to do this? I tried the below but it will return an error 'TypeError: 'DataFrameGroupBy' object is not callable'
def country_report_func(df_name,region_df):
df_name = region_df.groupby('Country')(['Sales','Profit']).agg([np.sum])
country_report_func('pacific_df_agg',pacific_df)
country_report_func('europe_df_agg',europe_df)
country_report_func('asia_df_agg',asia_df)
country_report_func('otac_df_agg',otac_df)
I'm basically just trying to get a piece of code to run for each of the dataframes I have defined and produce an additional dataframe for each. Does anyone have any recommendations on the best way to do this? I.e loop through a list of dataframes etc ?
Update:
I have now updated the function so it is applying the agg function to a datframe object and look to return the dataframe within the function. This is now returning the pacific_df_agg, however I'm unable to print this. The Europe, Asia and otac dataframes are also not created.
def country_report_func(df_name,region_df):
df_name = region_df.groupby('Country')[['Sales','Profit']].agg([np.sum])
return df_name
country_report_func('pacific_df_agg',pacific_df)
country_report_func('europe_df_agg',europe_df)
country_report_func('asia_df_agg',asia_df)
country_report_func('otac_df_agg',otac_df)
Update 2:
I think I have solved it as I am now returning multiple dataframes for the function using the below code. Unsure if this is the easiest way to do this if anyone has any further suggestions:
def country_report_func(df_name,region_df):
df_name = region_df.groupby('Country')[['Sales','Profit']].agg([np.sum])
return df_name
pacific_df_agg = country_report_func('pacific_df_agg',pacific_df)
europe_df_agg = country_report_func('europe_df_agg',europe_df)
asia_df_agg = country_report_func('asia_df_agg',asia_df)
otac_df_agg = country_report_func('otac_df_agg',otac_df)
print(pacific_df_agg)
print(europe_df_agg)
print(asia_df_agg)
print(otac_df_agg)
I keep getting AttributeError: 'DataFrame' object has no attribute 'column' when I run the function on a column in a dataframe
def reform (column, dataframe):
if dataframe.column.nunique() > 2 and dataframe.column.dtypes == object:
enc.fit(dataframe[['column']])
enc.categories_
onehot = enc.transform(dataframe[[column]]).toarray()
dataframe[enc.categories_] = onehot
elif dataframe.column.nunique() == 2 and dataframe.column.dtypes == object :
le.fit_transform(dataframe[['column']])
else:
print('Column cannot be reformed')
return dataframe
Try changing
dataframe.column to dataframe.loc[:,column].
dataframe[['column']] to dataframe.loc[:,[column]]
For more help, please provide more information. Such as: What is enc (show your imports)? What does dataframe look like (show a small example, perhaps with dataframe.head(5))?
Details:
Since column is an input (probably a string), you need to use it correctly when asking for that column from the dataframe object. If you just use dataframe.column it will try to find the column actually named 'column', but if you ask for it dataframe.loc[:,column], it will use the string that is represented by the input parameter named column.
With dataframe.loc[:,column], you get a Pandas Series, and with dataframe.loc[:,[column]] you get a Pandas DataFrame.
The pandas attribute 'columns', used as dataframe.columns (note the 's' at the end) just returns a list of the names of all columns in your dataframe, probably not what you want here.
TIPS:
Try to name input parameters so that you know what they are.
When developing a function, try setting the input to something static, and iterate the code until you get desired output. E.g.
input_df = my_df
column_name = 'some_test_column'
if input_df.loc[:,column_name].nunique() > 2 and input_df.loc[:,column_name].dtypes == object:
enc.fit(input_df.loc[:,[column_name]])
onehot = enc.transform(input_df.loc[:,[column_name]]).toarray()
input_df.loc[:, enc.categories_] = onehot
elif input_df.loc[:,column_name].nunique() == 2 and input_df.loc[:,column_name].dtypes == object :
le.fit_transform(input_df.loc[:,[column_name]])
else:
print('Column cannot be transformed')
Look up on how to use SciKit Learn Pipelines, with ColumnTransformer. It will help make the workflow easier (https://scikit-learn.org/stable/modules/compose.html).
Datset
I'm trying to check for a win from the WINorLOSS column, but I'm getting the following error:
Code and Error Message
The variable combined.WINorLOSS seems to be a Series type object and you can't compare an iterable (like list, dict, Series,etc) with a string type value. I think you meant to do:
for i in combined.WINorLOSS:
if i=='W':
hteamw+=1
else:
ateamw+=1
You can't compare a Series of values (like your WINorLOSS dataframe column) to a single string value. However you can use the following to counts the 'L' and 'W' in your columns:
hteamw = combined['WINorLOSS'].value_counts()['W']
hteaml = combined['WINorLOSS'].value_counts()['L']
I have data frame which looks like:
Now I am comparing whether two columns (i.e. complaint and compliment) have equal value or not: I have written a function:
def col_comp(x):
return x['Complaint'].isin(x['Compliment'])
When I apply this function to dataframe i.e.
df.apply(col_comp,axis=1)
I get an error message
AttributeError: ("'float' object has no attribute 'isin'", 'occurred
at index 0')
Any suggestion where I am making the mistake.
isin requires an iterable. You are providing individual data points (floats) with apply and col_comp. What you should use is == in your function col_comp, instead of isin. Even better, you can compare the columns in one call:
df['Complaint'] == df['Compliment']
I have gone through all posts on the website and am not able to find solution to my problem.
I have a dataframe with 15 columns. Some of them come with None or NaN values. I need help in writing the if-else condition.
If the column in the dataframe is not null and nan, I need to format the datetime column. Current Code is as below
for index, row in df_with_job_name.iterrows():
start_time=df_with_job_name.loc[index,'startTime']
if not df_with_job_name.isna(df_with_job_name.loc[index,'startTime']) :
start_time_formatted =
datetime(*map(int, re.split('[^\d]', start_time)[:-1]))
The error that I am getting is
if not df_with_job_name.isna(df_with_job_name.loc[index,'startTime']) :
TypeError: isna() takes exactly 1 argument (2 given)
A direct way to take care of missing/invalid values is probably:
def is_valid(val):
if val is None:
return False
try:
return not math.isnan(val)
except TypeError:
return True
and of course you'll have to import math.
Also it seems isna is not invoked with any argument and returns a dataframe of boolean values (see link). You can iterate thru both dataframes to determine if the value is valid.
isna takes your entire data frame as the instance argument (that's self, if you're already familiar with classes) and returns a data frame of Boolean values, True where that value is invalid. You tried to specify the individual value you're checking as a second input argument. isna doesn't work that way; it takes empty parentheses in the call.
You have a couple of options. One is to follow the individual checking tactics here. The other is to make the map of the entire data frame and use that:
null_map_df = df_with_job_name.isna()
for index, row in df_with_job_name.iterrows() :
if not null_map_df.loc[index,row]) :
start_time=df_with_job_name.loc[index,'startTime']
start_time_formatted =
datetime(*map(int, re.split('[^\d]', start_time)[:-1]))
Please check my use of row & column indices; the index, row handling doesn't look right. Also, you should be able to apply an any operation to the entire row at once.