How to use apply function here? - python

import numpy as np
import pandas as pd
PATH = r'C:\Users\ADMIN\Desktop\Net_Present_value.csv'
data1 = pd.read_csv(PATH)
def calc_equity(assets,liabilities):
return liabilities - assets
data1.apply(calc_equity)
Its giving me error stating:
calc_equity() missing 1 required positional argument: 'liabilities
Please help as if how can I resolve this

I'm assuming your data has two columns ['assets', 'liabilities'] and you want to calculate the equity as a third column. You don't need the apply function here. You can calculate it as a difference of the two columns:
data1['equity'] = calc_equity(data1['assets'], data1['liabilities'])
This would create new column 'equity' in your DataFrame.
If you insist on applying a function to the DataFrame, the function in question needs to acept a single argument that is either a column or a row of the DataFrame. I your case you want to take a difference of two values in the same row, so the function to apply needs to take a row as an argument:
def calc_equity(row):
return row['liabilities'] - row['assets']
data['equity'] = data1.apply(calc_equity, axis=1)
axis=1 tells the apply function to work on each row. In the function you can access the values in the row by the columns. Bear in mind that this is slower than the first approach as it iterates all the rows instead of working on the columns as numpy arrays.

Related

Map Value to Specific Row and Column - Python Pandas

I have a data set where i want to match the index row and change the value of a column within that row.
I have looked at map and loc and have been able to locate the data use df.loc but it filters that data down, all i want to do is change the value in a column on that row when that row is found.
What is the best approach - my original post can be found here:
Original post
It's simple to do in excel but struggling with Pandas.
Edit:
I have this so far which seems to work but it includes a lot of numbers after the total calculation along with dtype: int64
import pandas as pd
df = pd.read_csv(r'C:\Users\david\Documents\test.csv')
multiply = {2.1: df['Rate'] * df['Quantity']}
df['Total'] = df['Code'].map(multiply)
df.head()
how do i get around this?
The pandas method mask is likely a good option here. Mask takes two main arguments: a condition and something with which to replace values that meet that condition.
If you're trying to replace values with a formula that draws on values from multiple dataframe columns, you'll also want to pass in an additional axis argument.
The condition: this would be something like, for instance:
df['Code'] == 2.1
The replacement value: this can be a single value, a series/dataframe, or (most valuable for your purposes) a function/callable. For example:
df['Rate'] * df['Quantity']
The axis: Because you're passing a function/callable as the replacement argument, you need to tell mask() how to find those values. It might look something like this:
axis=0
So all together, the code would read like this:
df['Total'] = df['Code'].mask(
df['Code'] == 2.1,
df['Rate'] * df['Quantity'],
axis=0
)

apply function in pandas to create two columns

I have a Pandas DataFrame called ebola as seen below. variable column has two pieces of information status whether it is Cases or Deaths and country which consists of country names. I try to create two new columns status and country out of that variable column by using .apply() function. However, since there are two values I am trying to extract, it does not work.
# let's create a splitter function
def splitter(column):
status, country = column.split("_")
return status, country
# apply this function to that column and assign to two new columns
ebola[['status', 'country']] = ebola['variable'].apply(splitter)
The error I get is
ValueError: Must have equal len keys and value when setting with an iterable
I want my output to be like this
Use Series.str.split
ebola[['status','country']]=ebola['variable'].str.split(pat='_',expand=True)
This is very late post to original question. Thanks to #ansev, the solution was great and it worked out great. While I was going through my question, I was trying to develop a solution based on my first approach. I was able to work it out and I wanted to share for anyone who might want to see a different perspective on this.
update to my code:
# let's create a splitter function
def splitter(column):
for row in column:
status, country = row.split("_")
return status, country
# apply this function to that column and assign to two new columns
ebola[['status', 'country']] = ebola['variable'].to_frame().apply(splitter, axis=1, result_type='expand')
Two updates to my code, so it could work.
Instead of going through Series, I converted it to dataframe using .to_frame() method.
In my splitter function, I had to iterate through each row since it was a DataFrame. Therefore, I added for row in column line.
To replicate all of this:
import numpy as np
import pandas as pd
# create the data
ebola_dict = {'Date':['3/24/2014', '3/22/2014', '1/15/2015', '1/4/2015'],
'variable': ['Cases_Guinea', 'Cases_Guinea', 'Cases_Liberia', 'Cases_Liberia']}
ebola = pd.DataFrame(ebola_dict)
print(ebola)
# let's create a splitter function
def splitter(column):
for row in column:
status, country = row.split("_")
return status, country
# apply this function to that column and assign to two new columns
ebola[['status', 'country']] = ebola['variable'].to_frame().apply(splitter, axis=1, result_type='expand')
# check if it worked
print(ebola)

Column missing after Pandas GroupBy (not the GroupBy column)

I am using the following source code:
import numpy as np
import pandas as pd
# Load data
data = pd.read_csv('C:/Users/user/Desktop/Daily_to_weekly.csv', keep_default_na=True)
print(data.shape[1])
# 18
# Create weekly data
# Agreggate by calculating the sum per store for every week
data_weekly = data.groupby(['STORE_ID', 'WEEK_NUMBER'], as_index=False).agg('sum')
print(data_weekly.shape[1])
# 17
As you may see for some reason a column is missing after the aggregation and this column is neither of the GroupBy columns ('STORE_ID', 'WEEK_NUMBER').
Why is this happening and how can I fix it?
I've run in to this problem numerous times before. The problem is panda's is dropping one of your columns because it has identified it as a "nuisance" column. This means that the aggregation you are attempting to do cannot be applied to it. If you wish to preserve this column I would recommend including it in the groupby.
https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#automatic-exclusion-of-nuisance-columns

Pandas Mean for Certain Column

I have a pandas dataframe like that:
How can I able to calculate mean (min/max, median) for specific column if Cluster==1 or CLuster==2?
Thanks!
You can create new df with only the relevant rows, using:
newdf = df[df['cluster'].isin([1,2)]
newdf.mean(axis=1)
In order to calc mean of a specfic column you can:
newdf["page"].mean(axis=1)
If you meant take the mean only where Cluster is 1 or 2, then the other answers here address your issue. If you meant take a separate mean for each value of Cluster, you can use pandas' aggregation functions, including groupyby and agg:
df.groupby("Cluster").mean()
is the simplest and will take means of all columns, grouped by Cluster.
df.groupby("Cluster").agg({"duration" : np.mean})
is an example where you are taking the mean of just one specific column, grouped by cluster. You can also use np.min, np.max, np.median, etc.
The groupby method produces a GroupBy object, which is something like but not like a DataFrame. Think of it as the DataFrame grouped, waiting for aggregation to be applied to it. The GroupBy object has simple built-in aggregation functions that apply to all columns (the mean() in the first example), and also a more general aggregation function (the agg() in the second example) that you can use to apply specific functions in a variety of ways. One way of using it is passing a dict of column names keyed to functions, so specific functions can be applied to specific columns.
You can do it in one line, using boolean indexing. For example you can do something like:
import numpy as np
import pandas as pd
# This will just produce an example DataFrame
df = pd.DataFrame({'a':np.arange(30), 'Cluster':np.ones(30,dtype=np.int)})
df.loc[10:19, "Cluster"] *= 2
df.loc[20:, "Cluster"] *= 3
# This line is all you need
df.loc[(df['Cluster']==1)|(df['Cluster']==2), 'a'].mean()
The boolean indexing array is True for the correct clusters. a is just the name of the column to compute the mean over.
Simple intuitive answer
First pick the rows of interest, then average then pick the columns of interest.
clusters_of_interest = [1, 2]
columns_of_interest = ['page']
# rows of interest
newdf = df[ df.CLUSTER.isin(clusters_of_interest) ]
# average and pick columns of interest
newdf.mean(axis=0)[ columns_of_interest ]
More advanced
# Create groups object according to the value in the 'cluster' column
grp = df.groupby('CLUSTER')
# apply functions of interest to all cluster groupings
data_agg = grp.agg( ['mean' , 'max' , 'min' ] )
This is also a good link which describes aggregation techniques. It should be noted that the "simple answer" averages over clusters 1 AND 2 or whatever is specified in the clusters_of_interest while the .agg function averages over each group of values having the same CLUSTER value.

Manipulate A Group Column in Pandas

I have a data set with columns Dist, Class, and Count.
I want to group that data set by dist and divide the count column of each group by the sum of the counts for that group (normalize it to one).
The following MWE demonstrates my approach thus far. But I wonder: is there a more compact/pandaific way of writing this?
import pandas as pd
import numpy as np
a = np.random.randint(0,4,(10,3))
s = pd.DataFrame(a,columns=['Dist','Class','Count'])
def manipcolumn(x):
csum = x['Count'].sum()
x['Count'] = x['Count'].apply(lambda x: x/csum)
return x
s.groupby('Dist').apply(manipcolumn)
One alternative way to get the normalised 'Count' column could be to use groupby and transform to get the sums for each group and then divide the returned Series by the 'Count' column. You can reassign this Series back to your DataFrame:
s['Count'] = s['Count'] / s.groupby('Dist')['Count'].transform(np.sum)
This avoids the need for a bespoke Python function and the use of apply. Testing it for the small example DataFrame in your question showed that it was around 8 times faster.

Categories