Is there a way to optimize the itterrows code in pandas? - python

This is the input data
The Closing stock value is calculated by only where there is a value for Opening Stock by adding Opening stock, Purchase Qty and Sold Qty .
I want to add values to 'Opening Stock' & 'Closing Stock' with conditions.
When the Opening stock value is 0 or blank, it should be filled
by the Closing Stock of previous record
Fill should happen only when the Site and Item code are same for this record and previous record
for i, row in df.iterrows():
df['Opening Stock'] = np.where((df['Site'] == df['Site'].shift(1)) & (df['Item Code'] == df['Item Code'].shift(1))& ((df['Opening Stock'] == 0) | (df['Opening Stock'].isna())),df['Closing Stock'].shift(1),df['Opening Stock'])
df['Closing Stock'][i] = df['Opening Stock'][i]+df['Purchase Qty'][i]+df['Sold Qty'][i]
This is how the output looks like
The problem is since the size of the dataset is large it takes hours to complete.
Is there a way to optimise this code?

You can do this without any iterative approach. The first step is to convert the 0 values in Opening Stock to np.nan so that we can fill them in the next step.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Site': ['site 1', 'site 1', 'site 2', 'site 2'],
'Item Code': ['A', 'A', 'A', 'A'],
'Opening Stock': [1000, 0, 2000, 0],
'Closing Stock': [1200, 0, 2250, 0],
'Purchase Qty': [500, 100, 400, 300],
'Sold Qty': [-300, -200, -150, -100]})
df.loc[df['Opening Stock'] == 0, 'Opening Stock'] = np.nan
df['Opening Stock'] = df.groupby(['Site', 'Item Code'])['Opening Stock'].fillna(df['Closing Stock'].shift(1))
df['Closing Stock'] = df['Opening Stock'] + df['Purchase Qty'] + df['Sold Qty']

one way to apply a condition on every row in a Pandas dataframe is df.apply():
df.apply(
self.my_function
args = function_args
....
)
def my_function(self, row, args):
row['Closing Stock']= row['Opening Stock']+row['Purchase Qty']+row['Sold Qty']
# you can do pretty much whatever you want inside this function
#note: row is a pandas series
return row #it return the row modified and inserts it into the dataframe

Related

How to calculate statistics from dataframe and insert them into new dataframe, matching by name?

I have a dataframe (df1) of every single NBA shot taken with columns of time, shot location, player, etc. (there are duplicates of player names because every shot taken is its own row) and I want to create a new dataframe with calculated figures from the original dataframe. The new dataframe will have one row per player with various statistics like "Total Shots" and "Make %".
I have created the new dataframe with just the names of players:
df_names = df1[['Player Name']].drop_duplicates()
And now I would like to know how to go through df1, count the shots taken per player, insert that into my new df_names as a new column.
Welcome to StackOverflow!
The feature of pandas that I'd recommend diving into is .groupby() (pandas documentation for .groupby()). Rather than doing these in two different operations, you can do them as one:
data = {'Player Name': ['Player 1', 'Player 1', 'Player 1', 'Player 2', 'Player 2', 'Player 2']
, 'Shot Time': ['1/1/2022 10:00:0000', '1/1/2022 10:01:0000', '1/1/2022 10:02:0000', '1/1/2022 10:03:0000', '1/1/2022 10:04:0000', '1/1/2022 10:05:0000']
, 'Shot Made': [True,True,False,False,True,False]}
df = pd.DataFrame(data)
df_agg = df.groupby(['Player Name'],as_index=False)['Shot Time'].count()
You can group by Player Name and do various aggregations over your columns, such as counts of shot times for number of shots (as above), mode of position, average time between shots, etc.
Eventually, you may need to join multiple of these dataframes together. If so, you would use the .join() or .merge() function. Here's the pandas documentation about various ways to smoosh data together: Pandas Join, Merge, Concatenate and Compare
You should accept K. Thorspear's solution, as I am just adding to it.
You can aggregate on a groupby of the player name. Count the 'Shot Time' column (will simply count how many times a player took a shot), then sum the 'Shot Made columns (since True = 1, and False = 0, the sum will give the total made).
Then just divide those out to get a shot percent:
import pandas as pd
data = {'Player Name': ['Player 1', 'Player 1', 'Player 1', 'Player 2', 'Player 2', 'Player 2']
, 'Shot Time': ['1/1/2022 10:00:0000', '1/1/2022 10:01:0000', '1/1/2022 10:02:0000', '1/1/2022 10:03:0000', '1/1/2022 10:04:0000', '1/1/2022 10:05:0000']
, 'Shot Made': [True,True,False,False,True,False]}
df = pd.DataFrame(data)
df = df.groupby(['Player Name']).agg({'Shot Time':'count', 'Shot Made':'sum'}).rename(columns={'Shot Time':'Shot Attempt'})
df['Shot %'] = df['Shot Made'] / df['Shot Attempt']
Output:
print(df)
Shot Attempt Shot Made Shot %
Player Name
Player 1 3 2 0.666667
Player 2 3 1 0.333333

Add a calculated column to a pivot table in pandas

Hi I am trying to create new columns to a multi-indexed pandas pivot table to do a countif statement (similar to excel) depending if a level of the index contains a specific string. This is the sample data:
df = pd.DataFrame({'City': ['Houston', 'Austin', 'Hoover','Adak','Denver','Houston','Adak','Denver'],
'State': ['Texas', 'Texas', 'Alabama','Alaska','Colorado','Texas','Alaska','Colorado'],
'Name':['Aria', 'Penelope', 'Niko','Susan','Aria','Niko','Aria','Niko'],
'Unit':['Sales', 'Marketing', 'Operations','Sales','Operations','Operations','Sales','Operations'],
'Assigned':['Yes','No','Maybe','No','Yes','Yes','Yes','Yes']},
columns=['City', 'State', 'Name', 'Unit','Assigned'])
pivot=df.pivot_table(index=['City','State'],columns=['Name','Unit'],values=['Assigned'],aggfunc=lambda x:', '.join(set(x)),fill_value='')
and this is the desired output (in screenshot). Thanks in advance!
try:
temp = pivot[('Mango', 'Aria', 'Sales')].str.len()>0
pivot['new col'] = temp.astype(int)
the result:
Based on your edit:
import numpy as np
temp = pivot.xs('Sales', level=2, drop_level=False, axis = 1).apply(lambda x: np.sum([1 if y!='' else 0 for y in x]), axis = 1)
pivot[('', 'total sales', 'count how many...')]=temp

how to apply a class function to replace NaN for mean within a subset of pandas df columns?

The class is composed of a set of attributes and functions including:
Attributes:
df : a pandas dataframe.
numerical_feature_names: df columns with a numeric value.
label_column_names: df string columns to be grouped.
Functions:
mean(nums): takes a list of numbers as input and returns the mean
fill_na(df, numerical_feature_names, label_columns): takes class attributes as inputs and returns a transformed df.
And here's the class:
class PLUMBER():
def __init__(self):
################# attributes ################
self.df=df
# specify label and numerical features names:
self.numerical_feature_names=numerical_feature_names
self.label_column_names=label_column_names
##################### mean ##############################
def mean(self, nums):
total=0.0
for num in nums:
total=total+num
return total/len(nums)
############ fill the numerical features ##################
def fill_na(self, df, numerical_feature_names, label_column_names):
# declaring parameters:
df=self.df
numerical_feature_names=self.numerical_feature_names
label_column_names=self.label_column_names
# now replacing NaN with group mean
for numerical_feature_name in numerical_feature_names:
df[numerical_feature_name]=df.groupby([label_column_names]).transform(lambda x: x.fillna(self.mean(x)))
return df
When trying to apply it to a pandas df:
if __name__=="__main__":
# initialize class
plumber=PLUMBER()
# replace NaN with group mean
df=plumber.fill_na(df=df, numerical_feature_names=numerical_feature_names, label_column_names=label_column_names)
The next error arises:
ValueError: Grouper and axis must be same length
data and class parameters
import pandas as pd
d={'month': ['01/01/2020', '01/02/2020', '01/03/2020', '01/01/2020', '01/02/2020', '01/03/2020'],
'country': ['Japan', 'Japan', 'Japan', 'Poland', 'Poland', 'Poland'],
'level':['A01', 'A01', 'A01', 'A00','A00', 'A00'],
'job title':['Insights Manager', 'Insights Manager', 'Insights Manager', 'Sales Director', 'Sales Director', 'Sales Director'],
'number':[np.nan, 450, 299, np.nan, 19, 29],
'age':[np.nan, 30, 28, np.nan, 29, 18]}
df=pd.DataFrame(d)
# headers
column_names=df.columns.values.tolist()
column_names= [column_name.strip() for column_name in column_names]
# label_column_names (to be grouped)
label_column_names=['country', 'level', 'job title']
# numerical_features:
numerical_feature_names = [x for x in column_names if x not in label_column_names]
numerical_feature_names.remove('month')
How could I change the class in order to get the transformed df (i.e. the one that replaces np.nan with it's group mean)?
First the error is because label_column_names is already a list, so in the groupby you don't need the [] around it. so it should be df.groupby(label_column_names)... instead of df.groupby([label_column_names])...
Now, to actually solve you problem, in the function fill_na of your class, replace the loop for (you don't need it actually) by
df[numerical_feature_names] = (
df[numerical_feature_names]
.fillna(
df.groupby(label_column_names)
[numerical_feature_names].transform('mean')
)
)
in which you fillna the columns numerical_feature_names by the result of the groupy.tranform with the mean of these columns

Style rows of a dataframe based on column value

I have a csv file that I'm trying to read into a dataframe and style in jupyter notebook. The csv file data is:
[[' ', 'Name', 'Title', 'Date', 'Transaction', 'Price', 'Shares', '$ Value'],
[0, 'Sneed Michael E', 'EVP, Global Corp Aff & COO', 'Dec 09', 'Sale', 152.93, 54662, 8359460],
[1, 'Wengel Kathryn E', 'EVP, Chief GSC Officer', 'Sep 02', 'Sale', 153.52, 16115, 2473938],
[2, 'McEvoy Ashley', 'EVP, WW Chair, Medical Devices', 'Jul 28', 'Sale', 147.47, 29000, 4276630],
[3, 'JOHNSON & JOHNSON', '10% Owner', 'Jun 30', 'Buy', 17.00, 725000, 12325000]]
My goal is to style the background color of the rows so that the row is colored green if the Transaction column value is 'Buy', and red if the Transaction column value is 'Sale'.
The code I've tried is:
import pandas as pd
data = pd.read_csv('/Users/broderickbonelli/Desktop/insider.csv', index_col='Unnamed: 0')
def red_or_green():
if data.Transaction == 'Sale':
return ['background-color: red']
else:
return ['background-color: green']
data.style.apply(red_or_green, axis=1)
display(data)
When I run the code it outputs an unstyled spreadsheet without giving me an error code:
Dataframe
I'm not really sure what I'm doing wrong, I've tried it a # of different ways but can't seem to make it work. Any help would be appreciated!
If you want to compare the entire row when the condition matches, the following is faster than apply on axis=1, where we use style on the entire dataframe:
def red_or_green(dataframe):
c = dataframe['Transaction'] == 'Sale'
a = np.where(np.repeat(c.to_numpy()[:,None],dataframe.shape[1],axis=1),
'background-color: red','background-color: green')
return pd.DataFrame(a,columns=dataframe.columns,index=dataframe.index)
df.style.apply(red_or_green, axis=None)#.to_excel(.....)
Try
data['background-color'] = data.apply(lambda x: red_or_green(x.Transaction), axis=1)
def red_or_green(transaction):
if transaction == 'Sale':
return 'red'
else:
return 'green'
or you can use map:
data['background-color'] = data.Transaction.map(red_or_green)

Python pandas transpose data issue

I am having trouble figuring out how to properly transpose data in a DataFrame in order to calculate differences between actuals and targets. Doing something like: df['difference'] = df['Revenue'] - df['Target'], is straightforward so this is more a question of desired output formatting.
Assume you have a DataFrame with the follow columns and values:
Desire outputs would be a roll up from both sources and comparison at the Source level. Assume there are 30+ additional data points similar to revenue, users, and new users... :
and
Any and all suggestions are very much appreciated.
Setup
df = pd.DataFrame([
['2016-06-01', 15000, 10000, 1000, 900, 100, 50, 'US'],
['2016-06-01', 16000, 12000, 1500, 1200, 150, 100, 'UK']
], columns=['Date', 'Revenue', 'Target', 'Users', 'Target', 'New Users', 'Target', 'Source'])
df
Your columns are not unique. I'll start with moving Source and Date into the index and renaming the columns.
df1 = df.copy()
df1.Date = pd.to_datetime(df1.Date)
df1 = df1.set_index(['Date', 'Source'])
idx = pd.MultiIndex.from_product([['Revenue', 'Users', 'New Users'], ['Actual', 'Target']])
df1.columns = idx
df1
Then move the first level of columns to the index
df1 = df1.stack(0)
df1
From here, I'm going to sum sources across ['Revenue', 'Users', 'New Users'] and assign the result to df2.
df2 = df1.groupby(level=-1).sum()
df2
Finally:
df2['Difference'] = df2.Actual / df2.Target
df1['Difference'] = df1.Actual / df1.Target
df2
df1.stack().unstack([0, 1, -1])

Categories