Style rows of a dataframe based on column value - python

I have a csv file that I'm trying to read into a dataframe and style in jupyter notebook. The csv file data is:
[[' ', 'Name', 'Title', 'Date', 'Transaction', 'Price', 'Shares', '$ Value'],
[0, 'Sneed Michael E', 'EVP, Global Corp Aff & COO', 'Dec 09', 'Sale', 152.93, 54662, 8359460],
[1, 'Wengel Kathryn E', 'EVP, Chief GSC Officer', 'Sep 02', 'Sale', 153.52, 16115, 2473938],
[2, 'McEvoy Ashley', 'EVP, WW Chair, Medical Devices', 'Jul 28', 'Sale', 147.47, 29000, 4276630],
[3, 'JOHNSON & JOHNSON', '10% Owner', 'Jun 30', 'Buy', 17.00, 725000, 12325000]]
My goal is to style the background color of the rows so that the row is colored green if the Transaction column value is 'Buy', and red if the Transaction column value is 'Sale'.
The code I've tried is:
import pandas as pd
data = pd.read_csv('/Users/broderickbonelli/Desktop/insider.csv', index_col='Unnamed: 0')
def red_or_green():
if data.Transaction == 'Sale':
return ['background-color: red']
else:
return ['background-color: green']
data.style.apply(red_or_green, axis=1)
display(data)
When I run the code it outputs an unstyled spreadsheet without giving me an error code:
Dataframe
I'm not really sure what I'm doing wrong, I've tried it a # of different ways but can't seem to make it work. Any help would be appreciated!

If you want to compare the entire row when the condition matches, the following is faster than apply on axis=1, where we use style on the entire dataframe:
def red_or_green(dataframe):
c = dataframe['Transaction'] == 'Sale'
a = np.where(np.repeat(c.to_numpy()[:,None],dataframe.shape[1],axis=1),
'background-color: red','background-color: green')
return pd.DataFrame(a,columns=dataframe.columns,index=dataframe.index)
df.style.apply(red_or_green, axis=None)#.to_excel(.....)

Try
data['background-color'] = data.apply(lambda x: red_or_green(x.Transaction), axis=1)
def red_or_green(transaction):
if transaction == 'Sale':
return 'red'
else:
return 'green'
or you can use map:
data['background-color'] = data.Transaction.map(red_or_green)

Related

Add a calculated column to a pivot table in pandas

Hi I am trying to create new columns to a multi-indexed pandas pivot table to do a countif statement (similar to excel) depending if a level of the index contains a specific string. This is the sample data:
df = pd.DataFrame({'City': ['Houston', 'Austin', 'Hoover','Adak','Denver','Houston','Adak','Denver'],
'State': ['Texas', 'Texas', 'Alabama','Alaska','Colorado','Texas','Alaska','Colorado'],
'Name':['Aria', 'Penelope', 'Niko','Susan','Aria','Niko','Aria','Niko'],
'Unit':['Sales', 'Marketing', 'Operations','Sales','Operations','Operations','Sales','Operations'],
'Assigned':['Yes','No','Maybe','No','Yes','Yes','Yes','Yes']},
columns=['City', 'State', 'Name', 'Unit','Assigned'])
pivot=df.pivot_table(index=['City','State'],columns=['Name','Unit'],values=['Assigned'],aggfunc=lambda x:', '.join(set(x)),fill_value='')
and this is the desired output (in screenshot). Thanks in advance!
try:
temp = pivot[('Mango', 'Aria', 'Sales')].str.len()>0
pivot['new col'] = temp.astype(int)
the result:
Based on your edit:
import numpy as np
temp = pivot.xs('Sales', level=2, drop_level=False, axis = 1).apply(lambda x: np.sum([1 if y!='' else 0 for y in x]), axis = 1)
pivot[('', 'total sales', 'count how many...')]=temp

Is there a way to optimize the itterrows code in pandas?

This is the input data
The Closing stock value is calculated by only where there is a value for Opening Stock by adding Opening stock, Purchase Qty and Sold Qty .
I want to add values to 'Opening Stock' & 'Closing Stock' with conditions.
When the Opening stock value is 0 or blank, it should be filled
by the Closing Stock of previous record
Fill should happen only when the Site and Item code are same for this record and previous record
for i, row in df.iterrows():
df['Opening Stock'] = np.where((df['Site'] == df['Site'].shift(1)) & (df['Item Code'] == df['Item Code'].shift(1))& ((df['Opening Stock'] == 0) | (df['Opening Stock'].isna())),df['Closing Stock'].shift(1),df['Opening Stock'])
df['Closing Stock'][i] = df['Opening Stock'][i]+df['Purchase Qty'][i]+df['Sold Qty'][i]
This is how the output looks like
The problem is since the size of the dataset is large it takes hours to complete.
Is there a way to optimise this code?
You can do this without any iterative approach. The first step is to convert the 0 values in Opening Stock to np.nan so that we can fill them in the next step.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Site': ['site 1', 'site 1', 'site 2', 'site 2'],
'Item Code': ['A', 'A', 'A', 'A'],
'Opening Stock': [1000, 0, 2000, 0],
'Closing Stock': [1200, 0, 2250, 0],
'Purchase Qty': [500, 100, 400, 300],
'Sold Qty': [-300, -200, -150, -100]})
df.loc[df['Opening Stock'] == 0, 'Opening Stock'] = np.nan
df['Opening Stock'] = df.groupby(['Site', 'Item Code'])['Opening Stock'].fillna(df['Closing Stock'].shift(1))
df['Closing Stock'] = df['Opening Stock'] + df['Purchase Qty'] + df['Sold Qty']
one way to apply a condition on every row in a Pandas dataframe is df.apply():
df.apply(
self.my_function
args = function_args
....
)
def my_function(self, row, args):
row['Closing Stock']= row['Opening Stock']+row['Purchase Qty']+row['Sold Qty']
# you can do pretty much whatever you want inside this function
#note: row is a pandas series
return row #it return the row modified and inserts it into the dataframe

Pandas DataFrame creation throws ValueError in loop

I have a nested dictionary(stats), which I'm trying to convert into a Pandas DF. When I run the below code, I get the desired result:
BAL_sp = pd.DataFrame(data = stats['sp']['Orioles'])
However, I need to do this 30 times then concatenate the results. When I run a for loop, I get a ValueError: DataFrame constructor not properly called! I don't understand, it is recognizing the key in stats as valid in the loop:
team_dict = {'LAA': 'Angels', 'ARI': 'Diamondbacks', 'BAL': 'Orioles', 'BOS': 'Red Sox', 'CHC': 'Cubs', 'CIN': 'Reds',
'CLE': 'Indians', 'COL': 'Rockies', 'DET': 'Tigers', 'HOU': 'Astros', 'KC': 'Royals', 'LAD': 'Dodgers',
'WSH':'Nationals', 'NYM': 'Mets', 'OAK': 'Athletics', 'PIT': 'Pirates', 'SD': 'Padres', 'SEA': 'Mariners',
'SF': 'Giants', 'STL': 'Cardinals', 'TB': 'Rays', 'TEX': 'Rangers', 'TOR': 'Blue Jays', 'MIN': 'Twins',
'PHI': 'Phillies', 'ATL': 'Braves', 'CWS': 'White Sox', 'MIA': 'Marlins', 'NYY': 'Yankees', 'MIL': 'Brewers' }
frames = []
for team in team_dict.values():
temp = pd.DataFrame(data = stats['sp'][team])
frames.append(temp)
sp_df = pd.concat(frames)
It doesn't throw an error if I do data = [stats['sp'][team]], but that does not produce the desired result. Thank you for any help.

Using regex to replace certain keyword in a dataframe value in python

I have a dataframe to store cars related information and I am trying to replace certain key "[Pink & Blue]" to "[Pink&Blue]" to remove the space in between using regex but my code fails to do so :
Cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus'],
'Price': [22000,np.nan,35000] ,
'Product Summary' : ['color is [Pink & Blue] size is 14' ,'color is [Pink & Yellow] size is 10','color is [Red & Black] size is 11']
}
df = DataFrame(Cars,columns= ['Brand', 'Price', 'Product Summary'])
df['Product Summary'] = df['Product Summary'].replace(r'\b & ',"&")
Use Series.str.replace accessor
df['Product Summary'] = df['Product Summary'].str.replace(' & ', '&')
OR In your code, use regex=True
df['Product Summary'] = df['Product Summary'].replace(r'\b & ', '&', regex=True)

Writing a multi-index dataframe to Excel file

The DataFrame MultiIndex is kicking my butt. After struggling for quite a while, I was able to create a MutliIndex DataFrame with this code
columns = pd.MultiIndex.from_tuples([('Zip', ''),
('All Properties', 'Avg List Price'),('All Properties', 'Median List Price'),
('3 Bedroom', 'Avg List Price'),('3 Bedroom', 'Median List Price'),
('2 Bedroom', 'Avg List Price'),('2 Bedroom', 'Median List Price'),
('1 Bedroom', 'Avg List Price'),('1 Bedroom', 'Median List Price')])
data[0] = ['11111', 'Val1', 'Val2', 'Val3', 'Val4', 'Val5', 'Val6', 'Val7', 'Val8']
df = pd.DataFrame(data, columns=columns)
Everything looks fine until I try to write it to an excel file
writer = pd.ExcelWriter('testData.xlsx', engine='openpyxl')
df.to_excel(writer, 'Sheet1')
writer.save()
When I open the excel file this is what I get.
If I unmerge the columns in Excel all the data is there.
Here's an image of what I'm trying to create
I'm guessing that the problem has something to do with the way I'm creating the multi index columns, but I can't figure out what the problem is.
I'm running python 2.7 on a Mac.
Thanks for any input.
This was a bug that will be fixed in version 0.17.1, or you can use engine='xlsxwriter'
https://github.com/pydata/pandas/pull/11328
This is a great use for itertools.product. Try this instead in your multiindex creation:
from itertools import product
cols = product(
['All Properties', '3 Bedroom', '2 Bedroom', '1 Bedroom'],
['Avg List Price', 'Median List Price']
)
columns = pd.MultiIndex.from_tuples(list(cols))
ind = pd.Index(['11111'], name='zip')
vals = ['Val1', 'Val2', 'Val3', 'Val4', 'Val5', 'Val6', 'Val7', 'Val8']
df = pd.DataFrame(
vals, index=ind, columns=columns
)
The issue is: you included zip (which names your index) in the construction of your MultiIndex for your columns (tragically, nothing called MultiColumns exists to clear up that confusion). You need to create your Index (which is a single-level normal pandas.Index) and your columns (which are a two-level pandas.MultiIndex) separately, as above, and you should get expected behavior when you write to excel.

Categories