Setting a range of values as variables for a barplot - python - python

In my data, I have a column that shows either one of the following options: 'NOT_TESTED', 'NOT_COMPLETED', 'TOO_LOW', or a value between 150 and 190 with a step of 5 (so 150, 155, 160 etc).
I am trying to plot a barplot which shows the amount of time each of these appear in the column, including each individual number.
So the barplot should have as variables on the x-axis: 'NOT_TESTED', 'NOT_COMPLETED', 'TOO_LOW', 150, 155, 160 and so on.
The stick height should be the amount of times it appears in the column.
This is the code I have tried and it has gotten me the closest to my goal, however, all the numbers (150-190) output 1 as a value for the barplot, so all the sticks are at the same height.
This does not follow the data and I do not know how to move forward.
I am new to this, any guidance would be greatly appreciated!
num_range = list(range(150,191, 5))
OUTCOMES = ['NOT_TESTED', 'NOT_COMPLETED', 'TOO_LOW']
OUTCOMES.extend(num_range)
df = df.append(pd.DataFrame(num_range,
columns=['PT1']),
ignore_index = True)
df["outcomes_col"] = df["PT1"].astype ("category")
df["outcomes_col"].cat.set_categories(OUTCOMES , inplace = True )
sns.countplot(x= "outcomes_col", data=df, palette='Magma')
plt.xticks(rotation = 90)
plt.ylabel('Amount')
plt.xlabel('Outcomes')
plt.title("Outcomes per Testing")
plt.show()
pd.DataFrame({'ID': {0: 'GF342', 1: 'IF874', 2: 'FH386', 3: 'KJ190', 4: 'TY748', 5: 'YT947', 6: 'DF063', 7: 'ET512', 8: 'GC714', 9: 'SD978', 10: 'EF472', 11: 'PL489', 12: 'AZ315', 13: 'OL821', 14: 'HN765', 15: 'ED589'}, 'Location': {0: 'Q1', 1: 'Q3', 2: 'Q1', 3: 'Q3', 4: 'Q3', 5: 'Q4', 6: 'Q3', 7: 'Q1', 8: 'Q2', 9: 'Q3', 10: 'Q1', 11: 'Q2', 12: 'Q1', 13: 'Q1', 14: 'Q3', 15: 'Q1'}, 'NEW': {0: 'YES', 1: 'NO', 2: 'NO', 3: 'YES', 4: 'YES', 5: 'NO', 6: 'NO', 7: 'YES', 8: 'NO', 9: 'NO', 10: 'NO', 11: 'YES', 12: 'NO', 13: 'YES', 14: 'YES', 15: 'YES'}, 'YEAR': {0: 2021, 1: 2018, 2: 2019, 3: 2021, 4: 2021, 5: 2019, 6: 2019, 7: 2021, 8: 2018, 9: 2019, 10: 2018, 11: 2021, 12: 2018, 13: 2021, 14: 2021, 15: 2021}, 'PT1': {0: '', 1: 'NOT_TESTED', 2: '', 3: 'NOT_FINISHED', 4: '165', 5: '', 6: '180', 7: '145', 8: '155', 9: '', 10: '', 11: '', 12: 'TOO_LOW', 13: '150', 14: '155', 15: ''}, 'PT2': {0: '', 1: '', 2: '', 3: '', 4: '', 5: 'TOO_LOW', 6: '', 7: '', 8: '160', 9: 'TOO_LOW', 10: '', 11: '', 12: '', 13: '', 14: '', 15: ''}, 'PT3': {0: '', 1: 'TOO_LOW', 2: '', 3: 'TOO_LOW', 4: '', 5: '', 6: '', 7: '', 8: '', 9: '', 10: '', 11: 'NOT_FINISHED', 12: '', 13: '185', 14: '', 15: '165'}, 'PT4': {0: '', 1: '', 2: '', 3: '', 4: '', 5: 165.0, 6: '', 7: '', 8: '', 9: '', 10: '', 11: '', 12: 180.0, 13: '', 14: '', 15: ''}})
This not the whole dataset, just part of it.

Starting from this dataframe:
(I replaced NOT_FINISHED with NOT_COMPLETED, compliant to code in your question, let me know if this replacement is correct)
ID Location NEW YEAR PT1 PT2 PT3 PT4
0 GF342 Q1 YES 2021
1 IF874 Q3 NO 2018 NOT_TESTED TOO_LOW
2 FH386 Q1 NO 2019
3 KJ190 Q3 YES 2021 NOT_COMPLETED TOO_LOW
4 TY748 Q3 YES 2021 165
5 YT947 Q4 NO 2019 TOO_LOW 165
6 DF063 Q3 NO 2019 180
7 ET512 Q1 YES 2021 145
8 GC714 Q2 NO 2018 155 160
9 SD978 Q3 NO 2019 TOO_LOW
10 EF472 Q1 NO 2018
11 PL489 Q2 YES 2021 NOT_COMPLETED
12 AZ315 Q1 NO 2018 TOO_LOW 180
13 OL821 Q1 YES 2021 150 185
14 HN765 Q3 YES 2021 155
15 ED589 Q1 YES 2021 165
If you are interested in a count plot of 'PT1' column, first of all you have to define the categories to be plotted. You can use pandas.CategoricalDtype, so you can sort these categories.
So you define a new 'outcomes_col' column:
num_range = list(range(150,191, 5))
OUTCOMES = ['NOT_TESTED', 'NOT_COMPLETED', 'TOO_LOW']
OUTCOMES.extend([str(num) for num in num_range])
OUTCOMES = CategoricalDtype(OUTCOMES, ordered = True)
df["outcomes_col"] = df["PT1"].astype(OUTCOMES)
Then you can proceed to plot this column:
sns.countplot(x= "outcomes_col", data=df, palette='Magma')
plt.xticks(rotation = 90)
plt.ylabel('Amount')
plt.xlabel('Outcomes')
plt.title("Outcomes per Testing")
plt.show()
Complete Code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pandas.api.types import CategoricalDtype
df = pd.DataFrame({'ID': {0: 'GF342', 1: 'IF874', 2: 'FH386', 3: 'KJ190', 4: 'TY748', 5: 'YT947', 6: 'DF063', 7: 'ET512', 8: 'GC714', 9: 'SD978', 10: 'EF472', 11: 'PL489', 12: 'AZ315', 13: 'OL821', 14: 'HN765', 15: 'ED589'}, 'Location': {0: 'Q1', 1: 'Q3', 2: 'Q1', 3: 'Q3', 4: 'Q3', 5: 'Q4', 6: 'Q3', 7: 'Q1', 8: 'Q2', 9: 'Q3', 10: 'Q1', 11: 'Q2', 12: 'Q1', 13: 'Q1', 14: 'Q3', 15: 'Q1'}, 'NEW': {0: 'YES', 1: 'NO', 2: 'NO', 3: 'YES', 4: 'YES', 5: 'NO', 6: 'NO', 7: 'YES', 8: 'NO', 9: 'NO', 10: 'NO', 11: 'YES', 12: 'NO', 13: 'YES', 14: 'YES', 15: 'YES'}, 'YEAR': {0: 2021, 1: 2018, 2: 2019, 3: 2021, 4: 2021, 5: 2019, 6: 2019, 7: 2021, 8: 2018, 9: 2019, 10: 2018, 11: 2021, 12: 2018, 13: 2021, 14: 2021, 15: 2021}, 'PT1': {0: '', 1: 'NOT_TESTED', 2: '', 3: 'NOT_COMPLETED', 4: '165', 5: '', 6: '180', 7: '145', 8: '155', 9: '', 10: '', 11: '', 12: 'TOO_LOW', 13: '150', 14: '155', 15: ''}, 'PT2': {0: '', 1: '', 2: '', 3: '', 4: '', 5: 'TOO_LOW', 6: '', 7: '', 8: '160', 9: 'TOO_LOW', 10: '', 11: '', 12: '', 13: '', 14: '', 15: ''}, 'PT3': {0: '', 1: 'TOO_LOW', 2: '', 3: 'TOO_LOW', 4: '', 5: '', 6: '', 7: '', 8: '', 9: '', 10: '', 11: 'NOT_COMPLETED', 12: '', 13: '185', 14: '', 15: '165'}, 'PT4': {0: '', 1: '', 2: '', 3: '', 4: '', 5: 165.0, 6: '', 7: '', 8: '', 9: '', 10: '', 11: '', 12: 180.0, 13: '', 14: '', 15: ''}})
num_range = list(range(150,191, 5))
OUTCOMES = ['NOT_TESTED', 'NOT_COMPLETED', 'TOO_LOW']
OUTCOMES.extend([str(num) for num in num_range])
OUTCOMES = CategoricalDtype(OUTCOMES, ordered = True)
df["outcomes_col"] = df["PT1"].astype(OUTCOMES)
sns.countplot(x= "outcomes_col", data=df, palette='Magma')
plt.xticks(rotation = 90)
plt.ylabel('Amount')
plt.xlabel('Outcomes')
plt.title("Outcomes per Testing")
plt.show()

Related

Boolean mask unexpected behavior when applying style

I am processing data where values may be of the format '<x' I want to return 'x/2'. So <5 would be returned as '2.5'. I have columns of mixed numbers and text. The problem is that I want to style the values that have been changed. Dummy data and code:
dummy={'Location': {0: 'Perth', 1: 'Perth', 2: 'Perth', 3: 'Perth', 4: 'Perth', 5: 'Perth', 6: 'Perth', 7: 'Perth', 8: 'Perth', 9: 'Perth', 10: 'Perth', 11: 'Perth', 12: 'Perth', 13: 'Perth', 14: 'Perth', 15: 'Perth', 16: 'Perth', 17: 'Perth'}, 'Date': {0: '11/01/2012 0:00', 1: '11/01/2012 0:00', 2: '20/03/2012 0:00', 3: '6/06/2012 0:00', 4: '14/09/2012 0:00', 5: '17/12/2013 0:00', 6: '1/02/2014 0:00', 7: '1/02/2014 0:00', 8: '1/02/2014 0:00', 9: '1/02/2014 0:00', 10: '1/02/2014 0:00', 11: '1/02/2014 0:00', 12: '1/02/2014 0:00', 13: '1/02/2014 0:00', 14: '1/02/2014 0:00', 15: '1/02/2014 0:00', 16: '1/02/2014 0:00', 17: '1/02/2014 0:00'}, 'As µg/L': {0: '9630', 1: '9630', 2: '8580', 3: '4990', 4: '6100', 5: '282', 6: '21', 7: '<1', 8: '<1', 9: '<1', 10: '<1', 11: '<1', 12: '<1', 13: '<1', 14: '<1', 15: '<1', 16: '<1', 17: '<1'}, 'As': {0: '9.63', 1: '9.63', 2: '8.58', 3: '4.99', 4: '6.1', 5: '0.282', 6: '0.021', 7: '<1', 8: '<1', 9: '<1', 10: '<1', 11: '<1', 12: '<1', 13: '<1', 14: '<1', 15: '<1', 16: '<1', 17: '10'}, 'Ba': {0: 1000.0, 1: np.nan, 2: np.nan, 3: np.nan, 4: np.nan, 5: np.nan, 6: np.nan, 7: np.nan, 8: np.nan, 9: np.nan, 10: np.nan, 11: np.nan, 12: np.nan, 13: np.nan, 14: np.nan, 15: np.nan, 16: np.nan, 17: np.nan}, 'HCO3': {0: '10.00', 1: '0.50', 2: '0.50', 3: '<22', 4: '0.50', 5: '0.50', 6: '0.50', 7: np.nan, 8: np.nan, 9: np.nan, 10: '0.50', 11: np.nan, 12: np.nan, 13: np.nan, 14: np.nan, 15: np.nan, 16: np.nan, 17: np.nan}, 'Cd': {0: 0.0094, 1: 0.0094, 2: 0.011, 3: 0.0035, 4: 0.004, 5: 0.002, 6: 0.0019, 7: np.nan, 8: np.nan, 9: np.nan, 10: np.nan, 11: np.nan, 12: np.nan, 13: np.nan, 14: np.nan, 15: np.nan, 16: np.nan, 17: np.nan}, 'Ca': {0: 248.0, 1: 248.0, 2: 232.0, 3: 108.0, 4: 150.0, 5: 396.0, 6: 472.0, 7: np.nan, 8: np.nan, 9: np.nan, 10: np.nan, 11: np.nan, 12: np.nan, 13: np.nan, 14: 472.0, 15: np.nan, 16: np.nan, 17: np.nan}, 'CO3': {0: 0.5, 1: 0.5, 2: 0.5, 3: 0.5, 4: 0.5, 5: 0.5, 6: 0.5, 7: np.nan, 8: np.nan, 9: 0.5, 10: np.nan, 11: np.nan, 12: np.nan, 13: np.nan, 14: np.nan, 15: np.nan, 16: np.nan, 17: np.nan}, 'Cl': {0: 2.0, 1: 2.0, 2: 2.0, 3: 2.0, 4: 0.5, 5: 2.0, 6: 5.0, 7: np.nan, 8: np.nan, 9: np.nan, 10: np.nan, 11: np.nan, 12: np.nan, 13: 5.0, 14: np.nan, 15: np.nan, 16: np.nan, 17: np.nan}}
df=pd.DataFrame(dummy)
import pandas a pd
import numpy as np
mask = df.applymap(lambda x: (isinstance(x, str) and x.startswith('<')))
def remove_less_thans(x):
if type(x) is int:
return x
elif type(x) is float:
return x
elif type(x) is str and x[0]=="<":
try:
return float(x[1:])/2
except:
return x
elif type(x) is str and len(x)<10:
try:
return float(x)
except:
return x
else:
return x
def colour_mask(val):
colour='color: red; font-weight: bold' if val in df.values[mask] else ''
return colour
#perform remove less-thans and divide the remainder by two
df=df.applymap(remove_less_thans)
styled_df= df.style.applymap(colour_mask)
styled_df
the mask looks correct, the remove < function works ok but I get values formatted when they shouldn't be. In the dummy data the HCO3 column has the 0.5 values reformatted even though they do no start with < and are not appearing as True in the mask. I know that they are numbers stored as text but that is how the real data might appear and given the mask is being constructed as expected (i.e. the one True is there and the rest of the values in the column are False) I don't know why they are being formatted. Same for column CO3, all the non-nan values are formatted when none should be. Why is this happening and how do I fix it? Dataframe
Output
Idea is pass mask to Styler.apply with numpy.where:
def colour_mask(x):
arr = np.where(mask, 'color: red; font-weight: bold', '')
return pd.DataFrame(arr, index=x.index, columns=x.columns)
styled_df = df.style.apply(colour_mask, axis=None)
Or:
def colour_mask(x, props=''):
return np.where(mask, props, '')
styled_df = df.style.apply(colour_mask, props='color: red; font-weight: bold', axis=None)

Pandas - Group by multiple columns and datetime

I have a df of tennis results and I would like to be able to see how many days its been since each player last won a game.
This is what my df looks like
Player 1
Player 2
Date
p1_win
p2_win
Murray
Nadal
2022-05-16
1
0
Nadal
Murray
2022-05-25
1
0
and this is what I want it to look like
Player 1
Player 2
Date
p1_win
p2_win
p1_lastwin
p2_lastwin
Murray
Nadal
2022-05-16
1
0
na
na
Nadal
Murray
2022-05-25
1
0
na
9
the results will have to be able to include the days since the last win whether the player was player 1 or 2 using group by I think. Also maybe if possible it would be good to have a win percentage for the year if possible.
Any help is much appreciated.
edit - here is the dict
{'Player 1': {0: 'Murray',
1: 'Nadal',
2: 'Murray',
3: 'Nadal',
4: 'Murray',
5: 'Nadal',
6: 'Murray',
7: 'Nadal',
8: 'Murray',
9: 'Nadal',
10: 'Murray'},
'Player 2': {0: 'Nadal',
1: 'Murray',
2: 'Nadal',
3: 'Murray',
4: 'Nadal',
5: 'Murray',
6: 'Nadal',
7: 'Murray',
8: 'Nadal',
9: 'Murray',
10: 'Nadal'},
'Date': {0: '2022-05-16',
1: '2022-05-26',
2: '2022-05-27',
3: '2022-05-28',
4: '2022-05-29',
5: '2022-06-01',
6: '2022-06-02',
7: '2022-06-05',
8: '2022-06-09',
9: '2022-06-13',
10: '2022-06-17'},
'p1_win': {0: '1',
1: '1',
2: '0',
3: '1',
4: '0',
5: '0',
6: '1',
7: '0',
8: '1',
9: '0',
10: '1'},
'p2_win': {0: '0',
1: '0',
2: '1',
3: '0',
4: '1',
5: '1',
6: '0',
7: '1',
8: '0',
9: '1',
10: '0'}}
Thanks :)
I leveraged pd.merge_asof to find the latest win, and then did a merge to the relevant index.
df = pd.DataFrame({'Player 1': {0: 'Murray', 1: 'Nadal', 2: 'Murray', 3: 'Nadal', 4: 'Murray', 5: 'Nadal', 6: 'Murray'}, 'Player 2': {0: 'Nadal', 1: 'Murray', 2: 'Nadal', 3: 'Murray', 4: 'Nadal', 5: 'Murray', 6: 'Nadal'}, 'Date': {0: '2022-05-16', 1: '2022-05-26', 2: '2022-05-27', 3: '2022-05-28', 4: '2022-05-29', 5: '2022-06-01', 6: '2022-06-02'}, 'p1_win': {0: '1', 1: '1', 2: '0', 3: '1', 4: '0', 5: '0', 6: '1'}, 'p2_win': {0: '0', 1: '0', 2: '1', 3: '0', 4: '1', 5: '1', 6: '0'}})
df['p1_win']=df.p1_win.astype(int)
df['p2_win']=df.p2_win.astype(int)
df['Date'] = pd.to_datetime(df['Date'])
df['match'] = [x+'_'+y if x>y else y+'_'+x for x,y in zip(df['Player 1'],df['Player 2'])]
# df['winner'] = np.where(df.p1_win==1,df['Player 1'],df['Player 2'])
# df['looser'] = np.where(df.p1_win==0,df['Player 1'],df['Player 2'])
df = df.reset_index()
df = df.sort_values(by='Date')
df = pd.merge_asof(df,df[df.p1_win==1][['match','Date','index']],by=['match'],on='Date',suffixes=('','_latest_win_p1'),allow_exact_matches=False,direction='backward')
df = pd.merge_asof(df,df[df.p2_win==1][['match','Date','index']],by=['match'],on='Date',suffixes=('','_latest_win_p2'),allow_exact_matches=False,direction='backward')
# df = df[['index','Date','Player 1','Player 2','p1_win','p2_win','match','winner','looser','index_latest_win_p2','index_latest_win_p1']]
df = df.merge(df[['Date','index','match']],how='left',left_on=['index_latest_win_p1','match'],right_on=['index','match'],suffixes=('','_latest_win_winner'))
df = df.merge(df[['Date','index','match']],how='left',left_on=['index_latest_win_p2','match'],right_on=['index','match'],suffixes=('','_latest_win_looser'))
df['days_since_last_win_winner'] = (df['Date']-df.Date_latest_win_winner).dt.days
df['days_since_last_win_looser'] = (df['Date']-df.Date_latest_win_looser).dt.days
not sure that this is exactly what you meant but let me know if you need anything else:

Python Filter Dataframe with Dynamic arguments

Hi i want to Filter a dataframe from arguments dynamically.
this is my idea now:
tr=pd.read_csv("sales.csv")
def filtr(*arg2):
fltr = tr.loc[(tr[arg2[0]] arg2[1] arg2[2]) arg2[3] ....]
print(fltr)
filtr(*sys.argv[1:])
## python test.py "Unit Cost" "==" 4 & .......
i had the idea of making the (tr[arg2[0]] arg2[1] arg2[2]) as body and iterating it but i don't know how.
edit: Data Example:
{'Region': {0: 'Sub-Saharan Africa', 1: 'Europe', 2: 'Middle East and North Africa', 3: 'Sub-Saharan Africa', 4: 'Europe', 5: 'Sub-Saharan Africa', 6: 'Asia', 7: 'Asia', 8: 'Sub-Saharan Africa', 9: 'Central America and the Caribbean', 10: 'Sub-Saharan Africa', 11: 'Europe', 12: 'Europe', 13: 'Asia', 14: 'Middle East and North Africa', 15: 'Australia and Oceania', 16: 'Central America and the Caribbean', 17: 'Europe', 18: 'Middle East and North Africa', 19: 'Europe'}, 'Country': {0: 'Chad', 1: 'Latvia', 2: 'Pakistan', 3: 'Democratic Republic of the Congo', 4: 'Czech Republic', 5: 'South Africa', 6: 'Laos', 7: 'China', 8: 'Eritrea', 9: 'Haiti', 10: 'Zambia', 11: 'Bosnia and Herzegovina', 12: 'Germany', 13: 'India', 14: 'Algeria', 15: 'Palau', 16: 'Cuba', 17: 'Vatican City', 18: 'Lebanon', 19: 'Lithuania'}, 'Item Type': {0: 'Office Supplies', 1: 'Beverages', 2: 'Vegetables', 3: 'Household', 4: 'Beverages', 5: 'Beverages', 6: 'Vegetables', 7: 'Baby Food', 8: 'Meat', 9: 'Office Supplies', 10: 'Cereal', 11: 'Baby Food', 12: 'Office Supplies', 13: 'Household', 14: 'Clothes', 15: 'Snacks', 16: 'Beverages', 17: 'Beverages', 18: 'Personal Care', 19: 'Snacks'}, 'Sales Channel': {0: 'Online', 1: 'Online', 2: 'Offline', 3: 'Online', 4: 'Online', 5: 'Offline', 6: 'Online', 7: 'Online', 8: 'Online', 9: 'Online', 10: 'Offline', 11: 'Offline', 12: 'Online', 13: 'Online', 14: 'Offline', 15: 'Offline', 16: 'Online', 17: 'Online', 18: 'Offline', 19: 'Offline'}, 'Order Priority': {0: 'L', 1: 'C', 2: 'C', 3: 'C', 4: 'C', 5: 'H', 6: 'L', 7: 'C', 8: 'L', 9: 'C', 10: 'M', 11: 'M', 12: 'C', 13: 'C', 14: 'C', 15: 'L', 16: 'H', 17: 'L', 18: 'H', 19: 'H'}, 'Order Date': {0: '1/27/2011', 1: '12/28/2015', 2: '1/13/2011', 3: '9/11/2012', 4: '10/27/2015', 5: '7/10/2012', 6: '2/20/2011', 7: '4/10/2017', 8: '11/21/2014', 9: '7/4/2015', 10: '7/26/2016', 11: '10/20/2012', 12: '2/22/2015', 13: '8/27/2016', 14: '6/21/2011', 15: '9/19/2013', 16: '11/15/2015', 17: '4/6/2015', 18: '4/12/2010', 19: '9/26/2011'}, 'Order ID': {0: 292494523, 1: 361825549, 2: 141515767, 3: 500364005, 4: 127481591, 5: 482292354, 6: 844532620, 7: 564251220, 8: 411809480, 9: 327881228, 10: 773452794, 11: 479823005, 12: 498603188, 13: 151717174, 14: 181401288, 15: 500204360, 16: 640987718, 17: 206925189, 18: 221503102, 19: 878520286}, 'Ship Date': {0: '2/12/2011', 1: '1/23/2016', 2: '2/1/2011', 3: '10/6/2012', 4: '12/5/2015', 5: '8/21/2012', 6: '3/20/2011', 7: '5/12/2017', 8: '1/10/2015', 9: '7/20/2015', 10: '8/24/2016', 11: '11/15/2012', 12: '2/27/2015', 13: '9/2/2016', 14: '7/21/2011', 15: '10/4/2013', 16: '11/30/2015', 17: '4/27/2015', 18: '5/19/2010', 19: '10/2/2011'}, 'Units Sold': {0: 4484, 1: 1075, 2: 6515, 3: 7683, 4: 3491, 5: 9880, 6: 4825, 7: 3330, 8: 2431, 9: 6197, 10: 724, 11: 9145, 12: 6618, 13: 5338, 14: 9527, 15: 441, 16: 1365, 17: 2617, 18: 6545, 19: 2530}, 'Unit Price': {0: 651.21, 1: 47.45, 2: 154.06, 3: 668.27, 4: 47.45, 5: 47.45, 6: 154.06, 7: 255.28, 8: 421.89, 9: 651.21, 10: 205.7, 11: 255.28, 12: 651.21, 13: 668.27, 14: 109.28, 15: 152.58, 16: 47.45, 17: 47.45, 18: 81.73, 19: 152.58}, 'Unit Cost': {0: 524.96, 1: 31.79, 2: 90.93, 3: 502.54, 4: 31.79, 5: 31.79, 6: 90.93, 7: 159.42, 8: 364.69, 9: 524.96, 10: 117.11, 11: 159.42, 12: 524.96, 13: 502.54, 14: 35.84, 15: 97.44, 16: 31.79, 17: 31.79, 18: 56.67, 19: 97.44}, 'Total Revenue': {0: 2920025.64, 1: 51008.75, 2: 1003700.9, 3: 5134318.41, 4: 165647.95, 5: 468806.0, 6: 743339.5, 7: 850082.4, 8: 1025614.59, 9: 4035548.37, 10: 148926.8, 11: 2334535.6, 12: 4309707.78, 13: 3567225.26, 14: 1041110.56, 15: 67287.78, 16: 64769.25, 17: 124176.65, 18: 534922.85, 19: 386027.4}, 'Total Cost': {0: 2353920.64, 1: 34174.25, 2: 592408.95, 3: 3861014.82, 4: 110978.89, 5: 314085.2, 6: 438737.25, 7: 530868.6, 8: 886561.39, 9: 3253177.12, 10: 84787.64, 11: 1457895.9, 12: 3474185.28, 13: 2682558.52, 14: 341447.68, 15: 42971.04, 16: 43393.35, 17: 83194.43, 18: 370905.15, 19: 246523.2}, 'Total Profit': {0: 566105.0, 1: 16834.5, 2: 411291.95, 3: 1273303.59, 4: 54669.06, 5: 154720.8, 6: 304602.25, 7: 319213.8, 8: 139053.2, 9: 782371.25, 10: 64139.16, 11: 876639.7, 12: 835522.5, 13: 884666.74, 14: 699662.88, 15: 24316.74, 16: 21375.9, 17: 40982.22, 18: 164017.7, 19: 139504.2}}
Just use eval() and here are the code:
import pandas as pd
def filter_df(df, args_list):
constraints = []
for a in args_list:
col = a[0]
symbol = a[1]
value = a[2]
constraint = "(df.{}{}{})".format(col, symbol, value)
constraints.append(constraint)
filter_str = "&".join(constraints)
return df[eval(filter_str)]
data = {
"COL_A": [1,2,3,2,4,6],
"COL_B": [1,10,100,20,20,40],
"COL_C": ["aaa", "bbb", "zzz", "xxx", "xxx", "xxx"]
}
df = pd.DataFrame(data)
args_list = [["COL_A", "<=", "4"], ["COL_C", "==", "'xxx'"]]
df2 = filter_df(df, args_list)
This is df:
After filter COL_A <= 4 & COL_C == 'xxx', this is df2:
How about this ?
def filter(df, **args):
conditions = args["args"]
for key , value in conditions.items():
df = df[df[key] > value]
return df
Invoke using
df = filter(df, args={"Unit Cost": 500, "Unit Price": 500})
Result:
print(df.shape)
(5,14)
Note: This approach can be used only when you want to compare all the conditions using >. if you need to include multiple operation, you may need to find a better approach
def filter_df(arg2):
if arg2[1]==">":
return tr.loc[(tr[arg2[0]] > int(arg2[2]))]
elif arg2[1]=="<":
return tr.loc[(tr[arg2[0]] < int(arg2[2]))]
elif arg2[1]=="=":
return tr.loc[(tr[arg2[0]] == int(arg2[2]))]
else:
raise ValueError("invalid comparison: %s"%arg2[1])
filter_df(arg2)
now if (for example) arg2 = ('Unit Cost', '>', '500'), the function will return only the rows with Unit Cost>500:
If you want to pass multiple condition it is more complicated and my hint is to pass them step-by-step, separately.

Convert column to date format

I am trying to convert the date to a correct date format. I have tested some of the possibilities that I have read in the forum but, I still don't know how to tackle this issue:
After importing:
df = pd.read_excel(r'/path/df_datetime.xlsb', sheet_name="12FEB22", engine='pyxlsb')
I get the following df:
{'Unnamed: 0': {0: 'Administrative ID',
1: '000002191',
2: '000002382',
3: '000002434',
4: '000002728',
5: '000002826',
6: '000003265',
7: '000004106',
8: '000004333'},
'Unnamed: 1': {0: 'Service',
1: 'generic',
2: 'generic',
3: 'generic',
4: 'generic',
5: 'generic',
6: 'generic',
7: 'generic',
8: 'generic'},
'Unnamed: 2': {0: 'Movement type',
1: 'New',
2: 'New',
3: 'New',
4: 'Modify',
5: 'New',
6: 'New',
7: 'New',
8: 'New'},
'Unnamed: 3': {0: 'Date',
1: 37503,
2: 37475,
3: 37453,
4: 44186,
5: 37711,
6: 37658,
7: 37770,
8: 37820},
'Unnamed: 4': {0: 'Contract Term',
1: '12',
2: '12',
3: '12',
4: '12',
5: '12',
6: '12',
7: '12',
8: '12'}}
However, even although I have tried to convert the 'Date' Column (or 'Unnamed 3', because the original dataset hasn't first row so I have to change the header after that) during the importation, it has been unsuccessful.
Is there any option that I can do?
Thanks!
try this:
from xlrd import xldate_as_datetime
def trans_date(x):
if isinstance(x, int):
return xldate_as_datetime(x, 0).date()
else:
return x
print(df['Unnamed: 3'].apply(trans_date))
>>>
0 Date
1 2002-09-04
2 2002-08-07
3 2002-07-16
4 2020-12-21
5 2003-03-31
6 2003-02-06
7 2003-05-29
8 2003-07-18
Name: Unnamed: 3, dtype: object

Default value not invoked when using np.select

I'm using np.select to evaluate a few conditions and am trying to assign a deafult value as the previous value of array
For eg if
row[i-1] = True
row[i] = NaN
then
row[i] = True
I have used the following lines
entry_conditions = [
(df['Close'] > df['Open'] + 100),
(df['Close'] < df['Open'] -100)
]
entry_choices = [
True, False
]
df['entry'] = np.nan
#Need to initialize the column with nan or else it throws an error because evaluating first row triggers default value
df['entry'] = np.select(entry_conditions,entry_choices,default = df['entry'].shift(1))
Sample output of df['entry']
True,
'nan',
'nan',
'nan',
'nan',
'nan',
'nan',
'nan',
'nan',
'nan',
'nan',
True,
'nan',
'nan',
'nan',
True,
I don't understand why even after the default value is mentioned, the column still shows nan as final outputs
Sample data obtained by df.to_dict
{'Date': {1: Timestamp('2021-01-01 09:30:00'),
2: Timestamp('2021-01-01 09:45:00'),
3: Timestamp('2021-01-01 10:00:00'),
4: Timestamp('2021-01-01 10:15:00'),
5: Timestamp('2021-01-01 10:30:00'),
6: Timestamp('2021-01-01 10:45:00'),
7: Timestamp('2021-01-01 11:00:00'),
8: Timestamp('2021-01-01 11:15:00'),
9: Timestamp('2021-01-01 11:30:00'),
10: Timestamp('2021-01-01 11:45:00'),
11: Timestamp('2021-01-01 12:00:00'),
12: Timestamp('2021-01-01 12:15:00'),
13: Timestamp('2021-01-01 12:30:00'),
14: Timestamp('2021-01-01 12:45:00'),
15: Timestamp('2021-01-01 13:00:00')},
'Open': {1: 31376.0,
2: 31405.0,
3: 31389.4,
4: 31377.5,
5: 31347.8,
6: 31310.8,
7: 31343.4,
8: 31349.5,
9: 31349.9,
10: 31325.1,
11: 31310.9,
12: 31329.0,
13: 31376.0,
14: 31375.5,
15: 31357.4},
'High': {1: 31425.0,
2: 31411.95,
3: 31389.45,
4: 31382.0,
5: 31350.0,
6: 31354.6,
7: 31359.0,
8: 31370.0,
9: 31364.7,
10: 31350.0,
11: 31337.9,
12: 31378.9,
13: 31419.5,
14: 31377.75,
15: 31360.0},
'Low': {1: 31367.95,
2: 31352.5,
3: 31331.65,
4: 31301.4,
5: 31303.05,
6: 31310.0,
7: 31325.05,
8: 31335.35,
9: 31315.35,
10: 31281.9,
11: 31292.0,
12: 31316.25,
13: 31352.05,
14: 31335.0,
15: 31322.0},
'Close': {1: 31398.3,
2: 31386.0,
3: 31377.0,
4: 31342.3,
5: 31311.7,
6: 31345.0,
7: 31349.0,
8: 31344.2,
9: 31327.6,
10: 31311.3,
11: 31325.6,
12: 31373.0,
13: 31375.0,
14: 31357.4,
15: 31326.0}}

Categories