Pandas Column based on values in other columns - python

Basically, I would like to fill in column Discount_Sub_Dpt with 'Yes' or 'No' depending on if there is a Discount for that Sub_Dpt for that week EXCLUDING the product on which that row lands (for instance I don't want any of the A rows to consider whether there is a Discount for that week for A but rather only for the products in that sub department(in most cases there is more than one other product).
I have tried using groupby with Sub_Dpt and Week to no avail.
Does anyone know how to solve this issue?
The Yellow column is obviously the desired outcome from the code.
Here is some of the code I have used, I am trying to create the column first and then update the values (but it could all potentially be wrong) (also I intentionally named the data frame df1):
df1['Discount_Sub_Dpt'] = np.where((df1['Discount']=='Yes'),'Yes','No')
grps = []
grps.append(df1.Sub_Dpt.unique())
for x in grps:
x = str(x)
yes_weeks = df1.loc[(df1.Discount_SubDpt == 'Yes') & (df1.Sub_Dpt_Description == x),'Week'].unique()
df1.loc[df1['Week'].isin(yes_weeks) & df1['Sub_Dpt_Description'] == x, 'Discount_SubDpt'] = 'Yes'

Ok, the following is a bit crazy, but it works pretty nicely, so listen up.
First, we are going to build a NetworkX graph as follows.
import networkx as nx
import numpy as np
import pandas as pd
G = nx.Graph()
Prods = df.Product.unique()
G.add_nodes_from(Prods)
We now add edges between our nodes (which are all of the products) whenever they belong to the same sub_dpt. In this case, since A and B share a dept, and C and D, do, we add edges AB and CD. If we had ABC in the same department, we would add AB, AC, BC. Confusing, I know, but just trust me on this one.
G.add_edges_from([('A','B'),('C','D')])
Now comes the fun part. We need to convert your Discount column from Yes/No to 1/0.
df['Disc2']=np.nan
df.loc[df['Discount']=='Yes','Disc2']=1
df.loc[df['Discount']=='No','Disc2']=0
Now we pivot the data
tab = df.pivot(index = 'Week',columns='Product',values = 'Disc2')
And now, we do this
tab = pd.DataFrame(np.dot(tab,nx.adjacency_matrix(G,Prods).todense()), columns=Prods,index=df.Week.unique())
tab[0].astype(bool)
df = df.merge(tab.unstack().reset_index(),left_on=['Product','Week'],right_on=['level_0','level_1'])
df['Discount_Sub_Dpt']=df[0]
print(df[['Product','Week','Sub_Dpt','Discount','Discount_Sub_Dpt']])
You may ask, why go through this trouble? Well, two reasons. First, its far more stable. The other answers can't handle all possible cases of your problem. Second, it's much faster than the other solutions. I hope this helped!

You can perform a GroupBy to map ('Week', 'Sub_Dpt') to lists of 'Product' only when Discount is "Yes".
Then use a list comprehension to check if any are on Discount apart from the product in question. Finally, map a Boolean series result to "Yes" / "No".
Data from #SahilPuri.
# GroupBy only when Discount == Yes
g = df1[df1['Discount'] == 'Yes'].groupby(['Week', 'Sub_Dpt'])['Product'].unique()
# calculate index by row
idx = df1.set_index(['Week', 'Sub_Dpt']).index
# construct list of Booleans according to criteria
L = [any(x for x in g.get(i, []) if x!=j) for i, j in zip(idx, df1['Product'])]
# map Boolean to strings
df1['Discount_SubDpt'] = pd.Series(L).map({True: 'Yes', False: 'No'})
print(df1)
Product Week Sub_Dpt Discount Discount_SubDpt
0 A 1 Toys Yes No
1 A 2 Toys No Yes
2 A 3 Toys No No
3 A 4 Toys Yes Yes
4 B 1 Toys No Yes
5 B 2 Toys Yes No
6 B 3 Toys No No
7 B 4 Toys Yes Yes
8 C 1 Candy No No
9 C 2 Candy No No
10 C 3 Candy Yes No
11 C 4 Candy Yes No
12 D 1 Candy No No
13 D 2 Candy No No
14 D 3 Candy No Yes
15 D 4 Candy No Yes

Okay, this might not scale well, but should be easy to read.
df1 = pd.DataFrame(data= [[ 'A', 1, 'Toys', 'Yes', ],
[ 'A', 2, 'Toys', 'No', ],
[ 'A', 3, 'Toys', 'No', ],
[ 'A', 4, 'Toys', 'Yes', ],
[ 'B', 1, 'Toys', 'No', ],
[ 'B', 2, 'Toys', 'Yes', ],
[ 'B', 3, 'Toys', 'No', ],
[ 'B', 4, 'Toys', 'Yes', ],
[ 'C', 1, 'Candy', 'No', ],
[ 'C', 2, 'Candy', 'No', ],
[ 'C', 3, 'Candy', 'Yes', ],
[ 'C', 4, 'Candy', 'Yes', ],
[ 'D', 1, 'Candy', 'No', ],
[ 'D', 2, 'Candy', 'No', ],
[ 'D', 3, 'Candy', 'No', ],
[ 'D', 4, 'Candy', 'No', ],], columns=['Product', 'Week', 'Sub_Dpt', 'Discount'])
df2 = df1.set_index(['Product', 'Week', 'Sub_Dpt'])
products = df1.Product.unique()
df1['Discount_SubDpt'] = df1.apply(lambda x: 'Yes' if 'Yes' in df2.loc[(list(products[products != x['Product']]), x['Week'], x['Sub_Dpt']), 'Discount'].tolist() else 'No', axis=1)
The first step creates a Multindex Dataframe.
Next, we get the list of all products
Next, for each row, we take out the same week and Sub Department and remove the product.
In this list if there is a discount, we select 'Yes' else 'No'
Edit 1:
If you don't want to create another dataframe (save memory, but will be a bit slower)
df1['Discount_SubDpt'] = df1.apply(lambda x: 'Yes' if 'Yes' in df1.loc[(df1['Product'] != x['Product']) & (df1['Week'] == x['Week']) & (df1['Sub_Dpt'] == x['Sub_Dpt']), 'Discount'].tolist() else 'No', axis=1)

It's late, but here's a go. I used the sample df in the comments above.
df1['dis'] = df1['Discount'].apply(lambda x: 1 if x =="Yes" else 0)
df2 = df1.groupby(['Sub_Dpt','Week']).sum()
df2.reset_index(inplace = True)
df3 = pd.merge(df1,df2, left_on=['Sub_Dpt','Week'], right_on =['Sub_Dpt','Week'])
df3['Discount_Sb_Dpt'] = np.where(df3['dis_x'] < df3['dis_y'], 'Yes', 'No')
df3.sort_values(by=['Product'], inplace = True)
df3

Related

How do I create a Pivot table in Python for two categorical variables?

My data looks smth like:
Index
Job
Y
Balance
1
A
Yes
1
2
B
No
2
3
A
No
5
4
A
No
0
5
B
Yes
4
I want to summarize the data in the following format, with job in the row and Y in the column:
Yes
No
A
1
2
B
1
1
I have tried the following code:
pivot = df.pivot_table(index =['job'], columns = ['y'], values = ['balance'], aggfunc ='count')
I am not able to run the pivot without using balance in the value parameter. How do I get the above result?
You can try this:
data = {'Index': [1, 2, 3, 4, 5],
'job': ['A', 'B', 'A', 'A', 'B'],
'y': ['Yes', 'No', 'No', 'No', 'Yes'],
'Balance': [1, 2, 5, 0, 4]}
df = pd.DataFrame(data)
pivot = df.groupby(['job', 'y']).size().unstack(fill_value=0)
print(pivot)
To be able to do this, you will need to do df.groupby() first, to group the data on Job and Y columns to get the count of yes/no using the below code:
df2 = df.groupby(['Job', 'Y'], as_index=False).count()
Job Y Balance
0 A No 2
1 A Yes 1
2 B No 1
3 B Yes 1
You can then use df2.pivot() to pivot this grouped table into the desired format:
df2.pivot(index='Job', columns='Y', values='Balance')
Y No Yes
Job
A 2 1
B 1 1

Pandas: adjust value of DataFrame that is sliced multiple times

Imagine I have the follow Pandas.DataFrame:
df = pd.DataFrame({
'type': ['A', 'A', 'A', 'B', 'B', 'B'],
'value': [1, 2, 3, 4, 5, 6]
})
I want to adjust the first value when type == 'B' to 999, i.e. the fourth row's value should become 999.
Initially I imagined that
df.loc[df['type'] == 'B'].iloc[0, -1] = 999
or something similar would work. But as far as I can see, slicing the df twice does not point to the original df anymore so the value of the df is not updated.
My other attempt is
df.loc[df.loc[df['type'] == 'B'].index[0], df.columns[-1]] = 999
which works, but is quite ugly.
So I'm wondering -- what would be the best approach in such situation?
You can use idxmax which returns the index of the first occurrence of a max value. Like this using a boolean series:
df.loc[(df['type'] == 'B').idxmax(), 'value'] = 999
Output:
type value
0 A 1
1 A 2
2 A 3
3 B 999
4 B 5
5 B 6

Create new rows to pandas dataframe based on condition efficiently

I have two pandas dataframes: one with IDs and values and another that maps IDs with other IDs. The objective is to create a new dataframe that is based on df1. It loops through each sourceId in df1 and looks to df2, a mapping df, for matches in sourceId. If a match is found, a new row is created with the same value as in df1. So if multiple matches are found, the loop creates multiple rows (e.g. with ids A and C). If only one match is found (e.g. with id B), only one row is created.
The below code does exactly what I want, but it does it very slowly. In my original dataset df1 is 440K rows and df2 has mappings for thousands of different IDs - currently the code runs at 10-25 it/s which is too much.
Is there a faster way to do this that would benefit from matrix calculations/other benefits of numpy/pandas?
import pandas as pd
df1 = pd.DataFrame({
'SourceId': ['A', 'B', 'C', 'A', 'C', 'B'],
'value': [1, 5, 12, 30, 32, 55],
'time': [pd.to_datetime('2020-04-04 08:49:52.166498900+0000'),
pd.to_datetime('2020-08-14 06:12:40.860460500+0000'),
pd.to_datetime('2020-05-13 09:20:50.052688900+0000'),
pd.to_datetime('2020-03-09 13:55:17.335340600+0000'),
pd.to_datetime('2020-08-14 09:30:56.359635400+0000'),
pd.to_datetime('2020-01-31 23:03:46.539892900+0000')],
'otherInfo': ['0A10a', '055jA', 'boAqz', '0t,m5A', '09tjq1', 'akk_1!']})
df2 = pd.DataFrame({'SourceId': ['A', 'A', 'B', 'C', 'C', 'C'], 'TargetId': ['A', 'Q', 'B', 'C', 'B', 'X'], 'trueIfMatch': [1, 0, 1, 1, 0, 0]})
df3 = pd.DataFrame()
for r in df1.itertuples():
SourceId = r.SourceId
value = r.value
time = r.time
otherInfo = r.otherInfo
if SourceId in df2.SourceId.unique():
entries = df2.loc[df2.SourceId == SourceId].TargetId.tolist()
for entry in entries:
df3 = df3.append({
'sourceId': SourceId,
'targetId': entry,
'value': value,
'time': time,
'otherInfo': otherInfo
}, ignore_index=True)
display(df3)
Use df.merge with sort_values:
In [2293]: df3 = df1.merge(df2, on='SourceId').sort_values('value')
In [2294]: df3
Out[2294]:
SourceId value TargetId
0 A 1 A
1 A 1 Q
4 B 5 B
6 C 12 C
7 C 12 B
8 C 12 X
2 A 30 A
3 A 30 Q
9 C 32 C
10 C 32 B
11 C 32 X
5 B 55 B

pandas python flag transactions across rows

I have a data as below. I would like to flag transactions -
when a same employee has one of the ('Car Rental', 'Car Rental - Gas' in the column expense type) and 'Car Mileage' on the same day - so in this case employee a and c's transactions would be highlighted. Employee b's transactions won't be highlighted as they don't meet the condition - he doesn't have a 'Car Mileage'
i want the column zflag. Different numbers in that column indicate group of instances when the above condition was met
d = {'emp': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c', 'c' ],
'date': ['1', '1', '1', '1', '2', '2', '2', '3', '3', '3', '3' ],
'usd':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 ],
'expense type':['Car Mileage', 'Car Rental', 'Car Rental - Gas', 'food', 'Car Rental', 'Car Rental - Gas', 'food', 'Car Mileage', 'Car Rental', 'food', 'wine' ],
'zflag':['1', '1', '1', ' ',' ',' ',' ','2','2',' ',' ' ]
}
df = pd.DataFrame(data=d)
df
Out[253]:
date emp expense type usd zflag
0 1 a Car Mileage 1 1
1 1 a Car Rental 2 1
2 1 a Car Rental - Gas 3 1
3 1 a food 4
4 2 b Car Rental 5
5 2 b Car Rental - Gas 6
6 2 b food 7
7 3 c Car Mileage 8 2
8 3 c Car Rental 9 2
9 3 c food 10
10 3 c wine 11
I would appreciate if i could get pointers regarding functions to use. I am thinking of using groupby...but not sure
I understand that date+emp will be my primary key
Here is an approach. It's not the cleanest but what you're describing is quite specific. Some of this might be able to be simplified with a function.
temp_df = df.groupby(["emp", "date"], axis=0)["expense type"].apply(lambda x: 1 if "Car Mileage" in x.values and any([k in x.values for k in ["Car Rental", "Car Rental - Gas"]]) else 0).rename("zzflag")
temp_df = temp_df.loc[temp_df!=0,:].cumsum()
final_df = pd.merge(df, temp_df.reset_index(), how="left").fillna(0)
Steps:
Groupby emp/date and search for criteria, 1 if met, 0 if not
Remove rows with 0's and cumsum to produce unique values
Join back to the original frame
Edit:
To answer your question below. The join works because after you run .reset_index() that takes "emp" and "date" from the index and moves them to columns.

Frequency tables in pandas (like plyr in R)

My problem is how to calculate frequencies on multiple variables in pandas .
I have from this dataframe :
d1 = pd.DataFrame( {'StudentID': ["x1", "x10", "x2","x3", "x4", "x5", "x6", "x7", "x8", "x9"],
'StudentGender' : ['F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'M', 'M'],
'ExamenYear': ['2007','2007','2007','2008','2008','2008','2008','2009','2009','2009'],
'Exam': ['algebra', 'stats', 'bio', 'algebra', 'algebra', 'stats', 'stats', 'algebra', 'bio', 'bio'],
'Participated': ['no','yes','yes','yes','no','yes','yes','yes','yes','yes'],
'Passed': ['no','yes','yes','yes','no','yes','yes','yes','no','yes']},
columns = ['StudentID', 'StudentGender', 'ExamenYear', 'Exam', 'Participated', 'Passed'])
To the following result
Participated OfWhichpassed
ExamenYear
2007 3 2
2008 4 3
2009 3 2
(1) One possibility I tried is to compute two dataframe and bind them
t1 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], cols = ['Participated'], aggfunc = len)
t2 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], cols = ['Passed'], aggfunc = len)
tx = pd.concat([t1, t2] , axis = 1)
Res1 = tx['yes']
(2) The second possibility is to use an aggregation function .
import collections
dg = d1.groupby('ExamenYear')
Res2 = dg.agg({'Participated': len,'Passed': lambda x : collections.Counter(x == 'yes')[True]})
Res2.columns = ['Participated', 'OfWhichpassed']
Both ways are awckward to say the least.
How is this done properly in pandas ?
P.S: I also tried value_counts instead of collections.Counter but could not get it to work
For reference: Few months ago, I asked similar question for R here and plyr could help
---- UPDATE ------
user DSM is right. there was a mistake in the desired table result.
(1) The code for option one is
t1 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], aggfunc = len)
t2 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], cols = ['Participated'], aggfunc = len)
t3 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], cols = ['Passed'], aggfunc = len)
Res1 = pd.DataFrame( {'All': t1,
'OfWhichParticipated': t2['yes'],
'OfWhichPassed': t3['yes']})
It will produce the result
All OfWhichParticipated OfWhichPassed
ExamenYear
2007 3 2 2
2008 4 3 3
2009 3 3 2
(2) For Option 2, thanks to user herrfz, I figured out how to use value_count and the code will be
Res2 = d1.groupby('ExamenYear').agg({'StudentID': len,
'Participated': lambda x: x.value_counts()['yes'],
'Passed': lambda x: x.value_counts()['yes']})
Res2.columns = ['All', 'OfWgichParticipated', 'OfWhichPassed']
which will produce the same result as Res1
My question remains though:
Using Option 2, will it be possible to use the same Variable twice (for another operation ?) can one pass a custom name for the resulting variable ?
---- A NEW UPDATE ----
I have finally decided to use apply which I understand is more flexible.
I finally decided to use apply.
I am posting what I came up with hoping that it can be useful for others.
From what I understand from Wes' book "Python for Data analysis"
apply is more flexible than agg and transform because you can define your own function.
the only requirement is that the functions returns a pandas object or a scalar value.
the inner mechanics: the function is called on each piece of the grouped object abd results are glued together using pandas.concat
One needs to "hard-code" structure you want at the end
Here is what I came up with
def ZahlOccurence_0(x):
return pd.Series({'All': len(x['StudentID']),
'Part': sum(x['Participated'] == 'yes'),
'Pass' : sum(x['Passed'] == 'yes')})
when I run it :
d1.groupby('ExamenYear').apply(ZahlOccurence_0)
I get the correct results
All Part Pass
ExamenYear
2007 3 2 2
2008 4 3 3
2009 3 3 2
This approach would also allow me to combine frequencies with other stats
import numpy as np
d1['testValue'] = np.random.randn(len(d1))
def ZahlOccurence_1(x):
return pd.Series({'All': len(x['StudentID']),
'Part': sum(x['Participated'] == 'yes'),
'Pass' : sum(x['Passed'] == 'yes'),
'test' : x['testValue'].mean()})
d1.groupby('ExamenYear').apply(ZahlOccurence_1)
All Part Pass test
ExamenYear
2007 3 2 2 0.358702
2008 4 3 3 1.004504
2009 3 3 2 0.521511
I hope someone else will find this useful
You may use pandas crosstab function, which by default computes a frequency table of two or more variables. For example,
> import pandas as pd
> pd.crosstab(d1['ExamenYear'], d1['Passed'])
Passed no yes
ExamenYear
2007 1 2
2008 1 3
2009 1 2
Use the margins=True option if you also want to see the subtotal of each row and column.
> pd.crosstab(d1['ExamenYear'], d1['Participated'], margins=True)
Participated no yes All
ExamenYear
2007 1 2 3
2008 1 3 4
2009 0 3 3
All 2 8 10
This:
d1.groupby('ExamenYear').agg({'Participated': len,
'Passed': lambda x: sum(x == 'yes')})
doesn't look way more awkward than the R solution, IMHO.
There is another approach that I like to use for similar problems, it uses groupby and unstack:
d1 = pd.DataFrame({'StudentID': ["x1", "x10", "x2","x3", "x4", "x5", "x6", "x7", "x8", "x9"],
'StudentGender' : ['F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'M', 'M'],
'ExamenYear': ['2007','2007','2007','2008','2008','2008','2008','2009','2009','2009'],
'Exam': ['algebra', 'stats', 'bio', 'algebra', 'algebra', 'stats', 'stats', 'algebra', 'bio', 'bio'],
'Participated': ['no','yes','yes','yes','no','yes','yes','yes','yes','yes'],
'Passed': ['no','yes','yes','yes','no','yes','yes','yes','no','yes']},
columns = ['StudentID', 'StudentGender', 'ExamenYear', 'Exam', 'Participated', 'Passed'])
(this is just the raw data from above)
d2 = d1.groupby("ExamenYear").Participated.value_counts().unstack(fill_value=0)['yes']
d3 = d1.groupby("ExamenYear").Passed.value_counts().unstack(fill_value=0)['yes']
d2.name = "Participated"
d3.name = "Passed"
pd.DataFrame(data=[d2,d3]).T
Participated Passed
ExamenYear
2007 2 2
2008 3 3
2009 3 2
This solution is slightly more cumbersome than the one above using apply, but this one is easier to understand and extend, I feel.

Categories