python grouping and transpose - python

I have a dataframe - df as below :
df = pd.DataFrame({"Customer_no": ['1', '1', '1', '2', '2', '6', '7','8','9','10'],
"Card_no": ['111', '222', '333', '444', '555', '666', '777','888','999','000'],
"Card_name":['AAA','AAA','BBB','CCC','AAA','DDD','EEE','BBB','CCC','CCC'],
"Group_code":['123','123','456','678','123','434','678','365','678','987'],
"Amount":['100','240','450','212','432','123','543','567','232','453'],
"Category" :['Electrical','Electrical','Hardware','House','Electrical','Car','House','Toy','House','Bike123']})
Now, i need to group by Customer no and get Total Amount, Top 1,2,3 categories and their percentage against the total amount.
NOTE : In my Toy dataset, i have only 2 categories , in my original data i have more, i need to select top 5 categories.
My Dataframe should look like this :
df = pd.DataFrame({"Customer_no": ['1', '1', '1', '2', '2', '6', '7','8','9','10'],
"Card_no": ['111', '222', '333', '444', '555', '666', '777','888','999','000'],
"Card_name":['AAA','AAA','BBB','CCC','AAA','DDD','EEE','BBB','CCC','CCC'],
"Group_code":['123','123','456','678','123','434','678','365','678','987'],
"Amount":['100','240','450','212','432','123','543','567','232','453'],
"Category" :['Electrical','Electrical','Hardware','House','Electrical','Car','House','Toy','House','Bike123'],
"Total amount" :['790','790','790','644','644','123','543','567','232','453'],
"Top-1 Category":['Hardware','Hardware','Hardware','Electrical','Electrical','Car','House','Toy',
'House','Bike123'],
"Top-1 Category %":['57','57','57','67','67','100','100','100','100','100'],
"Top-2 Category":['Electrical','Electrical','Electrical','House','House','','','','',''] ,
"Top-2 Category %":['43','43','43','33','33','0','0','0','0','0']})
Request your help to achieve it.
NOTE :
1) Top Category is selected by Grouping all the Category against each customer and summing up the amount for each category customer wise. Which ever category has more amount its Top 1 category, similarly the next one is Top 2 and so on
2) Top 1 category percentage : Its the Total amount of each category divided by Total amount and multiplied with 100. This is for each customer. Similarly for Top 2 category.

Try this:
#Your data
df = pd.DataFrame({"Customer_no": ['1', '1', '1', '2', '2', '6', '7','8','9','10'],
"Card_no": ['111', '222', '333', '444', '555', '666', '777','888','999','000'],
"Card_name":['AAA','AAA','BBB','CCC','AAA','DDD','EEE','BBB','CCC','CCC'],
"Group_code":['123','123','456','678','123','434','678','365','678','987'],
"Amount":['100','240','450','212','432','123','543','567','232','453'],
"Category" :['Electrical','Electrical','Hardware','House','Electrical','Car','House','Toy','House','Bike123']})
#Make some columns numerical
for i in ["Customer_no","Card_no","Group_code","Amount"]:
df[i] = pd.to_numeric(df[i])
#Total sum
Total_amount = pd.DataFrame(df.groupby(["Customer_no"]).Amount.sum()).reset_index().rename(columns={'Amount':'Total amount'})
#Add some nesessery colums and grouping
Top_1_Category = pd.DataFrame(df.groupby(['Customer_no',"Category"]).Amount.sum()).reset_index().rename(columns={'Amount':'group_sum'})
df = df.merge(Total_amount,how='left',on='Customer_no')
df = df.merge(Top_1_Category,how='left',on=['Category','Customer_no'])
group_top_1 = df[['Customer_no','Category','group_sum']].loc[df.groupby('Customer_no').group_sum.agg('idxmax')].rename(columns={'Category':'Top-1 Category','group_sum':'group_sum_1'})
df = df.merge(group_top_1,how='left',on='Customer_no')
#Make columns 'Top-1 Category %'
df['Top-1 Category %'] = round(100*df['group_sum_1']/df['Total amount'],0)
#drop unnecessary columns
df.drop(['group_sum','group_sum_1'],axis=1,inplace=True)
You can add Top-2 column simular

Related

Add columns to dataframe based on a dictionary

If have a dataframe like this:
df = pd.DataFrame({
'ID': ['1', '4', '4', '3', '3', '3'],
'club': ['arts', 'math', 'theatre', 'poetry', 'dance', 'cricket']
})
and I have a dictionary named tag_dict:
{'1': {'Granted'},
'3': {'Granted'}}
The keys of the dictionary match with some IDs in the ID column on data frame.
Now, I want to create a new column "Tag" in Dataframe such that
If a value in the ID column matches with the keys of a dictionary, then we have to place the value of that key in the dictionary else place '-' in that field
The output should look like this:
df = PD.DataFrame({
'ID': ['1', '4', '4', '3', '3', '3'],
'club': ['arts', 'math', 'theatre', 'poetry', 'dance', 'cricket'],
'tag':['Granted','-','-','Granted','Granted','Granted']
})
import pandas as pd
df = pd.DataFrame({
'ID': ['1', '4', '4', '3', '3', '3'],
'club': ['arts', 'math', 'theatre', 'poetry', 'dance', 'cricket']})
# I've removed the {} around your items. Feel free to add more key:value pairs
my_dict = {'1': 'Granted', '3': 'Granted'}
# use .map() to match your keys to their values
df['Tag'] = df['ID'].map(my_dict)
# if required, fill in NaN values with '-'
nan_rows = df['Tag'].isna()
df.loc[nan_rows, 'Tag'] = '-'
df
End result:
I'm not sure what the purpose of the curly brackets arount Granted is but you could use apply:
df = pd.DataFrame({
'ID': ['1', '4', '4', '3', '3', '3'],
'club': ['arts', 'math', 'theatre', 'poetry', 'dance', 'cricket']
})
tag_dict = {'1': 'Granted',
'3': 'Granted'}
df['tag'] = df['ID'].apply(lambda x: tag_dict.get(x, '-'))
print(df)
Output:
ID club tag
0 1 arts Granted
1 4 math -
2 4 theatre -
3 3 poetry Granted
4 3 dance Granted
5 3 cricket Granted
Solution with .map:
df["tag"] = df["ID"].map(dct).apply(lambda x: "-" if pd.isna(x) else [*x][0])
print(df)
Prints:
ID club tag
0 1 arts Granted
1 4 math -
2 4 theatre -
3 3 poetry Granted
4 3 dance Granted
5 3 cricket Granted

How to transform index values into columns using Pandas?

I have a dictionary like this:
my_dict = {'RuleSet': {'0': {'RuleSetID': '0',
'RuleSetName': 'Allgemein',
'Rules': [{'RulesID': '10',
'RuleName': 'Gemeinde Seiten',
'GroupHits': '2',
'KeyWordGroups': ['100', '101', '102']}]},
'1': {'RuleSetID': '1',
'RuleSetName': 'Portale Berlin',
'Rules': [{'RulesID': '11',
'RuleName': 'Portale Berlin',
'GroupHits': '4',
'KeyWordGroups': ['100', '101', '102', '107']}]},
'6': {'RuleSetID': '6',
'RuleSetName': 'Zwangsvollstr. Berlin',
'Rules': [{'RulesID': '23',
'RuleName': 'Zwangsvollstr. Berlin',
'GroupHits': '1',
'KeyWordGroups': ['100', '101']}]}}}
When using this code snippet it can be transformed into a dataframe:
rules_pd = pd.DataFrame(my_dict['RuleSet'])
rules_pd
The result is:
I would like to make it look like this:
Does anyone know how to tackle this challenge?
Doing from_dict with index
out = pd.DataFrame.from_dict(my_dict['RuleSet'],'index')
Out[692]:
RuleSetID ... Rules
0 0 ... [{'RulesID': '10', 'RuleName': 'Gemeinde Seite...
1 1 ... [{'RulesID': '11', 'RuleName': 'Portale Berlin...
6 6 ... [{'RulesID': '23', 'RuleName': 'Zwangsvollstr....
[3 rows x 3 columns]
#out.columns
#Out[693]: Index(['RuleSetID', 'RuleSetName', 'Rules'], dtype='object')
You could try use Transpose()
rules_pd = pd.DataFrame(my_dict['RuleSet']).transpose()
print(rules_pd)

PANDAS create column by iterating row by row checking values in 2nd dataframe until all values are true

df1 = pd.DataFrame({
'ItemNo' : ['001', '002', '003', '004', '005'],
'L' : ['5', '65.0', '445.0', '3200', '65000'],
'H' : ['2', '15.5', '150.5', '1500', '54000'],
'W' : ['5', '85.0', '640.0', '1650', '45000']
})
df2 = pd.DataFrame({
'Rank' : ['1','2','3','4','5'],
'Length': ['10', '100', '1000', '10000'],
'Width' : ['10', '100', '1000', '10000'],
'Height': ['10', '100', '1000', '10000'],
'Code' : [ 'aa', 'bb', 'cc', 'dd', 'ee']
})
So here I have two example dataframes. The first dataframe shows unique item numbers with given dimensions. df2 shows maximum allowable dimensions for given rank and code. Meaning all elements (length, width, height) must not exceed maximum given dimensions. I would like to check the dimensions in df1 against df2 until all dimension criteria are True in order to retrieve it's 'rank' and 'code'. So, in essence, iterate down row by row of df2 until all the criteria is True.
Make a new df3 as follows:
ItemNo Rank Code
001 1 aa
002 2 bb
003 3 cc
004 4 dd
005 5 ee
Using a numpy
changed sample data so that it's not just incrementing results
get index of row in df2 that matches required logic
build df3 using index in step 2
df1 = pd.DataFrame({
'ItemNo' : ['001', '002', '003', '004', '005'],
'L' : ['5', '65.0', '445.0', '5', '65000'],
'H' : ['2', '15.5', '150.5', '5', '54000'],
'W' : ['5', '85.0', '640.0', '5', '45000']
})
df2 = pd.DataFrame({
'Rank' : ['1','2','3','4','5'],
'Length': ['10', '100', '1000', '10000',100000],
'Width' : ['10', '100', '1000', '10000',100000],
'Height': ['10', '100', '1000', '10000',100000],
'Code' : [ 'aa', 'bb', 'cc', 'dd', 'ee']
})
# fix up datatypes for comparisons
df1.loc[:,["L","H","W"]] = df1.loc[:,["L","H","W"]].astype(float)
df2.loc[:,["Length","Height","Width"]] = df2.loc[:,["Length","Height","Width"]].astype(float)
# row by row comparison, argmax to get first True
idx = [np.argmax((df1.loc[r,["L","H","W"]].values
< df2.loc[:,["Length","Height","Width"]].values).all(axis=1))
for r in df1.index]
# finally the result
pd.concat([df1.ItemNo, df2.loc[idx,["Rank","Code"]].reset_index(drop=True)],axis=1)
ItemNo
Rank
Code
0
001
1
aa
1
002
2
bb
2
003
3
cc
3
004
1
aa
4
005
5
ee
I think you can try:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'ItemNo' : ['001', '002', '003', '004', '005', '006'],
'L' : ['5', '65.0', '445.0', '3200', '65000', '10'],
'H' : ['2', '15.5', '150.5', '1500', '54000','1000'],
'W' : ['5', '85.0', '640.0', '1650', '45000', '10']
})
df2 = pd.DataFrame({
'Rank' : ['1','2','3','4','5'],
'Length': ['10', '100', '1000', '10000', '100000'],
'Width' : ['10', '100', '1000', '10000', '100000'],
'Height': ['10', '100', '1000', '10000', '100000'],
'Code' : [ 'aa', 'bb', 'cc', 'dd', 'ee']
})
df_sort = pd.DataFrame({'W': np.searchsorted(df2['Width'].astype(float), df1['W'].astype(float)),
'H': np.searchsorted(df2['Height'].astype(float), df1['H'].astype(float)),
'L': np.searchsorted(df2['Length'].astype(float), df1['L'].astype(float))})
df1['Rank'] = df_sort.max(axis=1).map(df2['Rank'])
df1['Code'] = df1['Rank'].map(df2.set_index('Rank')['Code'])
print(df1)
Output:
ItemNo L H W Rank Code
0 001 5 2 5 1 aa
1 002 65.0 15.5 85.0 2 bb
2 003 445.0 150.5 640.0 3 cc
3 004 3200 1500 1650 4 dd
4 005 65000 54000 45000 5 ee
5 006 10 1000 10 3 cc
The core to the code is the use of the np.searchsorted function. Which is used to find the index of the value of L in Length for example per the conditions listed in the documentations. So, I use np.searchsorted for each of the three dimension then, I take the largest value using max(axis=1) and assign the rank and code based on that largest value using map.

loop over multiple dataframes and saving multiple lists

I have a bunch of dataframes, like the ones below:
import pandas as pd
data1 = [['1', '2', 'mary', 123], ['1', '3', 'john', 234 ], ['2', '4', 'layla', 345 ]]
data2 = [['2', '6', 'josh', 345], ['1', '2', 'dolores', 987], ['1', '4', 'kate', 843]]
df1 = pd.DataFrame(data1, columns = ['state', 'city', 'name', 'number1'])
df2 = pd.DataFrame(data2, columns = ['state', 'city', 'name', 'number1'])
for some silly reason I need to transform it in a list in this manner (for each row):
list(
df1.apply(
lambda x: {
"profile": {"state": x["state"], "city": x["city"], "name": x["name"]},
"number1": x["number1"],
},
axis=1,
)
)
what returns me exactly what I need:
[{'profile': {'state': '1', 'city': '2', 'name': 'mary'}, 'number1': 123},
{'profile': {'state': '1', 'city': '3', 'name': 'john'}, 'number1': 234},
{'profile': {'state': '2', 'city': '4', 'name': 'layla'}, 'number1': 345}]
It works if I do it for each dataframe, but I need to write a function so I can use it latter. Also, I need to be able to store both df1 and df2 separately after the operation.
I tried something like this:
df_list = [df1, df2]
for row in df_list:
row = list(row.apply(lambda x: {'send': {'state':x['state'], 'city':x['city'], 'name':x['name']}, 'number1':x['number1']}, axis=1))
but it saves only the value of the last df in the list (df2) row.
also, I tried something like this (and a lot of other stuff):
new_values = []
for row in df_list:
row = list(row.apply(lambda x: {'send'{'state':x['state'],'city':x['city'],'name':x['name']},'number1':x['number1']}, axis=1))
new_values.append(df_list)
I know it might be about not been saving the row value locally. I've read a lot posts here similar to my problem, but I couldn't manage to fully use then... Any help will be appreciated, I'm really stuck here..
Do you mean this?
def func(df):
return list(df.apply(lambda x:{'profile' : {'state': x['state'],'city': x['city'],'name':x['name']},'number1': x['number1']}, axis=1))
you can use it just like that:
df1 = func(df1)
also if you want to map all of data frames:
df1, df2 = [func(df) for df in [df1, df2]]

Group By with sumproduct

I am working with a df with the following structure:
df = DataFrame({'Date' : ['1', '1', '1', '1'],
'Ref' : ['one', 'one', 'two', 'two'],
'Price' : ['50', '65', '30', '35'],
'MktPrice' : ['63', '63', '32', '32'],
'Quantity' : ['10', '15', '20', '10'],
'MarketQuantity': ['50', '50', '100', '100'],
'Weightings' : ['2', '2', '4', '4'],
'QxWeightings' : ['20', '30', '80', '40'],
'MktQxWeightings': ['100', '100', '400', '400'],
})
I have managed to get the weighted percentage that represents my Quantity out of MarketQuantity, when Price is above Mkt Price (and showing it by Date and Ref)
def percentage(x):
return (x.loc[x['Price'] >= x['MktPrice'], ['QxWeightings']].sum()/(x['MktQxWeightings'].sum()/len(x)))
df.groupby(['Date', 'Ref']).apply(percentage)
Date Ref Output
1 one 0.3
1 two 0.1
However, when I am trying to group it only by Date I get:
Date Output
1 0.4
which is the sum of previous outputs, when it should be 0.14 (30+40)/(100+400).
How can I do that with groupby?
IIUC, may be something like this:
def percentage(x):
return (x.loc[x['Price'] >= x['MktPrice'], ['QxWeightings']].sum()/(x['MktQxWeightings'].sum()/len(x)))
df_new=df.groupby(['Date', 'Ref','MktQxWeightings']).apply(percentage).reset_index()
print(df_new)
Date Ref MktQxWeightings QxWeightings
0 1 one 100 0.3
1 1 two 400 0.1
df_new.groupby('Date')['MktQxWeightings','QxWeightings'].apply(lambda x: x['QxWeightings'].\
cumsum().sum()*100/x['MktQxWeightings'].sum())
Date
1 0.14

Categories