Add columns to dataframe based on a dictionary - python

If have a dataframe like this:
df = pd.DataFrame({
'ID': ['1', '4', '4', '3', '3', '3'],
'club': ['arts', 'math', 'theatre', 'poetry', 'dance', 'cricket']
})
and I have a dictionary named tag_dict:
{'1': {'Granted'},
'3': {'Granted'}}
The keys of the dictionary match with some IDs in the ID column on data frame.
Now, I want to create a new column "Tag" in Dataframe such that
If a value in the ID column matches with the keys of a dictionary, then we have to place the value of that key in the dictionary else place '-' in that field
The output should look like this:
df = PD.DataFrame({
'ID': ['1', '4', '4', '3', '3', '3'],
'club': ['arts', 'math', 'theatre', 'poetry', 'dance', 'cricket'],
'tag':['Granted','-','-','Granted','Granted','Granted']
})

import pandas as pd
df = pd.DataFrame({
'ID': ['1', '4', '4', '3', '3', '3'],
'club': ['arts', 'math', 'theatre', 'poetry', 'dance', 'cricket']})
# I've removed the {} around your items. Feel free to add more key:value pairs
my_dict = {'1': 'Granted', '3': 'Granted'}
# use .map() to match your keys to their values
df['Tag'] = df['ID'].map(my_dict)
# if required, fill in NaN values with '-'
nan_rows = df['Tag'].isna()
df.loc[nan_rows, 'Tag'] = '-'
df
End result:

I'm not sure what the purpose of the curly brackets arount Granted is but you could use apply:
df = pd.DataFrame({
'ID': ['1', '4', '4', '3', '3', '3'],
'club': ['arts', 'math', 'theatre', 'poetry', 'dance', 'cricket']
})
tag_dict = {'1': 'Granted',
'3': 'Granted'}
df['tag'] = df['ID'].apply(lambda x: tag_dict.get(x, '-'))
print(df)
Output:
ID club tag
0 1 arts Granted
1 4 math -
2 4 theatre -
3 3 poetry Granted
4 3 dance Granted
5 3 cricket Granted

Solution with .map:
df["tag"] = df["ID"].map(dct).apply(lambda x: "-" if pd.isna(x) else [*x][0])
print(df)
Prints:
ID club tag
0 1 arts Granted
1 4 math -
2 4 theatre -
3 3 poetry Granted
4 3 dance Granted
5 3 cricket Granted

Related

How to transform index values into columns using Pandas?

I have a dictionary like this:
my_dict = {'RuleSet': {'0': {'RuleSetID': '0',
'RuleSetName': 'Allgemein',
'Rules': [{'RulesID': '10',
'RuleName': 'Gemeinde Seiten',
'GroupHits': '2',
'KeyWordGroups': ['100', '101', '102']}]},
'1': {'RuleSetID': '1',
'RuleSetName': 'Portale Berlin',
'Rules': [{'RulesID': '11',
'RuleName': 'Portale Berlin',
'GroupHits': '4',
'KeyWordGroups': ['100', '101', '102', '107']}]},
'6': {'RuleSetID': '6',
'RuleSetName': 'Zwangsvollstr. Berlin',
'Rules': [{'RulesID': '23',
'RuleName': 'Zwangsvollstr. Berlin',
'GroupHits': '1',
'KeyWordGroups': ['100', '101']}]}}}
When using this code snippet it can be transformed into a dataframe:
rules_pd = pd.DataFrame(my_dict['RuleSet'])
rules_pd
The result is:
I would like to make it look like this:
Does anyone know how to tackle this challenge?
Doing from_dict with index
out = pd.DataFrame.from_dict(my_dict['RuleSet'],'index')
Out[692]:
RuleSetID ... Rules
0 0 ... [{'RulesID': '10', 'RuleName': 'Gemeinde Seite...
1 1 ... [{'RulesID': '11', 'RuleName': 'Portale Berlin...
6 6 ... [{'RulesID': '23', 'RuleName': 'Zwangsvollstr....
[3 rows x 3 columns]
#out.columns
#Out[693]: Index(['RuleSetID', 'RuleSetName', 'Rules'], dtype='object')
You could try use Transpose()
rules_pd = pd.DataFrame(my_dict['RuleSet']).transpose()
print(rules_pd)

PANDAS create column by iterating row by row checking values in 2nd dataframe until all values are true

df1 = pd.DataFrame({
'ItemNo' : ['001', '002', '003', '004', '005'],
'L' : ['5', '65.0', '445.0', '3200', '65000'],
'H' : ['2', '15.5', '150.5', '1500', '54000'],
'W' : ['5', '85.0', '640.0', '1650', '45000']
})
df2 = pd.DataFrame({
'Rank' : ['1','2','3','4','5'],
'Length': ['10', '100', '1000', '10000'],
'Width' : ['10', '100', '1000', '10000'],
'Height': ['10', '100', '1000', '10000'],
'Code' : [ 'aa', 'bb', 'cc', 'dd', 'ee']
})
So here I have two example dataframes. The first dataframe shows unique item numbers with given dimensions. df2 shows maximum allowable dimensions for given rank and code. Meaning all elements (length, width, height) must not exceed maximum given dimensions. I would like to check the dimensions in df1 against df2 until all dimension criteria are True in order to retrieve it's 'rank' and 'code'. So, in essence, iterate down row by row of df2 until all the criteria is True.
Make a new df3 as follows:
ItemNo Rank Code
001 1 aa
002 2 bb
003 3 cc
004 4 dd
005 5 ee
Using a numpy
changed sample data so that it's not just incrementing results
get index of row in df2 that matches required logic
build df3 using index in step 2
df1 = pd.DataFrame({
'ItemNo' : ['001', '002', '003', '004', '005'],
'L' : ['5', '65.0', '445.0', '5', '65000'],
'H' : ['2', '15.5', '150.5', '5', '54000'],
'W' : ['5', '85.0', '640.0', '5', '45000']
})
df2 = pd.DataFrame({
'Rank' : ['1','2','3','4','5'],
'Length': ['10', '100', '1000', '10000',100000],
'Width' : ['10', '100', '1000', '10000',100000],
'Height': ['10', '100', '1000', '10000',100000],
'Code' : [ 'aa', 'bb', 'cc', 'dd', 'ee']
})
# fix up datatypes for comparisons
df1.loc[:,["L","H","W"]] = df1.loc[:,["L","H","W"]].astype(float)
df2.loc[:,["Length","Height","Width"]] = df2.loc[:,["Length","Height","Width"]].astype(float)
# row by row comparison, argmax to get first True
idx = [np.argmax((df1.loc[r,["L","H","W"]].values
< df2.loc[:,["Length","Height","Width"]].values).all(axis=1))
for r in df1.index]
# finally the result
pd.concat([df1.ItemNo, df2.loc[idx,["Rank","Code"]].reset_index(drop=True)],axis=1)
ItemNo
Rank
Code
0
001
1
aa
1
002
2
bb
2
003
3
cc
3
004
1
aa
4
005
5
ee
I think you can try:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'ItemNo' : ['001', '002', '003', '004', '005', '006'],
'L' : ['5', '65.0', '445.0', '3200', '65000', '10'],
'H' : ['2', '15.5', '150.5', '1500', '54000','1000'],
'W' : ['5', '85.0', '640.0', '1650', '45000', '10']
})
df2 = pd.DataFrame({
'Rank' : ['1','2','3','4','5'],
'Length': ['10', '100', '1000', '10000', '100000'],
'Width' : ['10', '100', '1000', '10000', '100000'],
'Height': ['10', '100', '1000', '10000', '100000'],
'Code' : [ 'aa', 'bb', 'cc', 'dd', 'ee']
})
df_sort = pd.DataFrame({'W': np.searchsorted(df2['Width'].astype(float), df1['W'].astype(float)),
'H': np.searchsorted(df2['Height'].astype(float), df1['H'].astype(float)),
'L': np.searchsorted(df2['Length'].astype(float), df1['L'].astype(float))})
df1['Rank'] = df_sort.max(axis=1).map(df2['Rank'])
df1['Code'] = df1['Rank'].map(df2.set_index('Rank')['Code'])
print(df1)
Output:
ItemNo L H W Rank Code
0 001 5 2 5 1 aa
1 002 65.0 15.5 85.0 2 bb
2 003 445.0 150.5 640.0 3 cc
3 004 3200 1500 1650 4 dd
4 005 65000 54000 45000 5 ee
5 006 10 1000 10 3 cc
The core to the code is the use of the np.searchsorted function. Which is used to find the index of the value of L in Length for example per the conditions listed in the documentations. So, I use np.searchsorted for each of the three dimension then, I take the largest value using max(axis=1) and assign the rank and code based on that largest value using map.

Group By with sumproduct

I am working with a df with the following structure:
df = DataFrame({'Date' : ['1', '1', '1', '1'],
'Ref' : ['one', 'one', 'two', 'two'],
'Price' : ['50', '65', '30', '35'],
'MktPrice' : ['63', '63', '32', '32'],
'Quantity' : ['10', '15', '20', '10'],
'MarketQuantity': ['50', '50', '100', '100'],
'Weightings' : ['2', '2', '4', '4'],
'QxWeightings' : ['20', '30', '80', '40'],
'MktQxWeightings': ['100', '100', '400', '400'],
})
I have managed to get the weighted percentage that represents my Quantity out of MarketQuantity, when Price is above Mkt Price (and showing it by Date and Ref)
def percentage(x):
return (x.loc[x['Price'] >= x['MktPrice'], ['QxWeightings']].sum()/(x['MktQxWeightings'].sum()/len(x)))
df.groupby(['Date', 'Ref']).apply(percentage)
Date Ref Output
1 one 0.3
1 two 0.1
However, when I am trying to group it only by Date I get:
Date Output
1 0.4
which is the sum of previous outputs, when it should be 0.14 (30+40)/(100+400).
How can I do that with groupby?
IIUC, may be something like this:
def percentage(x):
return (x.loc[x['Price'] >= x['MktPrice'], ['QxWeightings']].sum()/(x['MktQxWeightings'].sum()/len(x)))
df_new=df.groupby(['Date', 'Ref','MktQxWeightings']).apply(percentage).reset_index()
print(df_new)
Date Ref MktQxWeightings QxWeightings
0 1 one 100 0.3
1 1 two 400 0.1
df_new.groupby('Date')['MktQxWeightings','QxWeightings'].apply(lambda x: x['QxWeightings'].\
cumsum().sum()*100/x['MktQxWeightings'].sum())
Date
1 0.14

python grouping and transpose

I have a dataframe - df as below :
df = pd.DataFrame({"Customer_no": ['1', '1', '1', '2', '2', '6', '7','8','9','10'],
"Card_no": ['111', '222', '333', '444', '555', '666', '777','888','999','000'],
"Card_name":['AAA','AAA','BBB','CCC','AAA','DDD','EEE','BBB','CCC','CCC'],
"Group_code":['123','123','456','678','123','434','678','365','678','987'],
"Amount":['100','240','450','212','432','123','543','567','232','453'],
"Category" :['Electrical','Electrical','Hardware','House','Electrical','Car','House','Toy','House','Bike123']})
Now, i need to group by Customer no and get Total Amount, Top 1,2,3 categories and their percentage against the total amount.
NOTE : In my Toy dataset, i have only 2 categories , in my original data i have more, i need to select top 5 categories.
My Dataframe should look like this :
df = pd.DataFrame({"Customer_no": ['1', '1', '1', '2', '2', '6', '7','8','9','10'],
"Card_no": ['111', '222', '333', '444', '555', '666', '777','888','999','000'],
"Card_name":['AAA','AAA','BBB','CCC','AAA','DDD','EEE','BBB','CCC','CCC'],
"Group_code":['123','123','456','678','123','434','678','365','678','987'],
"Amount":['100','240','450','212','432','123','543','567','232','453'],
"Category" :['Electrical','Electrical','Hardware','House','Electrical','Car','House','Toy','House','Bike123'],
"Total amount" :['790','790','790','644','644','123','543','567','232','453'],
"Top-1 Category":['Hardware','Hardware','Hardware','Electrical','Electrical','Car','House','Toy',
'House','Bike123'],
"Top-1 Category %":['57','57','57','67','67','100','100','100','100','100'],
"Top-2 Category":['Electrical','Electrical','Electrical','House','House','','','','',''] ,
"Top-2 Category %":['43','43','43','33','33','0','0','0','0','0']})
Request your help to achieve it.
NOTE :
1) Top Category is selected by Grouping all the Category against each customer and summing up the amount for each category customer wise. Which ever category has more amount its Top 1 category, similarly the next one is Top 2 and so on
2) Top 1 category percentage : Its the Total amount of each category divided by Total amount and multiplied with 100. This is for each customer. Similarly for Top 2 category.
Try this:
#Your data
df = pd.DataFrame({"Customer_no": ['1', '1', '1', '2', '2', '6', '7','8','9','10'],
"Card_no": ['111', '222', '333', '444', '555', '666', '777','888','999','000'],
"Card_name":['AAA','AAA','BBB','CCC','AAA','DDD','EEE','BBB','CCC','CCC'],
"Group_code":['123','123','456','678','123','434','678','365','678','987'],
"Amount":['100','240','450','212','432','123','543','567','232','453'],
"Category" :['Electrical','Electrical','Hardware','House','Electrical','Car','House','Toy','House','Bike123']})
#Make some columns numerical
for i in ["Customer_no","Card_no","Group_code","Amount"]:
df[i] = pd.to_numeric(df[i])
#Total sum
Total_amount = pd.DataFrame(df.groupby(["Customer_no"]).Amount.sum()).reset_index().rename(columns={'Amount':'Total amount'})
#Add some nesessery colums and grouping
Top_1_Category = pd.DataFrame(df.groupby(['Customer_no',"Category"]).Amount.sum()).reset_index().rename(columns={'Amount':'group_sum'})
df = df.merge(Total_amount,how='left',on='Customer_no')
df = df.merge(Top_1_Category,how='left',on=['Category','Customer_no'])
group_top_1 = df[['Customer_no','Category','group_sum']].loc[df.groupby('Customer_no').group_sum.agg('idxmax')].rename(columns={'Category':'Top-1 Category','group_sum':'group_sum_1'})
df = df.merge(group_top_1,how='left',on='Customer_no')
#Make columns 'Top-1 Category %'
df['Top-1 Category %'] = round(100*df['group_sum_1']/df['Total amount'],0)
#drop unnecessary columns
df.drop(['group_sum','group_sum_1'],axis=1,inplace=True)
You can add Top-2 column simular

Convert list of lists to custom dictionary

I'm unsuccessfully trying to convert list of lists to a custom dictionary.
I've created the following output saved in two lists:
headers = ['CPU', 'name', 'id', 'cused', 'callc', 'mused', 'mallc']
result = [['1/0', 'aaa', '10', '0.1', '15', '10.73', '16.00'],
['1/0', 'bbb', '10', '0.1', '20', '11.27', '14.00'],
['1/0', 'ccc', '10', '0.2', '10', '11.50', '15.00'],
['1/0', 'aaa', '10', '1.1', '15', '15.10', '23.00']]
Formatted output:
headers:
slot name id cused callc mused mallc
result:
1/0 aaa 10 0.1 15 10.73 16.00
2/0 bbb 25 0.1 20 11.39 14.00
1/0 ccc 10 0.2 10 11.50 15.00
1/0 aaa 10 1.1 15 15.10 23.00
The first n columns (3 in this case) should be used to concatenate key name with all of the remaining columns as output values.
I would like to convert it to a dictionary in the following format:
slot.<slot>.name.<name>.id.<id>.cused:<value>,
slot.<slot>.name.<name>.id.<id>.callc:<value>,
slot.<slot>.name.<name>.id.<id>.mused:<value>,
slot.<slot>.name.<name>.id.<id>.mallc:<value>,
...
for example:
dictionary = {
'slot.1/0.name.aaa.id.10.cused':10,
'slot.1/0.name.aaa.id.25.callc':15,
'slot.1/0.name.aaa.id.10.mused':10.73,
'slot.1/0.name.aaa.id.10.mallc':16.00,
'slot.2/0.name.bbb.id.10.cused':0.1,
...
'slot.<n>.name.<name>.id.<id>.<value_name> <value>
}
Can you show me how that can be done?
Updated - OP added raw lists
Now that you have updated the question to show the raw list it's even easier:
headers = ['CPU', 'name', 'id', 'cused', 'callc', 'mused', 'mallc']
result = [['1/0', 'aaa', '10', '0.1', '15', '10.73', '16.00'],
['1/0', 'bbb', '10', '0.1', '20', '11.27', '14.00'],
['1/0', 'ccc', '10', '0.2', '10', '11.50', '15.00'],
['1/0', 'aaa', '10', '1.1', '15', '15.10', '23.00']]
results = {}
for r in result:
slot, name, _id = r[:3]
results.update(
{'slot.{}.name.{}.id.{}.{}'.format(slot, name, _id, k) : v
for k, v in zip(headers[3:], r[3:])})
>>> from pprint import pprint
>>> pprint(results)
{'slot.1/0.name.aaa.id.10.callc': '15',
'slot.1/0.name.aaa.id.10.cused': '1.1',
'slot.1/0.name.aaa.id.10.mallc': '23.00',
'slot.1/0.name.aaa.id.10.mused': '15.10',
'slot.1/0.name.bbb.id.10.callc': '20',
'slot.1/0.name.bbb.id.10.cused': '0.1',
'slot.1/0.name.bbb.id.10.mallc': '14.00',
'slot.1/0.name.bbb.id.10.mused': '11.27',
'slot.1/0.name.ccc.id.10.callc': '10',
'slot.1/0.name.ccc.id.10.cused': '0.2',
'slot.1/0.name.ccc.id.10.mallc': '15.00',
'slot.1/0.name.ccc.id.10.mused': '11.50'}
Original file based answer
The following code will construct the required dictionary (results). The idea is that each non-header line in the file is split by whitespace into fields, and the fields are used in a dictionary comprehension to construct a dictionary for each line, which is then used to update the results dictionary.
with open('data') as f:
# skip the 3 header lines
for i in range(3):
_ = next(f)
STAT_NAMES = 'cused callc mused mallc'.split()
results = {}
for line in f:
line = line.split()
slot, name, _id = line[:3]
results.update(
{'slot.{}.name.{}.id.{}.{}'.format(slot, name, _id, k) : v
for k, v in zip(STAT_NAMES, line[3:])})
Output
>>> from pprint import pprint
>>> pprint(results)
{'slot.1/0.name.aaa.id.10.callc': '15',
'slot.1/0.name.aaa.id.10.cused': '1.1',
'slot.1/0.name.aaa.id.10.mallc': '23.00',
'slot.1/0.name.aaa.id.10.mused': '15.10',
'slot.1/0.name.ccc.id.10.callc': '10',
'slot.1/0.name.ccc.id.10.cused': '0.2',
'slot.1/0.name.ccc.id.10.mallc': '15.00',
'slot.1/0.name.ccc.id.10.mused': '11.50',
'slot.2/0.name.bbb.id.25.callc': '20',
'slot.2/0.name.bbb.id.25.cused': '0.1',
'slot.2/0.name.bbb.id.25.mallc': '14.00',
'slot.2/0.name.bbb.id.25.mused': '11.39'}
try this, Note: i changed "slot" instead of "CPU"
headers = ['slot', 'name', 'id', 'cused', 'callc', 'mused', 'mallc']
result = [['1/0', 'aaa', '10', '0.1', '15', '10.73', '16.00'],
['1/0', 'bbb', '10', '0.1', '20', '11.27', '14.00'],
['1/0', 'ccc', '10', '0.2', '10', '11.50', '15.00'],
['1/0', 'aaa', '10', '1.1', '15', '15.10', '23.00']]
#I get: [['1/0', '1/0', '1/0', '1/0'], ['aaa', 'bbb', 'ccc', 'aaa'], ....
transpose_result = map(list, zip(*result))
#I get: {'slot': ['1/0', '1/0', '1/0', '1/0'],
# 'mallc': ['16.00', '14.00', '15.00', '23.00'], ...
data = dict(zip(headers, transpose_result))
d = {}
for reg in ("cused", "callc", "mused", "mallc"):
for i, val in enumerate(data[reg]):
key = []
for reg2 in ("slot", "name", "id"):
key.append(reg2)
key.append(data[reg2][i])
key.append(reg)
d[".".join(key)] = val
you get in d
{
'slot.1/0.name.bbb.id.10.cused': '0.1',
'slot.1/0.name.aaa.id.10.cused': '1.1',
'slot.1/0.name.bbb.id.10.callc': '20',
'slot.1/0.name.aaa.id.10.mallc': '23.00',
'slot.1/0.name.aaa.id.10.callc': '15',
'slot.1/0.name.ccc.id.10.mallc': '15.00',
'slot.1/0.name.ccc.id.10.mused': '11.50',
'slot.1/0.name.aaa.id.10.mused': '15.10',
'slot.1/0.name.ccc.id.10.cused': '0.2',
'slot.1/0.name.ccc.id.10.callc': '10',
'slot.1/0.name.bbb.id.10.mallc': '14.00',
'slot.1/0.name.bbb.id.10.mused': '11.27'}
import itertools
headers = 'slot name id cused callc mused mallc'.split()
result = ['1/0 aaa 10 0.1 15 10.73 16.00'.split(),
'2/0 bbb 25 0.1 20 11.39 14.00'.split()]
key_len = 3
d = {}
for row in result:
key_start = '.'.join(itertools.chain(*zip(headers, row[:key_len])))
for key_end, val in zip(headers[key_len:], row[key_len:]):
d[key_start + '.' + key_end] = val
another solution with the correct type for cused, callc, mused and mallc
labels = ['slot','name','id','cused','callc','mused','mallc']
data = ['1/0 aaa 10 0.1 15 10.73 16.00',
'2/0 bbb 25 0.1 20 11.39 14.00',
'1/0 ccc 10 0.2 10 11.50 15.00',
'1/0 aaa 10 1.1 15 15.10 23.00']
data = [tuple(e.split()) for e in data]
data = [zip(labels, e) for e in data]
results = dict()
for e in data:
s = '%s.%s.%s' % tuple(['.'.join(e[i]) for i in range(3)])
for i in range(3,7):
results['%s.%s' % (s, e[i][0])] = int(e[i][1]) if i == 4 else float(e[i][1])
print results

Categories