From a dataframe, I build a dictionary that has as keys each distinct value from a given column.
The value of each key is a nested dictionary, being the key the distinct values from another column.
The Values in the nested dictionary will be updated by iterating a dataframe (third column).
Example:
import pandas as pd
data = [['computer',1, 10]
,['computer',2,20]
,['computer',4, 40]
,['laptop',1, 100]
,['laptop',3, 30]
,['printer',2, 200]
]
df = pd.DataFrame(data,columns=['Product','id', 'qtt'])
print (df)
Product
id
qtt
computer
1
10
computer
2
20
computer
4
40
laptop
1
100
laptop
3
30
printer
2
200
kdf_key_dic = {key: None for key in df['id'].unique().tolist()}
product_key_dic = {key: kdf_key_dic for key in df['Product'].unique().tolist()}
print ("product_key_dic: ", product_key_dic)
product_key_dic: {
'computer': {1: None, 2: None, 4: None, 3: None},
'laptop': {1: None, 2: None, 4: None, 3: None},
'printer': {1: None, 2: None, 4: None, 3: None}
}
Now I'd like to update the product_key_dic dictionary, but I can't get it right, it always uses the same key-dict for each key in the main dictionary!
for index, row in df.iterrows():
product_key_dic[row['Product']].update({row['id']:row['qtt']})
print("\n product_key_dic:\n", product_key_dic)
I get:
product_key_dic:
{ 'computer': {1: 100, 2: 200, 4: 40, 3: 30},
'laptop': {1: 100, 2: 200, 4: 40, 3: 30},
'printer': {1: 100, 2: 200, 4: 40, 3: 30}
}
I expect:
{ 'computer': {1: 10, 2: 20, 4: 40, 3: None},
'laptop': {1: 100, 2: None, 4: None, 3: 30},
'printer': {1: None, 2: 200, 4: None, 3: None}
}
I can't understand the problem, somehow it's like each key has the nested dictoinary..?
We can try a different approach creating a MultiIndex.from_product based on the unique values from Product and Id then reshaping so we can call DataFrame.to_dict directly:
cols = ['Product', 'id']
product_key_dic = (
df.set_index(cols).reindex(
pd.MultiIndex.from_product(
[df[col].unique() for col in cols],
names=cols
)
) # Reindex to ensure all pairs are present in the DF
.replace({np.nan: None}) # Replace nan with None
.unstack('Product') # Create Columns from Product
.droplevel(0, axis=1) # Remove qtt from column MultiIndex
.to_dict()
)
product_key_dic:
{
'computer': {1: 10.0, 2: 20.0, 3: None, 4: 40.0},
'laptop': {1: 100.0, 2: None, 3: 30.0, 4: None},
'printer': {1: None, 2: 200.0, 3: None, 4: None}
}
Methods Used:
DataFrame.set_index
DataFrame.reindex
MultiIndex.from_product
Series.unique
DataFrame.replace
DataFrame.unstack
DataFrame.droplevel
DataFrame.to_dict
Setup and imports:
import numpy as np
import pandas as pd
data = [['computer', 1, 10], ['computer', 2, 20], ['computer', 4, 40],
['laptop', 1, 100], ['laptop', 3, 30], ['printer', 2, 200]]
df = pd.DataFrame(data, columns=['Product', 'id', 'qtt'])
The initial solution could be modified by adding a copy call to the dictionary in the comprehension to make them separate dictionaries rather than multiple references to the same one (How to copy a dictionary and only edit the copy). However, iterating over DataFrames is discouraged (Does pandas iterrows have performance issues?):
kdf_key_dic = {key: None for key in df['id'].unique().tolist()}
product_key_dic = {key: kdf_key_dic.copy()
for key in df['Product'].unique().tolist()}
for index, row in df.iterrows():
product_key_dic[row['Product']].update({row['id']: row['qtt']})
product_key_dic:
{
'computer': {1: 10.0, 2: 20.0, 3: None, 4: 40.0},
'laptop': {1: 100.0, 2: None, 3: 30.0, 4: None},
'printer': {1: None, 2: 200.0, 3: None, 4: None}
}
This is because you are reusing same dict object. Let's take these two statements.
kdf_key_dic = {key: None for key in df['id'].unique().tolist()}
product_key_dic = {key: kdf_key_dic for key in df['Product'].unique().tolist()}
You are passing kdf_key_dic as value(in the second statement) which is same object in each iteration.
So instead of this you can pass a copy of kdf_key_dic while constructing product_key_dic
product_key_dic = {key: kdf_key_dic.copy() for key in df['Product'].unique().tolist()}
Related
I am working on an optimization problem and need to create indexing to build a mixed-integer mathematical model. I am using python dictionaries for the task. Below is a sample of my dataset. Full dataset is expected to have about 400K rows if that matters.
# sample input data
pd.DataFrame.from_dict({'origin': {0: 'perris', 1: 'perris', 2: 'perris', 3: 'perris', 4: 'perris'},
'dest': {0: 'alexandria', 1: 'alexandria', 2: 'alexandria', 3: 'alexandria', 4: 'alexandria'},
'product': {0: 'bike', 1: 'bike', 2: 'bike', 3: 'bike', 4: 'bike'},
'lead_time': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4}, 'build_time': {0: 2, 1: 2, 2: 2, 3: 2, 4: 2},
'ship_date': {0: '02/25/2022', 1: '02/26/2022', 2: '02/27/2022', 3: '02/28/2022', 4: '03/01/2022'},
'ship_day': {0: 5, 1: 6, 2: 7, 3: 1, 4: 2},
'truck_in': {0: '03/01/2022', 1: '03/02/2022', 2: '03/03/2022', 3: '03/04/2022', 4: '03/07/2022'},
'product_in': {0: '03/03/2022', 1: '03/04/2022', 2: '03/05/2022', 3: '03/06/2022', 4: '03/09/2022'}})
The data frame looks like this -
I am looking to generate a dictionary from each row of this dataframe where the keys and values are tuples made of multiple column values. The output would look like this -
(origin, dest, product, ship_date): (origin, dest, product, truck_in)
# for example, first two rows will become a dictionary key-value pair like
{('perris', 'alexandria', 'bike', '2/25/2022'): ('perris', 'alexandria', 'bike', '3/1/2022'),
('perris', 'alexandria', 'bike', '2/26/2022'): ('perris', 'alexandria', 'bike', '3/2/2022')}
I am very new to python and couldn't figure out how to do this. Any help is appreciated. Thanks!
You can loop through the DataFrame.
Assuming your DataFrame is called "df" this gives you the dict.
result_dict = {}
for idx, row in df.iterrows():
result_dict[(row.origin, row.dest, row['product'], row.ship_date )] = (
row.origin, row.dest, row['product'], row.truck_in )
Since looping through 400k rows will take some time, have a look at tqdm (https://tqdm.github.io/) to get a progress bar with a time estimate that quickly tells you if the approach works for your dataset.
Also, note that 400K dictionary entries may take up a lot of memory so you may try to estimate if the dict fits your memory.
Another, memory waisting but faster way is to do it in Pandas
Create a new column with the value for the dictionary
df['value'] = df.apply(lambda x: (x.origin, x.dest, x['product'], x.truck_in), axis=1)
Then set the index and convert to dict
df.set_index(['origin','dest','product','ship_date'])['value'].to_dict()
The approach below splits the initial dataframe into two dataframes that will be the source of the keys and values in the dictionary. These are then converted to arrays in order to get away from working with dataframes as soon as possible. The arrays are converted to tuples and zipped together to create the key:value pairs.
import pandas as pd
import numpy as np
df = pd.DataFrame.from_dict(
{'origin': {0: 'perris', 1: 'perris', 2: 'perris', 3: 'perris', 4: 'perris'},
'dest': {0: 'alexandria', 1: 'alexandria', 2: 'alexandria', 3: 'alexandria', 4: 'alexandria'},
'product': {0: 'bike', 1: 'bike', 2: 'bike', 3: 'bike', 4: 'bike'},
'lead_time': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4}, 'build_time': {0: 2, 1: 2, 2: 2, 3: 2, 4: 2},
'ship_date': {0: '02/25/2022', 1: '02/26/2022', 2: '02/27/2022', 3: '02/28/2022', 4: '03/01/2022'},
'ship_day': {0: 5, 1: 6, 2: 7, 3: 1, 4: 2},
'truck_in': {0: '03/01/2022', 1: '03/02/2022', 2: '03/03/2022', 3: '03/04/2022', 4: '03/07/2022'},
'product_in': {0: '03/03/2022', 1: '03/04/2022', 2: '03/05/2022', 3: '03/06/2022', 4: '03/09/2022'}}
)
#display(df)
#desired output: (origin, dest, product, ship_date): (origin, dest, product, truck_in)
#slice df to key/value chunks
#list to array
ship = df[['origin','dest', 'product', 'ship_date']]
ship.set_index('origin', inplace = True)
keys_array=ship.to_records()
truck = df[['origin', 'dest', 'product', 'truck_in']]
truck.set_index('origin', inplace = True)
values_array = truck.to_records()
#array_of_tuples = map(tuple, an_array)
keys_map = map(tuple, keys_array)
values_map = map(tuple, values_array)
#tuple_of_tuples = tuple(array_of_tuples)
keys_tuple = tuple(keys_map)
values_tuple = tuple(values_map)
zipp = zip(keys_tuple, values_tuple)
dict2 = dict(zipp)
print(dict2)
I have a DF that looks like this.
df = pd.DataFrame({'ID': {0: 1, 1: 2, 2: 3}, 'Value': {0: 'a', 1: 'b', 2: np.nan}})
ID
Value
0
1
a
1
2
b
2
3
c
I'd like to create a dictionary out of it.
So if I run df.to_dict('records'), it gives me
[{'Visual_ID': 1, 'Customer': 'a'},
{'Visual_ID': 2, 'Customer': 'b'},
{'Visual_ID': 3, 'Customer': 'c'}]
​However, what I want is the following.
{
1: 'a',
2: 'b',
3: 'c'
}
All of the rows in the DF or unique, so it shouldn't run into same key names issue.
Try with
d = dict(zip(df.ID, df.Value))
If I have df1
df1 = pd.DataFrame({'Col_Name': {0: 'A', 1: 'b', 2: 'c'}, 'X': {0: 12, 1: 23, 2: 223}, 'Z': {0: 42, 1: 33, 2: 28 }})
and df2
df2 = pd.DataFrame({'Col': {0: 'Y', 1: 'X', 2: 'Z'}, 'Low1': {0: 0, 1: 0, 2: 0}, 'High1': {0: 10, 1: 10, 2: 630}, 'Low2': {0: 10, 1: 10, 2: 630}, 'High2': {0: 50, 1: 50, 2: 3000}, 'Low3': {0: 50, 1: 50, 2: 3000}, 'High3': {0: 100, 1: 100, 2: 8500}, 'Low4': {0: 100, 1: 100, 2: 8500}, 'High4': {0: 'np.inf', 1: 'np.inf', 2: 'np.inf'}})
Select the row values of df2 if row is present in column name of df1.
Expected Output: df3
df3 = pd.DataFrame({'Col': {0: 'X', 1: 'Z'}, 'Low1': {0: 0, 1: 0}, 'High1': {0: 10, 1: 630}, 'Low2': {0: 10, 1: 630}, 'High2': {0: 50, 1: 3000}, 'Low3': {0: 50, 1: 3000}, 'High3': {0: 100, 1: 8500}, 'Low4': {0: 100, 1: 8500}, 'High4': {0: 'np.inf', 1: 'np.inf'}})
How to do it?
You can pass a boolean list to select the rows of df2 that you want. This list can be created by looking at each value in the Col column and asking if the value is in the columns of df1
df3 = df2[[col in df1.columns for col in df2['Col']]]
you can drop the non-relevant col and use the other columns...
df3 = df2[df2['Col'].isin(list(df1.drop('Col_Name',axis=1).columns))]
My code is below
apply pd.to_numeric to the columns where supposed to int or float but coming as object. Can we convert more into pandas way like applying np.where
if df.dtypes.all() == 'object':
df=df.apply(pd.to_numeric,errors='coerce').fillna(df)
else:
df = df
A simple one liner is assign with selest_dtypes which will reassign existing columns
df.assign(**df.select_dtypes('O').apply(pd.to_numeric,errors='coerce').fillna(df))
np.where:
df[:] = (np.where(df.dtypes=='object',
df.apply(pd.to_numeric,errors='coerce').fillna(df),df)
Example (check Price column) :
d = {'CusID': {0: 1, 1: 2, 2: 3},
'Name': {0: 'Paul', 1: 'Mark', 2: 'Bill'},
'Shop': {0: 'Pascal', 1: 'Casio', 2: 'Nike'},
'Price': {0: '24000', 1: 'a', 2: '900'}}
df = pd.DataFrame(d)
print(df)
CusID Name Shop Price
0 1 Paul Pascal 24000
1 2 Mark Casio a
2 3 Bill Nike 900
df.to_dict()
{'CusID': {0: 1, 1: 2, 2: 3},
'Name': {0: 'Paul', 1: 'Mark', 2: 'Bill'},
'Shop': {0: 'Pascal', 1: 'Casio', 2: 'Nike'},
'Price': {0: '24000', 1: 'a', 2: '900'}}
(df.assign(**df.select_dtypes('O').apply(pd.to_numeric,errors='coerce')
.fillna(df)).to_dict())
{'CusID': {0: 1, 1: 2, 2: 3},
'Name': {0: 'Paul', 1: 'Mark', 2: 'Bill'},
'Shop': {0: 'Pascal', 1: 'Casio', 2: 'Nike'},
'Price': {0: 24000.0, 1: 'a', 2: 900.0}}
Equivalent of your if/else is df.mask
df_out = df.mask(df.dtypes =='O', df.apply(pd.to_numeric, errors='coerce')
.fillna(df))
I have this kind of dictionary in python:
x = {'test': {1: 2, 2: 4, 3: 5},
'this': {1: 2, 2: 3, 7: 6},
'is': {1: 2},
'something': {90: 2,92:3}}
I want to modify all of the value in the key by whatever value I want. Let's say 100, the methods I tried are below:
counter = 1
print(x)
for key,anotherKey in x.items():
while counter not in x[key]:
counter+=1
while counter in x[key]:
x[key][counter] = 100
counter+=1
counter =0
Which got the result below:
{'test': {1: 100, 2: 100, 3: 100},
'this': {1: 100, 2: 100, 7: 6},
'is': {1: 100},
'something': {90: 100,92: 3}}
I know why this is happening it's because the loop doesn't consider if the differences is more than 1 which in this case in 'this' : where the differences from 2 to 7 is more than 1. However I don't know how to solve this.
You can iterate via a nested for loop:
x = {'test': {1: 2, 2: 4, 3: 5},
'this': {1: 2, 2: 3, 7: 6},
'is': {1: 2},
'something': {90: 2,92:3}}
for a in x:
for b in x[a]:
x[a][b] = 100
print(x)
{'is': {1: 100},
'something': {90: 100, 92: 100},
'test': {1: 100, 2: 100, 3: 100},
'this': {1: 100, 2: 100, 7: 100}}
Or for a new dictionary you can use a dictionary comprehension:
res = {a: {b: 100 for b in x[a]} for a in x}
You can use dictionary comprehension in this way:
{a: {b:100 for b in d} for (a,d) in x.items()}