python dataframe to dictionary with multiple columns in keys and values - python

I am working on an optimization problem and need to create indexing to build a mixed-integer mathematical model. I am using python dictionaries for the task. Below is a sample of my dataset. Full dataset is expected to have about 400K rows if that matters.
# sample input data
pd.DataFrame.from_dict({'origin': {0: 'perris', 1: 'perris', 2: 'perris', 3: 'perris', 4: 'perris'},
'dest': {0: 'alexandria', 1: 'alexandria', 2: 'alexandria', 3: 'alexandria', 4: 'alexandria'},
'product': {0: 'bike', 1: 'bike', 2: 'bike', 3: 'bike', 4: 'bike'},
'lead_time': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4}, 'build_time': {0: 2, 1: 2, 2: 2, 3: 2, 4: 2},
'ship_date': {0: '02/25/2022', 1: '02/26/2022', 2: '02/27/2022', 3: '02/28/2022', 4: '03/01/2022'},
'ship_day': {0: 5, 1: 6, 2: 7, 3: 1, 4: 2},
'truck_in': {0: '03/01/2022', 1: '03/02/2022', 2: '03/03/2022', 3: '03/04/2022', 4: '03/07/2022'},
'product_in': {0: '03/03/2022', 1: '03/04/2022', 2: '03/05/2022', 3: '03/06/2022', 4: '03/09/2022'}})
The data frame looks like this -
I am looking to generate a dictionary from each row of this dataframe where the keys and values are tuples made of multiple column values. The output would look like this -
(origin, dest, product, ship_date): (origin, dest, product, truck_in)
# for example, first two rows will become a dictionary key-value pair like
{('perris', 'alexandria', 'bike', '2/25/2022'): ('perris', 'alexandria', 'bike', '3/1/2022'),
('perris', 'alexandria', 'bike', '2/26/2022'): ('perris', 'alexandria', 'bike', '3/2/2022')}
I am very new to python and couldn't figure out how to do this. Any help is appreciated. Thanks!

You can loop through the DataFrame.
Assuming your DataFrame is called "df" this gives you the dict.
result_dict = {}
for idx, row in df.iterrows():
result_dict[(row.origin, row.dest, row['product'], row.ship_date )] = (
row.origin, row.dest, row['product'], row.truck_in )
Since looping through 400k rows will take some time, have a look at tqdm (https://tqdm.github.io/) to get a progress bar with a time estimate that quickly tells you if the approach works for your dataset.
Also, note that 400K dictionary entries may take up a lot of memory so you may try to estimate if the dict fits your memory.
Another, memory waisting but faster way is to do it in Pandas
Create a new column with the value for the dictionary
df['value'] = df.apply(lambda x: (x.origin, x.dest, x['product'], x.truck_in), axis=1)
Then set the index and convert to dict
df.set_index(['origin','dest','product','ship_date'])['value'].to_dict()

The approach below splits the initial dataframe into two dataframes that will be the source of the keys and values in the dictionary. These are then converted to arrays in order to get away from working with dataframes as soon as possible. The arrays are converted to tuples and zipped together to create the key:value pairs.
import pandas as pd
import numpy as np
df = pd.DataFrame.from_dict(
{'origin': {0: 'perris', 1: 'perris', 2: 'perris', 3: 'perris', 4: 'perris'},
'dest': {0: 'alexandria', 1: 'alexandria', 2: 'alexandria', 3: 'alexandria', 4: 'alexandria'},
'product': {0: 'bike', 1: 'bike', 2: 'bike', 3: 'bike', 4: 'bike'},
'lead_time': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4}, 'build_time': {0: 2, 1: 2, 2: 2, 3: 2, 4: 2},
'ship_date': {0: '02/25/2022', 1: '02/26/2022', 2: '02/27/2022', 3: '02/28/2022', 4: '03/01/2022'},
'ship_day': {0: 5, 1: 6, 2: 7, 3: 1, 4: 2},
'truck_in': {0: '03/01/2022', 1: '03/02/2022', 2: '03/03/2022', 3: '03/04/2022', 4: '03/07/2022'},
'product_in': {0: '03/03/2022', 1: '03/04/2022', 2: '03/05/2022', 3: '03/06/2022', 4: '03/09/2022'}}
)
#display(df)
#desired output: (origin, dest, product, ship_date): (origin, dest, product, truck_in)
#slice df to key/value chunks
#list to array
ship = df[['origin','dest', 'product', 'ship_date']]
ship.set_index('origin', inplace = True)
keys_array=ship.to_records()
truck = df[['origin', 'dest', 'product', 'truck_in']]
truck.set_index('origin', inplace = True)
values_array = truck.to_records()
#array_of_tuples = map(tuple, an_array)
keys_map = map(tuple, keys_array)
values_map = map(tuple, values_array)
#tuple_of_tuples = tuple(array_of_tuples)
keys_tuple = tuple(keys_map)
values_tuple = tuple(values_map)
zipp = zip(keys_tuple, values_tuple)
dict2 = dict(zipp)
print(dict2)

Related

How to make this to data frame?

I am using python and I am trying to change this to dataframe but the length of the dictionary are different.
Do you have any ideas? The length of keys (0-6 in total) present are different in each row.
0 {1: 0.14428478, 3: 0.3088169, 5: 0.54362816}
1 {0: 0.41822478, 2: 0.081520624, 3: 0.40189278,...
2 {3: 0.9927109}
3 {0: 0.07826376, 3: 0.9162877}
4 {0: 0.022929467, 1: 0.0127365505, 2: 0.8355256...
...
59834 {1: 0.93473625, 5: 0.055679787}
59835 {1: 0.72145665, 3: 0.022041071, 5: 0.25396}
59836 {0: 0.01922486, 1: 0.019249884, 2: 0.5345934, ...
59837 {0: 0.014184893, 1: 0.23436697, 2: 0.58155864,...
59838 {0: 0.013977169, 1: 0.24653174, 2: 0.60093427,...
I would like get the codes of python.

sort the order of dataframes in a list of dataframes based on a value in each dataframe

I have a list of dataframe and I want to sort the order they are in the list
Each dataframe has the same structure as shown below
df1 = pd.DataFrame.from_dict({'Ch1': {0: -28, 1: -36, 2: -39, 3: -16}, 'Ch2': {0: 543, 1: 547, 2: 559, 3: 561}, 'Ch3': {0: -126, 1: -131, 2: -147, 3: -149}, 'time': {0: '2022-02-10 16.37.25.502', 1: '2022-02-10 16.37.25.502', 2: '2022-02-10 16.37.25.502', 3: '2022-02-10 16.37.25.502'}})
df2 = pd.DataFrame.from_dict({'Ch1': {0: 81, 1: 70, 2: 70, 3: 75}, 'Ch2': {0: 570, 1: 559, 2: 554, 3: 565}, 'Ch3': {0: -103, 1: -120, 2: -131, 3: -122}, 'time': {0: '2022-02-11 05.29.28.116', 1: '2022-02-11 05.29.28.116', 2: '2022-02-11 05.29.28.116', 3: '2022-02-11 05.29.28.116'}})
df3 = pd.DataFrame.from_dict({'Ch1': {0: -887, 1: -887, 2: -890, 3: -898}, 'Ch2': {0: 1307, 1: 1292, 2: 1301, 3: 1307}, 'Ch3': {0: 59, 1: 61, 2: 57, 3: 55}, 'time': {0: '2022-02-08 01.12.54.578', 1: '2022-02-08 01.12.54.578', 2: '2022-02-08 01.12.54.578', 3: '2022-02-08 01.12.54.578'}})
df_list = [df1,df2,df3]
the values in the "time" column does not change in each row within the same dataframe.
I want the dataframes in the list sorted by time (first to last) so that further processing and can match up with other data.
my attempt thus far.
for i in df_list:
b = pd.to_datetime(i['time'].iloc[0]) #grab the first cell that contains the time stamp
b = b.sort_values(by('time'))
returns the following error
ValueError: ('Unknown string format:', '2022-02-05 08.03.09.794')
I would expect the dataframes to appear in the list with df3 being fist, df1, second and df2 last. I the time column is going and needs to be dropped for other operations therefore I would like them sorted in time order already
Any help suggestion alternative approaches greatly appreciated
If you want to sort the rows of each dataframe, you need to provide the exact format of your datetime, and you should sort in place:
for d in df_list:
d['time'] = pd.to_datetime(d['time'], format='%Y-%m-%d %H.%M.%S.%f')
d.sort_values(by='time', inplace=True)
Or, if you want to sort the dataframes in the list, which is completely different, use:
df_list.sort(key=lambda d: d['time'].iloc[0])
You should be able to sort using the string due to your particular format (assuming YYYY-MM-DD).
To ensure sorting on datetime (for example if the format was MM-DD-YYYY):
df_list.sort(key=lambda d: pd.to_datetime(d['time'].iloc[0], format='%Y-%m-%d %H.%M.%S.%f'))

Take sum of values before the row's date

I have a dataframe that looks like this:
df = pd.DataFrame({'id': {0: 1, 1: 3, 2: 2, 3: 2, 4: 1, 5: 3},
'date': {0: '11/11/2018',
1: '11/12/2018',
2: '11/13/2018',
3: '11/14/2018',
4: '11/15/2018',
5: '11/16/2018'},
'score': {0: 1, 1: 1, 2: 3, 3: 2, 4: 0, 5: 5}})
I need the resulting dataframe to look like this:
output = pd.DataFrame({'id': {0: 1, 1: 3, 2: 2, 3: 2, 4: 1, 5: 3},
'date': {0: '11/11/2018',
1: '11/12/2018',
2: '11/13/2018',
3: '11/14/2018',
4: '11/15/2018',
5: '11/16/2018'},
'score': {0: 1, 1: 1, 2: 3, 3: 2, 4: 0, 5: 5},
'total_score_per_id_before_date': {0: 1, 1: 1, 2: 3, 3: 3, 4: 1, 5: 1}})
my code so far:
output= df[["id","score"]].groupby("id").sum()
However, this gives me the total sum of scores for each id. I need the sum of scores before the date in that specific row. Only the first score should not be discarded.
Use the cumulative sum on a series. Then subtract the current values, as you asked for the cumulative sum before the current index. Finally, add back the first values, otherwise they’re zero.
previously_accumulated_scores = df.groupby("id").cumsum().score - df.score
firsts = df.groupby("id").first().reset_index()
df2 = df.merge(firsts, on=["id", "date"], how="left", suffixes=("", "_r"))
df["total_score_per_id_before_date"] = previously_accumulated_scores + df2.score_r.fillna(0)
The merge could be done more elegantly, by changing the index to a MultiIndex, but that’s a style preference.
Note: this assumes your DataFrame is sorted by the date-like column (groupby preserves the order of rows within each group (source: docs)).

How to change from index to multiindex - pandas

I've got a data frame structured as below:
dict1 = {'id': {0: 11, 1: 12, 2: 13, 3: 14, 4: 15, 5: 16, 6: 19, 7: 18, 8: 17},
'var1': {0: 20.272108843537413,
1: 21.088435374149658,
2: 20.68027210884354,
3: 23.945578231292515,
4: 22.857142857142854,
5: 21.496598639455787,
6: 39.18367346938776,
7: 36.46258503401361,
8: 34.965986394557824},
'var2': {0: 27.731092436974773,
1: 43.907563025210074,
2: 55.67226890756303,
3: 62.81512605042017,
4: 71.63865546218487,
5: 83.40336134453781,
6: 43.48739495798319,
7: 59.243697478991606,
8: 67.22689075630252},
'var3': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 2, 7: 2, 8: 2}}
ex = pd.DataFrame(dict1.to_dict()).set_index('id')
id is set as an index, but now I would like to create a MultiIndex from var3 and id. But my following attempt fails:
ex.set_index(['var3', 'id'])
How can I then set a MultiIndex straight from Index? I know I can reset_index first and then set a MultiIndex, but it feels there has to be more elegant way.
DataFrame.set_index has an append argument, which is False by default.
If you have a DataFrame already indexed by "id", and you'd like to append "var3" to that, simply invoke:
new_df = ex.set_index("var3", append=True)
As suggested by #piRSquared in the comments, you can also swap the order if you would like "var3" to come first by method chaining a call to swaplevel. I.e.:
new_df = ex.set_index("var3", append=True).swaplevel(0, 1)
Like this:
ex.set_index(['var3', ex.index])

Probability Density Function using pandas data

I would like to model the probability of an event occurring given the existence of the previous event.
To give you more context, I plan to group my data by anonymous_id, sort the values of the grouped dataset by timestamp (ts) and calculate the probability of the sequence of sources (utm_source) the person goes through. The person is represented by a unique anonymous_id. So the desired end goal is the probability of someone who came from a Facebook source to then come through from a Google source etc
I have been told that a package such as sci.py gaussian_kde would be useful for this. However, from playing around with it, this requires numerical inputs.
test_sample = test_sample.groupby('anonymous_id').apply(lambda x: x.sort_values(['ts'])).reset_index(drop=True)
and not sure what to try next.
I have also tried this, but i don't think that it makes much sense:
stats.gaussian_kde(test_two['utm_source'])
Here is a sample of my data
{'Unnamed: 0': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9},
'anonymous_id': {0: '0000f8ea-3aa6-4423-9247-1d9580d378e1',
1: '00015d49-2cd8-41b1-bbe7-6aedbefdb098',
2: '0002226e-26a4-4f55-9578-2eff2999de7e',
3: '00022b83-240e-4ef9-aaad-ac84064bb902',
4: '00022b83-240e-4ef9-aaad-ac84064bb902',
5: '00022b83-240e-4ef9-aaad-ac84064bb902',
6: '00022b83-240e-4ef9-aaad-ac84064bb902',
7: '00022b83-240e-4ef9-aaad-ac84064bb902',
8: '00022b83-240e-4ef9-aaad-ac84064bb902',
9: '0002ed69-4aff-434d-a626-fc9b20ef1b02'},
'ts': {0: '2018-04-11 06:59:20.206000',
1: '2019-05-18 05:59:11.874000',
2: '2018-09-10 18:19:25.260000',
3: '2017-10-11 08:20:18.092000',
4: '2017-10-11 08:20:31.466000',
5: '2017-10-11 08:20:37.345000',
6: '2017-10-11 08:21:01.322000',
7: '2017-10-11 08:21:14.145000',
8: '2017-10-11 08:23:47.526000',
9: '2019-06-12 10:42:50.401000'},
'utm_source': {0: nan,
1: 'facebook',
2: 'facebook',
3: 'google',
4: nan,
5: 'facebook',
6: 'google',
7: 'adwords',
8: 'youtube',
9: nan},
'rank': {0: 1, 1: 1, 2: 1, 3: 1, 4: 2, 5: 3, 6: 4, 7: 5, 8: 6, 9: 1}}
Note: i converted the dataframe to a dictionary
Here is one way you can do it (if I understand correctly):
from itertools import chain
from collections import Counter
groups = (df
.sort_values(by='ts')
.dropna()
.groupby('anonymous_id').utm_source
.agg(list)
.reset_index()
)
groups['transitions'] = groups.utm_source.apply(lambda x: list(zip(x,x[1:])))
all_transitions = Counter(chain(*groups.transitions.tolist()))
Which gives you (on your example data):
In [42]: all_transitions
Out[42]:
Counter({('google', 'facebook'): 1,
('facebook', 'google'): 1,
('google', 'adwords'): 1,
('adwords', 'youtube'): 1})
Or are you looking for something different?

Categories