I am working on an optimization problem and need to create indexing to build a mixed-integer mathematical model. I am using python dictionaries for the task. Below is a sample of my dataset. Full dataset is expected to have about 400K rows if that matters.
# sample input data
pd.DataFrame.from_dict({'origin': {0: 'perris', 1: 'perris', 2: 'perris', 3: 'perris', 4: 'perris'},
'dest': {0: 'alexandria', 1: 'alexandria', 2: 'alexandria', 3: 'alexandria', 4: 'alexandria'},
'product': {0: 'bike', 1: 'bike', 2: 'bike', 3: 'bike', 4: 'bike'},
'lead_time': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4}, 'build_time': {0: 2, 1: 2, 2: 2, 3: 2, 4: 2},
'ship_date': {0: '02/25/2022', 1: '02/26/2022', 2: '02/27/2022', 3: '02/28/2022', 4: '03/01/2022'},
'ship_day': {0: 5, 1: 6, 2: 7, 3: 1, 4: 2},
'truck_in': {0: '03/01/2022', 1: '03/02/2022', 2: '03/03/2022', 3: '03/04/2022', 4: '03/07/2022'},
'product_in': {0: '03/03/2022', 1: '03/04/2022', 2: '03/05/2022', 3: '03/06/2022', 4: '03/09/2022'}})
The data frame looks like this -
I am looking to generate a dictionary from each row of this dataframe where the keys and values are tuples made of multiple column values. The output would look like this -
(origin, dest, product, ship_date): (origin, dest, product, truck_in)
# for example, first two rows will become a dictionary key-value pair like
{('perris', 'alexandria', 'bike', '2/25/2022'): ('perris', 'alexandria', 'bike', '3/1/2022'),
('perris', 'alexandria', 'bike', '2/26/2022'): ('perris', 'alexandria', 'bike', '3/2/2022')}
I am very new to python and couldn't figure out how to do this. Any help is appreciated. Thanks!
You can loop through the DataFrame.
Assuming your DataFrame is called "df" this gives you the dict.
result_dict = {}
for idx, row in df.iterrows():
result_dict[(row.origin, row.dest, row['product'], row.ship_date )] = (
row.origin, row.dest, row['product'], row.truck_in )
Since looping through 400k rows will take some time, have a look at tqdm (https://tqdm.github.io/) to get a progress bar with a time estimate that quickly tells you if the approach works for your dataset.
Also, note that 400K dictionary entries may take up a lot of memory so you may try to estimate if the dict fits your memory.
Another, memory waisting but faster way is to do it in Pandas
Create a new column with the value for the dictionary
df['value'] = df.apply(lambda x: (x.origin, x.dest, x['product'], x.truck_in), axis=1)
Then set the index and convert to dict
df.set_index(['origin','dest','product','ship_date'])['value'].to_dict()
The approach below splits the initial dataframe into two dataframes that will be the source of the keys and values in the dictionary. These are then converted to arrays in order to get away from working with dataframes as soon as possible. The arrays are converted to tuples and zipped together to create the key:value pairs.
import pandas as pd
import numpy as np
df = pd.DataFrame.from_dict(
{'origin': {0: 'perris', 1: 'perris', 2: 'perris', 3: 'perris', 4: 'perris'},
'dest': {0: 'alexandria', 1: 'alexandria', 2: 'alexandria', 3: 'alexandria', 4: 'alexandria'},
'product': {0: 'bike', 1: 'bike', 2: 'bike', 3: 'bike', 4: 'bike'},
'lead_time': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4}, 'build_time': {0: 2, 1: 2, 2: 2, 3: 2, 4: 2},
'ship_date': {0: '02/25/2022', 1: '02/26/2022', 2: '02/27/2022', 3: '02/28/2022', 4: '03/01/2022'},
'ship_day': {0: 5, 1: 6, 2: 7, 3: 1, 4: 2},
'truck_in': {0: '03/01/2022', 1: '03/02/2022', 2: '03/03/2022', 3: '03/04/2022', 4: '03/07/2022'},
'product_in': {0: '03/03/2022', 1: '03/04/2022', 2: '03/05/2022', 3: '03/06/2022', 4: '03/09/2022'}}
)
#display(df)
#desired output: (origin, dest, product, ship_date): (origin, dest, product, truck_in)
#slice df to key/value chunks
#list to array
ship = df[['origin','dest', 'product', 'ship_date']]
ship.set_index('origin', inplace = True)
keys_array=ship.to_records()
truck = df[['origin', 'dest', 'product', 'truck_in']]
truck.set_index('origin', inplace = True)
values_array = truck.to_records()
#array_of_tuples = map(tuple, an_array)
keys_map = map(tuple, keys_array)
values_map = map(tuple, values_array)
#tuple_of_tuples = tuple(array_of_tuples)
keys_tuple = tuple(keys_map)
values_tuple = tuple(values_map)
zipp = zip(keys_tuple, values_tuple)
dict2 = dict(zipp)
print(dict2)
I try and concat or append (neither are working) 2 9-column dataframes together. But, instead of just doing a normal vertical stacking of them, pandas keeps trying to add 9 more empty columns as well. Do you know how to stop this?
output looks like this:
0,1,2,3,4,5,6,7,8,9,10,11,12,13,0,1,10,11,12,13,2,3,4,5,6,7,8,9
10/23/2020,New Castle,DE,Gary,IN,Full,Flatbed,0.00,46,48,0,Dispatch,(800) 488-1860,Meadow Lark Agency ,,,,,,,,,,,,,,
10/22/2020,Wilmington,DE,METHUEN,MA,Full,Flatbed / Step Deck,0.00,48,48,0,Ken,(903) 280-7878,UrTruckBroker ,,,,,,,,,,,,,,
10/23/2020,WILMINGTON,DE,METHUEN,MA,Full,Flatbed w/Tarps,0.00,47,1,0,Dispatch,(912) 748-3801,DSV Road Inc. ,,,,,,,,,,,,,,
10/23/2020,WILMINGTON,DE,METHUEN,MA,Full,Flatbed w/Tarps,0.00,48,1,0,Dispatch,(541) 826-4786,Sureway Transportation Co / Anderson Trucking Serv ,,,,,,,,,,,,,,
10/30/2020,New Castle,DE,Gary,IN,Full,Flatbed,945.00,46,48,0,Dispatch,(800) 488-1860,Meadow Lark Agency ,,,,,,,,,,,,,,
...
,,,,,,,,,,,,,,03/02/2021,Knapp,0.0,Dispatch,(763) 432-3680,Fuze Logistics Services USA ,WI,Jackson,NE,Full,Flatbed / Step Deck,0.0,48.0,48.0
,,,,,,,,,,,,,,03/02/2021,Knapp,0.0,Dispatch,(763) 432-3680,Fuze Logistics Services USA ,WI,Sterling,IL,Full,Flatbed / Step Deck,0.0,48.0,48.0
,,,,,,,,,,,,,,03/02/2021,Milwaukee,0.0,Dispatch,(763) 432-3680,Fuze Logistics Services USA ,WI,Great Falls,MT,Full,Flatbed / Step Deck,0.0,45.0,48.0
,,,,,,,,,,,,,,03/02/2021,Algoma,0.0,Dispatch,(763) 432-3680,Fuze Logistics Services USA ,WI,Pamplico,SC,Full,Flatbed / Step Deck,0.0,48.0,48.0
code is a web request to get data, which I save to dataframe, which is then concat-ed with another dataframe that comes from a CSV. I then save all of this back to that csv:
this_csv = 'freights_trulos.csv'
try:
old_df = pd.read_csv(this_csv)
except BaseException as e:
print(e)
old_df = pd.DataFrame()
state, equip = 'DE', 'Flat'
url = "https://backend-a.trulos.com/load-table/grab_loads.php?state=%s&equipment=%s" % (state, equip)
payload = {}
headers = {
...
}
response = requests.request("GET", url, headers=headers, data=payload)
# print(response.text)
parsed = json.loads(response.content)
data = [r[0:13] + [r[-4].split('<br/>')[-2].split('>')[-1]] for r in parsed]
df = pd.DataFrame(data=data)
if not old_df.empty:
# concatenate old and new and remove duplicates
# df.reset_index(drop=True, inplace=True)
# old_df.reset_index(drop=True, inplace=True)
# df = pd.concat([old_df, df], ignore_index=True) <--- CONCAT HAS SAME ISSUES AS APPEND
df = df.append(old_df, ignore_index=True)
# remove duplicates on cols
df.drop_duplicates()
df.to_csv(this_csv, index=False)
EDIT appended df's have had their types changed
df.dtypes
Out[2]:
0 object
1 object
2 object
3 object
4 object
5 object
6 object
7 object
8 object
9 object
10 object
11 object
12 object
13 object
dtype: object
old_df.dtypes
Out[3]:
0 object
1 object
2 object
3 object
4 object
5 object
6 object
7 float64
8 int64
9 int64
10 int64
11 object
12 object
13 object
dtype: object
old_df to csv
0,1,2,3,4,5,6,7,8,9,10,11,12,13
10/23/2020,New Castle,DE,Gary,IN,Full,Flatbed,0.0,46,48,0,Dispatch,(800) 488-1860,Meadow Lark Agency
10/22/2020,Wilmington,DE,METHUEN,MA,Full,Flatbed / Step Deck,0.0,48,48,0,Ken,(903) 280-7878,UrTruckBroker
10/23/2020,WILMINGTON,DE,METHUEN,MA,Full,Flatbed w/Tarps,0.0,47,1,0,Dispatch,(912) 748-3801,DSV Road Inc.
10/23/2020,WILMINGTON,DE,METHUEN,MA,Full,Flatbed w/Tarps,0.0,48,1,0,Dispatch,(541) 826-4786,Sureway Transportation Co / Anderson Trucking Serv
10/30/2020,New Castle,DE,Gary,IN,Full,Flatbed,945.0,46,48,0,Dispatch,(800) 488-1860,Meadow Lark Agency
new_df to csv
0,1,2,3,4,5,6,7,8,9,10,11,12,13
10/23/2020,New Castle,DE,Gary,IN,Full,Flatbed,0.00,46,48,0,Dispatch,(800) 488-1860,Meadow Lark Agency
10/22/2020,Wilmington,DE,METHUEN,MA,Full,Flatbed / Step Deck,0.00,48,48,0,Ken,(903) 280-7878,UrTruckBroker
10/23/2020,WILMINGTON,DE,METHUEN,MA,Full,Flatbed w/Tarps,0.00,47,1,0,Dispatch,(912) 748-3801,DSV Road Inc.
10/23/2020,WILMINGTON,DE,METHUEN,MA,Full,Flatbed w/Tarps,0.00,48,1,0,Dispatch,(541) 826-4786,Sureway Transportation Co / Anderson Trucking Serv
10/30/2020,New Castle,DE,Gary,IN,Full,Flatbed,945.00,46,48,0,Dispatch,(800) 488-1860,Meadow Lark Agency
I guess the problem could be how you read the data if I copy your sample data to excel and split by comma and then import to pandas, all is fine. Also if I split on comma AND whitespaces, I have +9 additional columns. So you could try debugging by replacing all whitespaces before creating your dataframe.
I also used your sample data and it workend just fine for me if I initialize it like this:
import pandas as pd
df_new = pd.DataFrame({'0': {0: '10/23/2020',
1: '10/22/2020',
2: '10/23/2020',
3: '10/23/2020',
4: '10/30/2020'},
'1': {0: 'New_Castle',
1: 'Wilmington',
2: 'WILMINGTON',
3: 'WILMINGTON',
4: 'New_Castle'},
'2': {0: 'DE', 1: 'DE', 2: 'DE', 3: 'DE', 4: 'DE'},
'3': {0: 'Gary', 1: 'METHUEN', 2: 'METHUEN', 3: 'METHUEN', 4: 'Gary'},
'4': {0: 'IN', 1: 'MA', 2: 'MA', 3: 'MA', 4: 'IN'},
'5': {0: 'Full', 1: 'Full', 2: 'Full', 3: 'Full', 4: 'Full'},
'6': {0: 'Flatbed',
1: 'Flatbed_/_Step_Deck',
2: 'Flatbed_w/Tarps',
3: 'Flatbed_w/Tarps',
4: 'Flatbed'},
'7': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 945.0},
'8': {0: 46, 1: 48, 2: 47, 3: 48, 4: 46},
'9': {0: 48, 1: 48, 2: 1, 3: 1, 4: 48},
'10': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
'11': {0: 'Dispatch', 1: 'Ken', 2: 'Dispatch', 3: 'Dispatch', 4: 'Dispatch'},
'12': {0: '(800)_488-1860',
1: '(903)_280-7878',
2: '(912)_748-3801',
3: '(541)_826-4786',
4: '(800)_488-1860'},
'13': {0: 'Meadow_Lark_Agency_',
1: 'UrTruckBroker_',
2: 'DSV_Road_Inc._',
3: 'Sureway_Transportation_Co_/_Anderson_Trucking_Serv_',
4: 'Meadow_Lark_Agency_'}})
df_new = pd.DataFrame({'0': {0: '10/23/2020',
1: '10/22/2020',
2: '10/23/2020',
3: '10/23/2020',
4: '10/30/2020'},
'1': {0: 'New_Castle',
1: 'Wilmington',
2: 'WILMINGTON',
3: 'WILMINGTON',
4: 'New_Castle'},
'2': {0: 'DE', 1: 'DE', 2: 'DE', 3: 'DE', 4: 'DE'},
'3': {0: 'Gary', 1: 'METHUEN', 2: 'METHUEN', 3: 'METHUEN', 4: 'Gary'},
'4': {0: 'IN', 1: 'MA', 2: 'MA', 3: 'MA', 4: 'IN'},
'5': {0: 'Full', 1: 'Full', 2: 'Full', 3: 'Full', 4: 'Full'},
'6': {0: 'Flatbed',
1: 'Flatbed_/_Step_Deck',
2: 'Flatbed_w/Tarps',
3: 'Flatbed_w/Tarps',
4: 'Flatbed'},
'7': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 945.0},
'8': {0: 46, 1: 48, 2: 47, 3: 48, 4: 46},
'9': {0: 48, 1: 48, 2: 1, 3: 1, 4: 48},
'10': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
'11': {0: 'Dispatch', 1: 'Ken', 2: 'Dispatch', 3: 'Dispatch', 4: 'Dispatch'},
'12': {0: '(800)_488-1860',
1: '(903)_280-7878',
2: '(912)_748-3801',
3: '(541)_826-4786',
4: '(800)_488-1860'},
'13': {0: 'Meadow_Lark_Agency_',
1: 'UrTruckBroker_',
2: 'DSV_Road_Inc._',
3: 'Sureway_Transportation_Co_/_Anderson_Trucking_Serv_',
4: 'Meadow_Lark_Agency_'}})
df_new.append(df_old, ignore_index=True)
#OR
pd.concat([df_new, df_old])