generate new row in pandas if column value is over certain value - python

I'm writing a function to get the HOR_df output below by feeding in 2 dataframes, planned_routes and actual_shippers.
The function should look at the actual_pallets (from actual_shippers df) and then look at planned_routes to see if the columns actual pallets > max_truck_capacity. If it does, it should generate a new row, like in the picture below.
Visualisation of the inputs and wanted output:
Note
In the above case: S1 had planned 10 pallets, but with the new actual_pallets was increased so that the max_truck_capacity is too small to handle the new actual_pallets. Therefore a new row is generated with the S2 ID, and the 3 needed extra pallets.
HOR_df has in this case made sure that on the 1st of December 2021, the actual_pallets for shipper S2 were split up onto routes of 10 and 3 pallets separately, instead of the 10 pallets that were in the initial planned_routes.
Potential idea how it should be done
I'm not sure what's the most efficient way to do this, for instance if I should build something that iteratively goes through and "fill the routes up" with the new "actual_pallet" data?
# x = planned_routes
# y = actual_shippers
# z = cost of adhocs and cancellations
# w = truck eligibility
def optimal_trips(x, y, z, w):
# Step 1: Take in actual shippers package and pallet data
# Step 2: Take the actual data and feed it into the planned routes and add routes based on demand.
# Step 3: return a df with the new optimal routes.
Code for the dfs (to replicate)
Input 1:
planned_routes = pd.DataFrame({
'date':['2021-12-01','2021-12-02'],
'planned_route_id':['R1', 'R2'],
'S1_id':['S1', 'S1'],
'S2_id':['S2', 'S2'],
'S3_id':['NaN', 'NaN'],
'S4_id':['NaN', 'NaN'],
'S1_planned_packages':[110, 100],
'S2_planned_packages':[120, 100],
'S3_planned_packages':['NaN', 'NaN'],
'S4_planned_packages':['NaN', 'NaN'],
'total_planned_packages':[230, 200],
'S1_planned_pallets':[11, 10],
'S2_planned_pallets':[12, 10],
'S3_planned_pallets':['NaN', 'NaN'],
'S4_planned_pallets':['NaN', 'NaN'],
'total_pallets':[23, 20],
'truck_max_capacity':[24, 24],
'cost_route':[120, 120]
})
Input 2:
actual_shippers = pd.DataFrame({
'date':['2021-12-01','2021-12-01','2021-12-02','2021-12-02'],
'shipper_id':['S1', 'S2','S1', 'S2'],
'actual_packages':[140, 130, 140, 130],
'shipper_spp':[10, 10, 10, 10],
'actual_pallets':[14, 13, 14, 13],
'shipper_max_eligibility':[24, 24, 24, 24],
'truck_max_capacity':[24, 24, 24, 24]
})
Wanted output:
HOR_df = pd.DataFrame({
'date':['2021-12-01','2021-12-01', '2021-12-02', '2021-12-02'],
'planned_route_id':['R1', 'R3','R2', 'R4'],
'S1_id':['S1', 'S2', 'S1', 'S2'],
'S2_id':['S2', 'NaN', 'S2', 'NaN'],
'S3_id':['NaN', 'NaN','NaN', 'NaN'],
'S4_id':['NaN', 'NaN', 'NaN', 'NaN'],
'S1_actual_packages':[140, 0, 140, 0],
'S2_actual_packages':[100, 30, 100, 30],
'S3_actual_packages':['NaN', 'NaN', 'NaN', 'NaN'],
'S4_actual_packages':['NaN', 'NaN', 'NaN', 'NaN'],
'total_planned_packages':[240, 30, 240, 30], # sum(S1_actual_packages, S2_actual packages, S3... etc)
'S1_actual_pallets':[14, 3, 14, 3],
'S2_actual_pallets':[10, 'NaN', 10, 'NaN'],
'S3_actual_pallets':['NaN', 'NaN', 'NaN', 'NaN'],
'S4_actual_pallets':['NaN', 'NaN', 'NaN', 'NaN'],
'total_pallets':[24, 3, 24, 3], #sum(S1_actual_pallets, S2 ... etc)
'truck_max_capacity':[24, 24, 24, 24],
'cost_route':[120, 130, 120, 130]
})

Related

Why is this Groupby transform not working?

For a dummy dataset, which each id corresponds to one match:
df2 = pd.DataFrame(columns=['id', 'score', 'duration', 'user'],
data=[[1, 800, 60, 'abc'], [1, 900, 60, 'zxc'], [2, 800, 250, 'abc'], [2, 5000, 250, 'bvc'],
[3, 6000, 250, 'zxc'], [3, 8000, 250, 'klp'], [4, 1400, 500,'kod'],
[4, 8000, 500, 'bvc']])
If I want to keep only the records where either one of the same id have duration greater than 120 and score greater than 1500, this works fine:
cond = df2['duration'].gt(120) & df2['score'].gt(1500)
out = df2[cond.groupby(df2['id']).transform('all')]
and returns 2 instances of the same id. However, if I want to keep only the pairs of id's where the user is 'abc' it does not work. I have tried:
out = df2[(df2['user'].eq('abc')).groupby(df2['id']).transform('all')]
out = df2[(df2['user'] == 'abc').groupby(df2['id']).transform('all')]
and they both return blank df's. How to solve this problem? The outcome should be any match that user 'abc' played in.
From the comments, you want 'any', not 'all':
out = df2[(df2['user'] == 'abc').groupby(df2['id']).transform('any')]

Python Pandas exchange column value

While I am using pandas.DataFrame, when I want to inverse whole Column Value, I find that they are different when I use DF.loc[wInd, 'Column'] and DF.loc[:, 'Column'], where the 1st case exchanges the value, but 2nd case gives me same column value. Why are they different? Thank you.
wInd = LineCircPart_df.index
for cWd in ['X', 'Y', 'Angle']:
(LineCircPart_df.loc[wInd,f'Start{cWd}'], LineCircPart_df.loc[wInd,f'End{cWd}']) =
(LineCircPart_df.loc[wInd, f'End{cWd}'], LineCircPart_df.loc[wInd,f'Start{cWd}'])
and i need to modified with .copy() for the value assigned to, like:
wInd = LineCircPart_df.index
for cWd in ['X', 'Y', 'Angle']:
LineCircPart_df.loc[:,f'Start{cWd}'], LineCircPart_df.loc[:,f'End{cWd}'] =
(LineCircPart_df.loc[:, f'End{cWd}'].copy(), LineCircPart_df.loc[:,f'Start{cWd}'].copy())
Any Suggestions?
Example updated as follows:
LineCircPart_df = pd.DataFrame({'StartX': [3000, 4000, 5000], 'StartY': [30, 40, 50], 'StartAngle': [3, 4, 5], 'EndX': [6000, 7000, 8000], 'EndY': [60, 70, 80], 'EndAngle': [6, 7, 8],})
for cWd in ['X', 'Y', 'Angle']:
(LineCircPart_df.loc[:,f'Start{cWd}'], LineCircPart_df.loc[:,f'End{cWd}']) = (LineCircPart_df.loc[:, f'End{cWd}'],LineCircPart_df.loc[:,f'Start{cWd}'])

Rolling sum for a window of 2 days

I am trying to compute a rolling 2 day using trans_date sum against the amount column that is grouped by ID within the table below using python.
<table><tbody><tr><th>ID</th><th>Trans_Date</th><th>Trans_Time</th><th>Amount</th><th> </th></tr><tr><td>1</td><td>03/23/2019</td><td>06:51:03</td><td>100</td><td> </td></tr><tr><td>1</td><td>03/24/2019</td><td>12:32:48</td><td>600</td><td> </td></tr><tr><td>1</td><td>03/24/2019</td><td>14:15:35</td><td>50</td><td> </td></tr><tr><td>1</td><td>06/05/2019</td><td>16:18:21</td><td>75</td><td> </td></tr><tr><td>2</td><td>02/01/2019</td><td>18:02:52</td><td>200</td><td> </td></tr><tr><td>2</td><td>02/02/2019</td><td>10:03:02</td><td>150</td><td> </td></tr><tr><td>2</td><td>02/03/2019</td><td>23:47:51</td><td>800</td><td> </td></tr><tr><td>3</td><td>01/18/2019</td><td>11:12:58</td><td>1000</td><td> </td></tr><tr><td>3</td><td>01/23/2019</td><td>22:12:41</td><td>15</td><td> </td></tr></tbody></table>
Ultimately, I am trying to achieve the result below using
<table><tbody><tr><th>ID</th><th>Trans_Date</th><th>Trans_Time</th><th>Amount</th><th>2d_Running_Total</th><th> </th></tr><tr><td>1</td><td>03/23/2019</td><td>06:51:03</td><td>100</td><td>100</td><td> </td></tr><tr><td>1</td><td>03/24/2019</td><td>12:32:48</td><td>600</td><td>700</td><td> </td></tr><tr><td>1</td><td>03/24/2019</td><td>14:15:35</td><td>250</td><td>950</td><td> </td></tr><tr><td>1</td><td>06/05/2019</td><td>16:18:21</td><td>75</td><td>75</td><td> </td></tr><tr><td>2</td><td>02/01/2019</td><td>18:02:52</td><td>200</td><td>200</td><td> </td></tr><tr><td>2</td><td>02/02/2019</td><td>10:03:02</td><td>150</td><td>350</td><td> </td></tr><tr><td>2</td><td>02/03/2019</td><td>23:47:51</td><td>800</td><td>950</td><td> </td></tr><tr><td>3</td><td>01/18/2019</td><td>11:12:58</td><td>1000</td><td>1000</td><td> </td></tr><tr><td>3</td><td>01/23/2019</td><td>22:12:41</td><td>15</td><td>15</td><td> </td></tr></tbody></table>
This hyperlink was very close to solving this, but the issue is for the records that have multiple transactions on the same day, it provides the same value for the same day.
https://python-forum.io/Thread-Rolling-sum-for-a-window-of-2-days-Pandas
This should do it:
import pandas as pd
# create dummy data
df = pd.DataFrame(
columns = ['ID', 'Trans_Date', 'Trans_Time', 'Amount'],
data = [
[1, '03/23/2019', '06:51:03', 100],
[1, '03/24/2019', '12:32:48', 600],
[1, '03/24/2019', '14:15:35', 250],
[1, '06/05/2019', '16:18:21', 75],
[2, '02/01/2019', '18:02:52', 200],
[2, '02/02/2019', '10:03:02', 150],
[2, '02/03/2019', '23:47:51', 800],
[3, '01/18/2019', '11:12:58', 1000],
[3, '01/23/2019', '22:12:41', 15]
]
)
df_out = pd.DataFrame(
columns = ['ID', 'Trans_Date', 'Trans_Time', 'Amount', '2d_Running_Total'],
data = [
[1, '03/23/2019', '06:51:03', 100, 100],
[1, '03/24/2019', '12:32:48', 600, 700],
[1, '03/24/2019', '14:15:35', 250, 950],
[1, '06/05/2019', '16:18:21', 75, 75],
[2, '02/01/2019', '18:02:52', 200, 200],
[2, '02/02/2019', '10:03:02', 150, 350],
[2, '02/03/2019', '23:47:51', 800, 950],
[3, '01/18/2019', '11:12:58', 1000, 1000]
]
)
# convert into datetime object and set as index
df['Trans_DateTime'] = pd.to_datetime(df['Trans_Date'] + ' ' + df['Trans_Time'])
df = df.set_index('Trans_DateTime')
# group by ID and apply rolling window to the amount column
df['2d_Running_Total'] = df.groupby('ID')['Amount'].rolling('2d').sum().values.astype(int)
df.reset_index(drop=True)

How to iteratively build a panel data set of past events with data measured at the time of the event?

I am trying to build a model to predict the winner of professional tennis matches. I have scraped data on players, and past matches. However, the data that I have is representative of the players current information, and I would like to know the traits of the player at the time of the match.
For example, height isn't subject to change through time, but I would like to know their age at the time of the match, and their record at the time of the match (rather than knowing their current age and record).
I have tried to set up a simple illustrated example below:
import pandas as pd
d = {'winning_player': ['A', 'B', 'C', 'A'],
'losing_player': ['W', 'X', 'Y', 'Z'],
'height_winner': [74, 76, 75, 74],
'height_loser': [69, 69, 70, 78],
'age_winner': [28, 29, 37, 28],
'age_loser': [34, 34, 23, 29],
'date_of_match': ['01-02-2003', '03-10-2005', '20-12-2012', '03-03-2015'],
'A_player_wins': [0, 0, 0, 1],
'B_player_wins': [0, 0, 0, 0],
'A_player_losses': [0, 0, 0, 0],
'B_player_losses': [0, 0, 0, 0],
'A_player_birth_year': [1990, 1989, 1981, 1990],
'B_player_birth_year': [1984, 1984, 1995, 1989],
'A_age_at_match': [13, 16, 31, 25],
'B_age_at_match': [19, 21, 17, 26]
}
df = pd.DataFrame(data=d)
print(df)
What I would like to do is generate new columns for the record of the winning player and losing player at the time of the match. I would ideally do this by tallying which of the players won and lost previously in the dataset. Additionally for age, I have the current age of the player, but I would like to know how old they were at the time of the match.
In this example, 'winning_player':'A' is the only individual who appears twice. How can I go about creating the relevant data and adding it to the dataframe? Is this something that using classes would solve? I am very new to python and come from a background in R, so I am not very familiar with OOP.

Why converting np.nan to int results in huge number?

I have a numpy array like this below:
array([['18.0', '11.0', '5.0', ..., '19.0', '18.0', '20.0'],
['11.0', '14.0', '15.0', ..., '45.0', '26.0', '20.0'],
['1.0', '0.0', '1.0', ..., '3.0', '4.0', '17.0'],
...,
['nan', 'nan', 'nan', ..., 'nan', 'nan', 'nan'],
['nan', 'nan', 'nan', ..., 'nan', 'nan', 'nan'],
['nan', 'nan', 'nan', ..., 'nan', 'nan', 'nan']],
dtype='|S230')
But converting it to int array makes the np.nan value to be weird values:
df[:,4:].astype('float').astype('int')
array([[ 18, 11, 5,
..., 19, 18,
20],
[ 11, 14, 15,
..., 45, 26,
20],
[ 1, 0, 1,
..., 3, 4,
17],
...,
[-9223372036854775808, -9223372036854775808, -9223372036854775808,
..., -9223372036854775808, -9223372036854775808,
-9223372036854775808],
[-9223372036854775808, -9223372036854775808, -9223372036854775808,
..., -9223372036854775808, -9223372036854775808,
-9223372036854775808],
[-9223372036854775808, -9223372036854775808, -9223372036854775808,
..., -9223372036854775808, -9223372036854775808,
-9223372036854775808]])
So how to fix my problem ?
Converting floating-point Nan to an integer type is undefined behavior, as far as I know. The number:
-9223372036854775808
Is the smallest int64, i.e. -2**63. Note the same thing happens on my system when I coerce to int32:
>>> arr
array([['18.0', '11.0', '5.0', 'nan']],
dtype='<U4')
>>> arr.astype('float').astype(np.int32)
array([[ 18, 11, 5, -2147483648]], dtype=int32)
>>> -2**31
-2147483648
It all depends what you expect the result to be. nan is of a float type, so converting the string 'nan' into float is no problem. But there is no definition of converting it to int values.
I suggest you handle it differently - first choose what spcific int you want all the nan values to become (for example 0), and only then convert the whole array to int
a = np.array(['1','2','3','nan','nan'])
a[a=='nan'] = 0 # this will convert all the nan values to 0, or choose another number
a = a.astype('int')
Now a is equal to
array([1, 2, 3, 0, 0])

Categories