I try to combine many pairs of rows when run the code one time. As my example shows, for two rows which can be combined, the rules are,
values in PT, DS, SC columns must be same.
time stamps in FS must be the closest pair.
combine on ID column (string) is like ID1,ID2.
combine on WT and CB column (number) is sum().
combine on FS is as the latest time.
My example is,
df0 = pd.DataFrame({'ID':['1001','1002','1003','1004','2001','2002','2003','2004','3001','3002','3003','3004','4001','4002','4003','4004','5001','5002','5003','5004','6001'],
'PT':['B','B','B','B','B','B','B','B','B','B','B','B','B','B','B','B','D','D','D','D','F'],
'DS':['AAA','AAA','AAA','AAA','AAA','AAA','AAA','AAA','AAB','AAB','AAB','AAB','AAB','AAB','AAB','AAB','AAA','AAA','AAA','AAB','AAB'],
'SC':['P1','P1','P1','P1','P2','P2','P2','P2','P1','P1','P1','P1','P2','P2','P2','P2','P1','P1','P1','P2','P2'],
'FS':['2020-10-16 00:00:00','2020-10-16 00:00:02','2020-10-16 00:00:03','2020-10-16 00:00:04','2020-10-16 00:00:00','2020-10-16 00:00:01','2020-10-16 00:00:02','2020-10-16 00:00:03','2020-10-16 00:00:00','2020-10-16 00:00:01','2020-10-16 00:00:05','2020-10-16 00:00:07','2020-10-16 00:00:01','2020-10-16 00:00:10','2020-10-16 00:10:00','2020-10-16 00:10:40','2020-10-16 00:00:00','2020-10-16 00:10:00','2020-10-16 00:00:40','2020-10-16 00:00:10','2020-10-16 00:00:05'],
'WT':[1,2,3,4,10,11,12,13,20,21,22,23,30,31,32,33,40,41,42,43,53],
'CB':[0.1,0.2,0.3,0.4,1,1.1,1.2,1.3,2,2.1,2.2,2.3,3,3.1,3.2,3.3,4,4.1,4.2,4.3,5.3]})
When run the code one time, the new dataframe df1 is,
df1 = pd.DataFrame({'ID':['1001,1002','1003,1004','2001,2002','2003,2004','3001,3002','3003,3004','4001,4002','4003,4004','5001,5002','5003','5004','6001'],
'PT':['B','B','B','B','B','B','B','B','D','D','D','F'],
'DS':['AAA','AAA','AAA','AAA','AAB','AAB','AAB','AAB','AAA','AAA','AAB','AAB'],
'SC':['P1','P1','P2','P2','P1','P1','P2','P2','P1','P1','P2','P2'],
'FS':['2020-10-16 00:00:02','2020-10-16 00:00:04','2020-10-16 00:00:01','2020-10-16 00:00:03','2020-10-16 00:00:01','2020-10-16 00:00:07','2020-10-16 00:00:10','2020-10-16 00:10:40','2020-10-16 00:10:00','2020-10-16 00:00:40','2020-10-16 00:00:10','2020-10-16 00:00:05'],
'WT':[3,7,21,25,41,45,61,65,81,42,43,53],
'CB':[0.3,0.7,2.1,2.5,4.1,4.5,6.1,6.5,8.1,4.2,4.3,5.3]})
When run the code again on df1, the new dataframe df2 is,
df2 = pd.DataFrame({'ID':['1001,1002,1003,1004','2001,2002,2003,2004','3001,3002,3003,3004','4001,4002,4003,4004','5001,5002,5003','5004','6001'],
'PT':['B','B','B','B','D','D','F'],
'DS':['AAA','AAA','AAB','AAB','AAA','AAB','AAB'],
'SC':['P1','P2','P1','P2','P1','P2','P2'],
'FS':['2020-10-16 00:00:04','2020-10-16 00:00:03','2020-10-16 00:00:07','2020-10-16 00:10:40','2020-10-16 00:10:00','2020-10-16 00:00:10','2020-10-16 00:00:05'],
'WT':[10,46,86,126,123,43,53],
'CB':[1,4.6,8.6,12.6,12.3,4.3,5.3]})
Here no more combines can be done on df2 because no any pair of rows meets the rules.
The reason is that I have memory limit and have to decrease the size of data without losing the info. So I try to bundle IDs which shares same features and happens close to each other. I plan to run the code multiple times until no more memory issue or no more possible combines.
This is a good place to use GroupBy operations.
My source was Wes McKinney's Python for Data Analysis.
df0['ID'] = df0.groupby([df0['PT'], df0['DS'], df0['SC']])['ID'].transform(lambda x: ','.join(x))
max_times = df0.groupby(['ID', 'PT', 'DS', 'SC'], as_index = False).max().drop(['WT', 'CB'], axis = 1)
sums_WT_CB = df0.groupby(['ID', 'PT', 'DS', 'SC'], as_index = False).sum()
df2 = pd.merge(max_times, sums_WT_CB, on=['ID', 'PT', 'DS', 'SC'])
This code just takes the most recent time for each unique grouping of the columns you specified. If there are other requirements for the FS column, you will have to modify this.
Code to concatenate the IDs came from:
Concatenate strings from several rows using Pandas groupby
Perhaps there's something more straightforward (please comment if so :)
but the following seems to work:
def combine(data):
return pd.DataFrame(
{
"ID": ",".join(map(str, data["ID"])),
"PT": data["PT"].iloc[0],
"DS": data["DS"].iloc[0],
"SC": data["SC"].iloc[0],
"WT": data["WT"].sum(),
"CB": data["CB"].sum(),
"FS": data["FS"].max(),
},
index=[0],
).reset_index(drop=True)
df_agg = (
df.sort_values(["PT", "DS", "SC", "FS"])
.groupby(["PT", "DS", "SC"])
.apply(combine)
.reset_index(drop=True)
)
returns
ID PT DS SC WT CB FS
0 1001,1002,1003,1004 B AAA P1 10 1.0 2020-10-16 00:00:04
1 2001,2002,2003,2004 B AAA P2 46 4.6 2020-10-16 00:00:03
2 3001,3002,3003,3004 B AAB P1 86 8.6 2020-10-16 00:00:07
3 4001,4002,4003,4004 B AAB P2 126 12.6 2020-10-16 00:10:40
4 5001,5003,5002 D AAA P1 123 12.3 2020-10-16 00:10:00
5 5004 D AAB P2 43 4.3 2020-10-16 00:00:10
6 6001 F AAB P2 53 5.3 2020-10-16 00:00:05
I have two data frames df1 and df2. Both have first column common SKUCode=SKU
df1:
df2:
I want to update df1 and set SKUStatus=0 if SKUCode matches SKU in df2.
I want to add new row to df1 if SKU from df2 has no match to SKUCode.
So after the operation df1 looks like following:
One way I could get this done is via df2.iterrows() and looping through values however I think there must be another neat way of doing this?
Thank you
import pandas as pdx
df1=pdx.DataFrame({'SKUCode':['A','B','C','D'],'ListPrice':[1798,2997,1798,999],'SalePrice':[1798,2997,1798,999],'SKUStatus':[1,1,1,0],'CostPrice':[500,773,525,300]})
df2=pdx.DataFrame({'SKUCode':['X','Y','B'],'Status':[0,0,0],'e_date':['31-05-2020','01-06-2020','01-06-2020']})
df1.merge(df2,left_on='SKUCode')
try this, using outer merge which gives both matching and non-matching records.
In [75]: df_m = df1.merge(df2, on="SKUCode", how='outer')
In [76]: mask = df_m['Status'].isnull()
In [77]: df_m.loc[~mask, 'SKUStatus'] = df_m.loc[~mask, 'Status']
In [78]: df_m[['SKUCode', "ListPrice", "SalePrice", "SKUStatus", "CostPrice"]].fillna(0.0)
output
SKUCode ListPrice SalePrice SKUStatus CostPrice
0 A 1798.0 1798.0 1.0 500.0
1 B 2997.0 2997.0 0.0 773.0
2 C 1798.0 1798.0 1.0 525.0
3 D 999.0 999.0 0.0 300.0
4 X 0.0 0.0 0.0 0.0
5 Y 0.0 0.0 0.0 0.0
I'm not sure exactly if I understood you correctly but I think you can use .loc. something along the lines of:
df1.loc[df2['SKUStatu'] != 0, 'SKUStatus'] = 1
You should have a look at pd.merge function [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html].
First rename a column with the same name (e.g rename SKU to SKUCode). Then try:
df1.merge(df2, left_on='SKUCode')
If you provide input data (not screenshots), I can try with the appropriate parameters.
Below is some dummy data that reflects the data I am working with.
import pandas as pd
import numpy as np
from numpy import random
random.seed(30)
# Dummy data that represents a percent change
datelist = pd.date_range(start='1983-01-01', end='1994-01-01', freq='Y')
df1 = pd.DataFrame({"P Change_1": np.random.uniform(low=-0.55528, high=0.0396181, size=(11,)),
"P Change_2": np.random.uniform(low=-0.55528, high=0.0396181, size=(11,))})
#This dataframe contains the rows we want to operate on
df2 = pd.DataFrame({
'Loc1': [None, None, None, None, None, None, None, None, None, None, 2.5415],
'Loc2': [None, None, None, None, None, None, None, None, None, None, 3.2126],})
#Set the datetime index
df1 = df1.set_index(datelist)
df2 = df2.set_index(datelist)
df1:
P Change_1 P Change_2
1984-12-31 -0.172080 -0.231574
1985-12-31 -0.328773 -0.247018
1986-12-31 -0.160834 -0.099079
1987-12-31 -0.457924 0.000266
1988-12-31 0.017374 -0.501916
1989-12-31 -0.349052 -0.438816
1990-12-31 0.034711 0.036164
1991-12-31 -0.415445 -0.415372
1992-12-31 -0.206852 -0.413107
1993-12-31 -0.313341 -0.181030
1994-12-31 -0.474234 -0.118058
df2:
Loc1 Loc2
1984-12-31 NaN NaN
1985-12-31 NaN NaN
1986-12-31 NaN NaN
1987-12-31 NaN NaN
1988-12-31 NaN NaN
1989-12-31 NaN NaN
1990-12-31 NaN NaN
1991-12-31 NaN NaN
1992-12-31 NaN NaN
1993-12-31 NaN NaN
1994-12-31 2.5415 3.2126
DataFrame details:
First off, Loc1 will correspond with P Change_1 and Loc2 corresponds to P Change_2, etc. Looking at Loc1 first, I want to either fill up the DataFrame containing Loc1 and Loc2 with the relevant values or compute a new dataframe that has columns Calc1 and Calc2.
The calculation:
I want to start with the 1994 value of Loc1 and calculate a new value for 1993 by taking Loc1 1993 = Loc1 1994 + (Loc1 1994 * P Change_1 1993). With the values filled in it would be 2.5415 +(-0.313341 * 2.5415) which equals about 1.74514.
This 1.74514 value will replace the NaN value in 1993, and then I want to use that calculated value to get a value for 1992. This means we now compute Loc1 1992 = Loc1 1993 + (Loc1 1993 * P Change_1 1992). I want to carry out this operation row-wise until it gets the earliest value in the timeseries.
What is the best way to go about implementing this row-wise equation? I hope this makes some sense and any help is greatly appreciated!
df = pd.merge(df1, df2, how='inner', right_index=True, left_index=True) # merging dataframes on date index
df['count'] = range(len(df)) # creating a column, count for easy operation
# divides dataframe in two part, one part above the not NaN row and one below
da1 = df[df['count']<=df.dropna().iloc[0]['count']]
da2 = df[df['count']>=df.dropna().iloc[0]['count']]
da1.sort_values(by=['count'],ascending=False, inplace=True)
g=[da1,da2]
num_col=len(df1.columns)
for w in range(len(g)):
list_of_col=[]
count = 0
list_of_col=[list() for i in range(len(g[w]))]
for item, rows in g[w].iterrows():
n=[]
if count==0:
for p in range(1,num_col+1):
n.append(rows[f'Loc{p}'])
else:
for p in range(1,num_col+1):
n.append(list_of_col[count-1][p-1]+ list_of_col[count-1][p-1]* rows[f'P Change_{p}'])
list_of_col[count].extend(n)
count+=1
tmp=[list() for i in range(num_col)]
for d_ in range(num_col):
for x_ in range(len(list_of_col)):
tmp[d_].append(list_of_col[x_][d_])
z1=[]
z1.extend(tmp)
for i in range(num_col):
g[w][f'Loc{i+1}']=z1[i]
da1.sort_values(by=['count'] ,inplace=True)
final_df = pd.concat([da1, da2[1:]])
calc_df = pd.DataFrame()
for i in range(num_col):
calc_df[f'Calc{i+1}']=final_df[f'Loc{i+1}']
print(calc_df)
I have tried to include all the obscure thing I have done in the comment. I have edited my code to let initial dataframes remain unaffected.
[Edited] : I have edited the code to include any number of columns in the given dataframe.
[Edited:]If the name of columns are arbitrary in df1 and df2, please run this block of code before running the upper code. I have renamed the columns name using list comprehension!
df1.columns = [f'P Change_{i+1}' for i in range(len(df1.columns))]
df2.columns = [f'Loc{i+1}' for i in range(len(df2.columns))]
[EDITED] Perhaps there are better/more elegant ways to do this, but this worked fine for me:
def fill_values(df1, df2, cols1=None, cols2=None):
if cols1 is None: cols1 = df1.columns
if cols2 is None: cols2 = df2.columns
for i in reversed(range(df2.shape[0]-1)):
for col1, col2 in zip(cols1, cols2):
if np.isnan(df2[col2].iloc[i]):
val = df2[col2].iloc[i+1] + df2[col2].iloc[i+1] * df1[col1].iloc[i]
df2[col2].iloc[i] = val
return df1, df2
df1, df2 = fill_values(df1, df2)
print(df2)
Loc1 Loc2
1983-12-31 0.140160 0.136329
1984-12-31 0.169291 0.177413
1985-12-31 0.252212 0.235614
1986-12-31 0.300550 0.261526
1987-12-31 0.554444 0.261457
1988-12-31 0.544976 0.524925
1989-12-31 0.837202 0.935388
1990-12-31 0.809117 0.902741
1991-12-31 1.384158 1.544128
1992-12-31 1.745144 2.631024
1993-12-31 2.541500 3.212600
This assumes that the rows in df1 and df2 corresponds perfectly (I'm not querying the index, but only the location). Hope it helps!
Just to be clear, what you need is Loc1[year]=Loc1[next_year] + PChange[year]*Loc1[next_year], right?
The below loop will do what you are looking for, but it just assumes that the number of rows in both df's is always equal, etc. (instead of matching the value in the index). From your description, I think this works for your data.
for i in range(df2.shape[0]-2,-1,-1):
df2.Loc1[i]=df2.Loc1[i+1] + (df1.PChange_1[i]*df2.Loc1[i+1])
Hope this helps :)