I´m relatively new to Pandas and tried the search already, but I couldn´t find a solution.
I have a dataframe with Transaction-No., customerId and the date of purchase which looks like this:
Transaction 12345 12346 12347 12348 12349
customerID
1 NaN 2019-09-01 NaN 2019-09-11 2019-09-22...
2 2019-10-01 NaN NaN NaN 2019-10-07...
3 ...
The dataframe has [6334 rows x 8557 columns].
Every row has NaN-values, as the Transaction-No. is unique.
I would like to calculate the date difference for each row so I get
customerID Datedifference1 Datedifference2 etc.
1 10 11
2 6
3 ...
I´m struggling to get a list with the datedifferences for every customerId.
Is there a way to ignore NaN in the dataframe and to only calculate on the values that are not NaN?
I would like to have a list with customerId and the datediff between purchase 1 and 2, 2 and 3, etc. to estimate the days until the next purchase will occur.
Is there a solution for that?
Idea is reshape data by DataFrame.stack, then get differences, remove first missing values per groups and reshape back:
df = df.apply(pd.to_datetime)
df1 = (df.stack()
.groupby(level=0)
.diff()
.dropna()
.dt.days
.reset_index(level=1, drop=True)
.to_frame())
df1 = (df1.set_index(df1.groupby(['customerID']).cumcount(), append=True)[0]
.unstack()
.add_prefix('Datedifference'))
print (df1)
Datedifference0 Datedifference1
Transaction
1 10.0 11.0
2 6.0 NaN
EDIT: If input data are different, solution is changed - convert column to datetimes, create new column by DataFrameGroupBy.diff for differencies, remove only NaN rows by DataFrame.dropna and last reshape with DataFrame.set_index and unstack with counter Series by GroupBy.cumcount:
print (df1)
customerID Transaction date
0 1 12346 2019-09-01
1 1 12348 2019-09-11
2 1 12349 2019-09-22
3 2 12345 2019-10-01
4 2 12349 2019-10-07
df1['date'] = pd.to_datetime(df1['date'])
df1['diff'] = df1.groupby('customerID')['date'].diff().dt.days
df1 = df1.dropna(subset=['diff'])
df2 = (df1.set_index(['customerID', df1.groupby('customerID').cumcount()])['diff']
.unstack()
.add_prefix('Datedifference'))
print (df2)
Datedifference0 Datedifference1
customerID
1 10.0 11.0
2 6.0 NaN
Related
I am getting a KeyError: 'Cust_id_2' when I try to merge the following dataframes.
df =
Cust_id year is_sub
0 4 1516 is_sub
1 4 1920 is_sub
2 4 1819 is_sub
3 4 1718 is_sub
4 4 1617 is_sub
df2 =
Cust_id_2 year_freq_score
0 4 9.0
1 5 6.0
2 7 10.0
3 8 2.0
4 10 1.0
Most recently I have tried this:
result = pd.merge(
df,
df2[['year_freq_score']],
how='left',
left_on='Cust_id',
right_on='Cust_id_2'
)
df has 14,000 rows. df2 has 3,000 rows. df2 is a pivot table derived from df.
My first version had the Cust_id as the index of df2 and i tried to use 'right_index=True' which gave a KeyError.
I then reset the index and used 'on' columns having the same name (on='Cust_id) which gave KeyError: 'Cust_id'.
I then and changed df2 to ''Cust_id_2'' to isolate where the error was coming from and now receive 'KeyError: 'Cust_id_2''.
I've read through multiple posts on 'KeyError' but have not found (or understood) the solution to this issue.
Any help or pointers in the right direction greatly appreciated.
You slice df2 to only keep year_freq_score, so there is no more Cust_id_2 column for the merge.
Do instead:
result = pd.merge(
df,
df2,
how='left',
left_on='Cust_id',
right_on='Cust_id_2'
)
I have this dataframe below where i am trying to group each them into a single row by person and purchase id. column purchase date location contains the column name in which the date is located for said purchase. i am trying to use the location to determine the earliest date a purchase was made
person
purchase_id
purchase_date_location
column_z
column_x
final_pruchase_date
a
1
column_z
NaN
NaN
a
1
column_z
2022-01-01
NaN
a
1
column_z
2022-02-01
NaN
b
2
column_x
NaN
NaN
b
2
column_x
NaN
2022-03-03
i have tried this so far:
groupings = {df.purchase_date_location.iloc[0]: 'min'}
df2 = df.groupby('purchase_id', as_index=False).agg(groupings)
My problem here is due to the iloc[0] my value will always be column_z, my question is how do i make this value change corresponding to the row and not be fixated on the first
I would try solve it like this:
df['purchase_date'] = df[['purchase_date_location']].apply(
lambda x: df.loc[x.name, x.iloc[0]], axis=1)
df2 = df.groupby('purchase_id', as_index=False).agg({"purchase_date": min})
i have two dataframes. The second dataframe contains the values to be updated in the first dataframe. df1:
data=[[1,"potential"],[2,"lost"],[3,"at risk"],[4,"promising"]]
df=pd.DataFrame(data,columns=['id','class'])
id class
1 potential
2 lost
3 at risk
4 promising
df2:
data2=[[2,"new"],[4,"loyal"]]
df2=pd.DataFrame(data2,columns=['id','class'])
id class
2 new
4 loyal
expected output:
data3=[[1,"potential"],[2,"new"],[3,"at risk"],[4,"loyal"]]
df3=pd.DataFrame(data3,columns=['id','class'])
id class
1 potential
2 new
3 at risk
4 loyal
The code below seems to be working, but I believe there is a more effective solution.
final=df.append([df2])
final = final.drop_duplicates(subset='id', keep="last")
addition:
Is there a way for me to write the previous value in a new column?
like this:
id class prev_class modified date
1 potential nan nan
2 new lost 2022.xx.xx
3 at risk nan nan
4 loyal promising 2022.xx.xx
Your solution is good, here is alternative with concat and added DataFrame.sort_values:
df = (pd.concat([df, df2])
.drop_duplicates(subset='id', keep="last")
.sort_values('id', ignore_index=True))
print (df)
id class
0 1 potential
1 2 new
2 3 at risk
3 4 loyal
Solution is change if need add previous class values and today:
df3 = pd.concat([df, df2])
mask = df3['id'].duplicated(keep='last')
df31 = df3[mask]
df32 = df3[~mask]
df3 = (df32.merge(df31, on='id', how='left', suffixes=('','_prev'))
.sort_values('id', ignore_index=True))
df3.loc[df3['class_prev'].notna(), 'modified date'] = pd.to_datetime('now').normalize()
print (df3)
id class class_prev modified date
0 1 potential NaN NaT
1 2 new lost 2022-03-31
2 3 at risk NaN NaT
3 4 loyal promising 2022-03-31
We can use DataFrame.update
df = df.set_index('id')
df.update(df2.set_index('id'))
df = df.reset_index()
Result
print(df)
id class
0 1 potential
1 2 new
2 3 at risk
3 4 loyal
You can operate along your id's by setting them as your index, and use combine_first to perform this operation. Then assigning youre prev_class is extremely straightforward because you've properly used the Index!
df = df.set_index('id')
df2 = df2.set_index('id')
out = (
df2.combine_first(df)
.assign(
prev_class=df2["class"],
modified=lambda d:
d["prev_class"].where(
d["prev_class"].isna(), pd.Timestamp.now()
)
)
)
print(out)
class prev_class modified
id
1 potential NaN NaN
2 new new 2022-03-31 06:51:20.832668
3 at risk NaN NaN
4 loyal loyal 2022-03-31 06:51:20.832668
I want to merge a seperate dataframe (df2) with the main dataframe (df1), but if, for a given row, the dates in df1 do not exist in df2, then search for the recent date before the underlying date in df1.
I tried to use pd.merge, but it would remove rows with unmatched dates, and only keep the rows that matched in both df's.
df1 = [['2007-01-01','A'],
['2007-01-02','B'],
['2007-01-03','C'],
['2007-01-04','B'],
['2007-01-06','C']]
df2 = [['2007-01-01','B',3],
['2007-01-02','A',4],
['2007-01-03','B',5],
['2007-01-06','C',3]]
df1 = pd.DataFrame(df1)
df2 = pd.DataFrame(df2)
df1[0] = pd.to_datetime(df1[0])
df2[0] = pd.to_datetime(df2[0])
Current df1 | pd.merge():
0 1 2
0 2007-01-06 C 3
Only gets the exact date between both df's, it does not consider value from recent dates.
Expected df1:
0 1 2
0 2007-01-01 A NaN
1 2007-01-02 B 3
2 2007-01-03 C NaN
3 2007-01-04 B 3
4 2007-01-06 C 3
Getting NaNs because data doesn't exist on or before that date in df2. For index row 1, it gets data before a day before, while index row 4, it gets data exactly on the same day.
Check you output by using merge_asof
pd.merge_asof(df1,df2,on=0,by=1,allow_exact_matches=True)
Out[15]:
0 1 2
0 2007-01-01 A NaN
1 2007-01-02 B 3.0
2 2007-01-03 C NaN
3 2007-01-04 B 5.0 # here should be 5 since 5 ' date is more close. also df2 have two B
4 2007-01-06 C 3.0
Using your merge code, which I assume you have since its not present in your question, insert the argument how=left or how=outer.
It should look like this:
dfmerged = pd.merge(df1, df2, how='left', left_on=['Date'], right_on=['Date'])
You can then use slicing and renaming to keep the columns you wish.
dfmerged = dfmerged[['Date', 'Letters', 'Numbers']]
Note: I do not know your column names since you haven't shown any code. Substitute as necessary
I have two dataframes in Pandas which are being merged together df.A and df.B, df.A is the original, and df.B has the new data I want to bring over. The merge works fine and as expected I get two columns col_x and col_y in the merged df.
However, in some rows, the original df.A has values where the other df.B does not. My question is, how can I selectively take the values from col_x and col_y and place them into a new col such as col_z ?
Here's what I mean, how can I merge df.A:
date impressions spend col
1/1/15 100000 3.00 ABC123456
1/2/15 145000 5.00 ABCD00000
1/3/15 300000 15.00 (null)
with df.B
date col
1/1/15 (null)
1/2/15 (null)
1/3/15 DEF123456
To get:
date impressions spend col_z
1/1/15 100000 3.00 ABC123456
1/2/15 145000 5.00 ABCD00000
1/3/15 300000 15.00 DEF123456
Any help or point in the right direction would be really appreciated!
Thanks
OK assuming that your (null) values are in fact NaN values and not that string then the following works:
In [10]:
# create the merged df
merged = dfA.merge(dfB, on='date')
merged
Out[10]:
date impressions spend col_x col_y
0 2015-01-01 100000 3 ABC123456 NaN
1 2015-01-02 145000 5 ABCD00000 NaN
2 2015-01-03 300000 15 NaN DEF123456
You can use where to conditionally assign a value from the _x and _y columns:
In [11]:
# now create col_z using where
merged['col_z'] = merged['col_x'].where(merged['col_x'].notnull(), merged['col_y'])
merged
Out[11]:
date impressions spend col_x col_y col_z
0 2015-01-01 100000 3 ABC123456 NaN ABC123456
1 2015-01-02 145000 5 ABCD00000 NaN ABCD00000
2 2015-01-03 300000 15 NaN DEF123456 DEF123456
You can then drop the extraneous columns:
In [13]:
merged = merged.drop(['col_x','col_y'],axis=1)
merged
Out[13]:
date impressions spend col_z
0 2015-01-01 100000 3 ABC123456
1 2015-01-02 145000 5 ABCD00000
2 2015-01-03 300000 15 DEF123456
IMO the shortest and yet readable solution is something like that:
df.A.loc[df.A['col'].isna(), 'col'] = df.A.merge(df.B, how='left', on='date')['col_y']
What it basically does is assigning values from merged table column col_y to primary df.A table, for those rows in col column, which are empty (.isna() condition).
If you have got data that contains 'nans' and you want to fill the 'nans' from other dataframe
(that matching the index and columns names) you can do the following:
df_A : target DataFrame that contain nans element
df_B : the source DataFrame thatcomplete the missing elements
df_A = df_A.where(df_A.notnull(),df_B)