Merging Dataframe with Different Dates? - python

I want to merge a seperate dataframe (df2) with the main dataframe (df1), but if, for a given row, the dates in df1 do not exist in df2, then search for the recent date before the underlying date in df1.
I tried to use pd.merge, but it would remove rows with unmatched dates, and only keep the rows that matched in both df's.
df1 = [['2007-01-01','A'],
['2007-01-02','B'],
['2007-01-03','C'],
['2007-01-04','B'],
['2007-01-06','C']]
df2 = [['2007-01-01','B',3],
['2007-01-02','A',4],
['2007-01-03','B',5],
['2007-01-06','C',3]]
df1 = pd.DataFrame(df1)
df2 = pd.DataFrame(df2)
df1[0] = pd.to_datetime(df1[0])
df2[0] = pd.to_datetime(df2[0])
Current df1 | pd.merge():
0 1 2
0 2007-01-06 C 3
Only gets the exact date between both df's, it does not consider value from recent dates.
Expected df1:
0 1 2
0 2007-01-01 A NaN
1 2007-01-02 B 3
2 2007-01-03 C NaN
3 2007-01-04 B 3
4 2007-01-06 C 3
Getting NaNs because data doesn't exist on or before that date in df2. For index row 1, it gets data before a day before, while index row 4, it gets data exactly on the same day.

Check you output by using merge_asof
pd.merge_asof(df1,df2,on=0,by=1,allow_exact_matches=True)
Out[15]:
0 1 2
0 2007-01-01 A NaN
1 2007-01-02 B 3.0
2 2007-01-03 C NaN
3 2007-01-04 B 5.0 # here should be 5 since 5 ' date is more close. also df2 have two B
4 2007-01-06 C 3.0

Using your merge code, which I assume you have since its not present in your question, insert the argument how=left or how=outer.
It should look like this:
dfmerged = pd.merge(df1, df2, how='left', left_on=['Date'], right_on=['Date'])
You can then use slicing and renaming to keep the columns you wish.
dfmerged = dfmerged[['Date', 'Letters', 'Numbers']]
Note: I do not know your column names since you haven't shown any code. Substitute as necessary

Related

Grouping a Python Dataframe containing columns with column locations

I have this dataframe below where i am trying to group each them into a single row by person and purchase id. column purchase date location contains the column name in which the date is located for said purchase. i am trying to use the location to determine the earliest date a purchase was made
person
purchase_id
purchase_date_location
column_z
column_x
final_pruchase_date
a
1
column_z
NaN
NaN
a
1
column_z
2022-01-01
NaN
a
1
column_z
2022-02-01
NaN
b
2
column_x
NaN
NaN
b
2
column_x
NaN
2022-03-03
i have tried this so far:
groupings = {df.purchase_date_location.iloc[0]: 'min'}
df2 = df.groupby('purchase_id', as_index=False).agg(groupings)
My problem here is due to the iloc[0] my value will always be column_z, my question is how do i make this value change corresponding to the row and not be fixated on the first
I would try solve it like this:
df['purchase_date'] = df[['purchase_date_location']].apply(
lambda x: df.loc[x.name, x.iloc[0]], axis=1)
df2 = df.groupby('purchase_id', as_index=False).agg({"purchase_date": min})

pandas, update dataframe values ​with a not in the same format dataframe

i have two dataframes. The second dataframe contains the values ​​to be updated in the first dataframe. df1:
data=[[1,"potential"],[2,"lost"],[3,"at risk"],[4,"promising"]]
df=pd.DataFrame(data,columns=['id','class'])
id class
1 potential
2 lost
3 at risk
4 promising
df2:
data2=[[2,"new"],[4,"loyal"]]
df2=pd.DataFrame(data2,columns=['id','class'])
id class
2 new
4 loyal
expected output:
data3=[[1,"potential"],[2,"new"],[3,"at risk"],[4,"loyal"]]
df3=pd.DataFrame(data3,columns=['id','class'])
id class
1 potential
2 new
3 at risk
4 loyal
The code below seems to be working, but I believe there is a more effective solution.
final=df.append([df2])
final = final.drop_duplicates(subset='id', keep="last")
addition:
Is there a way for me to write the previous value in a new column?
like this:
id class prev_class modified date
1 potential nan nan
2 new lost 2022.xx.xx
3 at risk nan nan
4 loyal promising 2022.xx.xx
Your solution is good, here is alternative with concat and added DataFrame.sort_values:
df = (pd.concat([df, df2])
.drop_duplicates(subset='id', keep="last")
.sort_values('id', ignore_index=True))
print (df)
id class
0 1 potential
1 2 new
2 3 at risk
3 4 loyal
Solution is change if need add previous class values and today:
df3 = pd.concat([df, df2])
mask = df3['id'].duplicated(keep='last')
df31 = df3[mask]
df32 = df3[~mask]
df3 = (df32.merge(df31, on='id', how='left', suffixes=('','_prev'))
.sort_values('id', ignore_index=True))
df3.loc[df3['class_prev'].notna(), 'modified date'] = pd.to_datetime('now').normalize()
print (df3)
id class class_prev modified date
0 1 potential NaN NaT
1 2 new lost 2022-03-31
2 3 at risk NaN NaT
3 4 loyal promising 2022-03-31
We can use DataFrame.update
df = df.set_index('id')
df.update(df2.set_index('id'))
df = df.reset_index()
Result
print(df)
id class
0 1 potential
1 2 new
2 3 at risk
3 4 loyal
You can operate along your id's by setting them as your index, and use combine_first to perform this operation. Then assigning youre prev_class is extremely straightforward because you've properly used the Index!
df = df.set_index('id')
df2 = df2.set_index('id')
out = (
df2.combine_first(df)
.assign(
prev_class=df2["class"],
modified=lambda d:
d["prev_class"].where(
d["prev_class"].isna(), pd.Timestamp.now()
)
)
)
print(out)
class prev_class modified
id
1 potential NaN NaN
2 new new 2022-03-31 06:51:20.832668
3 at risk NaN NaN
4 loyal loyal 2022-03-31 06:51:20.832668

How can I match the missing values (nan) of two dataframes?

how can I set all my values in df1 as missing if their position equivalent is a missing value in df2?
Data df1:
Index Data
1 3
2 8
3 9
Data df2:
Index Data
1 nan
2 2
3 nan
desired output:
Index Data
1 nan
2 8
3 nan
So I would like to keep the data of df1, but only for the positions for which df2 also has data entries. For all nans in df2 I would like to replace the value of df1 with nan as well.
I tried the following, but this replaced all data points with nan.
df1 = df1.where(df2== np.nan, np.nan)
Thank you very much for your help.
Use mask, which is doing exactly the inverse of where:
df3 = df1.mask(df2.isna())
output:
Index Data
0 1 NaN
1 2 8.0
2 3 NaN
In your case, you were setting all elements matching a non-NaN as NaN, and because equality is not the correct way to check for NaN (np.nan == np.nan yields False), you were setting all to NaN.
Change df2 == np.nan by df2.notna():
df3 = df1.where(df2.notna(), np.nan)
print(df3)
# Output
Index Data
0 1 NaN
1 2 8.0
2 3 NaN

How to add value of dataframe to another dataframe?

I want to add a row of dataframe to every row of another dataframe.
df1=pd.DataFrame({"a": [1,2],
"b": [3,4]})
df2=pd.DataFrame({"a":[4], "b":[5]})
I want to add df2 value to every row of df1.
I use df1+df2 and get following result
a b
0 5.0 8.0
1 NaN NaN
But I want to get the following result
a b
0 5 7
1 7 9
Any help would be dearly appreciated!
If really need add values per columns it means number of columns in df2 is same like number of rows in df1 use:
df = df1.add(df2.loc[0].to_numpy(), axis=0)
print (df)
a b
0 5 7
1 7 9
If need add by rows it means first value of df1 is add to first column of df2, so output is different:
df = df1.add(df2.loc[0], axis=1)
print (df)
a b
0 5 8
1 6 9

Chosing a different value for NaN entries from appending DataFrames with different columns

I am concatenating multiple months of csv's where newer, more recent versions have additional columns. As a result, putting them all together fills certain rows of certain columns with NaN.
The issue with this behavior is that it mixes these NaNs with true null entries from the data set which need to be easily distinguishable.
My only solution as of now is to replace the original NaNs with a unique string, concatenate the csv's, replace the new NaNs with a second unique string, replace the first unique string with NaN.
Given the amount of data I am processing, this is a very inefficient solution. I thought there was some way to determine how Panda's DataFrame fill these entries but couldn't find anything on it.
Updated example:
A B
1 NaN
2 3
And append
A B C
1 2 3
Gives
A B C
1 NaN NaN
2 3 NaN
1 2 3
But I want
A B C
1 NaN 'predated'
2 3 'predated'
1 2 3
In case you have a core set of columns, as here represented by df1, you could apply .fillna() to the .difference() between the core set and any new columns in more recent DataFrames.
df1 = pd.DataFrame(data={'A': [1, 2], 'B': [np.nan, 3]})
A B
0 1 NaN
1 2 3
df2 = pd.DataFrame(data={'A': 1, 'B': 2, 'C': 3}, index=[0])
A B C
0 1 2 3
df = pd.concat([df1, df2], ignore_index=True)
new_cols = df2.columns.difference(df1.columns).tolist()
df[new_cols] = df[new_cols].fillna(value='predated')
A B C
0 1 NaN predated
1 2 3 predated
2 1 2 3

Categories