Merge Only When Value is Empty/Null in Pandas - python

I have two dataframes in Pandas which are being merged together df.A and df.B, df.A is the original, and df.B has the new data I want to bring over. The merge works fine and as expected I get two columns col_x and col_y in the merged df.
However, in some rows, the original df.A has values where the other df.B does not. My question is, how can I selectively take the values from col_x and col_y and place them into a new col such as col_z ?
Here's what I mean, how can I merge df.A:
date impressions spend col
1/1/15 100000 3.00 ABC123456
1/2/15 145000 5.00 ABCD00000
1/3/15 300000 15.00 (null)
with df.B
date col
1/1/15 (null)
1/2/15 (null)
1/3/15 DEF123456
To get:
date impressions spend col_z
1/1/15 100000 3.00 ABC123456
1/2/15 145000 5.00 ABCD00000
1/3/15 300000 15.00 DEF123456
Any help or point in the right direction would be really appreciated!
Thanks

OK assuming that your (null) values are in fact NaN values and not that string then the following works:
In [10]:
# create the merged df
merged = dfA.merge(dfB, on='date')
merged
Out[10]:
date impressions spend col_x col_y
0 2015-01-01 100000 3 ABC123456 NaN
1 2015-01-02 145000 5 ABCD00000 NaN
2 2015-01-03 300000 15 NaN DEF123456
You can use where to conditionally assign a value from the _x and _y columns:
In [11]:
# now create col_z using where
merged['col_z'] = merged['col_x'].where(merged['col_x'].notnull(), merged['col_y'])
merged
Out[11]:
date impressions spend col_x col_y col_z
0 2015-01-01 100000 3 ABC123456 NaN ABC123456
1 2015-01-02 145000 5 ABCD00000 NaN ABCD00000
2 2015-01-03 300000 15 NaN DEF123456 DEF123456
You can then drop the extraneous columns:
In [13]:
merged = merged.drop(['col_x','col_y'],axis=1)
merged
Out[13]:
date impressions spend col_z
0 2015-01-01 100000 3 ABC123456
1 2015-01-02 145000 5 ABCD00000
2 2015-01-03 300000 15 DEF123456

IMO the shortest and yet readable solution is something like that:
df.A.loc[df.A['col'].isna(), 'col'] = df.A.merge(df.B, how='left', on='date')['col_y']
What it basically does is assigning values from merged table column col_y to primary df.A table, for those rows in col column, which are empty (.isna() condition).

If you have got data that contains 'nans' and you want to fill the 'nans' from other dataframe
(that matching the index and columns names) you can do the following:
df_A : target DataFrame that contain nans element
df_B : the source DataFrame thatcomplete the missing elements
df_A = df_A.where(df_A.notnull(),df_B)

Related

How to compare the difference of two sets of columns in a dataframe in an efficient and dynamic way

I have two dataframes df1 and df2 with the same set of columns. I merged 2 dfs together and want to
calculate the score diff for score1 and score2
see if the region column is the same.
The desired result would be having 'score1_diff', 'socre2_diff' and 'regional_diff' in the df_final (as shown in the attached picture). I created 'score1_diff', 'score2_diff' and 'regional_diff' columns in the df_final using the codes shown below.
However, in my read dataframe, I have over 30+ score columns and over 10+ region columns, there will be more score columns and region columns being added to the dataframe from time to time. In stead of creating these columns one by one, what would be an efficient or dynamic way to achieve the same result?
import pandas as pd
pd.set_option('display.max_columns', None)
df1 = { 'Name':['George','Andrea','micheal','Ann',
'maggie','Ravi','Xien','Jalpa'],
'region':['a','a','a','a','b','b','b','b'],
'score1':[63,42,55,70,38,77,86,99],
'score2':[45,74,44,89,69,49,72,98]}
df2 = { 'Name':['George','Andrea','micheal', 'Matt',
'maggie','Ravi','Xien','Jalpa'],
'region':['a','b','a','a','a','b','b','a'],
'score1':[62,47,55,74,32,77,86,77],
'score2':[45,78,44,89,66,49,72,73]}
df1=pd.DataFrame(df1)
df2=pd.DataFrame(df2)
df_all = pd.merge(df1,df2,how='outer',indicator=True, on='Name',suffixes=('_df1','_df2'))
df_final=df_all.copy()
df_final['score1_diff']=df_final['score1_df1']-df_final['score1_df2']
df_final['score2_diff']=df_final['score2_df1']-df_final['score2_df2']
df_final['regional_diff']=df_final['region_df1']==df_final['region_df2']
Thanks
Start from df_final=df_all.copy()
df_final=df_all.copy()
#You can filter columns by the prefix; and remove the region column. Next, find the number of columns and rename them because they have to be the same name.
num_columns = len(df_final.filter(regex='_df1').iloc[:, 1:].columns)
cols = ['score{}_diff'.format(i) for i in range(1, 1 + num_columns)]
#get df1 and df2 -- NOTE: the order of the columns must be the same or you'll need to do that first. don't want score1 - score2, for instance. This assumes all columns are aligned correctly.
dfa = df_final.filter(regex='_df1').iloc[:, 1:]
dfa.columns=cols
dfb = df_final.filter(regex='_df2').iloc[:, 1:]
dfb.columns=cols
df_diff = dfa - dfb
df_final = pd.concat([df_final, df_diff], axis=1)
df_final['regional_diff']=df_final['region_df1']==df_final['region_df2']
df_final
Name region_df1 score1_df1 score2_df1 region_df2 score1_df2 \
0 George a 63.000 45.000 a 62.000
1 Andrea a 42.000 74.000 b 47.000
2 micheal a 55.000 44.000 a 55.000
3 Ann a 70.000 89.000 NaN NaN
4 maggie b 38.000 69.000 a 32.000
5 Ravi b 77.000 49.000 b 77.000
6 Xien b 86.000 72.000 b 86.000
7 Jalpa b 99.000 98.000 a 77.000
8 Matt NaN NaN NaN a 74.000
score2_df2 _merge score1_diff score2_diff regional_diff
0 45.000 both 1.000 0.000 True
1 78.000 both -5.000 -4.000 False
2 44.000 both 0.000 0.000 True
3 NaN left_only NaN NaN False
4 66.000 both 6.000 3.000 False
5 49.000 both 0.000 0.000 True
6 72.000 both 0.000 0.000 True
7 73.000 both 22.000 25.000 False
8 89.000 right_only NaN NaN False
By taking Jonathan's idea, below is the solution that works for me. The general idea is to create 2 df with all columns which have the suffix of _df1 and _df2, respectively, then replace the suffix with _diff in order to substract one df from another.
This approach is identical to Jonathan's solution only except I filter all columns with specific naming convention rather than using index with iloc.
df_final=df_all.copy()
num_current= [itm + '_df1' for itm in num_col]
num_previous =[itm + '_df2' for itm in num_col]
dfx = df_final[num_current]
dfx.columns = dfx.columns.str.replace('_df1', '_diff')
dfy = df_final[num_previous]
dfy.columns = dfy.columns.str.replace('_df2', '_diff')
df_diff = dfx - dfy
df_all = pd.concat([df_final, df_diff], axis=1)

How to calculate time-difference in pandas with NaN-values

I´m relatively new to Pandas and tried the search already, but I couldn´t find a solution.
I have a dataframe with Transaction-No., customerId and the date of purchase which looks like this:
Transaction 12345 12346 12347 12348 12349
customerID
1 NaN 2019-09-01 NaN 2019-09-11 2019-09-22...
2 2019-10-01 NaN NaN NaN 2019-10-07...
3 ...
The dataframe has [6334 rows x 8557 columns].
Every row has NaN-values, as the Transaction-No. is unique.
I would like to calculate the date difference for each row so I get
customerID Datedifference1 Datedifference2 etc.
1 10 11
2 6
3 ...
I´m struggling to get a list with the datedifferences for every customerId.
Is there a way to ignore NaN in the dataframe and to only calculate on the values that are not NaN?
I would like to have a list with customerId and the datediff between purchase 1 and 2, 2 and 3, etc. to estimate the days until the next purchase will occur.
Is there a solution for that?
Idea is reshape data by DataFrame.stack, then get differences, remove first missing values per groups and reshape back:
df = df.apply(pd.to_datetime)
df1 = (df.stack()
.groupby(level=0)
.diff()
.dropna()
.dt.days
.reset_index(level=1, drop=True)
.to_frame())
df1 = (df1.set_index(df1.groupby(['customerID']).cumcount(), append=True)[0]
.unstack()
.add_prefix('Datedifference'))
print (df1)
Datedifference0 Datedifference1
Transaction
1 10.0 11.0
2 6.0 NaN
EDIT: If input data are different, solution is changed - convert column to datetimes, create new column by DataFrameGroupBy.diff for differencies, remove only NaN rows by DataFrame.dropna and last reshape with DataFrame.set_index and unstack with counter Series by GroupBy.cumcount:
print (df1)
customerID Transaction date
0 1 12346 2019-09-01
1 1 12348 2019-09-11
2 1 12349 2019-09-22
3 2 12345 2019-10-01
4 2 12349 2019-10-07
df1['date'] = pd.to_datetime(df1['date'])
df1['diff'] = df1.groupby('customerID')['date'].diff().dt.days
df1 = df1.dropna(subset=['diff'])
df2 = (df1.set_index(['customerID', df1.groupby('customerID').cumcount()])['diff']
.unstack()
.add_prefix('Datedifference'))
print (df2)
Datedifference0 Datedifference1
customerID
1 10.0 11.0
2 6.0 NaN

pandas - downsample a more frequent DataFrame to the frequency of a less frequent DataFrame

I have two DataFrames that have different data measured at different frequencies, as in those csv examples:
df1:
i,m1,m2,t
0,0.556529,6.863255,43564.844
1,0.5565576199999884,6.86327749999999,43564.863999999994
2,0.5565559400000003,6.8632764,43564.884
3,0.5565699799999941,6.863286799999996,43564.903999999995
4,0.5565570200000007,6.863277200000001,43564.924
5,0.5565316400000097,6.863257100000007,43564.944
...
df2:
i,m3,m4,t
0,306.81162500000596,-1.2126870045404683,43564.878125
1,306.86175000000725,-1.1705838272666433,43564.928250000004
2,306.77552454544787,-1.1240195386446195,43564.97837499999
3,306.85900545454086,-1.0210345363692084,43565.0285
4,306.8354250000052,-1.0052431772666657,43565.078625
5,306.88397499999286,-0.9468344809917896,43565.12875
...
I would like to obtain a single df that have all the measures of both dfs at the times of the first one (which get data less frequently).
I tried to do that with a for loop averaging over the df2 measures between two timestamps of df1 but it was extremely slow.
IIUC, i is index column and you want to put df2['t'] in bins and averaging the other columns. So you can use pd.cut:
groups =pd.cut(df2.t, bins= list(df1.t) + [np.inf],
right=False,
labels=df1['t'])
# cols to copy
cols = [col for col in df2.columns if col != 't']
# groupby and get the average
new_df = (df2[cols].groupby(groups)
.mean()
.reset_index()
)
Then new_df is:
t m3 m4
0 43564.844 NaN NaN
1 43564.864 306.811625 -1.212687
2 43564.884 NaN NaN
3 43564.904 NaN NaN
4 43564.924 306.861750 -1.170584
5 43564.944 306.838482 -1.024283
which you can merge with df1 on t:
df1.merge(new_df, on='t', how='left')
and get:
m1 m2 t m3 m4
0 0.556529 6.863255 43564.8 NaN NaN
1 0.556558 6.863277 43564.9 306.811625 -1.212687
2 0.556556 6.863276 43564.9 NaN NaN
3 0.556570 6.863287 43564.9 NaN NaN
4 0.556557 6.863277 43564.9 306.861750 -1.170584
5 0.556532 6.863257 43564.9 306.838482 -1.024283

pandas how to outer join without creating new columns

I have 2 pandas dataframes like this:
date value
20100101 100
20100102 150
date value
20100102 150.01
20100103 180
The expected output should be:
date value
20100101 100
20100102 150
20100103 180
The 2nd dataframe always contains newest value that I'd like to add into the 1st dataframe. However, the value on the same day may differ slightly between the two dataframes. I would like to ignore the same dates and focus on adding the new date and value into the 1st dataframe.
I've tried outer join in pandas, but it gives me two columns value_x and value_y because the value are not essentially the same on same dates. Any solution to this?
I believe need concat with drop_duplicates:
df = pd.concat([df1,df2]).drop_duplicates('date', keep='last')
print (df)
date value
0 20100101 100.00
0 20100102 150.01
1 20100103 180.00
df = pd.concat([df1,df2]).drop_duplicates('date', keep='first')
print (df)
date value
0 20100101 100.0
1 20100102 150.0
1 20100103 180.0

How to reference groupby index when using apply, transform, agg - Python Pandas?

To be concrete, say we have two DataFrames:
df1:
date A
0 12/1/14 3
1 12/1/14 1
2 12/3/14 2
3 12/3/14 3
4 12/3/14 4
5 12/6/14 5
df2:
B
12/1/14 10
12/2/14 20
12/3/14 10
12/4/14 30
12/5/14 10
12/6/14 20
Now I want to groupby date in df1, and take a sum of value A in each group and then normalize it by the value of B in df2 in the corresponding date. Something like this
df1.groupby('date').agg(lambda x: np.sum(x)/df2.loc[x.date,'B'])
The question is that neither aggregate, apply, nor transform can reference to the index. Any idea how to work around this?
When you call .groupby('column') it makes column to be part of DataFrameGroupBy index. And it is accessible through .index property.
So, in your case, assuming that date is NOT part of index in either df this should work:
def f(x):
return x.sum() / df2.set_index('date').loc[x.index[0], 'B']
df1.set_index('date').groupby(level='date').apply(f)
This produces:
A
date
2014-01-12 0.40
2014-03-12 0.90
2014-06-12 0.25
If date is in index of df2 - just use df2.loc[x.index[0], 'B'] in the above code.
If date is in df1.index change the last line to df1.groupby(level='date').apply(f).
> df_grouped = df1.groupby('date').sum()
> print df_grouped['A'] / df2['B'].astype(float)
date
12/1/14 0.40
12/2/14 NaN
12/3/14 0.90
12/4/14 NaN
12/5/14 NaN
12/6/14 0.25
dtype: float64

Categories