Why pandas merge is skipping some rows - python

I have df1, df2 who share mutual column (time) where df2.time ∈ df1.also tdf1 shape is (2353X11) and df2 shape is (57X1). I'm trying to create df3 using merge method to extract some rows from df1 based on rows of df2. issue is df3 is missing some rows even though both df1 & df2 are float64 and have mutual values.
df3 shape should have 57 rows also but I get 54 only!
df1
df2
def pressure_filter (noisydata, reducedtime, filcutoff, tzero):
b,a = sig.butter(2, filcutoff, btype='low', analog=False)
noisydata['p_lowcut'] = sig.filtfilt(b, a, noisydata.p_noisy)
noisydata.at[0,'p_lowcut'] = noisydata.at[0,'p_noisy']
noisydata['p_lowcut_ma'] = noisydata['p_lowcut'].rolling(20, center = True).mean()
noisydata['p_lowcut_ma'] = noisydata.apply(lambda row: row['p_lowcut'] if
np.isnan(row['p_lowcut_ma'])
else row['p_lowcut_ma'], axis=1)
datared = pd.merge(noisydata, reducedtime, on=['time'], how='inner')
return datared

Related

How to combine certain columns from one data frame to another?

How to add three columns from one data frame to another at a certain position?
I want to add these columns after a specific column? DF1=['C','D'] after
columns A and B in DF2. So how to join columns in between other columns in
another dataframe.
df1=pd.read_csv(csvfile)
df2=pd.read_csv(csvfile)
df1['C','D','E'] to df2['K','L','A','B','F']
so it looks like df3= ['K','L','A','B','C','D','F']
Use concat with DataFrame.reindex for change order of columns:
df3 = pd.concat([df1, df2], axis=1).reindex(['K','L','A','B','C','D'], axis=1)
More general solution:
df1 = pd.DataFrame(columns=['H','G','C','D','E'])
df2 = pd.DataFrame(columns=['K','L','A','B','F'])
df3 = pd.concat([df1, df2], axis=1)
c = df3.columns.difference(['C', 'D'], sort=False)
pos = c.get_loc('B') + 1
c = list(c)
#https://stackoverflow.com/a/3748092/2901002
c[pos:pos] = ['C', 'D']
df3 = df3.reindex(c, axis=1)
print (df3)
Empty DataFrame
Columns: [H, G, E, K, L, A, B, C, D, F]
Index: []
Try:
df3=pd.DataFrame()
df3[['K','L','A','B']]=df2[['K','L','A','B']]
df3[['C','D','E']]=df1[['C','D','E']]
Finally:
df3=df3[['K','L','A','B','C','D']]
OR
df3=df3.loc[:,['K','L','A','B','C','D']]
This should work
pd.merge([df1, df2, left_index=True, right_index=True]).[['K','L','A','B','C','D']]
or simply use join which is left by deafult
df1.join(df2)[['K','L','A','B','C','D']]

Merge dataframe dynamic

I have 2 dataframes: df1 and df2. I would like to merge the 2 dataframes on the column link in df2. Link column in df2 contains a list of column and values which match in df1:
df1 = pd.DataFrame({'p':[1,2,3,4], 'a':[1,2,2,2],'b':['z','z','z','z'],'c':[3,3,4,4],'d':[5,5,5,6]})
df2 = pd.DataFrame({'e':[11,22,33,44], 'link':['a=1,c=3','a=2,c=3','a=2,c=4,d=5','a=2,c=4']})
The result should end with dataframe like this where column e from df2 are merge together with df1:
df_res = pd.DataFrame({'p':[1,2,3,3,4], 'a':[1,2,2,2,2],'b':['z','z','z','z','z'],'c':[3,3,4,4,4],'d':[5,5,5,5,6],'e':[11,22,33,44,44]})
How can this be done in pandas?
df1["e"] = df2["e"]
frames = [df1, df2]
result = pd.concat(frames)

Iterate over DataFrame lines in Python

I want to iterate over df and change value in df2. If the value of A and B in df is equal to A and B in df2 then C in df is equal to C+E/2 of df2.
df = pd.read_csv('final.csv',names=['A','B','C','D'])
df2 = pd.read_csv('final.csv',names=['A','B','C','D','E'])
for x in df2:
z=x.loc['A','B']
df.loc['A','B']
a=df[['C']]

Check if text in data frame exists in any of all headers pandas

I have a dataframes from an excel called
df1, df2, df3, df4
I also have df called df5 below.
A B C
df1 df2 df3
df1 df3 df4
How do I check if A, B, C each row contains text, then get that named df and do action. All dataframes are labeled A, B, C
So for row 1,
go to df1 df1.pop('A')
go to df2 df2.pop('A')
go to df3 df3.pop('A')
I'm aware of solutions that involve columns.
df = pd.DataFrame([[0,1],[2,3],[4,5]], columns=['A', 'B'])
aa = ((df['A'] == 2) & (df['B'] == 3)).any()
Not quite what I desire.
Below could be one way to handle this.
create a dictionary mapping dataframe names to data frame objects
objs={'df1': df1 , 'df2':df2, 'df3' : df3}
define a function which manipulate the dataframes
def handler(df):
df.pop('A')
Then apply for the df columns as
df['A'].apply(lambda x : handler(objs.get(x)))
may be not the most elegant way, but should meet your requirement.

Compare 2 Pandas dataframes, row by row, cell by cell

I have 2 dataframes, df1 and df2, and want to do the following, storing results in df3:
for each row in df1:
for each row in df2:
create a new row in df3 (called "df1-1, df2-1" or whatever) to store results
for each cell(column) in df1:
for the cell in df2 whose column name is the same as for the cell in df1:
compare the cells (using some comparing function func(a,b) ) and,
depending on the result of the comparison, write result into the
appropriate column of the "df1-1, df2-1" row of df3)
For example, something like:
df1
A B C D
foo bar foobar 7
gee whiz herp 10
df2
A B C D
zoo car foobar 8
df3
df1-df2 A B C D
foo-zoo func(foo,zoo) func(bar,car) func(foobar,foobar) func(7,8)
gee-zoo func(gee,zoo) func(whiz,car) func(herp,foobar) func(10,8)
I've started with this:
for r1 in df1.iterrows():
for r2 in df2.iterrows():
for c1 in r1:
for c2 in r2:
but am not sure what to do with it, and would appreciate some help.
So to continue the discussion in the comments, you can use vectorization, which is one of the selling points of a library like pandas or numpy. Ideally, you shouldn't ever be calling iterrows(). To be a little more explicit with my suggestion:
# with df1 and df2 provided as above, an example
df3 = df1['A'] * 3 + df2['A']
# recall that df2 only has the one row so pandas will broadcast a NaN there
df3
0 foofoofoozoo
1 NaN
Name: A, dtype: object
# more generally
# we know that df1 and df2 share column names, so we can initialize df3 with those names
df3 = pd.DataFrame(columns=df1.columns)
for colName in df1:
df3[colName] = func(df1[colName], df2[colName])
Now, you could even have different functions applied to different columns by, say, creating lambda functions and then zipping them with the column names:
# some example functions
colAFunc = lambda x, y: x + y
colBFunc = lambda x, y; x - y
....
columnFunctions = [colAFunc, colBFunc, ...]
# initialize df3 as above
df3 = pd.DataFrame(columns=df1.columns)
for func, colName in zip(columnFunctions, df1.columns):
df3[colName] = func(df1[colName], df2[colName])
The only "gotcha" that comes to mind is that you need to be sure that your function is applicable to the data in your columns. For instance, if you were to do something like df1['A'] - df2['A'] (with df1, df2 as you have provided), that would raise a ValueError as the subtraction of two strings is undefined. Just something to be aware of.
Edit, re: your comment: That is doable as well. Iterate over the dfX.columns that is larger, so you don't run into a KeyError, and throw an if statement in there:
# all the other jazz
# let's say df1 is [['A', 'B', 'C']] and df2 is [['A', 'B', 'C', 'D']]
# so iterate over df2 columns
for colName in df2:
if colName not in df1:
df3[colName] = np.nan # be sure to import numpy as np
else:
df3[colName] = func(df1[colName], df2[colName])

Categories