merging two dataframes while moving column positions [duplicate] - python

This question already has an answer here:
Merge DataFrames based on index columns [duplicate]
(1 answer)
Closed 4 years ago.
I have a dataframe called df1 that is:
0
103773708 68.50
103773718 57.01
103773730 30.80
103773739 67.62
I have another one called df2 that is:
0
103773739 37.02
103773708 30.25
103773730 15.50
103773718 60.54
105496332 20.00
I'm wondering how I would get them to combine to end up looking like df3:
0 1
103773708 30.25 68.50
103773718 60.54 57.01
103773730 15.50 30.80
103773739 37.02 67.62
105496332 20.00 00.00
As you can see sometimes the index position is not the same, so it has to append the data to the same index. The goal is to append column 0 from df1, into df2 while pushing column 0 in df2 over one.

result = df1.join(df2.rename(columns={0:1})).fillna(0)

Simply merge on index, and then relabel the columns:
df = pd.merge(df1, df2, left_index=True, right_index=True, how='outer')
df.columns = [0,1]
df = df.fillna(0)

df1.columns = ['1'] # Rename the column from '0' to '1'. I assume names as strings.
df=df2.join(df1).fillna(0) # Join by default is LEFT
df
0 1
103773739 37.02 67.20
103773708 30.25 68.50
103773730 15.50 30.80
103773718 60.54 57.01
105496332 20.00 0.00

Related

Stick the columns based on the one columns keeping ids

I have a DataFrame with 100 columns (however I provide only three columns here) and I want to build a new DataFrame with two columns. Here is the DataFrame:
import pandas as pd
df = pd.DataFrame()
df ['id'] = [1,2,3]
df ['c1'] = [1,5,1]
df ['c2'] = [-1,6,5]
df
I want to stick the values of all columns for each id and put them in one columns. For example, for id=1 I want to stick 2, 3 in one column. Here is the DataFrame that I want.
Note: df.melt does not solve my question. Since I want to have the ids also.
Note2: I already use the stack and reset_index, and it can not help.
df = df.stack().reset_index()
df.columns = ['id','c']
df
You could first set_index with "id"; then stack + reset_index:
out = (df.set_index('id').stack()
.droplevel(1).reset_index(name='c'))
Output:
id c
0 1 1
1 1 -1
2 2 5
3 2 6
4 3 1
5 3 5

Merge dataframes based on column values with duplicated rows

I want to merge two dataframes based on equal column values. The problem is that one of my columns have duplicated row values, which cannot be drop since it's correlated to another columns. Here's an example of my two dataframes:
Essentialy, I want to merge this two dataframes based on equal values of FromPatchID (df1) and Id (df2) columns, in order to get something like this:
FromPatchID ToPatchID ... Id MMM LB
1 1 ... 1 26.67 27.67
1 2 ... 1 26.67 27.67
1 3 ... 1 26.67 27.67
2 1 ... 2 26.50 27.50
3 1 ... 3 26.63 27.63
I already tried a simple merge with df_merged = pd.merge(df1, df2, on=['FromPatchID','Id']), but I got KeyError indicating to check for duplicates in FromPatchID column.
You have to specify the different column names to match on with left_on and right_on. Also specify how='right' to use only keys from the right frame.
df_merged = pd.merge(df1, df2, left_on='FromPatchID', right_on='Id', how='right')

Pandas create column based on index condition [duplicate]

This question already has an answer here:
Am I doing a df.Merge wrong?
(1 answer)
Closed 1 year ago.
I have two different dataframe:
df1
2003 rows × 1 columns index column is RecordDate
df2:
(927 rows × 1 columns, index column is RecordDate)
I'd like to create a new column in df1 condition with: if df1's RecordDate and df2's RecordDate match set DailyMoneyDeposit 's value on that row otherwise set that value to zero
df1['MoneyDeposited] = df2['MoneyDeposited']
I can't basically do this because df1 is daily basis date in the other hand df2 only consists of day that investors deposit their money and df1's index row amount is 2003 and df2's is 927
Desired dataframe:
RecordDate
ActiveAccounts
MoneyDeposited
2013-07-05
1
9000.00
2013-07-06
1
0
.
.
RecordDate
ActiveAccounts
MoneyDeposited
2013-11-06
500
6190.00
2013-11-07
500
0
pd.merge(left=df1, right=df2, how='left', left_on='RecordDate', right_on='RecordDate')

Convert two dataframes to numpy arrays for pairwise comparison [duplicate]

This question already has answers here:
set difference for pandas
(12 answers)
Closed 2 years ago.
I have two incredibly large dataframes, df1 and df2. Their sizes are below:
print(df1.shape) #444500 x 3062
print(df2.shape) #254232 x 3062
I know that each value of df2 appears in df1, and what I am looking to do is build a third dataframe that is the difference of the two, meaning, all of the rows that appear in df1 that do not appear in df2.
I have tried using the below method from this question:
df3 = (pd.merge(df2,df1, indicator=True, how='outer')
.query('_merge=="left_only"').drop('_merge', axis=1))
But am continually getting MemoryError failures due to this
Thus, I am now trying to do the following:
Loop through each row of df1
See if df1 appears in df2
If it does, skip
If not, add it to a list
What I am concerned about, in terms of rows, is that the rows of data are equal, meaning, all of the elements match pairwise, for example
[1,2,3]
[1,2,3]
is a match, while:
[1,2,3]
[1,3,2]
is not a match
I am now trying:
for i in notebook.tqdm(range(svm_data.shape[0])):
real_row = np.asarray(real_data.iloc[[i]].to_numpy())
synthetic_row = np.asarray(svm_data.iloc[[i]].to_numpy())
if (np.array_equal(real_row, synthetic_row)):
continue
else:
list_of_rows.append(list(synthetic_row))
gc.collect()
But for some reason, this is not finding the values in the rows themselves, so I am clearly still doing something wrong.
Note, I also tried:
df3 = df1[~df1.isin(df2)].dropna(how='all')
but that yielded incorrect results.
How can I (in a memory efficient way) find all of the rows in one of my dataframe
DATA
df1:
1,0,0.0,0,0,0,0,0,0.0,2
1,0,0.0,0,0,0,0,0,0.0,2
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,0,0,0,0,0,0.0,2
1,0,0.0,0,0,0,0,0,0.0,8
1,0,0.0,0,0,0,0,0,0.0,8
1,0,0.0,0,0,0,0,0,0.0,8
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,0,0,0,0,0,0.0,2
df2:
1,0,0.0,0,0,0,0,0,0.0,2
1,0,0.0,0,0,0,0,0,0.0,3
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,0,0,0,0,0,2.0,2
1,0,0.0,0,0,0,0,0,0.0,8
1,0,0.0,0,0,1,0,0,0.0,8
1,0,0.0,0,0,0,0,0,0.0,8
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,5,0,0,0,0,0.0,4
Let's try concat and groupby to identify duplicate rows:
# sample data
df1 = pd.DataFrame([[1,2,3],[1,2,3],[4,5,6],[7,8,9]])
df2 = pd.DataFrame([[4,5,6],[7,8,9]])
s = (pd.concat((df1,df2), keys=(1,2))
.groupby(list(df1.columns))
.ngroup()
)
# `s.loc[1]` corresponds to rows in df1
# `s.loc[2]` corresponds to rows in df2
df1_in_df2 = s.loc[1].isin(s.loc[2])
df1[df1_in_df2]
Output:
0 1 2
2 4 5 6
3 7 8 9
Update Another option is to merge on the non-duplicated df2:
df1.merge(df2.drop_duplicates(), on=list(df1.columns), indicator=True, how='left')
Output (you should be able to guess which rows you need from there):
0 1 2 _merge
0 1 2 3 left_only
1 1 2 3 left_only
2 4 5 6 both
3 7 8 9 both

join two pandas dataframe using a specific column

I am new with pandas and I am trying to join two dataframes based on the equality of one specific column. For example suppose that I have the followings:
df1
A B C
1 2 3
2 2 2
df2
A B C
5 6 7
2 8 9
Both dataframes have the same columns and the value of only one column (say A) might be equal. What I want as output is this:
df3
A B C B C
2 8 9 2 2
The values for column 'A' are unique in both dataframes.
Thanks
pd.concat([df1.set_index('A'),df2.set_index('A')], axis=1, join='inner')
If you wish to maintain column A as a non-index, then:
pd.concat([df1.set_index('A'),df2.set_index('A')], axis=1, join='inner').reset_index()
Alternatively, you could just do:
df3 = df1.merge(df2, on='A', how='inner', suffixes=('_1', '_2'))
And then you can keep track of each value's origin

Categories