Convert two dataframes to numpy arrays for pairwise comparison [duplicate] - python

This question already has answers here:
set difference for pandas
(12 answers)
Closed 2 years ago.
I have two incredibly large dataframes, df1 and df2. Their sizes are below:
print(df1.shape) #444500 x 3062
print(df2.shape) #254232 x 3062
I know that each value of df2 appears in df1, and what I am looking to do is build a third dataframe that is the difference of the two, meaning, all of the rows that appear in df1 that do not appear in df2.
I have tried using the below method from this question:
df3 = (pd.merge(df2,df1, indicator=True, how='outer')
.query('_merge=="left_only"').drop('_merge', axis=1))
But am continually getting MemoryError failures due to this
Thus, I am now trying to do the following:
Loop through each row of df1
See if df1 appears in df2
If it does, skip
If not, add it to a list
What I am concerned about, in terms of rows, is that the rows of data are equal, meaning, all of the elements match pairwise, for example
[1,2,3]
[1,2,3]
is a match, while:
[1,2,3]
[1,3,2]
is not a match
I am now trying:
for i in notebook.tqdm(range(svm_data.shape[0])):
real_row = np.asarray(real_data.iloc[[i]].to_numpy())
synthetic_row = np.asarray(svm_data.iloc[[i]].to_numpy())
if (np.array_equal(real_row, synthetic_row)):
continue
else:
list_of_rows.append(list(synthetic_row))
gc.collect()
But for some reason, this is not finding the values in the rows themselves, so I am clearly still doing something wrong.
Note, I also tried:
df3 = df1[~df1.isin(df2)].dropna(how='all')
but that yielded incorrect results.
How can I (in a memory efficient way) find all of the rows in one of my dataframe
DATA
df1:
1,0,0.0,0,0,0,0,0,0.0,2
1,0,0.0,0,0,0,0,0,0.0,2
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,0,0,0,0,0,0.0,2
1,0,0.0,0,0,0,0,0,0.0,8
1,0,0.0,0,0,0,0,0,0.0,8
1,0,0.0,0,0,0,0,0,0.0,8
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,0,0,0,0,0,0.0,2
df2:
1,0,0.0,0,0,0,0,0,0.0,2
1,0,0.0,0,0,0,0,0,0.0,3
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,0,0,0,0,0,2.0,2
1,0,0.0,0,0,0,0,0,0.0,8
1,0,0.0,0,0,1,0,0,0.0,8
1,0,0.0,0,0,0,0,0,0.0,8
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,5,0,0,0,0,0.0,4

Let's try concat and groupby to identify duplicate rows:
# sample data
df1 = pd.DataFrame([[1,2,3],[1,2,3],[4,5,6],[7,8,9]])
df2 = pd.DataFrame([[4,5,6],[7,8,9]])
s = (pd.concat((df1,df2), keys=(1,2))
.groupby(list(df1.columns))
.ngroup()
)
# `s.loc[1]` corresponds to rows in df1
# `s.loc[2]` corresponds to rows in df2
df1_in_df2 = s.loc[1].isin(s.loc[2])
df1[df1_in_df2]
Output:
0 1 2
2 4 5 6
3 7 8 9
Update Another option is to merge on the non-duplicated df2:
df1.merge(df2.drop_duplicates(), on=list(df1.columns), indicator=True, how='left')
Output (you should be able to guess which rows you need from there):
0 1 2 _merge
0 1 2 3 left_only
1 1 2 3 left_only
2 4 5 6 both
3 7 8 9 both

Related

pandas : pd.concat results in duplicated columns

I have a number of large dataframes in a list. I concatenate all of them to produce a single large dataframe.
df_list # This contains a list of dataframes
result = pd.concat(df_list, axis=0)
result.columns.duplicated().any() # This returns True
My expectation was that pd.concat will not produce duplicate columns.
I want to understand when it could result in duplicate columns so that I can debug the source.
I could not reproduce the problem with a toy dataset.
I have verified that the input data frames have unique columns by running df.columns.duplicated().any().
The pandas version used 1.0.1
(Pdb) p result_data[0].columns.duplicated().any()
False
(Pdb) p result_data[1].columns.duplicated().any()
False
(Pdb) p result_data[2].columns.duplicated().any()
False
(Pdb) p result_data[3].columns.duplicated().any()
False
(Pdb) p pd.concat(result_data[0:4]).columns.duplicated().any()
True
Check the below behaviour:
In [452]: df1 = pd.DataFrame({'A':[1,2,3], 'B':[2,3,4]})
In [468]: df2 = pd.DataFrame({'A':[1,2,3], 'B':[2,4,5]})
In [460]: df_list = [df1,df2]
This concats and keeps duplicate columns:
In [463]: pd.concat(df_list, axis=1)
Out[474]:
A B A B
0 1 2 1 2
1 2 3 2 4
2 3 4 3 5
pd.concat always concatenates the dataframes as is. It does not drop duplicate columns at all.
If you concatenate without the axis, it will append one dataframe below another in the same columns.
So you can have duplicate rows now, but not columns.
In [477]: pd.concat(df_list)
Out[477]:
A B
0 1 2 ## duplicate row
1 2 3
2 3 4
0 1 2 ## duplicate row
1 2 4
2 3 5
You can remove these duplicate rows by using drop_duplicates():
In [478]: pd.concat(df_list).drop_duplicates()
Out[478]:
A B
0 1 2
1 2 3
2 3 4
1 2 4
2 3 5
Update after OP's comment:
In [507]: df_list[0].columns.duplicated().any()
Out[507]: False
In [508]: df_list[1].columns.duplicated().any()
Out[508]: False
In [510]: pd.concat(df_list[0:2]).columns.duplicated().any()
Out[510]: False
I have the same issue when I get data from IEXCloud. I used IEXFinance functions to grab different data sets which are all suppose to return dataframes. I then Use concat to join the dataframes. It looks to have repeated the first column (symbols) into column 97. The data in columns 96 and 98 where from the second dataframe. There are no duplicate columns in df1 or df2. I can't see any logical reason for duplicating it there. DF2 has 70 columns.I suspect some of what was returned as a 'dataframe' is something else but this doesnt explain the seeming random nature of the position the concat function chooses to duplicate the first column of the first df!

Joining two dataframes that have overlapping without specifying the overlapping columns but overwriting one of them (in python)

I'm struggling with the following problem:
I have two dataframes:
df1
A B C
1 5 8
2 1 2
3 2 1
4 3 6
and df2 with same column names, but not as much columns as df1:
A B
1 1
8 2
1 5
6 3
df1 and df2 always have the same amount of rows, only the amount of columns of df2 is less or equal than df1. Also, the column names are the same, but not the values in the column (they can be the same, but this is definitely not always the case)
Now, I want to create a new dataframe, where the overlapping columns between df1 and df2 (column A and B, NOT C) are determined by df2, but has the same shape as df1 (so df1 is dominating in amount of columns, but df2 is dominating in which value to take from the overlapping columns). Important to know is that I don't want to specify which columns are overlapping.
So the result should give:
df3:
A B C
1 1 8
8 2 2
1 5 1
6 3 6
Is this possible, especially with the difficulty of not specifying upfront the overlapping columns. anyone a clever solution? Because it seems not to be possible with all the possible variations of merge and join.
As long as there are no column labels in df2 not present in df1, you can use
df3 = df1.copy()
df3.loc[:,df2.columns] = df2

Will passing ignore_index=True to pd.concat preserve index succession within dataframes that I'm concatenating?

I have two dataframes:
df1 =
value
0 a
1 b
2 c
df2 =
value
0 d
1 e
I need to concatenate them across index, but I have to preserve the index of the first dataframe and continue it in the second dataframe, like this:
result =
value
0 a
1 b
2 c
3 d
4 e
My guess is that pd.concat([df1, df2], ignore_index=True) will do the job. However, I'm worried that for large dataframes the order of the rows may be changed and I'll end up with something like this (first two rows changed indices):
result =
value
0 b
1 a
2 c
3 d
4 e
So my question is, does the pd.concat with ignore_index=True save the index succession within dataframes that are being concatenated, or there is randomness in the index assignment?
In my experience, pd.concat concats the rows in the order the DataFrames are passed to it during concatenation.
If you want to be safe, specify sort=False which will also avoid sorting on columns:
pd.concat([df1, df2], axis=0, sort=False, ignore_index=True)
value
0 a
1 b
2 c
3 d
4 e

join two pandas dataframe using a specific column

I am new with pandas and I am trying to join two dataframes based on the equality of one specific column. For example suppose that I have the followings:
df1
A B C
1 2 3
2 2 2
df2
A B C
5 6 7
2 8 9
Both dataframes have the same columns and the value of only one column (say A) might be equal. What I want as output is this:
df3
A B C B C
2 8 9 2 2
The values for column 'A' are unique in both dataframes.
Thanks
pd.concat([df1.set_index('A'),df2.set_index('A')], axis=1, join='inner')
If you wish to maintain column A as a non-index, then:
pd.concat([df1.set_index('A'),df2.set_index('A')], axis=1, join='inner').reset_index()
Alternatively, you could just do:
df3 = df1.merge(df2, on='A', how='inner', suffixes=('_1', '_2'))
And then you can keep track of each value's origin

Combining DataFrames without Nans

I have two df. One maps values to IDs. The other one has multiple entries of these IDs. I want to have a df with the first dataframe with the values assigned to the respective IDs.
df1 =
Val1 Val2 Val3
x 1000 2 0
y 2000 3 9
z 3000 1 8
df2=
foo ID bar
0 something y a
1 nothing y b
2 everything x c
3 who z d
result=
foo ID bar Val1 Val2 Val3
0 something y a 2000 3 9
1 nothing y b 2000 3 9
2 everything x c 1000 2 0
3 who z d 3000 1 8
I've tried merge and join (obviously incorrectly) but I am getting a bunch of NaNs when I do that. It appears that I am getting NaNs on every alternate ID.
I have also tried indexing both DFs by ID but that didn't seem to help either. I am obviously missing something that I am guessing is a core functionality but I can't get my head around it.
merge and join could both get you the result DataFrame you want. Since one of your DataFrames is indexed (by ID) and the other has just a integer index, merge is the logical choice.
Merge:
# use ID as the column to join on in df2 and the index of df1
result = df2.merge(df1, left_on="ID", right_index=True, how="inner")
Join:
df2.set_index("ID", inplace=True) # index df2 in place so you can use join, which merges by index by default
result = df2.join(df1, how="inner") # join df1 by index

Categories