pandas : pd.concat results in duplicated columns - python

I have a number of large dataframes in a list. I concatenate all of them to produce a single large dataframe.
df_list # This contains a list of dataframes
result = pd.concat(df_list, axis=0)
result.columns.duplicated().any() # This returns True
My expectation was that pd.concat will not produce duplicate columns.
I want to understand when it could result in duplicate columns so that I can debug the source.
I could not reproduce the problem with a toy dataset.
I have verified that the input data frames have unique columns by running df.columns.duplicated().any().
The pandas version used 1.0.1
(Pdb) p result_data[0].columns.duplicated().any()
False
(Pdb) p result_data[1].columns.duplicated().any()
False
(Pdb) p result_data[2].columns.duplicated().any()
False
(Pdb) p result_data[3].columns.duplicated().any()
False
(Pdb) p pd.concat(result_data[0:4]).columns.duplicated().any()
True

Check the below behaviour:
In [452]: df1 = pd.DataFrame({'A':[1,2,3], 'B':[2,3,4]})
In [468]: df2 = pd.DataFrame({'A':[1,2,3], 'B':[2,4,5]})
In [460]: df_list = [df1,df2]
This concats and keeps duplicate columns:
In [463]: pd.concat(df_list, axis=1)
Out[474]:
A B A B
0 1 2 1 2
1 2 3 2 4
2 3 4 3 5
pd.concat always concatenates the dataframes as is. It does not drop duplicate columns at all.
If you concatenate without the axis, it will append one dataframe below another in the same columns.
So you can have duplicate rows now, but not columns.
In [477]: pd.concat(df_list)
Out[477]:
A B
0 1 2 ## duplicate row
1 2 3
2 3 4
0 1 2 ## duplicate row
1 2 4
2 3 5
You can remove these duplicate rows by using drop_duplicates():
In [478]: pd.concat(df_list).drop_duplicates()
Out[478]:
A B
0 1 2
1 2 3
2 3 4
1 2 4
2 3 5
Update after OP's comment:
In [507]: df_list[0].columns.duplicated().any()
Out[507]: False
In [508]: df_list[1].columns.duplicated().any()
Out[508]: False
In [510]: pd.concat(df_list[0:2]).columns.duplicated().any()
Out[510]: False

I have the same issue when I get data from IEXCloud. I used IEXFinance functions to grab different data sets which are all suppose to return dataframes. I then Use concat to join the dataframes. It looks to have repeated the first column (symbols) into column 97. The data in columns 96 and 98 where from the second dataframe. There are no duplicate columns in df1 or df2. I can't see any logical reason for duplicating it there. DF2 has 70 columns.I suspect some of what was returned as a 'dataframe' is something else but this doesnt explain the seeming random nature of the position the concat function chooses to duplicate the first column of the first df!

Related

How to compare two dataframes and filter rows and columns where a difference is found

I am testing dataframes for equality.
df_diff=(df1!=df2)
I get df_diff which is same shape as df*, and contains boolean True/False.
Now I would like to keep only the columns and rows of df1 where there was at least a different value.
If I simply do
df1=[df_diff.values]
I get all the rows where there was at least one True in df_diff, but lots of columns originally had False only.
As a second step, I would like then to be able to replace all the values (element-wise in the dataframe) which were equal (where df_diff==False) with NaNs.
example:
df1=pd.DataFrame(data=[[1,2,3],[4,5,6],[7,8,9]])
df2=pd.DataFrame(data=[[1,99,3],[4,5,99],[7,8,9]])
I would like to get from df1
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
to
1 2
0 2 NaN
1 NaN 6
I think you need DataFrame.any for check at least one True per rows of columns:
df = df_diff[df_diff.any(axis=1)]
It is possible to filter both of the original dataframes like so:
df11 = df1[df_diff.any(axis=1)]
df22 = df2[df_diff.any(axis=1)]
If want all columns and rows:
df = df_diff.loc[df_diff.any(axis=1), df_diff.any()]
EDIT: Filter d1 and add NaNs by where:
df_diff=(df1!=df2)
m1 = df_diff.any(axis=1)
m2 = df_diff.any()
out = df1.loc[m1, m2].where(df_diff.loc[m1, m2])
print (out)
1 2
0 2.0 NaN
1 NaN 6.0

Python adding column to dataframe causes NaN

I have a series and df
s = pd.Series([1,2,3,5])
df = pd.DataFrame()
When I add columns to df like this
df.loc[:, "0-2"] = s.iloc[0:3]
df.loc[:, "1-3"] = s.iloc[1:4]
I get df
0-2 1-3
0 1 NaN
1 2 2.0
2 3 3.0
Why am I getting NaN? I tried create new series with correct idxs, but adding it to df still causes NaN.
What I want is
0-2 1-3
0 1 2
1 2 3
2 3 5
Try either of the following lines.
df.loc[:, "1-3"] = s.iloc[1:4].values
# -OR-
df.loc[:, "1-3"] = s.iloc[1:4].reset_index(drop=True)
Your original code is trying unsuccessfully to match the index of the data frame df to the index of the subset series s.iloc[1:4]. When it can't find the 0 index in the series, it places a NaN value in df at that location. You can get around this by only keeping the values so it doesn't try to match on the index or resetting the index on the subset series.
>>> s.iloc[1:4]
1 2
2 3
3 5
dtype: int64
Notice the index values since the original, unsubset series is the following.
>>> s
0 1
1 2
2 3
3 5
dtype: int64
The index of the first row in df is 0. By dropping the indices with the values call, you bypass the index matching which is producing the NaN. By resetting the index in the second option, you make the indices the same.

Pandas (Python) - Update column of a dataframe from another one with conditions and different columns

I had a problem and I found a solution but I feel it's the wrong way to do it. Maybe, there is a more 'canonical' way to do it.
I already had an answer for a really similar problem, but here I have not the same amount of rows in each dataframe. Sorry for the "double-post", but the first one is still valid so I think it's better to make a new one.
Problem
I have two dataframe that I would like to merge without having extra column and without erasing existing infos. Example :
Existing dataframe (df)
A A2 B
0 1 4 0
1 2 5 1
2 2 5 1
Dataframe to merge (df2)
A A2 B
0 1 4 2
1 3 5 2
I would like to update df with df2 if columns 'A' and 'A2' corresponds.
The result would be :
A A2 B
0 1 4 2 <= Update value ONLY
1 2 5 1
2 2 5 1
Here is my solution, but I think it's not a really good one.
import pandas as pd
df = pd.DataFrame([[1,4,0],[2,5,1],[2,5,1]],columns=['A','A2','B'])
df2 = pd.DataFrame([[1,4,2],[3,5,2]],columns=['A','A2','B'])
df = df.merge(df2,on=['A', 'A2'],how='left')
df['B_y'].fillna(0, inplace=True)
df['B'] = df['B_x']+df['B_y']
df = df.drop(['B_x','B_y'], axis=1)
print(df)
I tried this solution :
rows = (df[['A','A2']] == df2[['A','A2']]).all(axis=1)
df.loc[rows,'B'] = df2.loc[rows,'B']
But I have this error because of the wrong number of rows :
ValueError: Can only compare identically-labeled DataFrame objects
Does anyone has a better way to do ?
Thanks !
I think you can use DataFrame.isin for check where are same rows in both DataFrames. Then create NaN by mask, which is filled by combine_first. Last cast to int:
mask = df[['A', 'A2']].isin(df2[['A', 'A2']]).all(1)
print (mask)
0 True
1 False
2 False
dtype: bool
df.B = df.B.mask(mask).combine_first(df2.B).astype(int)
print (df)
A A2 B
0 1 4 2
1 2 5 1
2 2 5 1
With a minor tweak in the way in which the boolean mask gets created, you can get it to work:
cols = ['A', 'A2']
# Slice it to match the shape of the other dataframe to compare elementwise
rows = (df[cols].values[:df2.shape[0]] == df2[cols].values).all(1)
df.loc[rows,'B'] = df2.loc[rows,'B']
df

Inner Join list of DataFrames on Row Values

I have a list of dataframes in python pandas that have the same rowname and rowvalues. What I would like to do is produce one dataframe with them innerjoined on the rowvalues. I have looked online and found the merge function, but this isn't working because my rows aren't a column. Does anyone know the best way to do this? Is the solution to take the row values and turn it into a column, and if so how do you do that? Thanks for the help.
input:
"happy"
userid
1 2
2 8
3 9
"sad"
userid
1 9
2 12
3 11
output:
"sad" "happy"
userid
1 9 2
2 12 8
3 11 9
It looks like your DataFrames have indices, in which case your merge() should indicate that's how it wants to proceed:
In [51]: df1
Out[51]:
"happy"
userid
1 2
2 8
3 9
In [52]: df2
Out[52]:
"sad"
userid
1 9
2 12
3 11
In [53]: pd.merge(df2, df1, left_index=True, right_index=True)
Out[53]:
"sad" "happy"
userid
1 9 2
2 12 8
3 11 9
And if you want to run this over a list of DataFrames, just reduce() them:
reduce(lambda x, y: pd.merge(x, y, left_index=True, right_index=True), list_of_dfs)
Transposing swaps the columns and rows of the DataFrame. If dfs is your list of DataFrames, then:
dfs = [df.T for df in dfs]
will make dfs a list of transposed DataFrames.
Then to merge:
merged = dfs[0]
for df in dfs[1:]:
merged = pd.merge(merged, df, how='inner')
By default pd.merge merges DataFrames based on all columns shared in common.
Note that transposing requires copying all the data in the original DataFrame into a new DataFrame. It would be more efficient to build the DataFrame in the correct (transposed) format from the beginning (if possible), rather than fixing it later by transposing.

Compare pandas dataframes by multiple columns

What is the best way to figure out how two dataframes differ based on a combination of multiple columns. So if I have the following:
df1:
A B C
0 1 2 3
1 3 4 2
df2:
A B C
0 1 2 3
1 3 5 2
Want to show all rows where there is a difference such as (3,4,2) vs. (3,5,2) from above example. I've tried using the pd.merge() thinking that if I use all columns as the key to join using outer join, I would end up with dataframe that would help me get what I want but it doesn't turn out that way.
Thanks to EdChum I was able to use a mask from a boolean diff as below but first had to make sure indexes were comparable.
df1 = df1.set_index('A')
df2 = df2.set_index('A') #this gave me a nice index using one of the keys.
#if there are different rows than I would get nulls.
df1 = df1.reindex_like(df2)
df1[~(df1==df2).all(axis=1)] #this gave me all rows that differed.
We can use .all and pass axis=1 to perform row comparisons, we can then use this boolean index to show the rows that differ by negating ~ the boolean index:
In [43]:
df[~(df==df1).all(axis=1)]
Out[43]:
A B C
1 3 4 2
breaking this down:
In [44]:
df==df1
Out[44]:
A B C
0 True True True
1 True False True
In [45]:
(df==df1).all(axis=1)
Out[45]:
0 True
1 False
dtype: bool
We can then pass the above as a boolean index to df and invert it using ~

Categories