Inner Join list of DataFrames on Row Values - python

I have a list of dataframes in python pandas that have the same rowname and rowvalues. What I would like to do is produce one dataframe with them innerjoined on the rowvalues. I have looked online and found the merge function, but this isn't working because my rows aren't a column. Does anyone know the best way to do this? Is the solution to take the row values and turn it into a column, and if so how do you do that? Thanks for the help.
input:
"happy"
userid
1 2
2 8
3 9
"sad"
userid
1 9
2 12
3 11
output:
"sad" "happy"
userid
1 9 2
2 12 8
3 11 9

It looks like your DataFrames have indices, in which case your merge() should indicate that's how it wants to proceed:
In [51]: df1
Out[51]:
"happy"
userid
1 2
2 8
3 9
In [52]: df2
Out[52]:
"sad"
userid
1 9
2 12
3 11
In [53]: pd.merge(df2, df1, left_index=True, right_index=True)
Out[53]:
"sad" "happy"
userid
1 9 2
2 12 8
3 11 9
And if you want to run this over a list of DataFrames, just reduce() them:
reduce(lambda x, y: pd.merge(x, y, left_index=True, right_index=True), list_of_dfs)

Transposing swaps the columns and rows of the DataFrame. If dfs is your list of DataFrames, then:
dfs = [df.T for df in dfs]
will make dfs a list of transposed DataFrames.
Then to merge:
merged = dfs[0]
for df in dfs[1:]:
merged = pd.merge(merged, df, how='inner')
By default pd.merge merges DataFrames based on all columns shared in common.
Note that transposing requires copying all the data in the original DataFrame into a new DataFrame. It would be more efficient to build the DataFrame in the correct (transposed) format from the beginning (if possible), rather than fixing it later by transposing.

Related

Concat values from same column name in "one" dataframe

I want to concat the values if they have same columns.
I've found some solutions that are from different dataframe, but not from one dataframe.
Also, I tried to separate columns to single dataframe then concat, but it seems not working because the columns' name are shown differently. (For example, it shows "apple", "banana", "pizza", "apple.1", "banana.1"...)
Is there any solution to show like this? Thanks!
You can use melt to flatten your dataframe then pivot to reshape it as its original shape:
df.columns = df.columns.str.rsplit('.').str[0]
out = df.melt().assign(index=lambda x: x.groupby('variable').cumcount()) \
.pivot_table('value', 'index', 'variable', fill_value=0) \
.rename_axis(index=None, columns=None)[df.columns.unique()]
print(out)
# Output
apple banana pizza
0 1 4 4
1 2 3 7
2 3 2 3
3 5 0 1
4 8 0 5
5 9 0 34

Using the items of a df as a header of a diffeerent dataframe

I have 2 dataframes
df1= 0 2
1 _A1-Site_0_norm _A1-Site_1_norm
and df2=
0 2
2 0.500000 0.012903
3 0.010870 0.013793
4 0.011494 0.016260
I want to use df1 as a header of df2 so that df1 is either the header of the columns or the first raw.
1 _A1-Site_0_norm _A1-Site_1_norm
2 0.500000 0.012903
3 0.010870 0.013793
4 0.011494 0.016260
i have multiple columns so it will not work to do
df2.columns=["_A1-Site_0_norm", "_A1-Site_1_norm"]
I thought of making a list of all the items present in the df1 to the use df2.columns and then include that list but I am having problems with converting the elements in row 1 of df1 of each column in items of a list.
I am not married to that approach any alternative to do it is wellcome
Many thanks
if I understood you question correctly
then this example should work for you
d={'A':[1],'B':[2],'C':[3]}
df = pd.DataFrame(data=d)
d2 = {'1':['D'],'2':['E'],'3':['F']}
df2 = pd.DataFrame(data=d2)
df.columns = df2.values.tolist() #this is what you need to implement

pandas : pd.concat results in duplicated columns

I have a number of large dataframes in a list. I concatenate all of them to produce a single large dataframe.
df_list # This contains a list of dataframes
result = pd.concat(df_list, axis=0)
result.columns.duplicated().any() # This returns True
My expectation was that pd.concat will not produce duplicate columns.
I want to understand when it could result in duplicate columns so that I can debug the source.
I could not reproduce the problem with a toy dataset.
I have verified that the input data frames have unique columns by running df.columns.duplicated().any().
The pandas version used 1.0.1
(Pdb) p result_data[0].columns.duplicated().any()
False
(Pdb) p result_data[1].columns.duplicated().any()
False
(Pdb) p result_data[2].columns.duplicated().any()
False
(Pdb) p result_data[3].columns.duplicated().any()
False
(Pdb) p pd.concat(result_data[0:4]).columns.duplicated().any()
True
Check the below behaviour:
In [452]: df1 = pd.DataFrame({'A':[1,2,3], 'B':[2,3,4]})
In [468]: df2 = pd.DataFrame({'A':[1,2,3], 'B':[2,4,5]})
In [460]: df_list = [df1,df2]
This concats and keeps duplicate columns:
In [463]: pd.concat(df_list, axis=1)
Out[474]:
A B A B
0 1 2 1 2
1 2 3 2 4
2 3 4 3 5
pd.concat always concatenates the dataframes as is. It does not drop duplicate columns at all.
If you concatenate without the axis, it will append one dataframe below another in the same columns.
So you can have duplicate rows now, but not columns.
In [477]: pd.concat(df_list)
Out[477]:
A B
0 1 2 ## duplicate row
1 2 3
2 3 4
0 1 2 ## duplicate row
1 2 4
2 3 5
You can remove these duplicate rows by using drop_duplicates():
In [478]: pd.concat(df_list).drop_duplicates()
Out[478]:
A B
0 1 2
1 2 3
2 3 4
1 2 4
2 3 5
Update after OP's comment:
In [507]: df_list[0].columns.duplicated().any()
Out[507]: False
In [508]: df_list[1].columns.duplicated().any()
Out[508]: False
In [510]: pd.concat(df_list[0:2]).columns.duplicated().any()
Out[510]: False
I have the same issue when I get data from IEXCloud. I used IEXFinance functions to grab different data sets which are all suppose to return dataframes. I then Use concat to join the dataframes. It looks to have repeated the first column (symbols) into column 97. The data in columns 96 and 98 where from the second dataframe. There are no duplicate columns in df1 or df2. I can't see any logical reason for duplicating it there. DF2 has 70 columns.I suspect some of what was returned as a 'dataframe' is something else but this doesnt explain the seeming random nature of the position the concat function chooses to duplicate the first column of the first df!

join two pandas dataframe using a specific column

I am new with pandas and I am trying to join two dataframes based on the equality of one specific column. For example suppose that I have the followings:
df1
A B C
1 2 3
2 2 2
df2
A B C
5 6 7
2 8 9
Both dataframes have the same columns and the value of only one column (say A) might be equal. What I want as output is this:
df3
A B C B C
2 8 9 2 2
The values for column 'A' are unique in both dataframes.
Thanks
pd.concat([df1.set_index('A'),df2.set_index('A')], axis=1, join='inner')
If you wish to maintain column A as a non-index, then:
pd.concat([df1.set_index('A'),df2.set_index('A')], axis=1, join='inner').reset_index()
Alternatively, you could just do:
df3 = df1.merge(df2, on='A', how='inner', suffixes=('_1', '_2'))
And then you can keep track of each value's origin

Combining DataFrames without Nans

I have two df. One maps values to IDs. The other one has multiple entries of these IDs. I want to have a df with the first dataframe with the values assigned to the respective IDs.
df1 =
Val1 Val2 Val3
x 1000 2 0
y 2000 3 9
z 3000 1 8
df2=
foo ID bar
0 something y a
1 nothing y b
2 everything x c
3 who z d
result=
foo ID bar Val1 Val2 Val3
0 something y a 2000 3 9
1 nothing y b 2000 3 9
2 everything x c 1000 2 0
3 who z d 3000 1 8
I've tried merge and join (obviously incorrectly) but I am getting a bunch of NaNs when I do that. It appears that I am getting NaNs on every alternate ID.
I have also tried indexing both DFs by ID but that didn't seem to help either. I am obviously missing something that I am guessing is a core functionality but I can't get my head around it.
merge and join could both get you the result DataFrame you want. Since one of your DataFrames is indexed (by ID) and the other has just a integer index, merge is the logical choice.
Merge:
# use ID as the column to join on in df2 and the index of df1
result = df2.merge(df1, left_on="ID", right_index=True, how="inner")
Join:
df2.set_index("ID", inplace=True) # index df2 in place so you can use join, which merges by index by default
result = df2.join(df1, how="inner") # join df1 by index

Categories