Join an empty pandas DataFrame with a Multiindex DataFrame - python

I want to build a large pandas DataFrame in a loop. In the first iteration the DataFrame df1 is still empty. When I join df1 with df2 that has a MultiIndex, the Index gets squashed somehow.
df1 = pd.DataFrame(index=range(6))
df2 = pd.DataFrame(np.random.randn(6, 3),
columns=pd.MultiIndex.from_arrays((['A','A','A'],
['a', 'b', 'c'])))
df1[df2.columns] = df2
df1
(A, a) (A, b) (A, c)
0 -0.673923 1.392369 1.848935
1 1.427368 0.042691 0.130962
2 -0.258589 0.216157 0.196101
3 -1.022283 1.312113 -0.770108
4 0.511127 -0.633477 -0.229149
5 -1.364237 0.713107 2.124274
I was hoping for a DataFrame with the MultiIndex intact like this:
A
a b c
0 -0.673923 1.392369 1.848935
1 1.427368 0.042691 0.130962
2 -0.258589 0.216157 0.196101
3 -1.022283 1.312113 -0.770108
4 0.511127 -0.633477 -0.229149
5 -1.364237 0.713107 2.124274
What am I doing wrong?

The multiple index will not always recognized when we do assign for a simple index , so
df1 = pd.DataFrame(index=range(6),columns=pd.MultiIndex.from_arrays([[],[]]))
df1[df2.columns] = df2
df1
Out[697]:
A
a b c
0 -0.755397 0.574920 0.901570
1 -0.165472 -1.865715 1.583416
2 -0.403287 1.358329 0.706650
3 0.028019 1.432543 -0.586325
4 -0.414851 0.825253 0.745090
5 0.389917 0.940657 0.125837

Related

How to replace data in one pandas df by the data of another one?

Want to replace some rows of some columns in a bigger pandas df by data in a smaller pandas df. The column names are same in both.
Tried using combine_first but it only updates the null values.
For example lets say df1.shape is 100, 25 and df2.shape is 10,5
df1
A B C D E F G ...Z Y Z
1 abc 10.20 0 pd.NaT
df2
A B C D E
1 abc 15.20 1 10
Now after replacing df1 should look like:
A B C D E F G ...Z Y Z
1 abc 15.20 1 10 ...
To replace values in df1 the condition is where df1.A = df2.A and df1.B = df2.B
How can it be achieved in the most pythonic way? Any help will be appreciated.
Don't know I really understood your question does this solves your problem ?
df1 = pd.DataFrame(data={'A':[1],'B':[2],'C':[3],'D':[4]})
df2 = pd.DataFrame(data={'A':[1],'B':[2],'C':[5],'D':[6]})
new_df=pd.concat([df1,df2]).drop_duplicates(['A','B'],keep='last')
print(new_df)
output:
A B C D
0 1 2 5 6
You could play with Multiindex.
First let us create those dataframe that you are working with:
cols = pd.Index(list(ascii_uppercase))
vals = np.arange(100*len(cols)).reshape(100, len(cols))
df = pd.DataFrame(vals, columns=cols)
df1 = pd.DataFrame(vals[:10,:5], columns=cols[:5])
Then transform A and B in indices:
df = df.set_index(["A","B"])
df1 = df1.set_index(["A","B"])*1.5 # multiply just to make the other values different
df.loc[df1.index, df1.columns] = df1
df = df.reset_index()

How can I find and store how many columns it takes to reach a value greater than the first value in each row?

import pandas as pd
df = {'a': [3,4,5], 'b': [1,2,3], 'c': [4,3,3], 'd': [1,5,4], 'e': [9,4,6]}
df1 = pd.DataFrame(df, columns = ['a', 'b', 'c', 'd', 'e'])
dg = {'b': [2,3,4]}
df2 = pd.DataFrame(dg, columns = ['b'])
Original dataframe is df1. For each row, I want to find the first time a value is bigger than the value in the first column and store it in a new dataframe.
df1
a b c d e
0 3 1 4 1 9
1 4 2 3 5 4
2 5 3 3 4 6
df2 is the resulting dataframe. For example, for df1 row 1; the first value is 3 and the first value bigger than 3 is 4 (column c). Hence in df2 row 1, we store 2 (there are two columns from column a to c). For df1 row 2, the first value is 4 and the first value bigger than 4 is 5 (column d). Hence in df2 row 2, we store 3 (there are three columns from column a to d). For df1 row 3, the first value is 5 and the first value bigger than 5 is 6 (column e). Hence in df2 row 3, we store 4 (there are four columns from column a to e).
df2
b
0 2
1 3
2 4
I would appreciate the help.
In your case we can do sub , if the value gt than 0 , we get the id with idxmax
s=df1.columns.get_indexer(df1.drop('a',1).sub(df1.a,0).ge(0).idxmax(1))
array([1, 1, 3])
df['New']=s
You can get the column names by comparing the entire DataFrame index wise against the first columns, replacing false values with NaNs and applying first_valid_index row wise, eg:
names = (
df1.gt(df1.iloc[:, 0], axis=0)
.replace(False, pd.NA) # or use np.nan
.apply(pd.Series.first_valid_index, axis=1)
)
That'll give you:
0 c
1 d
2 e
Then you can convert those to offsets:
offsets = df1.columns.get_indexer(names)
# array([2, 3, 4])

Fast way to get the value of a column element from another dataframe in pandas DataFrame

I have following dataframes:
import pandas as pd
df1 = pd.DataFrame({'Col_1': ('a', 'b', 'c'), 'Col_2': ('a', 'c', 'd')})
df2 = pd.DataFrame({'Col_3': ('a', 'b', 'c', 'd'), 'Val': (1, 2, 3, 4)})
df1:
Col_1 Col_2
0 a a
1 b c
2 c d
df2:
Col_3 Val
0 a 1
1 b 2
2 c 3
3 d 4
I am trying to add two columns to add the values of a, b, c and d from df2. Here is the code that I have, but I am not sure if this is the most efficient way to do it for large size datasets.
df3 = df1.merge(df2, left_on='Col_1', right_on='Col_3').merge(df2, left_on='Col_2', right_on='Col_3')
df3:
Col_1 Col_2 Col_3_x Val_x Col_3_y Val_y
0 a a a 1 a 1
1 b c b 2 c 3
2 c d c 3 d 4
If using merge is efficient enough, is there any way to stop duplicating the Col_3_x and Col_3_y?
Thanks for the help.
Since the join is on a single column, you can map twice:
s = df2.set_index('Col_3')['Val'] # Use this to map
for col in df1.columns:
df1[f'Val_{col}'] = df1[col].map(s)
print(df1)
Col_1 Col_2 Val_Col_1 Val_Col_2
0 a a 1 1
1 b c 2 3
2 c d 3 4
If the join is on multiple columns then you can map with tuples as the keys, though creating them can be slow. merge is more naturaly and to avoid duplication rename so the keys are the same in both DataFrames.
for col in ['Col_1', 'Col_2']:
df1 = df1.merge(df2.rename(columns={'Col_3': col, 'Val': f'Val_{col}'}),
how='left', on=col)

Why does concat Series to DataFrame with index matching columns not work?

I want to append a Series to a DataFrame where Series's index matches DataFrame's columns using pd.concat, but it gives me surprises:
df = pd.DataFrame(columns=['a', 'b'])
sr = pd.Series(data=[1,2], index=['a', 'b'], name=1)
pd.concat([df, sr], axis=0)
Out[11]:
a b 0
a NaN NaN 1.0
b NaN NaN 2.0
What I expected is of course:
df.append(sr)
Out[14]:
a b
1 1 2
It really surprises me that pd.concat is not index-columns aware. So is it true that if I want to concat a Series as a new row to a DF, then I can only use df.append instead?
Need DataFrame from Series by to_frame and transpose:
a = pd.concat([df, sr.to_frame(1).T])
print (a)
a b
1 1 2
Detail:
print (sr.to_frame(1).T)
a b
1 1 2
Or use setting with enlargement:
df.loc[1] = sr
print (df)
a b
1 1 2
"df.loc[1] = sr" will drop the column if it isn't in df
df = pd.DataFrame(columns = ['a','b'])
sr = pd.Series({'a':1,'b':2,'c':3})
df.loc[1] = sr
df will be like:
a b
1 1 2

Pandas concatenate alternating columns

I have two dataframes as follows:
df2 = pd.DataFrame(np.random.randn(5,2),columns=['A','C'])
df3 = pd.DataFrame(np.random.randn(5,2),columns=['B','D'])
I wish to get the columns in an alternating fashion such that I get the result below:
df4 = pd.DataFrame()
for i in range(len(df2.columns)):
df4[df2.columns[i]]=df2[df2.columns[i]]
df4[df3.columns[i]]=df3[df3.columns[i]]
df4
A B C D
0 1.056889 0.494769 0.588765 0.846133
1 1.536102 2.015574 -1.279769 -0.378024
2 -0.097357 -0.886320 0.713624 -1.055808
3 -0.269585 -0.512070 0.755534 0.855884
4 -2.691672 -0.597245 1.023647 0.278428
I think I'm being really inefficient with this solution. What is the more pythonic/ pandic way of doing this?
p.s. In my specific case the column names are not A,B,C,D and aren't alphabetically arranged. Just so know which two dataframes I want to combine.
If you need something more dynamic, first zip both columns names of both DataFrames and then flat it:
df5 = pd.concat([df2, df3], axis=1)
print (df5)
A C B D
0 0.874226 -0.764478 1.022128 -1.209092
1 1.411708 -0.395135 -0.223004 0.124689
2 1.515223 -2.184020 0.316079 -0.137779
3 -0.554961 -0.149091 0.179390 -1.109159
4 0.666985 1.879810 0.406585 0.208084
#http://stackoverflow.com/a/10636583/2901002
print (list(sum(zip(df2.columns, df3.columns), ())))
['A', 'B', 'C', 'D']
print (df5[list(sum(zip(df2.columns, df3.columns), ()))])
A B C D
0 0.874226 1.022128 -0.764478 -1.209092
1 1.411708 -0.223004 -0.395135 0.124689
2 1.515223 0.316079 -2.184020 -0.137779
3 -0.554961 0.179390 -0.149091 -1.109159
4 0.666985 0.406585 1.879810 0.208084
How about this?
df4 = pd.concat([df2, df3], axis=1)
Or do they have to be in a specific order? Anyway, you can always reorder them:
df4 = df4[['A','B','C','D']]
And without writing out the columns:
df4 = df4[[item for items in zip(df2.columns, df3.columns) for item in items]]
You could concat and then reindex_axis.
df = pd.concat([df2, df3], axis=1)
df.reindex_axis(df.columns[::2].tolist() + df.columns[1::2].tolist(), axis=1)
Append even indices to df2 columns and odd indices to df3 columns. Use these new levels to sort.
df2_ = df2.T.set_index(np.arange(len(df2.columns)) * 2, append=True).T
df3_ = df3.T.set_index(np.arange(len(df3.columns)) * 2 + 1, append=True).T
df = pd.concat([df2_, df3_], axis=1).sort_index(1, 1)
df.columns = df.columns.droplevel(1)
df

Categories