I have a set of four dataframes: df_all = [df1, df2, df3, df4]
As a sample, they look like this:
df1:
Name Dates a
Apple 5-5-15 NaN
Apple 6-5-15 42
Apple 6-5-16 36
Apple 6-5-17 36
df2:
Name Dates a
Banana 5-5-15 85
Banana 6-5-15 NaN
Banana 6-6-15 100
Banana 6-5-16 18
I want to merge on "Dates", which I achieve in the following manner:
for cols in df_all:
cols = cols.drop(['Name'], axis=1, inplace=True)
a = df1.merge(df2, how='left', on = 'Date').merge(df3, how='left', on = 'Date').merge(df4, how='left', on = 'Date')
This gives me exactly what I want. However, the columns are renamed as a_x, a_y, a_x, a_y
This sample below shows what happens when I merge only df1 and df2.
Dates a_x a_y
5-5-15 NaN 85
6-5-15 42 NaN
6-6-15 NaN 100
6-5-16 36 18
6-5-17 36 NaN
Before the merge, I want to rename column a based on the value in Name (apple, or banana), and I want to automate it as much as possible to rename all dataframe column 'a' to the value in their column 'Name'
Try to change the column name in your first for loop before you drop that column.
for cols in df_all:
name = cols['Name'][0]
cols.drop(['Name'],axis=1,inplace=True)
cols.rename(columns={'a':name},inplace=True)
a = df1.merge(df2, how='left', on = 'Date').merge(df3, how='left', on = 'Date').merge(df4, how='left', on = 'Date')
Try do with concat and modify your dataframe
df=pd.concat([x.set_index(['Name','Dates']).a.unstack(level=0) for x in listdf])
Or combine then then pivot_table
df=pd.concat([df1,df2]).pivot_table(index='Dates',columns='Name',values='a',aggfunc='first')
Name Apple Banana
Dates
5-5-15 NaN 85.0
6-5-15 42.0 NaN
6-5-16 36.0 18.0
6-5-17 36.0 NaN
6-6-15 NaN 100.0
You can do a rename before you drop the name column. Since all names are the same in a dataframe, you can get it from the first line:
for cols in df_all:
cols.rename(columns={'a': cols.at[0, 'Name']}, inplace=True)
cols = cols.drop(['Name'], axis=1, inplace=True)
You can automate the process of merging the dataframe by renaming the columns before merging and then using functools.reduce to merge all dataframes in df_all:
from functools import reduce
# rename column a
df_all = [df.rename(columns={'a': df.pop('Name').iloc[0]}) for df in df_all]
# merge all dataframes
merged = reduce(lambda d1, d2: pd.merge(d1, d2, on=['Dates'], how='left') , df_all)
# print(merged)
# sample result after merging df1 & df2
Dates Apple Banana
0 5-5-15 NaN 85.0
1 6-5-15 42.0 NaN
2 6-5-16 36.0 18.0
3 6-5-17 36.0 NaN
Related
I have the original dataframe like that which contains 1772 columns and 130 rows. I would like to stack them into multiple target columns.
id
AA_F1R1
BB_F1R1
AA_F1R2
BB_F1R2
...
AA_F2R1
BB_F2R2
...
AA_F7R25
BB_F7R25
001
5
xy
xx
xx
zy
1
4
xx
002
6
zzz
yyy
zzz
xw
2
zzz
3
zzz
I found two different solutions that seem to work but for me is giving an error. Not sure if they work with NaN values.
pd.wide_to_long(df, stubnames=['AA', 'BB'], i='id', j='dropme', sep='_')\
.reset_index()\
.drop('dropme', axis=1)\
.sort_values('id')
Output:
0 rows × 1773 columns
Another solution I tried was
df.set_index('id', inplace=True)
df.columns = pd.MultiIndex.from_tuples(tuple(df.columns.str.split("_")))
df.stack(level = 1).reset_index(level = 1, drop = True).reset_index()
Output:
150677 rows × 2 columns
the problem with this last one is I couldn't keep the columns I wanted.
I appreciate any inputs!
Use suffix=r'\w+' parameter in wide_to_long:
df = pd.wide_to_long(df, stubnames=['AA','BB'], i='id', j='dropme', sep='_', suffix=r'\w+')\
.reset_index()\
.drop('dropme', axis=1)\
.sort_values('id')
In second solution add dropna=False to DataFrame.stack:
df.set_index('id', inplace=True)
df.columns = df.columns.str.split("_", expand=True)
df = df.stack(level = 1, dropna=False).reset_index(level = 1, drop = True).reset_index()
I have a reoccurring situation where I have multiple files (or excel sheets) with data in 8 rows and 12 columns, each file being some attribute or measurement but the indices line up across all files.
Example: In row 5 column 8 for each file is a new attribute for the same sample.
What I want to make is one data frame where each row is a sample and all the columns are its attributes for further analysis from these many files.
Currently, I can solve this two ways:
Slow and inefficient: Iterate over one dataframe and then get the values for each other dataframe at the same index/column, and build a new dataframe row by row.
I can unstack a dataframe and then drop the generated unwanted columns, then merge all unstacked dataframes. This also seems inefficient but is much faster than method 1.
Small example:
df = pd.DataFrame()
df1 = pd.read_excel(xls, 'Q_Rows', index_col=0)
df2 = pd.read_excel(xls, 'Q_Columns', index_col=0)
df3 = pd.read_excel(xls, 'Q_Wells', index_col=0)
df1 = df1.unstack().reset_index()
df1 = df1.rename(columns ={0:'Rows'})
df1 = df1[df1.columns.drop(list(df1.filter(regex='level')))]
df2 = df2.unstack().reset_index()
df2 = df2.rename(columns ={0:'Columns'})
df2 = df2[df2.columns.drop(list(df2.filter(regex='level')))]
df3 = df3.unstack().reset_index()
df3 = df3.rename(columns ={0:'Wells'})
df3 = df3[df3.columns.drop(list(df3.filter(regex='level')))]
df3
> Wells
0 A01
1 B01
2 C01
3 D01
4 E01
... ...
91 D12
92 E12
93 F12
94 G12
95 H12
df = df1.merge(df2, right_index=True, left_index=True, how='outer')
df = df.merge(df3, right_index=True, left_index=True, how='outer')
df
Rows Columns Wells
0 A 1 A01
1 B 1 B01
2 C 1 C01
3 D 1 D01
4 E 1 E01
... ... ... ...
91 D 12 D12
92 E 12 E12
93 F 12 F12
94 G 12 G12
95 H 12 H12
These methods "work", but my question is: is there a built-in pandas method or a more elegant solution that does this kind of column-based merging/transposing already? Or are there any more elegant suggestions?
Thank you.
Input DF:
df = pd.DataFrame({'A': ['one',np.nan,'two',np.nan],
'B': [np.nan,22,np.nan,44],
'group':[0,0,1,1]
})
print(df)
A B group
0 one NaN 0
1 NaN 22.0 0
2 two NaN 1
3 NaN 44.0 1
I want to merge those rows in one, all cells in one in same column. But taking into account groups.
Currently have:
df=df.agg(lambda x: ','.join(x.dropna().astype(str))
).to_frame().T
print(df)
A B group
0 one,two 22.0,44.0 0,0,1,1
but this way is taking all rows, not only groups
Expected Output:
A B
0 one 22.0
1 two 44.0
If possible simplify solution for first non missing value per group use:
df = df.groupby('group').first()
print(df)
A B
group
0 one 22.0
1 two 44.0
If not and need general solution:
df = pd.DataFrame({'A': ['one',np.nan,'two',np.nan],
'B': [np.nan,22,np.nan,44],
'group':[0,0,0,1]
})
def f(x):
return x.apply(lambda x: pd.Series(x.dropna().to_numpy()))
df = df.set_index('group').groupby('group').apply(f).reset_index(level=1, drop=True).reset_index()
print(df)
group A B
0 0 one 22.0
1 0 two NaN
2 1 NaN 44.0
df_a = df.drop('B', axis=1).dropna()
df_b = df.drop('A', axis=1).dropna()
pd.merge(df_a, df_b, on='group')
As per Pandas 0.19.2 documentation, I can provide keys argument to create a resulting multi-index DataFrame. An example (from pandas documents ) is :
result = pd.concat(frames, keys=['x', 'y', 'z'])
How would I concat the dataframe so that I can provide the keys at the column level instead of index level ?
What I basically need is something like this :
where df1 and df2 are to be concat.
This is supported by keys parameter of pd.concat when specifying axis=1:
df1 = pd.DataFrame(np.random.random((4, 4)), columns=list('ABCD'))
df2 = pd.DataFrame(np.random.random((4, 3)), columns=list('BDF'), index=[2, 3, 6, 7])
df = pd.concat([df1, df2], keys=['X', 'Y'], axis=1)
The resulting output:
X Y
A B C D B D F
0 0.654406 0.495906 0.601100 0.309276 NaN NaN NaN
1 0.020527 0.814065 0.907590 0.924307 NaN NaN NaN
2 0.239598 0.089270 0.033585 0.870829 0.882028 0.626650 0.622856
3 0.983942 0.103573 0.370121 0.070442 0.986487 0.848203 0.089874
6 NaN NaN NaN NaN 0.664507 0.319789 0.868133
7 NaN NaN NaN NaN 0.341145 0.308469 0.884074
I have a data frame and a series that I would like to return a rolling correlation as a new data frame.
So I have 3 columns in df1, I would like to return a new data frame that is the rolling correlation of each of these columns with a Series object.
import pandas as pd
df1 = pd.read_csv('https://bpaste.net/raw/d0456d3a020b')
df1['Date'] = pd.to_datetime(df1['Date'])
df1 = df1.set_index(df1['Date'])
del df1['Date']
df2 = pd.read_csv('https://bpaste.net/raw/d5cb455cb091')
df2['Date'] = pd.to_datetime(df2['Date'])
df2 = df2.set_index(df2['Date'])
del df2['Date']
pd.rolling_corr(df1, df2)
result https://bpaste.net/show/58b59c656ce4
gives NaNs and 1s only
pd.rolling_corr(df1['IWM_Close'], spy, window=22)
gives the ideal series returned, but I did not want to loop through the columns of the data frame. Is there a better way to do it?
Thanks.
I believe your second input has to be a Series to be correlated with all columns in the first DataFrame.
This works:
index = pd.DatetimeIndex(start=date(2015,1,1), freq='W', periods = 100)
df1 = pd.DataFrame(np.random.random((100,3)), index=index)
df2 = pd.DataFrame(np.random.random((100,1)), index=index)
print(pd.rolling_corr(df1, df2.squeeze(), window=20).tail())
or, for the same result:
df2 = pd.Series(np.random.random(100), index=index)
print(pd.rolling_corr(df1, df2, window=20).tail())
0 1 2
2016-10-30 -0.170971 -0.039929 -0.091098
2016-11-06 -0.199441 0.000093 -0.096331
2016-11-13 -0.213728 -0.020709 -0.129935
2016-11-20 -0.075859 0.014667 -0.153830
2016-11-27 -0.114041 0.019886 -0.155472
but this doesn't - note the missing .squeeze() - only correlates the matching columns:
print(pd.rolling_corr(df1, df2, window=20).tail())
0 1 2
2016-10-30 0.019865 NaN NaN
2016-11-06 0.087075 NaN NaN
2016-11-13 0.011679 NaN NaN
2016-11-20 -0.004155 NaN NaN
2016-11-27 0.111408 NaN NaN