openpyxl concat three dataframes horizonally

openpyxl concat three dataframes horizonally - python

I am using openpyxl to edit three dataframes, df1, df2, df3 (If it is necessary, we can also regard as three excels independently):
import pandas as pd
data1 = [[1, 1],[1,1]]
df1 = pd.DataFrame(data1, index = ['I1a','I1b'], columns=['v1a', 'v1b'])
df1.index.name='I1'
data2 = [[2, 2,2,2],[2,2,2,2],[2,2,2,2],[2,2,2,2]]
df2 = pd.DataFrame(data2, index = ['I2a','I2b','I2c','I2d'], columns=['v2a','v2b','v2c','v2d'])
df2.index.name='I2'
data3 = [['a', 'b',3,3],['a','c',3,3],['b','c',3,3],['c','d',3,3]]
df3 = pd.DataFrame(data3, columns=['v3a','v3b','v3c','v3d'])
df3 = df3.groupby(['v3a','v3b']).first()
Here df3 is multiindex. How to concat them into one excel horizontally (each dataframe start at the same line) as following:
Here we will regard index as a column and for multiindex, we keep the first level hidden.

Update
IIUC:
>>> pd.concat([df1.reset_index(), df2.reset_index(), df3.reset_index()], axis=1)
I1 v1a v1b I2 v2a v2b v2c v2d v3a v3b v3c v3d
0 I1a 1.0 1.0 I2a 2 2 2 2 a b 3 3
1 I1b 1.0 1.0 I2b 2 2 2 2 a c 3 3
2 NaN NaN NaN I2c 2 2 2 2 b c 3 3
3 NaN NaN NaN I2d 2 2 2 2 c d 3 3
Old answer
Assuming you know the start row, you can use pandas to remove extra columns:
import pandas as pd
df = pd.read_excel('input.xlsx', header=0, skiprows=3).dropna(how='all', axis=1)
df.to_excel('output.xlsx', index=False)
Input:
Output:

Related

How to subtract two dataframe column in dask dataframe?

I have two dataframes like df1, df2.
In df1 i have 4 columns (A,B,C,D) and two rows,
In df2 i have 4 columns (A,B,C,D) and two rows.
Now I want to subtract the two dataframe LIKE df1['A'] - df2['A'] and so on. But I don't know how to do it.
df1-
df2 -

Just do the subtraction but keep in mind the indexes, for example, let's say I have df1 and df2 with same columns but different index:
df1 = dd.from_array(np.arange(8).reshape(2, 4), columns=['A','B','C','D'])
df2 = dd.from_pandas(pd.DataFrame(
np.arange(8).reshape(2, 4),
columns=['A','B','C','D'],
index=[1, 2]
), npartitions=1)
Then:
(df1 - df2).compute()
# A B C D
# 0 NaN NaN NaN NaN
# 1 4.0 4.0 4.0 4.0
# 2 NaN NaN NaN NaN
On the other hand, let's match index from df2 to df1 and subtract
df2 = df2.assign(idx=1)
df2 = df2.set_index(df2.idx.cumsum() - 1)
df2 = df2.drop(columns=['idx'])
(df1 - df2).compute()
# A B C D
# 0 0 0 0 0
# 1 0 0 0 0

Outer merge between pandas and imputing NA with preceeding row

I have two dataframes containing the same columns:
df1 = pd.DataFrame({'a': [1,2,3,4,5],
'b': [2,3,4,5,6]})
df2 = pd.DataFrame({'a': [1,3,4],
'b': [2,4,5]})
I want df2 to have the same number of rows as df1. Any values of a not present in df1 should be copied over, and corresponding values of b should be taken from the row before.
In other words, I want df2 to look like this:
df2 = pd.DataFrame({'a': [1,2,3,4,5],
'b': [2,2,4,5,5]})
EDIT: I'm looking for an answer that is independent of the number of columns

Use DataFrame.merge by only a column from df1 and for replace missing values is added forward filling them:
df = df1[['a']].merge(df2, how='left', on='a').ffill()
print (df)
a b
0 1 2.0
1 2 2.0
2 3 4.0
3 4 5.0
4 5 5.0
Or use merge_asof:
df = pd.merge_asof(df1[['a']], df2, on='a')
print (df)
a b
0 1 2
1 2 2
2 3 4
3 4 5
4 5 5

Adding new columns to Pandas Data Frame which the length of new column value is bigger than length of index

I'm in a trouble with adding a new column to a pandas dataframe when the length of new column value is bigger than length of index.
Data may like this :
import pandas as pd
df = pd.DataFrame(
{
"bar": ["A","B","C"],
"zoo": [1,2,3],
})
So, you see, length of this df's index is 3.
And next I wanna add a new column , code may like this two ways below:
df["new_col"] = [1,2,3,4]
It'll raise an error : Length of values does not match length of index.
Or:
df["new_col"] = pd.Series([1,2,3,4])
I will just get values[1,2,3] in my data frame df.
(The count of new column values can't out of the max index).
Now, what I want just like :
Is there a better way ?
Looking forward to your answer,thanks!

Use DataFrame.join with change Series name and right join:
#if not default index
#df = df.reset_index(drop=True)
df = df.join(pd.Series([1,2,3,4]).rename('new_col'), how='right')
print (df)
bar zoo new_col
0 A 1.0 1
1 B 2.0 2
2 C 3.0 3
3 NaN NaN 4
Another idea is add reindex by new s.index:
s = pd.Series([1,2,3,4])
df = df.reindex(s.index)
df["new_col"] = s
print (df)
bar zoo new_col
0 A 1.0 1
1 B 2.0 2
2 C 3.0 3
3 NaN NaN 4
s = pd.Series([1,2,3,4])
df = df.reindex(s.index).assign(new_col = s)

df = pd.DataFrame(
{
"bar": ["A","B","C"],
"zoo": [1,2,3],
})
new_col = pd.Series([1,2,3,4])
df = pd.concat([df,new_col],axis=1)
print(df)
bar zoo 0
0 A 1.0 1
1 B 2.0 2
2 C 3.0 3
3 NaN NaN 4

Pandas, merge rows only in same groups

Input DF:
df = pd.DataFrame({'A': ['one',np.nan,'two',np.nan],
'B': [np.nan,22,np.nan,44],
'group':[0,0,1,1]
})
print(df)
A B group
0 one NaN 0
1 NaN 22.0 0
2 two NaN 1
3 NaN 44.0 1
I want to merge those rows in one, all cells in one in same column. But taking into account groups.
Currently have:
df=df.agg(lambda x: ','.join(x.dropna().astype(str))
).to_frame().T
print(df)
A B group
0 one,two 22.0,44.0 0,0,1,1
but this way is taking all rows, not only groups
Expected Output:
A B
0 one 22.0
1 two 44.0

If possible simplify solution for first non missing value per group use:
df = df.groupby('group').first()
print(df)
A B
group
0 one 22.0
1 two 44.0
If not and need general solution:
df = pd.DataFrame({'A': ['one',np.nan,'two',np.nan],
'B': [np.nan,22,np.nan,44],
'group':[0,0,0,1]
})
def f(x):
return x.apply(lambda x: pd.Series(x.dropna().to_numpy()))
df = df.set_index('group').groupby('group').apply(f).reset_index(level=1, drop=True).reset_index()
print(df)
group A B
0 0 one 22.0
1 0 two NaN
2 1 NaN 44.0

df_a = df.drop('B', axis=1).dropna()
df_b = df.drop('A', axis=1).dropna()
pd.merge(df_a, df_b, on='group')

Why does concat Series to DataFrame with index matching columns not work?

I want to append a Series to a DataFrame where Series's index matches DataFrame's columns using pd.concat, but it gives me surprises:
df = pd.DataFrame(columns=['a', 'b'])
sr = pd.Series(data=[1,2], index=['a', 'b'], name=1)
pd.concat([df, sr], axis=0)
Out[11]:
a b 0
a NaN NaN 1.0
b NaN NaN 2.0
What I expected is of course:
df.append(sr)
Out[14]:
a b
1 1 2
It really surprises me that pd.concat is not index-columns aware. So is it true that if I want to concat a Series as a new row to a DF, then I can only use df.append instead?

Need DataFrame from Series by to_frame and transpose:
a = pd.concat([df, sr.to_frame(1).T])
print (a)
a b
1 1 2
Detail:
print (sr.to_frame(1).T)
a b
1 1 2
Or use setting with enlargement:
df.loc[1] = sr
print (df)
a b
1 1 2

"df.loc[1] = sr" will drop the column if it isn't in df
df = pd.DataFrame(columns = ['a','b'])
sr = pd.Series({'a':1,'b':2,'c':3})
df.loc[1] = sr
df will be like:
a b
1 1 2

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

openpyxl concat three dataframes horizonally - python

Related

How to subtract two dataframe column in dask dataframe?

Outer merge between pandas and imputing NA with preceeding row

Adding new columns to Pandas Data Frame which the length of new column value is bigger than length of index

Pandas, merge rows only in same groups

Why does concat Series to DataFrame with index matching columns not work?

Categories

Resources