Outer merge between pandas and imputing NA with preceeding row

Outer merge between pandas and imputing NA with preceeding row - python

I have two dataframes containing the same columns:
df1 = pd.DataFrame({'a': [1,2,3,4,5],
'b': [2,3,4,5,6]})
df2 = pd.DataFrame({'a': [1,3,4],
'b': [2,4,5]})
I want df2 to have the same number of rows as df1. Any values of a not present in df1 should be copied over, and corresponding values of b should be taken from the row before.
In other words, I want df2 to look like this:
df2 = pd.DataFrame({'a': [1,2,3,4,5],
'b': [2,2,4,5,5]})
EDIT: I'm looking for an answer that is independent of the number of columns

Use DataFrame.merge by only a column from df1 and for replace missing values is added forward filling them:
df = df1[['a']].merge(df2, how='left', on='a').ffill()
print (df)
a b
0 1 2.0
1 2 2.0
2 3 4.0
3 4 5.0
4 5 5.0
Or use merge_asof:
df = pd.merge_asof(df1[['a']], df2, on='a')
print (df)
a b
0 1 2
1 2 2
2 3 4
3 4 5
4 5 5

Related

concatenate 2 dataframes in order to append some columns

I have 2 dataframes(df1 and df2) and I want to append them as follows :
df1 and df2 have some columns in common but I want to append the columns that exist in df2 and not in df1 but keep the columns of df1 as they are
df2 is empty (all rows are nan)
I could just add columns in df1 but in the future, df2 could have new cols added that is why I do not want to hardcode the column names but rather be done automatically. I used to use append but I get the following message
df_new = df1.append(df2)
FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead
I tried the following
df_new = pd.concat([df1, df2], axis=1)
but it concatenates all the columns of both dataframes

According to https://pandas.pydata.org/docs/reference/api/pandas.concat.html
join{‘inner’, ‘outer’}, default ‘outer’
How to handle indexes on other axis (or axes).
INNER
df = pd.DataFrame([['c', 3, 'cat'], ['d', 4, 'dog']], columns=['letter', 'number', 'animal'])
df2 = pd.DataFrame([[None,None,None,None],[None,None,None,None]], columns=['letter', 'number', 'animal', 'newcol'])
print(pd.concat([df,df2], join='inner').dropna(how='all'))
output:
letter number animal
0 c 3 cat
1 d 4 dog
OUTER
df = pd.DataFrame([['c', 3, 'cat'], ['d', 4, 'dog']], columns=['letter', 'number', 'animal'])
df2 = pd.DataFrame([[None,None,None,None],[None,None,None,None]], columns=['letter', 'number', 'animal', 'newcol'])
print(pd.concat([df,df2], join='outer').dropna(how='all'))
output:
letter number animal newcol
0 c 3 cat NaN
1 d 4 dog NaN

You could use pd.concat() with axis=0 (default) and join='outer' (default). I'm illustrating with some examples
df1 = pd.DataFrame({'col1': [3,3,3],
'col2': [4,4,4]})
df2 = pd.DataFrame({'col1': [1,2,3],
'col2': [1,2,3],
'col3': [1,2,3],
'col4': [1,2,3]})
print(df1)
col1 col2
0 3 4
1 3 4
2 3 4
print(df2)
col1 col2 col3 col4
0 1 1 1 1
1 2 2 2 2
2 3 3 3 3
df3 = pd.concat([df1, df2], axis=0, join='outer')
print(df3)
col1 col2 col3 col4
0 3 4 NaN NaN
1 3 4 NaN NaN
2 3 4 NaN NaN
0 1 1 1.0 1.0
1 2 2 2.0 2.0
2 3 3 3.0 3.0

To concatenate just the columns from df2 that are not present in df1:
pd.concat([df1, df2.loc[:, [c for c in df2.columns if c not in df1.columns]]], axis=1)

openpyxl concat three dataframes horizonally

I am using openpyxl to edit three dataframes, df1, df2, df3 (If it is necessary, we can also regard as three excels independently):
import pandas as pd
data1 = [[1, 1],[1,1]]
df1 = pd.DataFrame(data1, index = ['I1a','I1b'], columns=['v1a', 'v1b'])
df1.index.name='I1'
data2 = [[2, 2,2,2],[2,2,2,2],[2,2,2,2],[2,2,2,2]]
df2 = pd.DataFrame(data2, index = ['I2a','I2b','I2c','I2d'], columns=['v2a','v2b','v2c','v2d'])
df2.index.name='I2'
data3 = [['a', 'b',3,3],['a','c',3,3],['b','c',3,3],['c','d',3,3]]
df3 = pd.DataFrame(data3, columns=['v3a','v3b','v3c','v3d'])
df3 = df3.groupby(['v3a','v3b']).first()
Here df3 is multiindex. How to concat them into one excel horizontally (each dataframe start at the same line) as following:
Here we will regard index as a column and for multiindex, we keep the first level hidden.

Update
IIUC:
>>> pd.concat([df1.reset_index(), df2.reset_index(), df3.reset_index()], axis=1)
I1 v1a v1b I2 v2a v2b v2c v2d v3a v3b v3c v3d
0 I1a 1.0 1.0 I2a 2 2 2 2 a b 3 3
1 I1b 1.0 1.0 I2b 2 2 2 2 a c 3 3
2 NaN NaN NaN I2c 2 2 2 2 b c 3 3
3 NaN NaN NaN I2d 2 2 2 2 c d 3 3
Old answer
Assuming you know the start row, you can use pandas to remove extra columns:
import pandas as pd
df = pd.read_excel('input.xlsx', header=0, skiprows=3).dropna(how='all', axis=1)
df.to_excel('output.xlsx', index=False)
Input:
Output:

How to subtract two dataframe column in dask dataframe?

I have two dataframes like df1, df2.
In df1 i have 4 columns (A,B,C,D) and two rows,
In df2 i have 4 columns (A,B,C,D) and two rows.
Now I want to subtract the two dataframe LIKE df1['A'] - df2['A'] and so on. But I don't know how to do it.
df1-
df2 -

Just do the subtraction but keep in mind the indexes, for example, let's say I have df1 and df2 with same columns but different index:
df1 = dd.from_array(np.arange(8).reshape(2, 4), columns=['A','B','C','D'])
df2 = dd.from_pandas(pd.DataFrame(
np.arange(8).reshape(2, 4),
columns=['A','B','C','D'],
index=[1, 2]
), npartitions=1)
Then:
(df1 - df2).compute()
# A B C D
# 0 NaN NaN NaN NaN
# 1 4.0 4.0 4.0 4.0
# 2 NaN NaN NaN NaN
On the other hand, let's match index from df2 to df1 and subtract
df2 = df2.assign(idx=1)
df2 = df2.set_index(df2.idx.cumsum() - 1)
df2 = df2.drop(columns=['idx'])
(df1 - df2).compute()
# A B C D
# 0 0 0 0 0
# 1 0 0 0 0

Adding new columns to Pandas Data Frame which the length of new column value is bigger than length of index

I'm in a trouble with adding a new column to a pandas dataframe when the length of new column value is bigger than length of index.
Data may like this :
import pandas as pd
df = pd.DataFrame(
{
"bar": ["A","B","C"],
"zoo": [1,2,3],
})
So, you see, length of this df's index is 3.
And next I wanna add a new column , code may like this two ways below:
df["new_col"] = [1,2,3,4]
It'll raise an error : Length of values does not match length of index.
Or:
df["new_col"] = pd.Series([1,2,3,4])
I will just get values[1,2,3] in my data frame df.
(The count of new column values can't out of the max index).
Now, what I want just like :
Is there a better way ?
Looking forward to your answer,thanks!

Use DataFrame.join with change Series name and right join:
#if not default index
#df = df.reset_index(drop=True)
df = df.join(pd.Series([1,2,3,4]).rename('new_col'), how='right')
print (df)
bar zoo new_col
0 A 1.0 1
1 B 2.0 2
2 C 3.0 3
3 NaN NaN 4
Another idea is add reindex by new s.index:
s = pd.Series([1,2,3,4])
df = df.reindex(s.index)
df["new_col"] = s
print (df)
bar zoo new_col
0 A 1.0 1
1 B 2.0 2
2 C 3.0 3
3 NaN NaN 4
s = pd.Series([1,2,3,4])
df = df.reindex(s.index).assign(new_col = s)

df = pd.DataFrame(
{
"bar": ["A","B","C"],
"zoo": [1,2,3],
})
new_col = pd.Series([1,2,3,4])
df = pd.concat([df,new_col],axis=1)
print(df)
bar zoo 0
0 A 1.0 1
1 B 2.0 2
2 C 3.0 3
3 NaN NaN 4

Pandas, merge rows only in same groups

Input DF:
df = pd.DataFrame({'A': ['one',np.nan,'two',np.nan],
'B': [np.nan,22,np.nan,44],
'group':[0,0,1,1]
})
print(df)
A B group
0 one NaN 0
1 NaN 22.0 0
2 two NaN 1
3 NaN 44.0 1
I want to merge those rows in one, all cells in one in same column. But taking into account groups.
Currently have:
df=df.agg(lambda x: ','.join(x.dropna().astype(str))
).to_frame().T
print(df)
A B group
0 one,two 22.0,44.0 0,0,1,1
but this way is taking all rows, not only groups
Expected Output:
A B
0 one 22.0
1 two 44.0

If possible simplify solution for first non missing value per group use:
df = df.groupby('group').first()
print(df)
A B
group
0 one 22.0
1 two 44.0
If not and need general solution:
df = pd.DataFrame({'A': ['one',np.nan,'two',np.nan],
'B': [np.nan,22,np.nan,44],
'group':[0,0,0,1]
})
def f(x):
return x.apply(lambda x: pd.Series(x.dropna().to_numpy()))
df = df.set_index('group').groupby('group').apply(f).reset_index(level=1, drop=True).reset_index()
print(df)
group A B
0 0 one 22.0
1 0 two NaN
2 1 NaN 44.0

df_a = df.drop('B', axis=1).dropna()
df_b = df.drop('A', axis=1).dropna()
pd.merge(df_a, df_b, on='group')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Outer merge between pandas and imputing NA with preceeding row - python

Related

concatenate 2 dataframes in order to append some columns

openpyxl concat three dataframes horizonally

How to subtract two dataframe column in dask dataframe?

Adding new columns to Pandas Data Frame which the length of new column value is bigger than length of index

Pandas, merge rows only in same groups

Categories

Resources