Pandas, merge rows only in same groups - python

Input DF:
df = pd.DataFrame({'A': ['one',np.nan,'two',np.nan],
'B': [np.nan,22,np.nan,44],
'group':[0,0,1,1]
})
print(df)
A B group
0 one NaN 0
1 NaN 22.0 0
2 two NaN 1
3 NaN 44.0 1
I want to merge those rows in one, all cells in one in same column. But taking into account groups.
Currently have:
df=df.agg(lambda x: ','.join(x.dropna().astype(str))
).to_frame().T
print(df)
A B group
0 one,two 22.0,44.0 0,0,1,1
but this way is taking all rows, not only groups
Expected Output:
A B
0 one 22.0
1 two 44.0

If possible simplify solution for first non missing value per group use:
df = df.groupby('group').first()
print(df)
A B
group
0 one 22.0
1 two 44.0
If not and need general solution:
df = pd.DataFrame({'A': ['one',np.nan,'two',np.nan],
'B': [np.nan,22,np.nan,44],
'group':[0,0,0,1]
})
def f(x):
return x.apply(lambda x: pd.Series(x.dropna().to_numpy()))
df = df.set_index('group').groupby('group').apply(f).reset_index(level=1, drop=True).reset_index()
print(df)
group A B
0 0 one 22.0
1 0 two NaN
2 1 NaN 44.0

df_a = df.drop('B', axis=1).dropna()
df_b = df.drop('A', axis=1).dropna()
pd.merge(df_a, df_b, on='group')

Related

convert a dict with key and list of values into pandas Dataframe where values are column names

Given a dict like this:
d={'paris':['a','b'],
'brussels':['b','c'],
'mallorca':['a','d']}
#when doing:
df = pd.DataFrame(d)
df.T
I dont get the expected result.
What I would like to get is a one_hot_encoding DF, in which the columns are the capitals and the value 1 or 0 corresponds to every of the letters that every city includes being paris, mallorca ect
The desired result is:
df = pd.DataFrame([[1,1,0,0],[0,1,1,0],[1,0,0,1]], index=['paris','brussels','mallorca'], columns=list('abcd'))
df.T
Any clever way to do this without having to multiloop over the first dict to transform it into another one?
Solution 1:
Combine df.apply with series.value_counts and append df.fillna to fill NaN values with zeros.
out = df.apply(pd.Series.value_counts).fillna(0)
print(out)
paris brussels mallorca
a 1.0 0.0 1.0
b 1.0 1.0 0.0
c 0.0 1.0 0.0
d 0.0 0.0 1.0
Solution 1:
Transform your df using df.melt and then use the result inside pd.crosstab.
Again use df.fillna to change NaN values to zeros. Finally, reorder the columns based on the order in the original df.
out = df.melt(value_name='index')
out = pd.crosstab(index=out['index'], columns=out['variable'])\
.fillna(0).loc[:, df.columns]
print(out)
paris brussels mallorca
index
a 1 0 1
b 1 1 0
c 0 1 0
d 0 0 1
I don't know how 'clever' my solution is but it works and it is pretty consise and readable.
import pandas as pd
d = {'paris': ['a', 'b'],
'brussels': ['b', 'c'],
'mallorca': ['a', 'd']}
df = pd.DataFrame(d).T
df.columns = ['0', '1']
df = pd.concat([df['0'], df['1']])
df = pd.crosstab(df, columns=df.index)
print(df)
Yields:
brussels mallorca paris
a 0 1 1
b 1 0 1
c 1 0 0
d 0 1 0

openpyxl concat three dataframes horizonally

I am using openpyxl to edit three dataframes, df1, df2, df3 (If it is necessary, we can also regard as three excels independently):
import pandas as pd
data1 = [[1, 1],[1,1]]
df1 = pd.DataFrame(data1, index = ['I1a','I1b'], columns=['v1a', 'v1b'])
df1.index.name='I1'
data2 = [[2, 2,2,2],[2,2,2,2],[2,2,2,2],[2,2,2,2]]
df2 = pd.DataFrame(data2, index = ['I2a','I2b','I2c','I2d'], columns=['v2a','v2b','v2c','v2d'])
df2.index.name='I2'
data3 = [['a', 'b',3,3],['a','c',3,3],['b','c',3,3],['c','d',3,3]]
df3 = pd.DataFrame(data3, columns=['v3a','v3b','v3c','v3d'])
df3 = df3.groupby(['v3a','v3b']).first()
Here df3 is multiindex. How to concat them into one excel horizontally (each dataframe start at the same line) as following:
Here we will regard index as a column and for multiindex, we keep the first level hidden.
Update
IIUC:
>>> pd.concat([df1.reset_index(), df2.reset_index(), df3.reset_index()], axis=1)
I1 v1a v1b I2 v2a v2b v2c v2d v3a v3b v3c v3d
0 I1a 1.0 1.0 I2a 2 2 2 2 a b 3 3
1 I1b 1.0 1.0 I2b 2 2 2 2 a c 3 3
2 NaN NaN NaN I2c 2 2 2 2 b c 3 3
3 NaN NaN NaN I2d 2 2 2 2 c d 3 3
Old answer
Assuming you know the start row, you can use pandas to remove extra columns:
import pandas as pd
df = pd.read_excel('input.xlsx', header=0, skiprows=3).dropna(how='all', axis=1)
df.to_excel('output.xlsx', index=False)
Input:
Output:

How to subtract two dataframe column in dask dataframe?

I have two dataframes like df1, df2.
In df1 i have 4 columns (A,B,C,D) and two rows,
In df2 i have 4 columns (A,B,C,D) and two rows.
Now I want to subtract the two dataframe LIKE df1['A'] - df2['A'] and so on. But I don't know how to do it.
df1-
df2 -
Just do the subtraction but keep in mind the indexes, for example, let's say I have df1 and df2 with same columns but different index:
df1 = dd.from_array(np.arange(8).reshape(2, 4), columns=['A','B','C','D'])
df2 = dd.from_pandas(pd.DataFrame(
np.arange(8).reshape(2, 4),
columns=['A','B','C','D'],
index=[1, 2]
), npartitions=1)
Then:
(df1 - df2).compute()
# A B C D
# 0 NaN NaN NaN NaN
# 1 4.0 4.0 4.0 4.0
# 2 NaN NaN NaN NaN
On the other hand, let's match index from df2 to df1 and subtract
df2 = df2.assign(idx=1)
df2 = df2.set_index(df2.idx.cumsum() - 1)
df2 = df2.drop(columns=['idx'])
(df1 - df2).compute()
# A B C D
# 0 0 0 0 0
# 1 0 0 0 0

Outer merge between pandas and imputing NA with preceeding row

I have two dataframes containing the same columns:
df1 = pd.DataFrame({'a': [1,2,3,4,5],
'b': [2,3,4,5,6]})
df2 = pd.DataFrame({'a': [1,3,4],
'b': [2,4,5]})
I want df2 to have the same number of rows as df1. Any values of a not present in df1 should be copied over, and corresponding values of b should be taken from the row before.
In other words, I want df2 to look like this:
df2 = pd.DataFrame({'a': [1,2,3,4,5],
'b': [2,2,4,5,5]})
EDIT: I'm looking for an answer that is independent of the number of columns
Use DataFrame.merge by only a column from df1 and for replace missing values is added forward filling them:
df = df1[['a']].merge(df2, how='left', on='a').ffill()
print (df)
a b
0 1 2.0
1 2 2.0
2 3 4.0
3 4 5.0
4 5 5.0
Or use merge_asof:
df = pd.merge_asof(df1[['a']], df2, on='a')
print (df)
a b
0 1 2
1 2 2
2 3 4
3 4 5
4 5 5

Adding new columns to Pandas Data Frame which the length of new column value is bigger than length of index

I'm in a trouble with adding a new column to a pandas dataframe when the length of new column value is bigger than length of index.
Data may like this :
import pandas as pd
df = pd.DataFrame(
{
"bar": ["A","B","C"],
"zoo": [1,2,3],
})
So, you see, length of this df's index is 3.
And next I wanna add a new column , code may like this two ways below:
df["new_col"] = [1,2,3,4]
It'll raise an error : Length of values does not match length of index.
Or:
df["new_col"] = pd.Series([1,2,3,4])
I will just get values[1,2,3] in my data frame df.
(The count of new column values can't out of the max index).
Now, what I want just like :
Is there a better way ?
Looking forward to your answer,thanks!
Use DataFrame.join with change Series name and right join:
#if not default index
#df = df.reset_index(drop=True)
df = df.join(pd.Series([1,2,3,4]).rename('new_col'), how='right')
print (df)
bar zoo new_col
0 A 1.0 1
1 B 2.0 2
2 C 3.0 3
3 NaN NaN 4
Another idea is add reindex by new s.index:
s = pd.Series([1,2,3,4])
df = df.reindex(s.index)
df["new_col"] = s
print (df)
bar zoo new_col
0 A 1.0 1
1 B 2.0 2
2 C 3.0 3
3 NaN NaN 4
s = pd.Series([1,2,3,4])
df = df.reindex(s.index).assign(new_col = s)
df = pd.DataFrame(
{
"bar": ["A","B","C"],
"zoo": [1,2,3],
})
new_col = pd.Series([1,2,3,4])
df = pd.concat([df,new_col],axis=1)
print(df)
bar zoo 0
0 A 1.0 1
1 B 2.0 2
2 C 3.0 3
3 NaN NaN 4

Categories