Row wise concatenation and replacing nan with common column values - python

Below is the input data
df1
A B C D E F G
Messi Forward Argentina 1 Nan 5 6
Ronaldo Defender Portugal Nan 4 Nan 3
Messi Midfield Argentina Nan 5 Nan 6
Ronaldo Forward Portugal 3 Nan 2 3
Mbappe Forward France 1 3 2 5
Below is the intended output
df
A B C D E F G
Messi Forward,Midfield Argentina 1 5 5 6
Ronaldo Forward,Defender Portugal 3 4 2 3
Mbappe Forward France 1 3 2 5
My try:
df.groupby(['A','C'])['B'].agg(','.join).reset_index()
df.fillna(method='ffill')
Do we have a better way to do this ?

You can get first non missing values per groups by all columns without A,C and for B aggregate by join:
d = dict.fromkeys(df.columns.difference(['A','C']), 'first')
d['B'] = ','.join
df1 = df.groupby(['A','C'], sort=False, as_index=False).agg(d)
print (df1)
A C B D E F G
0 Messi Argentina Forward,Midfield 1.0 5.0 5.0 6
1 Ronaldo Portugal Defender,Forward 3.0 4.0 2.0 3
2 Mbappe France Forward 1.0 3.0 2.0 5
df1 = df.groupby(['A','C'], sort=False, as_index=False).agg(d).convert_dtypes()
print (df1)
A C B D E F G
0 Messi Argentina Forward,Midfield 1 5 5 6
1 Ronaldo Portugal Defender,Forward 3 4 2 3
2 Mbappe France Forward 1 3 2 5

For a generic method without manual definition of the columns, you can use the columns types to define whether to aggregate with ', '.join or 'first':
from pandas.api.types import is_string_dtype
out = (df.groupby(['A', 'C'], as_index=False)
.agg({c: ', '.join if is_string_dtype(df[c]) else 'first' for c in df})
)
Output:
A B C D E F G
0 Mbappe Forward France 1.0 3.0 2.0 5
1 Messi, Messi Forward, Midfield Argentina, Argentina 1.0 5.0 5.0 6
2 Ronaldo, Ronaldo Defender, Forward Portugal, Portugal 3.0 4.0 2.0 3

Related

Calculate average of two columns based on the availability data (missing or NaN value of those columns) in pandas

I have df as shown below
df:
player goals_oct goals_nov
messi 2 4
neymar 2 NaN
ronaldo NaN 3
salah NaN NaN
levenoski 2 2
Where I would like to calculate the average goal scored by each player. Which is the average of goals_oct and goals_nov when both the data are available else the available column, if both not available then NaN
Expected output
player goals_oct goals_nov avg_goals
messi 2 4 3
neymar 2 NaN 2
ronaldo NaN 3 3
salah NaN NaN NaN
levenoski 2 0 1
I tried the below code, but it did not works
conditions_g = [(df['goals_oct'].isnull() and df['goals_nov'].notnull()),
(df['goals_oct'].notnull() and df['goals_nov'].isnull())]
choices_g = [df['goals_nov'], df['goals_oct']]
df['avg_goals']=np.select(conditions_g, choices_g, default=(df['goals_oct']+df['goals_nov'])/2)
Simply use mean(axis=1). It will skip NaNs:
columns = df.columns[1:] # all columns except the first
df['avg_goal'] = df[columns].mean(axis=1)
Output:
>>> df
player goals_oct goals_nov avg_goal
0 messi 2.0 4.0 3.0
1 neymar 2.0 NaN 2.0
2 ronaldo NaN 3.0 3.0
3 salah NaN NaN NaN
4 levenoski 2.0 2.0 2.0
Try this it will work
df["avg_goals"] = np.where(df.goals_oct.isnull(),
np.where(df.goals_nov.isnull(), np.NaN, df.goals_nov),
np.where(df.goals_nov.isnull(), df.goals_oct, (df.goals_oct + df.goals_nov) / 2))
if you want to consider 0 as empty value then you can convert 0 to np.NaN and try above statement it will work

Combining the Same Column in Python

I want to combine the same columns. Here is an example:
Name X Name Y Name Z
0 Jack 5 Maria 8 John 12
1 Celine 14 Andrew 14 Jonathan 21
In the above example, I want to combine "Name" columns. It will be like this:
Name X Y Z
0 Jack 5 - -
1 Celine 14 - -
2 Maria - 8 -
3 Andrew - 14 -
4 John - - 12
5 Jonathan - - 21
Type: pandas.core.frame.DataFrame
Not the prettiest solution probably, but does the job. Open to improvements.
>>> df
Name X Name Y Name Z
0 Jack 5 Maria 8 John 12
1 Celine 14 Andrew 14 Jonathan 21
>>> pd.concat([df.iloc[:, i:i+2] for i in range(0, df.shape[1], 2)])
Name X Y Z
0 Jack 5.0 NaN NaN
1 Celine 14.0 NaN NaN
0 Maria NaN 8.0 NaN
1 Andrew NaN 14.0 NaN
0 John NaN NaN 12.0
1 Jonathan NaN NaN 21.0

How to create null value if value is not the first occurrence based on three columns combination

I have a data frame with three columns , wanted to keep only unique values in last column i.e 'CU' based on the the combination three columns.
import pandas as pd
data = [['Alex','AL',10],['Bob','AB',15],['Clarke','CC',9],['Alex','Ac',11],['Bob','Ay',10],['Clarke','cv',13],['Alex','Ac',11],['Bob','Ay',13],['Clarke','cv',13]]
df = pd.DataFrame(data,columns=['Name','Cat','Cu'],dtype=float)
df
Out[460]:
Name Cat Cu
0 Alex AL 10.0
1 Bob AB 15.0
2 Clarke CC 9.0
3 Alex Ac 11.0
4 Bob Ay 10.0
5 Clarke cv 13.0
6 Alex Ac 11.0
7 Bob Ay 13.0
8 Clarke cv 13.0
For the above data frame need to convert CU column value into zero if the combination is not first occurrence. basically trying to identify unique value based on three columns at the same time need to maintain all rows.
INPUT:
df
Out[460]:
Name Cat Cu
0 Alex AL 10.0
1 Bob AB 15.0
2 Clarke CC 9.0
3 Alex Ac 11.0
4 Bob Ay 10.0
5 Clarke cv 13.0
6 Alex Ac 11.0
7 Bob Ay 13.0
8 Clarke cv 13.0
OUTPUT:
Name Cat Cu
0 Alex AL 10.0
1 Bob AB 15.0
2 Clarke CC 9.0
3 Alex Ac 11.0
4 Bob Ay 10.0
5 Clarke cv 13.0
6 Alex Ac 0
7 Bob Ay 13.0
8 Clarke cv 0
Use GroupBy.cumcount
df.loc[df.groupby(['Name', 'Cat', 'Cu']).cumcount().gt(0), 'Cu'] = 0
Name Cat Cu
0 Alex AL 10.0
1 Bob AB 15.0
2 Clarke CC 9.0
3 Alex Ac 11.0
4 Bob Ay 10.0
5 Clarke cv 13.0
6 Alex Ac 0.0
7 Bob Ay 13.0
8 Clarke cv 0.0

Pandas: Merging and comparing dataframes

I've got 3 Dataframes I would like to merge or join by "label" and then being able to compare all columns
Examples of df are below:
df1
Label,col1,col2,col3
NF1,1,1,6
NF2,3,2,8
NF3,4,5,4
NF4,5,7,2
NF5,6,2,2
df2
Label,col1,col1,col3
NF1,8,4,5
NF2,4,7,8
NF3,9,7,8
df3
Label,col1,col1,col3
NF1,2,8,8
NF2,6,2,0
NF3,2,2,5
NF4,2,4,9
NF5,2,5,8
and what ill like to see is similar to
Label,df1_col1,df2_col1,df_col1,df1_col2,df2_col2,df3_col2,df1_col3,df2_col3,df_col3
NF1,1,8,2,1,4,8,6,5,8
NF2,3,4,6,2,7,2,8,8,0
NF3,4,9,2,5,7,2,4,8,5
NF4,5,,2,7,,4,2,,9
NF5,6,,2,2,,5,2,,8
but I'm to suggestions on how to make the comparisons more readable.
Thanks!
Use concat with list of DataFrames, add parameter keys for prefixes and sorting by columns names:
dfs = [df1, df2, df3]
k = ('df1','df2','df3')
df = (pd.concat([x.set_index('Label') for x in dfs], axis=1, keys=k)
.sort_index(axis=1, level=1)
.rename_axis('Label')
.reset_index())
df.columns = df.columns.map('_'.join).str.strip('_')
print (df)
Label df1_col1 df2_col1 df3_col1 df2_col1.1 df3_col1.1 df1_col2 \
0 NF1 1 8.0 2 4.0 8 1
1 NF2 3 4.0 6 7.0 2 2
2 NF3 4 9.0 2 7.0 2 5
3 NF4 5 NaN 2 NaN 4 7
4 NF5 6 NaN 2 NaN 5 2
df1_col3 df2_col3 df3_col3
0 6 5.0 8
1 8 8.0 0
2 4 8.0 5
3 2 NaN 9
4 2 NaN 8
You can use df.merge:
In [1965]: res = df1.merge(df2, on='Label', how='left', suffixes=('_df1', '_df2')).merge(df3, on='Label', how='left').rename(columns={'col1': 'col1_df3','col2':'col2_df3','col3':'col3_df3'})
In [1975]: res = res.reindex(sorted(res.columns), axis=1)
In [1976]: res
Out[1965]:
Label col1_df1 col1_df2 col1_df3 col2_df1 col2_df2 col2_df3 col3_df1 col3_df2 col3_df3
0 NF1 1 8.00 2 1 4.00 8 6 5.00 8
1 NF2 3 4.00 6 2 7.00 2 8 8.00 0
2 NF3 4 9.00 2 5 7.00 2 4 8.00 5
3 NF4 5 nan 2 7 nan 4 2 nan 9
4 NF5 6 nan 2 2 nan 5 2 nan 8
We can use Pandas' join method, by setting the Label column as the index and joining the dataframes :
dfs = [df1,df2,df3]
keys = ['df1','df2','df3']
#set Label as index
df1, *others = [frame.set_index("Label").add_prefix(f"{prefix}_")
for frame,prefix in zip(dfs,keys)]
#join df1 with others
outcome = df1.join(others,how='outer').rename_axis(index='Label').reset_index()
outcome
Label df1_col1 df1_col2 df1_col3 df2_col1 df2_col2 df2_col3 df3_col1 df3_col2 df3_col3
0 NF1 1 1 6 8.0 4.0 5.0 2 8 8
1 NF2 3 2 8 4.0 7.0 8.0 6 2 0
2 NF3 4 5 4 9.0 7.0 8.0 2 2 5
3 NF4 5 7 2 NaN NaN NaN 2 4 9
4 NF5 6 2 2 NaN NaN NaN 2 5 8

How to group, sort and calculate difference in this pandas dataframe?

I created this dataframe and need to group my data into category with the same number of beds, city, baths and sort(descending) each elements in the group by price.
Secondly I need to find the difference between each price with the one ranked after into the same group.
For example the result should be like that:
1 bed, 1 bath, Madrid, 10
1 bed, 1 bath, Madrid, 8
1 bed, 1 bath, Madrid, 5
1 bed, 1 bath, Madrid, 1
I should get 2, 3, 4...
I tried some code it seems far than what I expect to find...
data=[[1,'paris',1,2,'10'],[2,'madrid',2,2,8],[3,'madrid',2,2,11],[4,'paris',1,2,6],[5,'paris',1,2,5],[6,'madrid',2,1,7],[7,'paris',2,1,7],[8,'paris',2,1,7],[9,'madrid',1,4],[10,'paris',2,1,3],[11,'madrid',2,2,7],[12,'paris',2,3,12],[13,'madrid',2,3,7],[14,'madrid',1,1,3],[15,'paris',1,1,3],[16,'madrid',1,1,4],[17,'paris',1,1,5]]
df=pd.DataFrame(data, columns=['id','city','beds','baths','price'])
df
df['gap'] = df.sort_values('price',ascending=False).groupby(['city','beds','baths'])['price'].diff()
print (df)
Many thanks in advance.
I would use pd.to_numeric with errors = 'coerce'
to get rid of the strings in the price column, I would then calculate the difference without taking into account those rooms whose price is unknown (using DataFrame.dropna). Then I show the result ordering in DataFrame and without ordering:
df['price']=pd.to_numeric(df['price'],errors = 'coerce')
df['difference_price'] = ( df.dropna()
.sort_values('price',ascending=False)
.groupby(['city','beds','baths'])['price'].diff(-1) )
or using GroupBy.shift:
df['difference_price'] = df['price'].sub( df.dropna()
.sort_values('price',ascending=False)
.groupby(['city','beds','baths'])
.price
.shift(-1) )
Display result
print(df,'\n'*3,'Sorted DatFrame: ')
print(df.sort_values(['city','beds','baths','price'],ascending = [True,True,True,False]))
Output
id city beds baths price difference_price
0 1 paris 1 2 10.0 4.0
1 2 madrid 2 2 8.0 1.0
2 3 madrid 2 2 11.0 3.0
3 4 paris 1 2 6.0 1.0
4 5 paris 1 2 5.0 NaN
5 6 madrid 2 1 7.0 NaN
6 7 paris 2 1 7.0 0.0
7 8 paris 2 1 7.0 4.0
8 9 madrid 1 4 NaN NaN
9 10 paris 2 1 3.0 NaN
10 11 madrid 2 2 7.0 NaN
11 12 paris 2 3 12.0 NaN
12 13 madrid 2 3 7.0 NaN
13 14 madrid 1 1 3.0 NaN
14 15 paris 1 1 3.0 NaN
15 16 madrid 1 1 4.0 1.0
16 17 paris 1 1 5.0 2.0
Sorted DatFrame:
id city beds baths price difference_price
15 16 madrid 1 1 4.0 1.0
13 14 madrid 1 1 3.0 NaN
8 9 madrid 1 4 NaN NaN
5 6 madrid 2 1 7.0 NaN
2 3 madrid 2 2 11.0 3.0
1 2 madrid 2 2 8.0 1.0
10 11 madrid 2 2 7.0 NaN
12 13 madrid 2 3 7.0 NaN
16 17 paris 1 1 5.0 2.0
14 15 paris 1 1 3.0 NaN
0 1 paris 1 2 10.0 4.0
3 4 paris 1 2 6.0 1.0
4 5 paris 1 2 5.0 NaN
6 7 paris 2 1 7.0 0.0
7 8 paris 2 1 7.0 4.0
9 10 paris 2 1 3.0 NaN
11 12 paris 2 3 12.0 NaN
If I understand correctly with:
group my data into category with the same number of beds, city, baths and sort(descending)
All data that does not fulfill the value should be deleted? (Where beds and baths are different). This is my code to provide an answer given your problem:
import numpy as np
import pandas as pd
data=[[1,'paris',1,2,'10'],[2,'madrid',2,2,8],[3,'madrid',2,2,11],[4,'paris',1,2,6],[5,'paris',1,2,5],[6,'madrid',2,1,7],[7,'paris',2,1,7],[8,'paris',2,1,7],[9,'madrid',1,4],[10,'paris',2,1,3],[11,'madrid',2,2,7],[12,'paris',2,3,12],[13,'madrid',2,3,7],[14,'madrid',1,1,3],[15,'paris',1,1,3],[16,'madrid',1,1,4],[17,'paris',1,1,5]]
df=pd.DataFrame(data, columns=['id','city','beds','baths','price'])
df_new = df[df['beds'] == df['baths']]
df_new = df_new.sort_values(['city','price'],ascending=[False,False]).reset_index(drop=True)
df_new['diff_price'] = df_new.groupby(['city','beds','baths'])['price'].diff(-1)
print(df_new)
Output:
id city beds baths price diff_price
0 17 paris 1 1 5 NaN
1 15 paris 1 1 3 -2
2 3 madrid 2 2 11 NaN
3 2 madrid 2 2 8 -3
4 11 madrid 2 2 7 -1
5 16 madrid 1 1 4 NaN
6 14 madrid 1 1 3 -1

Categories