Combining the Same Column in Python - python

I want to combine the same columns. Here is an example:
Name X Name Y Name Z
0 Jack 5 Maria 8 John 12
1 Celine 14 Andrew 14 Jonathan 21
In the above example, I want to combine "Name" columns. It will be like this:
Name X Y Z
0 Jack 5 - -
1 Celine 14 - -
2 Maria - 8 -
3 Andrew - 14 -
4 John - - 12
5 Jonathan - - 21
Type: pandas.core.frame.DataFrame

Not the prettiest solution probably, but does the job. Open to improvements.
>>> df
Name X Name Y Name Z
0 Jack 5 Maria 8 John 12
1 Celine 14 Andrew 14 Jonathan 21
>>> pd.concat([df.iloc[:, i:i+2] for i in range(0, df.shape[1], 2)])
Name X Y Z
0 Jack 5.0 NaN NaN
1 Celine 14.0 NaN NaN
0 Maria NaN 8.0 NaN
1 Andrew NaN 14.0 NaN
0 John NaN NaN 12.0
1 Jonathan NaN NaN 21.0

Related

Row wise concatenation and replacing nan with common column values

Below is the input data
df1
A B C D E F G
Messi Forward Argentina 1 Nan 5 6
Ronaldo Defender Portugal Nan 4 Nan 3
Messi Midfield Argentina Nan 5 Nan 6
Ronaldo Forward Portugal 3 Nan 2 3
Mbappe Forward France 1 3 2 5
Below is the intended output
df
A B C D E F G
Messi Forward,Midfield Argentina 1 5 5 6
Ronaldo Forward,Defender Portugal 3 4 2 3
Mbappe Forward France 1 3 2 5
My try:
df.groupby(['A','C'])['B'].agg(','.join).reset_index()
df.fillna(method='ffill')
Do we have a better way to do this ?
You can get first non missing values per groups by all columns without A,C and for B aggregate by join:
d = dict.fromkeys(df.columns.difference(['A','C']), 'first')
d['B'] = ','.join
df1 = df.groupby(['A','C'], sort=False, as_index=False).agg(d)
print (df1)
A C B D E F G
0 Messi Argentina Forward,Midfield 1.0 5.0 5.0 6
1 Ronaldo Portugal Defender,Forward 3.0 4.0 2.0 3
2 Mbappe France Forward 1.0 3.0 2.0 5
df1 = df.groupby(['A','C'], sort=False, as_index=False).agg(d).convert_dtypes()
print (df1)
A C B D E F G
0 Messi Argentina Forward,Midfield 1 5 5 6
1 Ronaldo Portugal Defender,Forward 3 4 2 3
2 Mbappe France Forward 1 3 2 5
For a generic method without manual definition of the columns, you can use the columns types to define whether to aggregate with ', '.join or 'first':
from pandas.api.types import is_string_dtype
out = (df.groupby(['A', 'C'], as_index=False)
.agg({c: ', '.join if is_string_dtype(df[c]) else 'first' for c in df})
)
Output:
A B C D E F G
0 Mbappe Forward France 1.0 3.0 2.0 5
1 Messi, Messi Forward, Midfield Argentina, Argentina 1.0 5.0 5.0 6
2 Ronaldo, Ronaldo Defender, Forward Portugal, Portugal 3.0 4.0 2.0 3

How to group, sort and calculate difference in this pandas dataframe?

I created this dataframe and need to group my data into category with the same number of beds, city, baths and sort(descending) each elements in the group by price.
Secondly I need to find the difference between each price with the one ranked after into the same group.
For example the result should be like that:
1 bed, 1 bath, Madrid, 10
1 bed, 1 bath, Madrid, 8
1 bed, 1 bath, Madrid, 5
1 bed, 1 bath, Madrid, 1
I should get 2, 3, 4...
I tried some code it seems far than what I expect to find...
data=[[1,'paris',1,2,'10'],[2,'madrid',2,2,8],[3,'madrid',2,2,11],[4,'paris',1,2,6],[5,'paris',1,2,5],[6,'madrid',2,1,7],[7,'paris',2,1,7],[8,'paris',2,1,7],[9,'madrid',1,4],[10,'paris',2,1,3],[11,'madrid',2,2,7],[12,'paris',2,3,12],[13,'madrid',2,3,7],[14,'madrid',1,1,3],[15,'paris',1,1,3],[16,'madrid',1,1,4],[17,'paris',1,1,5]]
df=pd.DataFrame(data, columns=['id','city','beds','baths','price'])
df
df['gap'] = df.sort_values('price',ascending=False).groupby(['city','beds','baths'])['price'].diff()
print (df)
Many thanks in advance.
I would use pd.to_numeric with errors = 'coerce'
to get rid of the strings in the price column, I would then calculate the difference without taking into account those rooms whose price is unknown (using DataFrame.dropna). Then I show the result ordering in DataFrame and without ordering:
df['price']=pd.to_numeric(df['price'],errors = 'coerce')
df['difference_price'] = ( df.dropna()
.sort_values('price',ascending=False)
.groupby(['city','beds','baths'])['price'].diff(-1) )
or using GroupBy.shift:
df['difference_price'] = df['price'].sub( df.dropna()
.sort_values('price',ascending=False)
.groupby(['city','beds','baths'])
.price
.shift(-1) )
Display result
print(df,'\n'*3,'Sorted DatFrame: ')
print(df.sort_values(['city','beds','baths','price'],ascending = [True,True,True,False]))
Output
id city beds baths price difference_price
0 1 paris 1 2 10.0 4.0
1 2 madrid 2 2 8.0 1.0
2 3 madrid 2 2 11.0 3.0
3 4 paris 1 2 6.0 1.0
4 5 paris 1 2 5.0 NaN
5 6 madrid 2 1 7.0 NaN
6 7 paris 2 1 7.0 0.0
7 8 paris 2 1 7.0 4.0
8 9 madrid 1 4 NaN NaN
9 10 paris 2 1 3.0 NaN
10 11 madrid 2 2 7.0 NaN
11 12 paris 2 3 12.0 NaN
12 13 madrid 2 3 7.0 NaN
13 14 madrid 1 1 3.0 NaN
14 15 paris 1 1 3.0 NaN
15 16 madrid 1 1 4.0 1.0
16 17 paris 1 1 5.0 2.0
Sorted DatFrame:
id city beds baths price difference_price
15 16 madrid 1 1 4.0 1.0
13 14 madrid 1 1 3.0 NaN
8 9 madrid 1 4 NaN NaN
5 6 madrid 2 1 7.0 NaN
2 3 madrid 2 2 11.0 3.0
1 2 madrid 2 2 8.0 1.0
10 11 madrid 2 2 7.0 NaN
12 13 madrid 2 3 7.0 NaN
16 17 paris 1 1 5.0 2.0
14 15 paris 1 1 3.0 NaN
0 1 paris 1 2 10.0 4.0
3 4 paris 1 2 6.0 1.0
4 5 paris 1 2 5.0 NaN
6 7 paris 2 1 7.0 0.0
7 8 paris 2 1 7.0 4.0
9 10 paris 2 1 3.0 NaN
11 12 paris 2 3 12.0 NaN
If I understand correctly with:
group my data into category with the same number of beds, city, baths and sort(descending)
All data that does not fulfill the value should be deleted? (Where beds and baths are different). This is my code to provide an answer given your problem:
import numpy as np
import pandas as pd
data=[[1,'paris',1,2,'10'],[2,'madrid',2,2,8],[3,'madrid',2,2,11],[4,'paris',1,2,6],[5,'paris',1,2,5],[6,'madrid',2,1,7],[7,'paris',2,1,7],[8,'paris',2,1,7],[9,'madrid',1,4],[10,'paris',2,1,3],[11,'madrid',2,2,7],[12,'paris',2,3,12],[13,'madrid',2,3,7],[14,'madrid',1,1,3],[15,'paris',1,1,3],[16,'madrid',1,1,4],[17,'paris',1,1,5]]
df=pd.DataFrame(data, columns=['id','city','beds','baths','price'])
df_new = df[df['beds'] == df['baths']]
df_new = df_new.sort_values(['city','price'],ascending=[False,False]).reset_index(drop=True)
df_new['diff_price'] = df_new.groupby(['city','beds','baths'])['price'].diff(-1)
print(df_new)
Output:
id city beds baths price diff_price
0 17 paris 1 1 5 NaN
1 15 paris 1 1 3 -2
2 3 madrid 2 2 11 NaN
3 2 madrid 2 2 8 -3
4 11 madrid 2 2 7 -1
5 16 madrid 1 1 4 NaN
6 14 madrid 1 1 3 -1

Groupby and find similar or same items in two columns in Python

For a data frame as follows, if the string in name2 is approximately similar or same as the string in name1 in each group of type, then return Y, otherwise N.
id type name1 name2
0 1 A James B. James
1 2 A Keras Steven
2 3 A NaN Keras
3 4 B Jack Lucy
4 5 B Lucy Jack
5 6 C Jasica Hoverd
6 7 C Steven Jasica
7 8 C NaN Steven L.
The expected result is like this, for example, in type A, James from name2 have a similar value James B. in name1, Keras has same values in both name2 and name1, so both of them return Y in result. While Steven does not exist in name1, so return N.
id type name1 name2 result
0 1 A James B. James Y
1 2 A Keras Steven N
2 3 A NaN Keras Y
3 4 B Jack Lucy Y
4 5 B Lucy Jack Y
5 6 C Jasica Hoverd N
6 7 C Steven Jasica Y
7 8 C NaN Steven L. Y
Someone could help to do that? Thank you.
If find similar values is too complicated to realize, then find only same values and return Y will be OK.
Without similarity it is simplier:
mask = df.groupby('type', group_keys=False).apply(lambda x: x['name2'].isin(x['name1']))
df['new'] = np.where(mask, 'Y','N')
print (df)
id type name1 name2 new
0 1 A James B. James N
1 2 A Keras Steven N
2 3 A NaN Keras Y
3 4 B Jack Lucy Y
4 5 B Lucy Jack Y
5 6 C Jasica Hoverd N
6 7 C Steven Jasica Y
7 8 C NaN Steven L. N
With basic similarity with split:
mask = (df.assign(name1 = df['name1'].fillna('|').astype(str).str.split().str[0],
name2 = df['name2'].astype(str).str.split().str[0])
.groupby('type', group_keys=False)
.apply(lambda x: x['name2'].isin(x['name1'])))
df['new'] = np.where(mask, 'Y','N')
print (df)
id type name1 name2 new
0 1 A James B. James Y
1 2 A Keras Steven N
2 3 A NaN Keras Y
3 4 B Jack Lucy Y
4 5 B Lucy Jack Y
5 6 C Jasica Hoverd N
6 7 C Steven Jasica Y
7 8 C NaN Steven L. Y
For better matchin similarity is possible use SequenceMatcher for ratio and filter it by treshold, e.g. here by 0.5:
from difflib import SequenceMatcher
def f(x):
comp = [any(SequenceMatcher(None, a, b).ratio() > .5
for a in x['name1'].fillna('_'))
for b in x['name2']]
return pd.Series(comp, index=x.index)
mask = df.groupby('type', group_keys=False).apply(f)
df['new'] = np.where(mask, 'Y','N')
print (df)
id type name1 name2 new
0 1 A James B. James Y
1 2 A Keras Steven N
2 3 A NaN Keras Y
3 4 B Jack Lucy Y
4 5 B Lucy Jack Y
5 6 C Jasica Hoverd N
6 7 C Steven LA. Jasica Y
7 8 C NaN Steven L. Y
df['result'] = pd.DataFrame(df.groupby('type').apply(lambda x: ['Y' if i in ' '.join(x['name1'].astype(str)) else 'N' for i in list(x['name2'].dropna().str.split().str[0])]).tolist()).stack().reset_index(drop=True)
Output
id type name1 name2 result
0 1 A James B. James Y
1 2 A Keras Steven N
2 3 A NaN Keras Y
3 4 B Jack Lucy Y
4 5 B Lucy Jack Y
5 6 C Jasica Hoverd N
6 7 C Steven Jasica Y
7 8 C NaN Steven L. Y

how to melt a dataframe -- get the column name in the field of melt dataframe

I have a df as below
name 0 1 2 3 4
0 alex NaN NaN aa bb NaN
1 mike NaN rr NaN NaN NaN
2 rachel ss NaN NaN NaN ff
3 john NaN ff NaN NaN NaN
the melt function should return the below
name code
0 alex 2
1 alex 3
2 mike 1
3 rachel 0
4 rachel 4
5 john 1
Any suggestion is helpful. thanks.
Just follow these steps: melt, dropna, sort column name, reset index, and finally drop any unwanted columns
In [1171]: df.melt(['name'],var_name='code').dropna().sort_values('name').reset_index().drop(['index', 'value'], 1)
Out[1171]:
name code
0 alex 2
1 alex 3
2 john 1
3 mike 1
4 rachel 0
5 rachel 4
This should work.
df.unstack().reset_index().dropna()
df.set_index('name').unstack().reset_index().rename(columns={'level_0':'Code'}).dropna().drop(0,axis =1)[['name','Code']].sort_values('name')
output will be
name Code
alex 2
alex 3
john 1
mike 1
rachel 0
rachel 4

pandas - Add values of two or more different DataFrames through a list

I'm looking to add values between three or more DataFrames through a list instead of doing them one by one.
First, I'll use merge as an example.
The following line merges DataFrames (data0, data1, data2) one by one:
final_data = data0.merge(data1, on=['player_id', 'player_name'])
final_data = final_data.merge(data2, on=['player_id', 'player_name'])
However, instead, I could merge the DataFrames through a list, which significantly helps when dealing with more DF's, such as this:
data_list = [data0, data1, data2]
final_data = reduce(lambda left, right: pd.merge(left, right, on=['player_id', 'player_name']), data_list)
So now, I have these three following DataFrames and I would like to add the values between them.
data0:
player_id player_name ab run hit
0 28920 S. Smith 0 0 0
1 33351 T. Mancini 0 0 0
2 30267 C. Gentry 0 0 0
3 28513 A. Jones 0 0 0
4 31097 M. Machado 0 0 0
5 29170 C. Davis 0 0 0
6 29322 M. Trumbo 0 0 0
7 29564 W. Castillo 0 0 0
8 34885 H. Kim 0 0 0
9 32952 J. Rickard 0 0 0
10 31988 J. Schoop 0 0 0
11 5908 J.J. Hardy 0 0 0
Next,
data1:
player_id player_name ab run hit
0 28920 S. Smith 1 4 6
1 33351 T. Mancini 0 0 2
2 28513 A. Jones 2 1 0
3 31097 M. Machado 1 8 0
4 34885 H. Kim 1 1 2
5 32952 J. Rickard 0 2 0
6 31988 J. Schoop 5 3 4
7 5908 J.J. Hardy 4 2 10
And next,
data2:
player_id player_name ab run hit
0 28920 S. Smith 1 9 2
1 31097 M. Machado 3 3 3
2 29170 C. Davis 9 6 4
3 29322 M. Trumbo 3 5 7
4 32952 J. Rickard 1 3 4
5 5908 J.J. Hardy 0 0 5
The final DataFrame I am looking to get should look like this:
final_data:
player_id player_name ab run hit
0 28920 S. Smith 2 13 8
1 33351 T. Mancini 0 0 2
2 30267 C. Gentry 0 0 0
3 28513 A. Jones 2 1 0
4 31097 M. Machado 4 11 3
5 29170 C. Davis 9 6 4
6 29322 M. Trumbo 3 5 7
7 29564 W. Castillo 0 0 0
8 34885 H. Kim 1 1 2
9 32952 J. Rickard 1 5 4
10 31988 J. Schoop 5 3 4
11 5908 J.J. Hardy 4 2 15
I could get the result through the following code, but that adds the DataFrames one by one.
data0 = pd.read_csv('initial_df.csv')
data1 = pd.read_csv('add_vals1.csv')
data2 = pd.read_csv('add_vals2.csv')
data0 = data0.set_index(['player_id', 'player_name'])
data1 = data1.set_index(['player_id', 'player_name'])
data2 = data2.set_index(['player_id', 'player_name'])
final_data = data0.add(data1, fill_value=0).astype(int).reset_index()
final_data = final_data.set_index(['player_id', 'player_name'])
final_data = final_data.add(data2, fill_value=0).astype(int).reset_index()
Could anyone please help to get the final result through a list as I did with the merge function up on top? Thank you so much!
I believe need use parameter index_col for MultiIndex in read_csv and then reduce with add:
from functools import reduce
data0 = pd.read_csv('initial_df.csv', index_col=['player_id', 'player_name'])
data1 = pd.read_csv('add_vals1.csv', index_col=['player_id', 'player_name'])
data2 = pd.read_csv('add_vals2.csv', index_col=['player_id', 'player_name'])
data_list = [data0, data1, data2]
final_data = reduce(lambda x, y: x.add(y, fill_value=0), data_list).reset_index()
print (final_data)
player_id player_name ab run hit
0 5908 J.J. Hardy 4.0 2.0 15.0
1 28513 A. Jones 2.0 1.0 0.0
2 28920 S. Smith 2.0 13.0 8.0
3 29170 C. Davis 9.0 6.0 4.0
4 29322 M. Trumbo 3.0 5.0 7.0
5 29564 W. Castillo 0.0 0.0 0.0
6 30267 C. Gentry 0.0 0.0 0.0
7 31097 M. Machado 4.0 11.0 3.0
8 31988 J. Schoop 5.0 3.0 4.0
9 32952 J. Rickard 1.0 5.0 4.0
10 33351 T. Mancini 0.0 0.0 2.0
11 34885 H. Kim 1.0 1.0 2.0
Another solution with concat and sum by both levels:
data_list = [data0, data1, data2]
final_data = pd.concat(data_list).sum(level=[0,1]).reset_index()
print (final_data)
player_id player_name ab run hit
0 28920 S. Smith 2 13 8
1 33351 T. Mancini 0 0 2
2 30267 C. Gentry 0 0 0
3 28513 A. Jones 2 1 0
4 31097 M. Machado 4 11 3
5 29170 C. Davis 9 6 4
6 29322 M. Trumbo 3 5 7
7 29564 W. Castillo 0 0 0
8 34885 H. Kim 1 1 2
9 32952 J. Rickard 1 5 4
10 31988 J. Schoop 5 3 4
11 5908 J.J. Hardy 4 2 15

Categories