Let's suppose I have a dataframe:
import numpy as np
a = [['A',np.nan,2,'x|x|x|y'],['B','a|b',56,'b|c'],['C','c|e|e',65,'f|g'],['D','h',98,'j'],['E','g',98,'k|h'],['F','a|a|a|a|a|b',98,np.nan],['G','w',98,'p'],['H','s',98,'t|u']]
df1 = pd.DataFrame(a, columns=['1', '2','3','4'])
df1
1 2 3 4
0 A NaN 2 x|x|x|y
1 B a|b 56 b|c
2 C c|e|e 65 f|g
3 D h 98 j
4 E g 98 k|h
5 F a|a|a|a|a|b 98 NaN
6 G w 98 p
7 H s 98 t|u
and another dataframe:
a = [['x'],['b'],['h'],['v']]
df2 = pd.DataFrame(a, columns=['1'])
df2
1
0 x
1 b
2 h
3 v
I want to compare column 1 in df2 with column 2 and 4 (splitting it by "|") in df1, and if the value matches with either or both column 2 or 4 (after splitting), I want to extract only those rows of df1 in another dataframe with an added column that will have the value of df2 that matched with either column 2 or column 4 of df1.
For example, the result would look something like this:
1 2 3 4 5
0 A NaN 2 x|x|x|y x
1 B a|b 56 b|c b
2 F a|a|a|a|a|b 98 NaN b
3 D h 98 j h
4 E g 98 k|h h
Solution is join values of both columns to Series in DataFrame.agg, then splitting by Series.str.split, filter values in DataFrame.where with DataFrame.isin and then join values together without NaNs, last filter columns without empty strings:
df11 = df1[['2','4']].fillna('').agg('|'.join, 1).str.split('|', expand=True)
df1['5'] = (df11.where(df11.isin(df2['1'].tolist()))
.apply(lambda x: ','.join(set(x.dropna())), axis=1))
df1 = df1[df1['5'].ne('')]
print (df1)
1 2 3 4 5
0 A NaN 2 x|x|x|y x
1 B a|b 56 b|c b
3 D h 98 j h
4 E g 98 k|h h
5 F a|a|a|a|a|b 98 NaN b
Related
How can I replace specific row-wise duplicate cells in selected columns without dropping rows (preferably without looping through the rows)?
Basically, I want to keep the first value and replace the remaining duplicates in a row with NAN.
For example:
df_example = pd.DataFrame({'A':['a' , 'b', 'c'], 'B':['a', 'f', 'c'],'C':[1,2,3]})
df_example.head()
Original:
A B C
0 a a 1
1 b f 2
2 c c 3
Expected output:
A B C
0 a nan 1
1 b f 2
2 c nan 3
A bit more complicated example is as follows:
Original:
A B C D
0 a 1 a 1
1 b 2 f 5
2 c 3 c 3
Expected output:
A B C D
0 a 1 nan nan
1 b 2 f 5
2 c 3 nan nan
Use DataFrame.mask with Series.duplicated per rows in DataFrame.apply:
df_example = df_example.mask(df_example.apply(lambda x: x.duplicated(), axis=1))
print (df_example)
A B C
0 a NaN 1
1 b f 2
2 c NaN 3
With new data:
df_example = df_example.mask(df_example.apply(lambda x: x.duplicated(), axis=1))
print (df_example)
A B C D
0 a 1 NaN NaN
1 b 2 f 5.0
2 c 3 NaN NaN
I'm working on a pandas data frame where I want to find the farthest out non-null value in each row and then reverse the order of those values and output a data frame with the row values reversed without leaving null values in the first column. Essentially reversing column order and shifting non-null values to the left.
IN:
1 2 3 4 5
1 a b c d e
2 a b c
3 a b c d
4 a b c
OUT:
1 2 3 4 5
1 e d c b a
2 c b a
3 d c b a
4 c b a
For each row, create a new Series with the same indexes but with the values reversed:
def reverse(s):
# Strip the NaN on both ends, but not in the middle
idx1 = s.first_valid_index()
idx2 = s.last_valid_index()
idx = s.loc[idx1:idx2].index
return pd.Series(s.loc[idx[::-1]].values, index=idx)
df.apply(reverse, axis=1)
Result:
1 2 3 4 5
1 e d c b a
2 c b a NaN NaN
3 d c b a NaN
4 c NaN b a NaN
I would like to know if there's a way in Python to place columns from different dataframes with the same names (or related names) adjacent to each other.
I know there's the option to use JOIN, but I would like to make a function from the scratch that can achieve the same.
Example:
Let's assume 2 dataframes df1 and df2
df1 is
id A B
50 1 5
60 2 6
70 3 7
80 4 8
df2 is
id A_1 B_1
50 a b
60 c d
70 e f
80 g h
Expected Output: A new dataframe, say df3, looking like this
id A A_1 B B_1
50 1 a 5 b
60 2 c 6 d
70 3 e 7 f
80 4 g 8 h
you can use sorted() with column names like:
m=pd.concat([df1.set_index('id'),df2.set_index('id')],axis=1)
m[(sorted(m.columns))].reset_index()
id A A_1 B B_1
0 50 1 a 5 b
1 60 2 c 6 d
2 70 3 e 7 f
3 80 4 g 8 h
First you join the 2 dataframes -
df3 = df1.join(df2, how='inner')
And then you can sort the index -
df3 = df3.sort_index(axis=1)
I have a dataframe with two features: gps_height (numeric) and region (categorical).
The gps_height contains a lot of 0 values, which are missing values in this case. I want to fill the 0 values with the mean of the coherent region.
My reasoning is as follows:
1. Drop the zero values and take the mean values of gps_height, grouped by region
df[df.gps_height !=0].groupby(['region']).mean()
But how do I replace the zero values in my dataframe with those mean values?
Sample data:
gps_height region
0 1390 Iringa
1 1400 Mara
2 0 Iringa
3 250 Iringa
...
Use:
df = pd.DataFrame({'region':list('aaabbbccc'),
'gps_height':[2,3,0,3,4,5,1,0,0]})
print (df)
region gps_height
0 a 2
1 a 3
2 a 0
3 b 3
4 b 4
5 b 5
6 c 1
7 c 0
8 c 0
Replace 0 to missing values, and then replace NANs by fillna with means by GroupBy.transformper groups:
df['gps_height'] = df['gps_height'].replace(0, np.nan)
df['gps_height']=df['gps_height'].fillna(df.groupby('region')['gps_height'].transform('mean'))
print (df)
region gps_height
0 a 2.0
1 a 3.0
2 a 2.5
3 b 3.0
4 b 4.0
5 b 5.0
6 c 1.0
7 c 1.0
8 c 1.0
Or filter out 0 values, aggregate means and map all 0 rows:
m = df['gps_height'] != 0
s = df[m].groupby('region')['gps_height'].mean()
df.loc[~m, 'gps_height'] = df['region'].map(s)
#alternative
#df['gps_height'] = np.where(~m, df['region'].map(s), df['gps_height'])
print (df)
region gps_height
0 a 2.0
1 a 3.0
2 a 2.5
3 b 3.0
4 b 4.0
5 b 5.0
6 c 1.0
7 c 1.0
8 c 1.0
I ended up facing the same problem that #ahbon raised: what if there are more than one column to group by? And this was the closest question that I found to my problem. After a serious struggle, I came to a solution.
As far as I know (there are pandas specific functions to do similar things) It could not be an elegant/orthodox one, so I'd appreciate some feedback.
There it goes:
import pandas as pd
import random
random.seed(123)
df = pd.DataFrame({"A":list('a'*4+'b'*4+'c'*4+'d'*4),
"B":list('xy'*8),
"C":random.sample(range(17), 16)})
print(df)
A B C
0 a x 1
1 a y 8
2 a x 16
3 a y 12
4 b x 6
5 b y 4
6 b x 14
7 b y 0
8 c x 13
9 c y 5
10 c x 2
11 c y 9
12 d x 10
13 d y 11
14 d x 3
15 d y 15
First get the indices of 0 values to retrieve the non zero data and get the mean by group.
idx = list(df[df["C"] != 0].index)
data_to_group = df.iloc[idx,]
grouped_data = pd.DataFrame(data_to_group.groupby(["A", "B"])["C"].mean())
And now the tricky part. Here is where I get the impression that it could be a more elegant solution:
Stack, unstack and reset index
Then merge with the subset of rows in df where C is 0; drop C from the first and keep C from the second
Finaly update df with this subset with no zero in C.
grouped_data = grouped_data.stack().unstack().reset_index()
zero_rows = df[df.C == 0]
zero_rows_replaced = pd.merge(left = zero_rows, right = grouped_data,
how = "left", on=["A", "B"],
suffixes=('_x','')).drop('C_x', axis=1)
zero_rows_replaced = zero_rows_replaced.set_index(zero_rows.index.copy())
df.update(zero_rows_replaced)
print(df)
A B C
0 a x 1
1 a y 8
2 a x 16
3 a y 12
4 b x 6
5 b y 4
6 b x 14
7 b y 4
8 c x 13
9 c y 5
10 c x 2
11 c y 9
12 d x 10
13 d y 11
14 d x 3
15 d y 15
Suppose that you create the next python pandas data frames:
In[1]: print df1.to_string()
ID value
0 1 a
1 2 b
2 3 c
3 4 d
In[2]: print df2.to_string()
Id_a Id_b
0 1 2
1 4 2
2 2 1
3 3 3
4 4 4
5 2 2
How can I create a frame df_ids_to_values with the next values:
In[2]: print df_ids_to_values.to_string()
value_a value_b
0 a b
1 d b
2 b a
3 c c
4 d d
5 b b
In other words, I would like to replace the id's of df2 with the corresponding values in df1. I have tried doing this by performing a for loop but it is very slow and I am hopping that there is a function in pandas that allow me to do this operation very efficiently.
Thanks for your help...
Start by setting an index on df1
df1 = df1.set_index('ID')
then join the two columns
df = df2.join(df1, on='Id_a')
df = df.rename(columns = {'value' : 'value_a'})
df = df.join(df1, on='Id_b')
df = df.rename(columns = {'value' : 'value_b'})
result:
> df
Id_a Id_b value_a value_b
0 1 2 a b
1 4 2 d b
2 2 1 b a
3 3 3 c c
4 4 4 d d
5 2 2 b b
[6 rows x 4 columns]
(and you get to your expected output with df[['value_a','value_b']])