I'm working on a pandas data frame where I want to find the farthest out non-null value in each row and then reverse the order of those values and output a data frame with the row values reversed without leaving null values in the first column. Essentially reversing column order and shifting non-null values to the left.
IN:
1 2 3 4 5
1 a b c d e
2 a b c
3 a b c d
4 a b c
OUT:
1 2 3 4 5
1 e d c b a
2 c b a
3 d c b a
4 c b a
For each row, create a new Series with the same indexes but with the values reversed:
def reverse(s):
# Strip the NaN on both ends, but not in the middle
idx1 = s.first_valid_index()
idx2 = s.last_valid_index()
idx = s.loc[idx1:idx2].index
return pd.Series(s.loc[idx[::-1]].values, index=idx)
df.apply(reverse, axis=1)
Result:
1 2 3 4 5
1 e d c b a
2 c b a NaN NaN
3 d c b a NaN
4 c NaN b a NaN
Related
I would like to fill missing values in a pandas dataframe with the average of the cells directly before and after the missing value considering that there are different IDs.
maskedid test value
1 A 4
1 B NaN
1 C 5
2 A 5
2 B NaN
2 B 2
expected DF
maskedid test value
1 A 4
1 B 4.5
1 C 5
2 A 5
2 B 3.5
2 B 2
Try to interpolate:
df['value'] = df['value'].interpolate()
And by group:
df['value'] = df.groupby('maskedid')['value'].apply(pd.Series.interpolate)
Let's suppose I have a dataframe:
import numpy as np
a = [['A',np.nan,2,'x|x|x|y'],['B','a|b',56,'b|c'],['C','c|e|e',65,'f|g'],['D','h',98,'j'],['E','g',98,'k|h'],['F','a|a|a|a|a|b',98,np.nan],['G','w',98,'p'],['H','s',98,'t|u']]
df1 = pd.DataFrame(a, columns=['1', '2','3','4'])
df1
1 2 3 4
0 A NaN 2 x|x|x|y
1 B a|b 56 b|c
2 C c|e|e 65 f|g
3 D h 98 j
4 E g 98 k|h
5 F a|a|a|a|a|b 98 NaN
6 G w 98 p
7 H s 98 t|u
and another dataframe:
a = [['x'],['b'],['h'],['v']]
df2 = pd.DataFrame(a, columns=['1'])
df2
1
0 x
1 b
2 h
3 v
I want to compare column 1 in df2 with column 2 and 4 (splitting it by "|") in df1, and if the value matches with either or both column 2 or 4 (after splitting), I want to extract only those rows of df1 in another dataframe with an added column that will have the value of df2 that matched with either column 2 or column 4 of df1.
For example, the result would look something like this:
1 2 3 4 5
0 A NaN 2 x|x|x|y x
1 B a|b 56 b|c b
2 F a|a|a|a|a|b 98 NaN b
3 D h 98 j h
4 E g 98 k|h h
Solution is join values of both columns to Series in DataFrame.agg, then splitting by Series.str.split, filter values in DataFrame.where with DataFrame.isin and then join values together without NaNs, last filter columns without empty strings:
df11 = df1[['2','4']].fillna('').agg('|'.join, 1).str.split('|', expand=True)
df1['5'] = (df11.where(df11.isin(df2['1'].tolist()))
.apply(lambda x: ','.join(set(x.dropna())), axis=1))
df1 = df1[df1['5'].ne('')]
print (df1)
1 2 3 4 5
0 A NaN 2 x|x|x|y x
1 B a|b 56 b|c b
3 D h 98 j h
4 E g 98 k|h h
5 F a|a|a|a|a|b 98 NaN b
How can I replace specific row-wise duplicate cells in selected columns without dropping rows (preferably without looping through the rows)?
Basically, I want to keep the first value and replace the remaining duplicates in a row with NAN.
For example:
df_example = pd.DataFrame({'A':['a' , 'b', 'c'], 'B':['a', 'f', 'c'],'C':[1,2,3]})
df_example.head()
Original:
A B C
0 a a 1
1 b f 2
2 c c 3
Expected output:
A B C
0 a nan 1
1 b f 2
2 c nan 3
A bit more complicated example is as follows:
Original:
A B C D
0 a 1 a 1
1 b 2 f 5
2 c 3 c 3
Expected output:
A B C D
0 a 1 nan nan
1 b 2 f 5
2 c 3 nan nan
Use DataFrame.mask with Series.duplicated per rows in DataFrame.apply:
df_example = df_example.mask(df_example.apply(lambda x: x.duplicated(), axis=1))
print (df_example)
A B C
0 a NaN 1
1 b f 2
2 c NaN 3
With new data:
df_example = df_example.mask(df_example.apply(lambda x: x.duplicated(), axis=1))
print (df_example)
A B C D
0 a 1 NaN NaN
1 b 2 f 5.0
2 c 3 NaN NaN
New to Pandas, not very sure how the 3D DataFrame works. My dataframe, called 'new' looks like this:
unique cat numerical
a b c d e f
0 0 1 2 3 4 5
1 0 1 2 3 4 5
I want to insert column 'z' so that it ends up like this:
unique cat numerical
a b z c d e f
0 0 1 9 2 3 4 5
1 0 1 9 2 3 4 5
I successfully made a new column after slicing out 'unique' from my dataframe:
Doing this:
new_column = new.loc[:,'unique'].assign(z=pd.Series([9,9]).values)
Gets me this:
a b z
0 0 1 9
1 0 1 9
However I have no idea how to put it back into the dataframe. I tried:
new['unique'] = new_column
But I've since found out that it just tries to replace all the values in all the rows and columns found under 'unique', like this:
new['unique'] = 'a'
Gets:
unique cat numerical
a b c d e f
0 a a 2 3 4 5
1 a a 2 3 4 5
And using .loc gets this instead:
unique cat numerical
a b c d e f
0 NaN NaN 2 3 4 5
1 NaN NaN 2 3 4 5
Here's my full code:
import pandas as pd
import numpy as np
data=[[0,1,2,3,4,5],[0,1,2,3,4,5]]
datatypes=np.array(['unique','unique','cat','cat','numerical','numerical'])
columnnames=np.array(['a','b','c','d','e','f'])
new = pd.DataFrame(data=data, columns=pd.MultiIndex.from_tuples(zip(datatypes,columnnames)))
print('new: ')
print(new)
new_column = new.loc[:,'unique'].assign(z=pd.Series([9,9]).values)
print('\nnew column:')
print(new_column)
new.loc[:,'unique'] = new_column
print('\nattempt 1:')
print(new)
new['unique'] = new_column
print('\nattempt 2:')
print(new)
One way to do this:
# Create your new multiindexed column:
new['unique','z'] = 9
# Re-order your columns in your desired order:
new = new[['unique', 'cat', 'numerical']]
>>> new
unique cat numerical
a b z c d e f
0 0 1 9 2 3 4 5
1 0 1 9 2 3 4 5
I have a bunch of partially overlapping (in rows and columns) pandas DataFrames, exemplified like so:
df1 = pandas.DataFrame({'a':['1','2','3'], 'b':['a','b','c']})
df2 = pandas.DataFrame({'c':['q','w','e','r','t','y'], 'b':['a','b','c','d','e','f']})
df3 = pandas.DataFrame({'a':['4','5','6'], 'c':['r','t','y']})
...etc.
I want to merge them all together with as few NaN holes as possible.
Consecutive blind outer merges invariably give some (unfortunately useless to me) hole-and-duplicate-filled variant of:
a b c
0 1 a q
1 2 b w
2 3 c e
3 NaN d r
4 NaN e t
5 NaN f y
6 4 NaN r
7 5 NaN t
8 6 NaN y
My desired output given a, b, and c above would be this (column order doesn't matter):
a b c
0 1 a q
1 2 b w
2 3 c e
3 4 d r
4 5 e t
5 6 f y
I want the NaNs to be treated as places to insert data from the next dataframe, not obstruct it.
I'm at a loss here. Is there any way to achieve this in a general way?
I can not grantee the speed , But after sort with key , seems work for your sample data.
df.apply(lambda x : sorted(x,key=pd.isnull)).dropna(0)
Out[47]:
a b c
0 1.0 a q
1 2.0 b w
2 3.0 c e
3 4.0 d r
4 5.0 e t
5 6.0 f y