I have a dataframe like below:
Text Label
a NaN
b NaN
c NaN
1 NaN
2 NaN
b NaN
c NaN
a NaN
b NaN
c NaN
Whenever the pattern "a,b,c" occurs downwards I want to label that part as a string such as 'Check'. Final dataframe should look like this:
Text Label
a Check
b Check
c Check
1 NaN
2 NaN
b NaN
c NaN
a Check
b Check
c Check
What is the best way to do this. Thank you =)
Here's a NumPy based approach leveraging broadcasting:
import numpy as np
w = df.Text.cumsum().str[-3:].eq('abc') # inefficient for large dfs
m = (w[w].index.values[:,None] + np.arange(-2,1)).ravel()
df.loc[m, 'Label'] = 'Check'
Text Label
0 a Check
1 b Check
2 c Check
3 1 NaN
4 2 NaN
5 b NaN
6 c NaN
7 a Check
8 b Check
9 c Check
Use this solution with numpy.where for general solution:
arr = df['Text']
pat = list('abc')
N = len(pat)
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
c = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
return c
b = np.all(rolling_window(arr, N) == pat, axis=1)
c = np.mgrid[0:len(b)][b]
d = [i for x in c for i in range(x, x+N)]
df['label'] = np.where(np.in1d(np.arange(len(arr)), d), 'Check', np.nan)
print (df)
Text Label label
0 a NaN Check
1 b NaN Check
2 c NaN Check
3 1 NaN nan
4 2 NaN nan
5 b NaN nan
6 c NaN nan
7 a NaN Check
8 b NaN Check
9 c NaN Check
Good old shift and bfill work as well (for small number of steps):
s = df.Text.eq('c') & df.Text.shift().eq('b') & df.Text.shift(2).eq('a')
df.loc[s, 'Label'] = 'Check'
df.Label.bfill(limit=2, inplace=True)
Output:
Text Label
0 a Check
1 b Check
2 c Check
3 1 NaN
4 2 NaN
5 b NaN
6 c NaN
7 a Check
8 b Check
9 c Check
Related
I have two data frame :
import pandas as pd
from numpy import nan
df1 = pd.DataFrame({'key':[1,2,3,4],
'only_at_df1':['a','b','c','d'],
'col2':['e','f','g','h'],})
df2 = pd.DataFrame({'key':[1,9],
'only_at_df2':[nan,'x'],
'col2':['e','z'],})
How to acquire this:
df3 = pd.DataFrame({'key':[1,2,3,4,9],
'only_at_df1':['a','b','c','d',nan],
'only_at_df2':[nan,nan,nan,nan,'x'],
'col2':['e','f','g','h','z'],})
any help appreciated.
The best is probably to use combine_first after temporarily setting "key" as index:
df1.set_index('key').combine_first(df2.set_index('key')).reset_index()
output:
key col2 only_at_df1 only_at_df2
0 1 e a NaN
1 2 f b NaN
2 3 g c NaN
3 4 h d NaN
4 9 z NaN x
This seems like a straightforward use of merge with how="outer":
df1.merge(df2, how="outer")
Output:
key only_at_df1 col2 only_at_df2
0 1 a e NaN
1 2 b f NaN
2 3 c g NaN
3 4 d h NaN
4 9 NaN z x
I have a data frame like
df = pd.DataFrame({"A":[1,np.nan,5],"B":[np.nan,10,np.nan], "C":[2,3,np.nan]})
A B C
0 1 NaN 5
1 NaN 10 NaN
2 2 3 NaN
I want to left shift all the values to occupy the nulls. Desired output:
A B C
0 1 5 NaN
1 10 NaN NaN
2 2 3 NaN
I tried doing this using a series of df['A'].fillna(df['B'].fillna(df['C']) but in my actual data there are more than 100 columns. Is there a better way to do this?
Let us do
out = df.T.apply(lambda x : sorted(x,key=pd.isnull)).T
Out[41]:
A B C
0 1.0 5.0 NaN
1 10.0 NaN NaN
2 2.0 3.0 NaN
I also figured out another way to do this without the sort:
def shift_null(arr):
return [x for x in arr if x == x] + [np.nan for x in arr if x != x]
out = df.T.apply(lambda arr: shift_null(arr)).T
This was faster for big dataframes.
I've just found out about this strange behaviour of mask, could someone explain this to me?
A)
[input]
df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])
df['C'] ='hi'
df.mask(df[['A', 'B']]<3, inplace=True)
[output]
A
B
C
0
NaN
NaN
hi
1
NaN
3.0
hi
2
4.0
5.0
hi
3
6.0
7.0
hi
4
8.0
9.0
hi
B)
[input]
df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])
df['C'] ='hi'
df.mask(df[['A', 'B']]<3)
[output]
A
B
C
0
NaN
NaN
NaN
1
NaN
3.0
NaN
2
4.0
5.0
NaN
3
6.0
7.0
NaN
4
8.0
9.0
NaN
Thank you in advance
The root cause of different result is that you pass a boolean dataframe that is not the same shape as the dataframe you want to mask. df.mask() fill the missing part with the value of inplace.
From the sourcecode, you can see pandas.DataFrame.mask() calls pandas.DataFrame.where() internally. pandas.DataFrame.where() then calls a _where() method that replaces values where the condition is False.
I just take df.where() as an example, here is the example code:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(12).reshape(-1, 3), columns=['A', 'B', 'C'])
df1 = df.where(df[['A', 'B']]<3)
df.where(df[['A', 'B']]<3, inplace=True)
In this example, the df is
A B C
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
df[['A', 'B']]<3, the value of cond argument, is
A B
0 True True
1 False False
2 False False
3 False False
Digging into _where() method, the following lines are the key part:
def _where(...):
# align the cond to same shape as myself
cond = com.apply_if_callable(cond, self)
if isinstance(cond, NDFrame):
cond, _ = cond.align(self, join="right", broadcast_axis=1)
...
# make sure we are boolean
fill_value = bool(inplace)
cond = cond.fillna(fill_value)
Since the shape of cond and df are different, cond.align() fills the missing with NaN value. After that, cond looks like
A B C
0 True True NaN
1 False False NaN
2 False False NaN
3 False False NaN
Then with cond.fillna(fill_value), the NaN values are replaced with the value of inplace. So C column has the same value with inplace value.
Though there are still some codes (L9048 and L9124-L9145) related with inplace. We needn't care about the detail, since the aim of these lines are to replace values where the condition is False.
Recall that the df is
A B C
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
df1=df.where(df[['A', 'B']]<3): The cond C column is False since the default value of inplace is False. After doing df.where(), the df C column is set to the value of other argument which is NaN by default.
df.where(df[['A', 'B']]<3, inplace=True): The cond C column is True. After doing df.where(), the df C column keeps the same.
# print(df1)
A B C
0 0.0 1.0 NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
# print(df) after df.where(df[['A', 'B']]<3, inplace=True)
A B C
0 0.0 1.0 2
1 NaN NaN 5
2 NaN NaN 8
3 NaN NaN 11
Think it simple.
df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])
df['C'] ='hi'
df.mask(df[['A', 'B']]<3)
The last code line is asking for the full dataframe (df.). The condition was applied to columns ['A', 'B'] so, once the column 'C' was not part of the condition it will return NaN for the column C.
This below would be the same of df.mask(df[['A', 'B']]<3)
>>> df[["A","B","C"]].mask(df[['A', 'B']]<3)
A B C
0 NaN NaN NaN
1 NaN 3.0 NaN
2 4.0 5.0 NaN
3 6.0 7.0 NaN
4 8.0 9.0 NaN
>>>
And, df.mask(df[['A', 'B', 'C']]<3) will generate an error, because column 'C' is string type
TypeError: '<' not supported between instances of 'str' and 'int'
Finally, to return only columns "A" and "B"
>>> df[["A","B"]].mask(df[['A', 'B']]<3)
A B
0 NaN NaN
1 NaN 3.0
2 4.0 5.0
3 6.0 7.0
4 8.0 9.0
When you apply the command to be done inplace, it will do nothing to column C because of the NaN, which in the mask method will be 'do nothing'
I have a bunch of partially overlapping (in rows and columns) pandas DataFrames, exemplified like so:
df1 = pandas.DataFrame({'a':['1','2','3'], 'b':['a','b','c']})
df2 = pandas.DataFrame({'c':['q','w','e','r','t','y'], 'b':['a','b','c','d','e','f']})
df3 = pandas.DataFrame({'a':['4','5','6'], 'c':['r','t','y']})
...etc.
I want to merge them all together with as few NaN holes as possible.
Consecutive blind outer merges invariably give some (unfortunately useless to me) hole-and-duplicate-filled variant of:
a b c
0 1 a q
1 2 b w
2 3 c e
3 NaN d r
4 NaN e t
5 NaN f y
6 4 NaN r
7 5 NaN t
8 6 NaN y
My desired output given a, b, and c above would be this (column order doesn't matter):
a b c
0 1 a q
1 2 b w
2 3 c e
3 4 d r
4 5 e t
5 6 f y
I want the NaNs to be treated as places to insert data from the next dataframe, not obstruct it.
I'm at a loss here. Is there any way to achieve this in a general way?
I can not grantee the speed , But after sort with key , seems work for your sample data.
df.apply(lambda x : sorted(x,key=pd.isnull)).dropna(0)
Out[47]:
a b c
0 1.0 a q
1 2.0 b w
2 3.0 c e
3 4.0 d r
4 5.0 e t
5 6.0 f y
I'm new to python and I have a maybe stupid question, but I'm stuck with it and would be thankful for any help.
I have a dataframe A. And in some entries in A there is a range given (for example: '1,0 - 2,0'). I want to take the Maximum in every entry (in our example: '2,0').
I tried it with two for-loops
nrow = A.shape[0]-1
ncol = A.shape[1]-1
for i in range(0,nrow):
for j in range(0,ncol):
if "-" in A[i,j]:
A[i,j]= A[i,j].split(' - ')[1]
But I get this error: KeyError: (0, 0).
Questions:
Is there a more elegant way to solve my problem?
What is the problem with my solution?
edit: A.head()
You can use lambda functions
A['column']=A['column'].apply(lambda x: x.split('-')[1])
I think that your split is actually working fine, but you are missing an entry in A for (0,0). In the error message it says that the issue is with the key (0,0).
If you would have in issue with the [1] in the split you would get a IndexError: list index out of range error.
Looks like there is no (0,0) key in A. Which is understandable. According to your image the top-left corner is empty. Just add a check for key exictence to your loop and you're good:
nrow = A.shape[0]-1
ncol = A.shape[1]-1
for i in range(0,nrow):
for j in range(0,ncol):
if (i,j) in A and "-" in A[i,j]:
A[i,j]= A[i,j].split(' - ')[1]
I think you can use:
reshape A to Series by stack
filter only values with - by contains and boolean indexing
split values, create A2 by expand=True parameter, replace , to . if necessary, convert to float by astype and get max value by max
last reshape by unstack to A2
replace missing values in A2 by values of A by combine_first.
np.random.seed(34)
L = ['1,0 - 2,0','3,0 - 2,0','4,0 - 6,0', 5.0, 'a', 'c']
A = pd.DataFrame(np.random.choice(L, size=(5,5)), columns=list('abcde'))
print (A)
a b c d e
0 3,0 - 2,0 4,0 - 6,0 4,0 - 6,0 3,0 - 2,0 c
1 a 5.0 c 5.0 4,0 - 6,0
2 c 4,0 - 6,0 c 3,0 - 2,0 1,0 - 2,0
3 c c a c 1,0 - 2,0
4 c a 5.0 5.0 c
A1 = A.stack()
A2 = A1[A1.str.contains(' - ')]
A2 = A2.str.split(' - ', expand=True)
.replace(',', '.', regex=True)
.astype(float)
.max(axis=1)
.unstack()
print (A2)
a b c d e
0 3.0 6.0 6.0 3.0 NaN
1 NaN NaN NaN NaN 6.0
2 NaN 6.0 NaN 3.0 2.0
3 NaN NaN NaN NaN 2.0
A = A2.combine_first(A)
print (A)
a b c d e
0 3 6 6 3 c
1 a 5.0 c 5.0 6
2 c 6 c 3 2
3 c c a c 2
4 c a 5.0 5.0 c
If always second value of range is max:
A1 = A.stack()
A2 = A1[A1.str.contains(' - ')]
A2 = A2.str.split(' - ').str[1].replace(',','.', regex=True).astype(float).unstack()
print (A2)
a b c d e
0 2.0 6.0 6.0 2.0 NaN
1 NaN NaN NaN NaN 6.0
2 NaN 6.0 NaN 2.0 2.0
3 NaN NaN NaN NaN 2.0
A = A2.combine_first(A)
print (A)
a b c d e
0 2 6 6 2 c
1 a 5.0 c 5.0 6
2 c 6 c 2 2
3 c c a c 2
4 c a 5.0 5.0 c