I implemented a function that goes to the first occurence of a valued in a panda dataframe but I feel the implementation is kindda ugly. Would you have a nicer way to implement it??
[mots] is an array of strings
# Sans doutes la pire implémentation au monde...
def find_singular_value(self, mots):
bool_table = self.document.isin(mots)
for i in range(bool_table.shape[0]):
for j in range(bool_table.shape[1]):
boolean = bool_table.iloc[i][j]
if boolean:
return self.document.iloc[i][j + 1]
Here's a solution for getting the j+1 value. It uses df.unstack and df.shift
df = self.document.unstack()
vals = df[df.isin(mots).shift().fillna(False)]
vals will contain all of the j+1 values in self.documents. You can then select the first one as in my previous answer.
Hopefully this works for you.
This one liner should give you what you need.
self.document[self.document.isin(mots)].melt()["value"].dropna().values[0]
It applies your isin mask to the original df then finds the first non nan value using pd.melt and df.dropna
Here's a simple breakdown:
>>> df = pd.DataFrame({"a":[1,2,3],"b":[4,5,6],"c":[7,8,9]})
>>> df.isin([4,6])
a b c
0 False True False
1 False False False
2 False True False
>>> df[df.isin([4,6])]
a b c
0 NaN 4.0 NaN
1 NaN NaN NaN
2 NaN 6.0 NaN
>>> df[df.isin([4,6])].melt()
variable value
0 a NaN
1 a NaN
2 a NaN
3 b 4.0
4 b NaN
5 b 6.0
6 c NaN
7 c NaN
8 c NaN
>>> df[df.isin([4,6])].melt()["value"]
0 NaN
1 NaN
2 NaN
3 4.0
4 NaN
5 6.0
6 NaN
7 NaN
8 NaN
Name: value, dtype: float64
>>> df[df.isin([4,6])].melt()["value"].dropna()
3 4.0
5 6.0
Name: value, dtype: float64
>>> df[df.isin([4,6])].melt()["value"].dropna().values
array([ 4., 6.])
>>> df[df.isin([4,6])].melt()["value"].dropna().values[0]
4.0
>>>
Related
I'm still quite new to Python and programming in general. With luck, I have the right idea, but I can't quite get this to work.
With my example df, I want iteration to start when entry == 1.
import pandas as pd
import numpy as np
nan = np.nan
a = [0,0,4,4,4,4,6,6]
b = [4,4,4,4,4,4,4,4]
entry = [nan,nan,nan,nan,1,nan,nan,nan]
df = pd.DataFrame(columns=['a', 'b', 'entry'])
df = pd.DataFrame.assign(df, a=a, b=b, entry=entry)
I wrote a function, with little success. It returns an error, unhashable type: 'slice'. FWIW, I'm applying this function to groups of various lengths.
def exit_row(df):
start = df.index[df.entry == 1]
df.loc[start:,(df.a > df.b), 'exit'] = 1
return df
Ideally, the result would be as below:
a b entry exit
0 0 4 NaN NaN
1 0 4 NaN NaN
2 4 4 NaN NaN
3 4 4 NaN NaN
4 4 4 1.0 NaN
5 4 4 NaN NaN
6 6 4 NaN 1
7 6 4 NaN 1
Any advice much appreciated. I had wondered if I should attempt a For loop instead, though I often find them difficult to read.
You can use boolean indexing:
# what are the rows after entry?
m1 = df['entry'].notna().cummax()
# in which rows is a>b?
m2 = df['a'].gt(df['b'])
# set 1 where both conditions are True
df.loc[m1&m2, 'exit'] = 1
output:
a b entry exit
0 0 4 NaN NaN
1 0 4 NaN NaN
2 4 4 NaN NaN
3 4 4 NaN NaN
4 4 4 1.0 NaN
5 4 4 NaN NaN
6 6 4 NaN 1.0
7 6 4 NaN 1.0
Intermediates:
a b entry notna m1 m2 m1&m2 exit
0 0 4 NaN False False False False NaN
1 0 4 NaN False False False False NaN
2 4 4 NaN False False False False NaN
3 4 4 NaN False False False False NaN
4 4 4 1.0 True True False False NaN
5 4 4 NaN False True False False NaN
6 6 4 NaN False True True True 1.0
7 6 4 NaN False True True True 1.0
I have a data frame like
df = pd.DataFrame({"A":[1,np.nan,5],"B":[np.nan,10,np.nan], "C":[2,3,np.nan]})
A B C
0 1 NaN 5
1 NaN 10 NaN
2 2 3 NaN
I want to left shift all the values to occupy the nulls. Desired output:
A B C
0 1 5 NaN
1 10 NaN NaN
2 2 3 NaN
I tried doing this using a series of df['A'].fillna(df['B'].fillna(df['C']) but in my actual data there are more than 100 columns. Is there a better way to do this?
Let us do
out = df.T.apply(lambda x : sorted(x,key=pd.isnull)).T
Out[41]:
A B C
0 1.0 5.0 NaN
1 10.0 NaN NaN
2 2.0 3.0 NaN
I also figured out another way to do this without the sort:
def shift_null(arr):
return [x for x in arr if x == x] + [np.nan for x in arr if x != x]
out = df.T.apply(lambda arr: shift_null(arr)).T
This was faster for big dataframes.
I've just found out about this strange behaviour of mask, could someone explain this to me?
A)
[input]
df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])
df['C'] ='hi'
df.mask(df[['A', 'B']]<3, inplace=True)
[output]
A
B
C
0
NaN
NaN
hi
1
NaN
3.0
hi
2
4.0
5.0
hi
3
6.0
7.0
hi
4
8.0
9.0
hi
B)
[input]
df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])
df['C'] ='hi'
df.mask(df[['A', 'B']]<3)
[output]
A
B
C
0
NaN
NaN
NaN
1
NaN
3.0
NaN
2
4.0
5.0
NaN
3
6.0
7.0
NaN
4
8.0
9.0
NaN
Thank you in advance
The root cause of different result is that you pass a boolean dataframe that is not the same shape as the dataframe you want to mask. df.mask() fill the missing part with the value of inplace.
From the sourcecode, you can see pandas.DataFrame.mask() calls pandas.DataFrame.where() internally. pandas.DataFrame.where() then calls a _where() method that replaces values where the condition is False.
I just take df.where() as an example, here is the example code:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(12).reshape(-1, 3), columns=['A', 'B', 'C'])
df1 = df.where(df[['A', 'B']]<3)
df.where(df[['A', 'B']]<3, inplace=True)
In this example, the df is
A B C
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
df[['A', 'B']]<3, the value of cond argument, is
A B
0 True True
1 False False
2 False False
3 False False
Digging into _where() method, the following lines are the key part:
def _where(...):
# align the cond to same shape as myself
cond = com.apply_if_callable(cond, self)
if isinstance(cond, NDFrame):
cond, _ = cond.align(self, join="right", broadcast_axis=1)
...
# make sure we are boolean
fill_value = bool(inplace)
cond = cond.fillna(fill_value)
Since the shape of cond and df are different, cond.align() fills the missing with NaN value. After that, cond looks like
A B C
0 True True NaN
1 False False NaN
2 False False NaN
3 False False NaN
Then with cond.fillna(fill_value), the NaN values are replaced with the value of inplace. So C column has the same value with inplace value.
Though there are still some codes (L9048 and L9124-L9145) related with inplace. We needn't care about the detail, since the aim of these lines are to replace values where the condition is False.
Recall that the df is
A B C
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
df1=df.where(df[['A', 'B']]<3): The cond C column is False since the default value of inplace is False. After doing df.where(), the df C column is set to the value of other argument which is NaN by default.
df.where(df[['A', 'B']]<3, inplace=True): The cond C column is True. After doing df.where(), the df C column keeps the same.
# print(df1)
A B C
0 0.0 1.0 NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
# print(df) after df.where(df[['A', 'B']]<3, inplace=True)
A B C
0 0.0 1.0 2
1 NaN NaN 5
2 NaN NaN 8
3 NaN NaN 11
Think it simple.
df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])
df['C'] ='hi'
df.mask(df[['A', 'B']]<3)
The last code line is asking for the full dataframe (df.). The condition was applied to columns ['A', 'B'] so, once the column 'C' was not part of the condition it will return NaN for the column C.
This below would be the same of df.mask(df[['A', 'B']]<3)
>>> df[["A","B","C"]].mask(df[['A', 'B']]<3)
A B C
0 NaN NaN NaN
1 NaN 3.0 NaN
2 4.0 5.0 NaN
3 6.0 7.0 NaN
4 8.0 9.0 NaN
>>>
And, df.mask(df[['A', 'B', 'C']]<3) will generate an error, because column 'C' is string type
TypeError: '<' not supported between instances of 'str' and 'int'
Finally, to return only columns "A" and "B"
>>> df[["A","B"]].mask(df[['A', 'B']]<3)
A B
0 NaN NaN
1 NaN 3.0
2 4.0 5.0
3 6.0 7.0
4 8.0 9.0
When you apply the command to be done inplace, it will do nothing to column C because of the NaN, which in the mask method will be 'do nothing'
I have a pd.dataframe that looks like this:
key_value a b c d e
value_01 1 10 x NaN NaN
value_01 NaN 12 NaN NaN NaN
value_01 NaN 7 NaN NaN NaN
value_02 7 4 y NaN NaN
value_02 NaN 5 NaN NaN NaN
value_02 NaN 6 NaN NaN NaN
value_03 19 15 z NaN NaN
So now based on the key_value,
For column 'a' & 'c', I want to copy over the last cell's value from the same column 'a' & 'c' based off of the key_value.
For another column 'd', I want to copy over the row 'i - 1' cell value from column 'b' to column 'd' i'th cell.
Lastly, for column 'e' I want to copy over the sum of 'i - 1' cell's from column 'b' to column 'e' i'th cell .
For every key_value the columns 'a', 'b' & 'c' have some value in their first row, based on which the next values are being copied over or for different columns the values are being generated for.
key_value a b c d e
value_01 1 10 x NaN NaN
value_01 1 12 x 10 10
value_01 1 7 x 12 22
value_02 7 4 y NaN NaN
value_02 7 5 y 4 4
value_02 7 6 y 5 9
value_03 8 15 z NaN NaN
My current approach:
size = df.key_value.size
for i in range(size):
if pd.isna(df.a[i]) and df.key_value[i] == output.key_value[i - 1]:
df.a[i] = df.a[i - 1]
df.c[i] = df.c[i - 1]
df.d[i] = df.b[i - 1]
df.e[i] = df.e[i] + df.b[i - 1]
For columns like 'a' and 'b' the NaN values are all in the same row indexes.
My approach works but takes very long since my datframe has over 50000 records, I was wondering if there is a different way to do this, since I have multiple columns like 'a' & 'b' where values need to be copied over based on 'key_value' and some columns where the values are being computed using say a column like 'b'
pd.concat with groupby and assign
pd.concat([
g.ffill().assign(d=lambda d: d.b.shift(), e=lambda d: d.d.cumsum())
for _, g in df.groupby('key_value')
])
key_value a b c d e
0 value_01 1.0 1 x NaN NaN
1 value_01 1.0 2 x 1.0 1.0
2 value_01 1.0 3 x 2.0 3.0
3 value_02 7.0 4 y NaN NaN
4 value_02 7.0 5 y 4.0 4.0
5 value_02 7.0 6 y 5.0 9.0
6 value_03 19.0 7 z NaN NaN
groupby and apply
def h(g):
return g.ffill().assign(
d=lambda d: d.b.shift(), e=lambda d: d.d.cumsum())
df.groupby('key_value', as_index=False, group_keys=False).apply(h)
You can use groupby + ffill for the groupwise filling. The other operations require shift and cumsum.
In general, note that many common operations have been implemented efficiently in Pandas.
g = df.groupby('key_value')
df['a'] = g['a'].ffill()
df['c'] = g['c'].ffill()
df['d'] = df['b'].shift()
df['e'] = df['d'].cumsum()
print(df)
key_value a b c d e
0 value_01 1.0 1 x NaN NaN
1 value_01 1.0 2 x 1.0 1.0
2 value_01 1.0 3 x 2.0 3.0
3 value_02 7.0 4 y 3.0 6.0
4 value_02 7.0 5 y 4.0 10.0
5 value_02 7.0 6 y 5.0 15.0
6 value_03 19.0 7 z 6.0 21.0
I want to make the whole row NaN according to a condition, based on a column. For example, if B > 5, I want to make the whole row NaN.
Unprocessed data frame looks like this:
A B
0 1 4
1 3 5
2 4 6
3 8 7
Make whole row NaN, if B > 5:
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
Thank you.
Use boolean indexing for assign value per condition:
df[df['B'] > 5] = np.nan
print (df)
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
Or DataFrame.mask which add by default NaNs by condition:
df = df.mask(df['B'] > 5)
print (df)
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
Thank you Bharath shetty:
df = df.where(~(df['B']>5))
You can also use df.loc[df.B > 5, :] = np.nan
Example
In [14]: df
Out[14]:
A B
0 1 4
1 3 5
2 4 6
3 8 7
In [15]: df.loc[df.B > 5, :] = np.nan
In [16]: df
Out[16]:
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
in human language df.loc[df.B > 5, :] = np.nan can be translated to:
assign np.nan to any column (:) of the dataframe ( df ) where the
condition df.B > 5 is valid.
Or using reindex
df.loc[df.B<=5,:].reindex(df.index)
Out[83]:
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN