Classify DataFrame rows based on first matching condition

Classify DataFrame rows based on first matching condition - python

I have a pandas DataFrame, each column represents a quarter, the most recent quarters are placed to the right, not all the information gets at the same time, some columns might be missing information (NaN values)
I would like to create a new column with the first criteria number that the row matches, or zero if it doesn't match any criteria
The criteria gets applied to the 3 most recent columns that have data (an integer, ignoring NaNs) and a match is considered if the value in the list is greater than or equal to its corresponding value in the DataFrame
I tried using apply, but I couldn't make it work and the failed attempts were slow
import pandas as pd
import numpy as np
criteria_dict = {
1: [10, 0, 10]
, 2: [0, 10, 10]
}
list_of_tuples = [
(78, 7, 11, 15), # classify as 2 since 7 >= 0, 11 >= 10, 15 >= 10
(98, -5, np.NaN, 18), # classify as 0, ignoring NaN it doesn't match any criteria because of the -5
(-78, 20, 64, 28), # classify as 1 20 >= 10, 64 >= 0, 28 >= 10
(35, 63, 27, np.NaN), # classify as 1, NaN value should be ignored, 35 >= 10, 63 >=0, 27 >= 10
(-11, 0, 56, 10) # classify as 2, 0 >= 0, 56 >= 10, 10 >= 10
]
df = pd.DataFrame(
list_of_tuples,
index=['A', 'B', 'C', 'D', 'E'],
columns=['2021Q2', '2021Q3', '2021Q4', '2022Q1']
)
print(df)

Applying a custom function to each row should work.
def func(x):
x = x.dropna().to_numpy()[-3:]
if len(x) < 3:
return 0
for k, v in criteria_dict.items():
if np.all(x >= v):
return k
return 0
df.apply(func, axis=1)

Probably using apply is the most straightforward, but I wanted to try a solution with numpy, which should be faster with dataframes with many rows.
import numpy as np
# Rows with too many NaNs.
df_arr = df.to_numpy()
# Find NaNs.
nans = np.nonzero(np.isnan(df_arr))
# Roll the rows so that the latest three columns with valid data are all to the right.
for row, col in zip(*nans):
df_arr[row, :] = np.roll(df_arr[row, :], shift=4-col)
# Check for matching criteria.
df['criteria'] = np.select([np.all((df_arr[:, 1:] - criteria_dict[crit])>=0, axis=1) for crit in criteria_dict],
[crit for crit in criteria_dict])
print(df)
2021Q2 2021Q3 2021Q4 2022Q1 criteria
A 78 7 11.0 15.0 2.0
B 98 -5 NaN 18.0 0.0
C -78 20 64.0 28.0 1.0
D 35 63 27.0 NaN 1.0
E -11 0 56.0 10.0 2.0
Some timings on df = pd.concat([df]*10000):
# 103 ms ± 1.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit numpy(df)
# 1.32 s ± 14.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pandas_apply(df)
So it is ~10x faster.

It is possible to achieve a full vectorial comparison. Note that the bottleneck is the broadcasting step that will create an intermediate array of K*N*M size where M*N is the size of the subset of the dataframe (here 5*3) and K*N that of the criterions (here 2*3). You need to have enough memory to create this array.
Step by step procedure:
First get last 3 non-nan values as b:
N = 3
a = df.to_numpy()
from scipy.stats import rankdata
b = a[rankdata(~np.isnan(a), method='ordinal', axis=1)>(a.shape[1]-N)].reshape(-1,N)
array([[ 7., 11., 15.],
[98., -5., 18.],
[20., 64., 28.],
[35., 63., 27.],
[ 0., 56., 10.]])
Then craft an array with the conditions as c;
c = np.array(list(criteria_dict.values()))
array([[10, 0, 10],
[ 0, 10, 10]])
Broadcast the comparison of b and c and get all values >=:
d = (b>=c[:, None]).all(2)
array([[False, False, True, True, False],
[ True, False, True, True, True]])
Get index of first True using the criteria_dict keys (else 0):
e = np.where(d.any(0), np.array(list(criteria_dict))[np.argmax(d, axis=0)], 0)
array([2, 0, 1, 1, 2])
Assign to DataFrame:
df['criteria'] = e
2021Q2 2021Q3 2021Q4 2022Q1 criteria
A 78 7 11.0 15.0 2
B 98 -5 NaN 18.0 0
C -78 20 64.0 28.0 1
D 35 63 27.0 NaN 1
E -11 0 56.0 10.0 2

Related

Anything faster than groupby for iterating through groups?

So I've narrowed down a previous problem down to this: I have a DataFrame that looks like this
id temp1 temp2
9 10.0 True False
10 10.0 True False
11 10.0 False True
12 10.0 False True
17 15.0 True False
18 15.0 True False
19 15.0 True False
20 15.0 True False
21 15.0 False False
33 27.0 True False
34 27.0 True False
35 27.0 False True
36 27.0 False False
40 31.0 True False
41 31.0 False True
.
.
.
and in reality, it's a few million lines long (and has a few other columns).
What I have it currently doing is
grouped = coinc.groupby('id')
final = grouped.filter(lambda x: ( x['temp2'].any() and x['temp1'].any()))
lanif = final.drop(['temp1','temp2'],axis = 1 )
(coinc is the name of the dataframe)
which only keeps rows (grouped by id) if there is a true in both temp1 and temp2 for some rows with the same id. For example, with the above dataframe, it would get rid of rows with id 15, but keep everything else.
This, however, is deathly slow and I was wondering if there was a faster way to do this.

Using filter with a lambda function here is slowing you down a lot. You can speed things up by removing that.
u = coinc.groupby('id')
m = u.temp1.any() & u.temp2.any()
res = df.loc[coinc.id.isin(m[m].index), ['id']]
Comparing this to your approach on a larger frame.
a = np.random.randint(1, 1000, 100_000)
b = np.random.randint(0, 2, 100_000, dtype=bool)
c = ~b
coinc = pd.DataFrame({'id': a, 'temp1': b, 'temp2': c})
In [295]: %%timeit
...: u = coinc.groupby('id')
...: m = u.temp1.any() & u.temp2.any()
...: res = coinc.loc[coinc.id.isin(m[m].index), ['id']]
...:
13.5 ms ± 476 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [296]: %%timeit
...: grouped = coinc.groupby('id')
...: final = grouped.filter(lambda x: ( x['temp2'].any() and x['temp1'].any()))
...: lanif = final.drop(['temp1','temp2'],axis = 1 )
...:
527 ms ± 7.91 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
np.array_equal(res.values, lanif.values)
True

i, u = pd.factorize(coinc.id)
t = np.zeros((len(u), 2), bool)
c = np.column_stack([coinc.temp1.to_numpy(), coinc.temp2.to_numpy()])
np.logical_or.at(t, i, c)
final = coinc.loc[t.all(1)[i], ['id']]
final
id
9 10.0
10 10.0
11 10.0
12 10.0
33 27.0
34 27.0
35 27.0
36 27.0
40 31.0
41 31.0

The problem isn't the groupby it's the lambda. Lambda operations are not vectorized*. You can get the same result faster using agg. I'd do:
groupdf = coinc.groupby('id').agg(any)
# Selects instance where both contain at least one true statement
mask = maskdf[['temp1','temp2']].all(axis=1)
lanif = groupdf[mask].drop(['temp1','temp2'],axis = 1 )
*This is a pretty nuanced issue that I'm waaaay oversimplifying, sorry.

Here is another alternative solution
f = coinc.groupby('id').transform('any')
result = coinc.loc[f['temp1'] & f['temp2'], coinc.columns.drop(['temp1', 'temp2'])]

How to find the maximum consecutive number for multiple columns?

I need to identify the highest number of consecutive values that meet a certain criteria for multiple columns.
If my df is:
A B C D E
26 24 21 23 24
26 23 22 15 23
24 19 17 11 15
27 22 28 24 24
26 27 30 23 11
26 26 29 27 29
I want to know the maximum consecutive times that numbers over 25 occur for each column. So the output would be:
A 3
B 2
C 3
D 1
E 1
Using the following code, I can obtain the outcome for one column at a time; is there a way to create a table as above rather than repeating for each column (I have over 40 columns in total).
df.A.isnull().astype(int).groupby(df.A.notnull().astype(int).cumsum()).sum().max()
Thanks in advance.

Is this what you want ? pandas approach (PS: never thought I can make it one line LOL)
(df>25).apply(lambda x :x.groupby(x.diff().ne(0).cumsum()).cumcount()+1).mask(df<25).max()
Out[320]:
A 3.0
B 2.0
C 3.0
D 1.0
E 1.0
dtype: float64

One option using numpy to calculate the max consecutive:
def max_consecutive(arr):
# calculate the indices where the condition changes
split_indices = np.flatnonzero(np.ediff1d(arr.values, to_begin=1, to_end=1))
# calculate the chunk length of consecutive values and pick every other value based on
# the initial value
try:
max_size = np.diff(split_indices)[not arr.iat[0]::2].max()
except ValueError:
max_size = 0
return max_size
df.gt(25).apply(max_consecutive)
#A 3
#B 2
#C 3
#D 1
#E 1
#dtype: int64
Timing compared with the other approach:
%timeit df.gt(25).apply(max_consecutive)
# 520 µs ± 6.92 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit (df>25).apply(lambda x :x.groupby(x.diff().ne(0).cumsum()).cumcount()+1).mask(df<25).max(0)
# 10.3 ms ± 221 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Here's one with NumPy -
# mask is 2D boolean array representing islands as True values per col
def max_island_len_cols(mask):
m,n = mask.shape
out = np.zeros(n,dtype=int)
b = np.zeros((m+2,n),dtype=bool)
b[1:-1] = mask
for i in range(mask.shape[1]):
idx = np.flatnonzero(b[1:,i] != b[:-1,i])
if len(idx)>0:
out[i] = (idx[1::2] - idx[::2]).max()
return out
output = pd.Series(max_island_len_cols(df.values>25), index=df.columns)
Sample run -
In [690]: df
Out[690]:
A B C D E
0 26 24 21 23 24
1 26 23 22 15 23
2 24 19 17 11 15
3 27 22 28 24 24
4 26 27 30 23 11
5 26 26 29 27 29
In [690]:
In [691]: pd.Series(max_island_len_cols(df.values>25), index=df.columns)
Out[691]:
A 3
B 2
C 3
D 1
E 1
dtype: int64
Runtime test
Inspired by the given sample that has numbers in range (24,28) and with 40 cols, let's setup a bigger input dataframe and test out all the solutions -
# Input dataframe
In [692]: df = pd.DataFrame(np.random.randint(24,28,(1000,40)))
# Proposed in this post
In [693]: %timeit pd.Series(max_island_len_cols(df.values>25), index=df.columns)
1000 loops, best of 3: 539 µs per loop
# #Psidom's solution
In [694]: %timeit df.gt(25).apply(max_consecutive)
1000 loops, best of 3: 1.81 ms per loop
# #Wen's solution
In [695]: %timeit (df>25).apply(lambda x :x.groupby(x.diff().ne(0).cumsum()).cumcount()+1).mask(df<25).max(0)
10 loops, best of 3: 95.2 ms per loop

An approach using pandas and scipy.ndimage.label, for fun.
import pandas as pd
from scipy.ndimage import label
struct = [[0, 1, 0], # Structure used for segmentation
[0, 1, 0], # Equivalent to axis=0 in `numpy`
[0, 1, 0]] # Or 'columns' in `pandas`
labels, nlabels = label(df > 25, structure=struct)
>>> labels # Labels for each column-wise block of consecutive numbers > 25
Out[]:
array([[1, 0, 0, 0, 0],
[1, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[2, 0, 3, 0, 0],
[2, 4, 3, 0, 0],
[2, 4, 3, 5, 6]])
labels_df = pd.DataFrame(columns=df.columns, data=labels) # Add original columns names
res = (labels_df.apply(lambda x: x.value_counts()) # Execute `value_counts` on each column
.iloc[1:] # slice results for labels > 0
.max()) # and get max value
>>> res
Out[]:
A 3.0
B 2.0
C 3.0
D 1.0
E 1.0
dtype: float64

Removing values in dataframe once threshold (min/max) value has been reached with Pandas

I would like to make a filter for the entire dataframe, which includes many columns beyond column C. I'd like this filter to return values in each column once a minimum threshold value has been reached, and stop when a maximum threshold value has been reached. I'd like the min threshold to be 6.5 and the max to be 9.0. It's not as simple as it sounds here so hang with me...
The dataframe:
Time A1 A2 A3
1 6.305 6.191 5.918
2 6.507 6.991 6.203
3 6.407 6.901 6.908
4 6.963 7.127 7.116
5 7.227 7.330 7.363
6 7.445 7.632 7.575
7 7.710 7.837 7.663
8 8.904 8.971 8.895
9 9.394 9.194 8.994
10 8.803 8.113 9.333
11 8.783 8.783 8.783
The desired result:
Time A1 A2 A3
1 NaN NaN NaN
2 6.507 6.991 NaN
3 6.407 6.901 6.908
4 6.963 7.127 7.116
5 7.227 7.330 7.363
6 7.445 7.632 7.575
7 7.710 7.837 7.663
8 8.904 8.971 8.895
9 NaN NaN 8.994
10 NaN NaN NaN
11 NaN NaN NaN
To drive home the point, in Column A, for example, at Time 3 there is a value 6.407, which is lower than the 6.5 threshold, but since the threshold was met at Time 2, I would like to keep the data once the min threshold has been met. As for the upper threshold, in Column A at Time 9, the value is above the 9.0 threshold, so I would like it to omit that value and the values beyond that, even though the remaining values are less than 9.0. I'm hoping to iterate this over many many more columns.
Thank you!!!

Implementation
Here's a vectorized approach using NumPy boolean indexing -
# Extract values into an array
arr = df.values
# Determine the min,max limits along each column
minl = (arr > 6.5).argmax(0)
maxl = (arr>9).argmax(0)
# Setup corresponding boolean mask and set those in array to be NaNs
R = np.arange(arr.shape[0])[:,None]
mask = (R < minl) | (R >= maxl)
arr[mask] = np.nan
# Finally convert to dataframe
df = pd.DataFrame(arr,columns=df.columns)
Please note that alternatively, one can mask directly into the input dataframe instead of re-creating it, but the interesting find here is that boolean indexing into a NumPy array is faster than into a pandas dataframe. Since, we are filtering the entire dataframe, we can re-create the dataframe.
Closer look
Now, let's take a closer look at the mask creation part, which is the crux of this solution.
1) Input array :
In [148]: arr
Out[148]:
array([[ 6.305, 6.191, 5.918],
[ 6.507, 6.991, 6.203],
[ 6.407, 6.901, 6.908],
[ 6.963, 7.127, 7.116],
[ 7.227, 7.33 , 7.363],
[ 7.445, 7.632, 7.575],
[ 7.71 , 7.837, 7.663],
[ 8.904, 8.971, 8.895],
[ 9.394, 9.194, 8.994],
[ 8.803, 8.113, 9.333],
[ 8.783, 8.783, 8.783]])
2) Min,max limits :
In [149]: # Determine the min,max limits along each column
...: minl = (arr > 6.5).argmax(0)
...: maxl = (arr>9).argmax(0)
...:
In [150]: minl
Out[150]: array([1, 1, 2])
In [151]: maxl
Out[151]: array([8, 8, 9])
3) Using broadcasting to create a mask that spans across the entire dataframe/array and selects elements that are to set as NaNs :
In [152]: R = np.arange(arr.shape[0])[:,None]
In [153]: R
Out[153]:
array([[ 0],
[ 1],
[ 2],
[ 3],
[ 4],
[ 5],
[ 6],
[ 7],
[ 8],
[ 9],
[10]])
In [154]: (R < minl) | (R >= maxl)
Out[154]:
array([[ True, True, True],
[False, False, True],
[False, False, False],
[False, False, False],
[False, False, False],
[False, False, False],
[False, False, False],
[False, False, False],
[ True, True, False],
[ True, True, True],
[ True, True, True]], dtype=bool)
Runtime test
Let's time the approaches listed thus far to solve the problem and since it was mentioned that we would have many columns, so let's use a decently big number of columns.
Approaches listed as functions :
def cumsum_app(df): # Listed in other solution by #Merlin
df2 = df > 6.5
df = df[df2.cumsum()>0]
df2 = df > 9
df = df[~(df2.cumsum()>0)]
def boolean_indexing_app(df): # Approaches listed in this post
arr = df.values
minl = (arr > 6.5).argmax(0)
maxl = (arr>9).argmax(0)
R = np.arange(arr.shape[0])[:,None]
mask = (R < minl) | (R >= maxl)
arr[mask] = np.nan
df = pd.DataFrame(arr,columns=df.columns)
Timings :
In [163]: # Create a random array with floating pt numbers between 6 and 10
...: df = pd.DataFrame((np.random.rand(11,10000)*4)+6)
...:
...: # Create copies for testing approaches
...: df1 = df.copy()
...: df2 = df.copy()
In [164]: %timeit cumsum_app(df1)
100 loops, best of 3: 16.4 ms per loop
In [165]: %timeit boolean_indexing_app(df2)
100 loops, best of 3: 2.09 ms per loop

Try this:
df
A1 A2 A3
Time
1 6.305 6.191 5.918
2 6.507 6.991 6.203
3 6.407 6.901 6.908
4 6.963 7.127 7.116
5 7.227 7.330 7.363
6 7.445 7.632 7.575
7 7.710 7.837 7.663
8 8.904 8.971 8.895
9 9.394 9.194 8.994
10 8.803 8.113 9.333
11 8.783 8.783 8.783
df2 = df > 6.5
df = df[df2.cumsum()>0]
df2 = df > 9
df = df[~(df2.cumsum()>0)]
df
A1 A2 A3
Time
1 NaN NaN NaN
2 6.507 6.991 NaN
3 6.407 6.901 6.908
4 6.963 7.127 7.116
5 7.227 7.330 7.363
6 7.445 7.632 7.575
7 7.710 7.837 7.663
8 8.904 8.971 8.895
9 NaN NaN 8.994
10 NaN NaN NaN
11 NaN NaN NaN

Selecting rows from sparse dataframe by index position

In a typical python dataframe, it's easy to select desired rows based on index:
df.ix[list_of_inds] or df.loc[list_of_inds]
However, using this method to take a substantial subset of a large, sparse dataframe (73,000 rows, 8,000 columns specifically) seems to be extremely intensive - my memory shoots up and my computer crashes.
I did notice that indexing using a range like this..
df.ix[1:N]
works fine, while using a list of indices like this...
df.ix[np.arange(1,N)]
is what makes the memory overload.
Is there another way to select rows from a sparse dataframe that's computationally easier? Or, can I convert this dataframe to an actual sparse matrix...
sparse_df = scipy.sparse.csc(df)
and select only the indices I want from that?

The issue you are facing could be related to view vs copy semantics.
df.ix[1:N] # uses slicing => operates on a view
df.ix[np.arange(1,N)] # uses fancy indexing => "probably" creates a copy first
I created a DataFrame on my machine of shape 73000x8000 and my memory spiked to 4.4 GB so I wouldn't be surprised with crashes. That said, if you do need to create a new array with the index list, then you're out of luck. However, to modify the original DataFrame, you should be able to modify the DataFrame one row at a time, or few sliced rows at a time at the expense of speed, eg:
for i in arbitrary_list_of_indices:
df.ix[i] = new_values
Btw, you could try working off numpy arrays directly which I felt has clearer descriptions of which operations result in copies vs views. You can always create a DataFrame from the array with hardly any memory overhead since it just creates a reference to the original array.
Also indexing in numpy seems much faster, even without slicing. Here's a simple testcase:
In [66]: df
Out[66]:
0 1 2 3
0 3 14 5 1
1 9 19 14 4
2 5 4 5 5
3 13 14 4 7
4 8 12 3 16
5 15 3 17 12
6 11 0 12 0
In [68]: df.ix[[1,3,5]] # fancy index version
Out[68]:
0 1 2 3
1 9 19 14 4
3 13 14 4 7
5 15 3 17 12
In [69]: df.ix[1:5:2] # sliced version of the same
Out[69]:
0 1 2 3
1 9 19 14 4
3 13 14 4 7
5 15 3 17 12
In [71]: %timeit df.ix[[1,3,5]] = -1 # use fancy index version
1000 loops, best of 3: 251 µs per loop
In [72]: %timeit df.ix[1:5:2] = -2 # faster sliced version
10000 loops, best of 3: 157 µs per loop
In [73]: arr = df.values
In [74]: arr
Out[74]:
array([[ 3, 14, 5, 1],
[-2, -2, -2, -2],
[ 5, 4, 5, 5],
[-2, -2, -2, -2],
[ 8, 12, 3, 16],
[-2, -2, -2, -2],
[11, 0, 12, 0]])
In [75]: %timeit arr[[1,3,5]] = -1 # much faster than DataFrame
The slowest run took 23.49 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 4.56 µs per loop
In [77]: %timeit arr[1:5:2] = -3 # really fast but restricted to slicing
The slowest run took 19.46 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 821 ns per loop
Good luck!

Python: numpy/pandas change values on condition

I would like to know if there is a faster and more "pythonic" way of doing the following, e.g. using some built in methods.
Given a pandas DataFrame or numpy array of floats, if the value is equal or smaller than 0.5 I need to calculate the reciprocal value and multiply with -1 and replace the old value with the newly calculated one.
"Transform" is probably a bad choice of words, please tell me if you have a better/more accurate description.
Thank you for your help and support!!
Data:
import numpy as np
import pandas as pd
dicti = {"A" : np.arange(0.0, 3, 0.1),
"B" : np.arange(0, 30, 1),
"C" : list("ELVISLIVES")*3}
df = pd.DataFrame(dicti)
my function:
def transform_colname(df, colname):
series = df[colname]
newval_list = []
for val in series:
if val <= 0.5:
newval = (1/val)*-1
newval_list.append(newval)
else:
newval_list.append(val)
df[colname] = newval_list
return df
function call:
transform_colname(df, colname="A")
**--> I'm summing up the results here, since comments wouldn't allow to post code (or I don't know how to do it).**
Thank you all for your fast and great answers!!
using ipython "%timeit" with "real" data:
my function:
10 loops, best of 3: 24.1 ms per loop
from jojo:
def transform_colname_v2(df, colname):
series = df[colname]
df[colname] = np.where(series <= 0.5, 1/series*-1, series)
return df
100 loops, best of 3: 2.76 ms per loop
from FooBar:
def transform_colname_v3(df, colname):
df.loc[df[colname] <= 0.5, colname] = - 1 / df[colname][df[colname] <= 0.5]
return df
100 loops, best of 3: 3.32 ms per loop
from dmvianna:
def transform_colname_v4(df, colname):
df[colname] = df[colname].where(df[colname] <= 0.5, (1/df[colname])*-1)
return df
100 loops, best of 3: 3.7 ms per loop
Please tell/show me if you would implement your code in a different way!
One final QUESTION: (answered)
How could "FooBar" and "dmvianna" 's versions be made "generic"? I mean, I had to write the name of the column into the function (since using it as a variable didn't work). Please explain this last point!
--> thanks jojo, ".loc" isn't the right way, but very simple df[colname] is sufficient. changed the functions above to be more "generic". (also changed ">" to be "<=", and updated timing)
Thank you very much!!

The typical trick is to write a general mathematical operation to apply to the whole column, but then use indicators to select rows for which we actually apply it:
df.loc[df.A < 0.5, 'A'] = - 1 / df.A[df.A < 0.5]
In[13]: df
Out[13]:
A B C
0 -inf 0 E
1 -10.000000 1 L
2 -5.000000 2 V
3 -3.333333 3 I
4 -2.500000 4 S
5 0.500000 5 L
6 0.600000 6 I
7 0.700000 7 V
8 0.800000 8 E
9 0.900000 9 S
10 1.000000 10 E
11 1.100000 11 L
12 1.200000 12 V
13 1.300000 13 I
14 1.400000 14 S
15 1.500000 15 L
16 1.600000 16 I
17 1.700000 17 V
18 1.800000 18 E
19 1.900000 19 S
20 2.000000 20 E
21 2.100000 21 L
22 2.200000 22 V
23 2.300000 23 I
24 2.400000 24 S
25 2.500000 25 L
26 2.600000 26 I
27 2.700000 27 V
28 2.800000 28 E
29 2.900000 29 S

If we are talking about arrays:
import numpy as np
a = np.array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6], dtype=np.float)
print 1 / a[a <= 0.5] * (-1)
This will, however only return the values smaller than 0.5.
Alternatively use np.where:
import numpy as np
a = np.array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6], dtype=np.float)
print np.where(a < 0.5, 1 / a * (-1), a)
Talking about pandas DataFrame:
As in #dmvianna's answer (so give some credit to him ;) ), adapting it to pd.DataFrame:
df.a = df.a.where(df.a > 0.5, (1 / df.a) * (-1))

As in #jojo's answer, but using pandas:
df.A = df.A.where(df.A > 0.5, (1/df.A)*-1)
or
df.A.where(df.A > 0.5, (1/df.A)*-1, inplace=True) # this should be faster
.where docstring:
Definition: df.A.where(self, cond, other=nan, inplace=False,
axis=None, level=None, try_cast=False, raise_on_error=True)
Docstring:
Return an object of same shape as self and whose corresponding entries
are from self where cond is True and otherwise are from other.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Classify DataFrame rows based on first matching condition - python

Applying a custom function to each row should work. def func(x): x = x.dropna().to_numpy()[-3:] if len(x) < 3: return 0 for k, v in criteria_dict.items(): if np.all(x >= v): return k return 0 df.apply(func, axis=1)

Related

Anything faster than groupby for iterating through groups?

How to find the maximum consecutive number for multiple columns?

Removing values in dataframe once threshold (min/max) value has been reached with Pandas

Selecting rows from sparse dataframe by index position

Python: numpy/pandas change values on condition

Categories

Resources