compare a value to every pair of values in a pandas dataframe - python

I have a data frame with intervals like:
begin end
2 4
6 8
9 11
I want to compare a value to each of such pair of interval. If that value is in the range of any interval, it will be 'yes', else it will be 'no'.
For example: x = 3 => yes (because 2<x<4), x=5 => no
I currently do this with a nested loop through each value of x and each the interval. But I have multiple values of x and multiple intervals, so the nested loop is really slow. Is there any way I can do this efficiently without a loop? Thank you!

You can use broadcasting to speed up the comparisons:
def check_intervals(x):
if any((intervals.begin < x) & (x < intervals.end)):
return 'yes'
else:
return 'no'
>>> intervals = pd.DataFrame({'begin': [2, 6, 9], 'end': [4, 8, 11]})
>>> intervals
begin end
0 2 4
1 6 8
2 9 11
>>> check_intervals(3)
'yes'
>>> values = [3, 5, 10, 11]
>>> [check_intervals(x) for x in values]
['yes', 'no', 'yes', 'no']
That should be fast enough for a few thousand comparisons per second.

Not sure this improves efficiency at all, but you could create a column of whether your value falls in the interval like so,
df = pd.DataFrame({'a':[2, 6, 9], 'b':[4, 8, 11]})
df['a'] = df['a'].astype(str)
df['b'] = df['b'].astype(str)
df['t'] = df['a'].str.cat(df['b'],sep=",")
def in_between(x, val):
a, b = x.split(',')
if val > int(a) and val < int(b):
return 'yes'
else:
return 'no'
val = 3
df['bw'] = df['t'].map(lambda x: in_between(x, val))
df

Here is another take on it just for the sake of benchmarking, that relies only on numpy which probably makes it the fastest option at the expense of taking up more memory. On my machine, this takes around 2 seconds to evaluate 10000 values and 10000 intervals.
import numpy as np
import pandas as pd
N = 10
values = np.random.randn(N) * N # Your x values
df = pd.DataFrame({"begin": [2, 6, 9],
"end": [4, 8, 11]})
data = df.to_numpy()
begin = data[:, 0] < values[np.newaxis].T
end = values[np.newaxis].T < data[:, 1]
answer = np.stack((begin, end), axis=-1)
answer = answer.all(axis=-1).any(axis=-1)
Where answer is of the same shape as values. Afterwards, you can just do a simple replacement of True and False to yes and no, respectively.

Related

How to create multiple rows of a data frame based on some original values

I am a Python newbie and have a question.
As a simple example, I have three variables:
a = 3
b = 10
c = 1
I'd like to create a data frame with three columns ('a', 'b', and 'c') with:
each column +/- a certain constant from the original value AND also >0 and <=10.
If the constant is 1 then:
the possible values of 'a' will be 2, 3, 4
the possible values of 'b' will be 9, 10
the possible values of 'c' will be 1, 2
The final data frame will consist of all possible combination of a, b and c.
Do you know any Python code to do so?
Here is a code to start.
import pandas as pd
data = [[3 , 10, 1]]
df1 = pd.DataFrame(data, columns=['a', 'b', 'c'])
You may use itertools.product for this.
Create 3 separate lists with the necessary accepted data. This can be done by calling a method which will return you the list of possible values.
def list_of_values(n):
if 1 < n < 9:
return [n - 1, n, n + 1]
elif n == 1:
return [1, 2]
elif n == 10:
return [9, 10]
return []
So you will have the following:
a = [2, 3, 4]
b = [9, 10]
c = [1,2]
Next, do the following:
from itertools import product
l = product(a,b,c)
data = list(l)
pd.DataFrame(data, columns =['a', 'b', 'c'])

Transform a Pandas series to be monotonic

I'm looking for a way to remove the points that ruin the monotonicity of a series.
For example
s = pd.Series([0,1,2,3,10,4,5,6])
or
s = pd.Series([0,1,2,3,-1,4,5,6])
we would extract
s = pd.Series([0,1,2,3,4,5,6])
NB: we assume that the first element is always correct.
Monotonic could be both increasing or decreasing, the functions below will return exclude all values that brean monotonicity.
However, there seems to be a confusion in your question, given the series s = pd.Series([0,1,2,3,10,4,5,6]), 10 doesn't break monotonicity conditions, 4, 5, 6 do. So the correct answer there is 0, 1, 2, 3, 10
import pandas as pd
s = pd.Series([0,1,2,3,10,4,5,6])
def to_monotonic_inc(s):
return s[s >= s.cummax()]
def to_monotonic_dec(s):
return s[s <= s.cummin()]
print(to_monotonic_inc(s))
print(to_monotonic_dec(s))
Output is 0, 1, 2, 3, 10 for increasing and 0 for decreasing.
Perhaps you want to find the longest monotonic array? because that's a completely different search problem.
----- EDIT -----
Below is a simple way of finding the longest monotonic ascending array given your constraints using plain python:
def get_longeset_monotonic_asc(s):
enumerated = sorted([(v, i) for i, v in enumerate(s) if v >= s[0]])[1:]
output = [s[0]]
last_index = 0
for v, i in enumerated:
if i > last_index:
last_index = i
output.append(v)
return output
s1 = [0,1,2,3,10,4,5,6]
s2 = [0,1,2,3,-1,4,5,6]
print(get_longeset_monotonic_asc(s1))
print(get_longeset_monotonic_asc(s2))
'''
Output:
[0, 1, 2, 3, 4, 5, 6]
[0, 1, 2, 3, 4, 5, 6]
'''
Note that this solution involves sorting which is O(nlog(n)) + a second step which is O(n).
Here is a way to produce a monotonically increasing series:
import pandas as pd
# create data
s = pd.Series([1, 2, 3, 4, 5, 4, 3, 2, 3, 4, 5, 6, 7, 8])
# find max so far (i.e., running_max)
df = pd.concat([s.rename('orig'),
s.cummax().rename('running_max'),
], axis=1)
# are we at or above max so far?
df['keep?'] = (df['orig'] >= df['running_max'])
# filter out one or many points below max so far
df = df.loc[ df['keep?'], 'orig']
# verify that remaining points are monotonically increasing
assert pd.Index(df).is_monotonic_increasing
# print(df.drop_duplicates()) # eliminates ties
print(df) # keeps ties
0 1
1 2
2 3
3 4
4 5
10 5 # <-- same as previous value -- a tie
11 6
12 7
13 8
Name: orig, dtype: int64
You can see graphically with s.plot(); and df.plot();

pandas dataframe exponential decay summation

I have a pandas dataframe,
[[1, 3],
[4, 4],
[2, 8]...
]
I want to create a column that has this:
1*(a)^(3) # = x
1*(a)^(3 + 4) + 4 * (a)^4 # = y
1*(a)^(3 + 4 + 8) + 4 * (a)^(4 + 8) + 2 * (a)^8 # = z
...
Where "a" is some value.
The stuff: 1, 4, 2, is from column one, the repeated 3, 4, 8 is column 2
Is this possible using some form of transform/apply?
Essentially getting:
[[1, 3, x],
[4, 4, y],
[2, 8, z]...
]
Where x, y, z is the respective sums from the new column (I want them next to each other)
There is a "groupby" that is being done on the dataframe, and this is what I want to do for a given group
If I'm understanding your question correctly, this should work:
df = pd.DataFrame([[1, 3], [4, 4], [2, 8]], columns=['a', 'b'])
a = 42
new_lst = []
for n in range(len(lst)):
z = 0
i = 0
while i <= n:
z += df['a'][i]*a**(sum(df['b'][i:n+1]))
i += 1
new_lst.append(z)
df['new'] = new_lst
Update:
Saw that you are using pandas and updated with dataframe methods. Not sure there's an easy way to do this with apply since you need a mix of values from different rows. I think this for loop is still the best route.

R's which() and which.min() Equivalent in Python

I read the similar topic here. I think the question is different or at least .index() could not solve my problem.
This is a simple code in R and its answer:
x <- c(1:4, 0:5, 11)
x
#[1] 1 2 3 4 0 1 2 3 4 5 11
which(x==2)
# [1] 2 7
min(which(x==2))
# [1] 2
which.min(x)
#[1] 5
Which simply returns the index of the item which meets the condition.
If x be the input for Python, how can I get the indeces for the elements which meet criteria x==2 and the one which is the smallest in the array which.min.
x = [1,2,3,4,0,1,2,3,4,11]
x=np.array(x)
x[x>2].index()
##'numpy.ndarray' object has no attribute 'index'
Numpy does have built-in functions for it
x = [1,2,3,4,0,1,2,3,4,11]
x=np.array(x)
np.where(x == 2)
np.min(np.where(x==2))
np.argmin(x)
np.where(x == 2)
Out[9]: (array([1, 6], dtype=int64),)
np.min(np.where(x==2))
Out[10]: 1
np.argmin(x)
Out[11]: 4
A simple loop will do:
res = []
x = [1,2,3,4,0,1,2,3,4,11]
for i in range(len(x)):
if check_condition(x[i]):
res.append(i)
One liner with comprehension:
res = [i for i, v in enumerate(x) if check_condition(v)]
Here you have a live example
NumPy for R provides you with a bunch of R functionalities in Python.
As to your specific question:
import numpy as np
x = [1,2,3,4,0,1,2,3,4,11]
arr = np.array(x)
print(arr)
# [ 1 2 3 4 0 1 2 3 4 11]
print(arr.argmin(0)) # R's which.min()
# 4
print((arr==2).nonzero()) # R's which()
# (array([1, 6]),)
The method based on python indexing and numpy, which returns the value of the desired column based on the index of the minimum/maximum value
df.iloc[np.argmin(df['column1'].values)]['column2']
built-in index function can be used for this purpose:
x = [1,2,3,4,0,1,2,3,4,11]
print(x.index(min(x)))
#4
print(x.index(max(x)))
#9
However, for indexes based on a condition, np.where or manual loop and enumerate may work:
index_greater_than_two1 = [idx for idx, val in enumerate(x) if val>2]
print(index_greater_than_two1)
# [2, 3, 7, 8, 9]
# OR
index_greater_than_two2 = np.where(np.array(x)>2)
print(index_greater_than_two2)
# (array([2, 3, 7, 8, 9], dtype=int64),)
You could also use heapq to find the index of the smallest. Then you can chose to find multiple (for example index of the 2 smallest).
import heapq
x = np.array([1,2,3,4,0,1,2,3,4,11])
heapq.nsmallest(2, (range(len(x))), x.take)
Returns
[4, 0]

How to replace only the first n elements in a numpy array that are larger than a certain value?

I have an array myA like this:
array([ 7, 4, 5, 8, 3, 10])
If I want to replace all values that are larger than a value val by 0, I can simply do:
myA[myA > val] = 0
which gives me the desired output (for val = 5):
array([0, 4, 5, 0, 3, 0])
However, my goal is to replace not all but only the first n elements of this array that are larger than a value val.
So, if n = 2 my desired outcome would look like this (10 is the third element and should therefore not been replaced):
array([ 0, 4, 5, 0, 3, 10])
A straightforward implementation would be:
import numpy as np
myA = np.array([7, 4, 5, 8, 3, 10])
n = 2
val = 5
# track the number of replacements
repl = 0
for ind, vali in enumerate(myA):
if vali > val:
myA[ind] = 0
repl += 1
if repl == n:
break
That works but maybe someone can can up with a smart way of masking!?
The following should work:
myA[(myA > val).nonzero()[0][:2]] = 0
since nonzero will return the indexes where the boolean array myA > val is non zero e.g. True.
For example:
In [1]: myA = array([ 7, 4, 5, 8, 3, 10])
In [2]: myA[(myA > 5).nonzero()[0][:2]] = 0
In [3]: myA
Out[3]: array([ 0, 4, 5, 0, 3, 10])
Final solution is very simple:
import numpy as np
myA = np.array([7, 4, 5, 8, 3, 10])
n = 2
val = 5
myA[np.where(myA > val)[0][:n]] = 0
print(myA)
Output:
[ 0 4 5 0 3 10]
Here's another possibility (untested), probably no better than nonzero:
def truncate_mask(m, stop):
m = m.astype(bool, copy=False) # if we allow non-bool m, the next line becomes nonsense
return m & (np.cumsum(m) <= stop)
myA[truncate_mask(myA > val, n)] = 0
By avoiding building and using an explicit index you might end up with slightly better performance...but you'd have to test it to find out.
Edit 1: while we're on the subject of possibilities, you could also try:
def truncate_mask(m, stop):
m = m.astype(bool, copy=True) # note we need to copy m here to safely modify it
m[np.searchsorted(np.cumsum(m), stop):] = 0
return m
Edit 2 (the next day): I've just tested this and it seems that cumsum is actually worse than nonzero, at least with the kinds of values I was using (so neither of the above approaches is worth using). Out of curiosity, I also tried it with numba:
import numba
#numba.jit
def set_first_n_gt_thresh(a, val, thresh, n):
ii = 0
while n>0 and ii < len(a):
if a[ii] > thresh:
a[ii] = val
n -= 1
ii += 1
This only iterates over the array once, or rather it only iterates over the necessary part of the array once, never even touching the latter part. This gives you vastly superior performance for small n, but even for the worst case of n>=len(a) this approach is faster.
You could use the same solution as here with converting you np.array to pd.Series:
s = pd.Series([ 7, 4, 5, 8, 3, 10])
n = 2
m = 5
s[s[s>m].iloc[:n].index] = 0
In [416]: s
Out[416]:
0 0
1 4
2 5
3 0
4 3
5 10
dtype: int64
Step by step explanation:
In [426]: s > m
Out[426]:
0 True
1 False
2 False
3 True
4 False
5 True
dtype: bool
In [428]: s[s>m].iloc[:n]
Out[428]:
0 7
3 8
dtype: int64
In [429]: s[s>m].iloc[:n].index
Out[429]: Int64Index([0, 3], dtype='int64')
In [430]: s[s[s>m].iloc[:n].index]
Out[430]:
0 7
3 8
dtype: int64
Output in In[430] looks the same as In[428] but in 428 it's a copy and in 430 original series.
If you'll need np.array you could use values method:
In [418]: s.values
Out[418]: array([ 0, 4, 5, 0, 3, 10], dtype=int64)

Categories