Summing until time-condition is reached in Python - python

I want to sum over a certain, but rolling, period within my dynamic model. The formal representation is as follows
A simple code snippet to run the equation is:
import numpy as np
import pandas as pd
import operator
year = np.arange(50)
m_ = [50, 30, 15]
a = [25, 15, 7.5]
ARC_ = [38, 255, 837]
r = 0.03
I tried subtracting list a from m_ by list(map(operator.sub, m_, a))) as found within another post.
My failed attempt looks something like this:
for t in year:
for i in range(0, 3):
while t < t+(list(map(operator.sub, m_, a))):
L_[t] = sum(ARC_[i] / (1+r) ** t)

Not at all sure that I understood it right, I tried to base my answer on the equation. Even if it is still a bit of from the result you expect, it might help you to solve your issue.
I create a result list to store each value of L[t], i.e. 50 values. Then I compute the start / stop of the sum for every couple (t,i) and compute it.
import numpy as np
years = np.arange(50)
m_ = [50, 30, 15]
a = [25, 15, 7.5]
ARC_ = [38, 255, 837]
r = 0.03
result = []
for t in years:
s = 0
for i in range(3):
t0 = t
tf = t + m_[i]-a[i]
for k in range(int(t0), int(tf+1)):
s += ARC_[i] / (1+r) ** t
result.append(s)
If what you wanted to do is to compute the difference element-wise between m and a, a simple solution is:
[m_[i] - a[i] for i in range(len(m_))]
Hope it helps.

Related

Replace outlier values with NaN in numpy? (preserve length of array)

I have an array of magnetometer data with artifacts every two hours due to power cycling.
I'd like to replace those indices with NaN so that the length of the array is preserved.
Here's a code example, adapted from https://www.kdnuggets.com/2017/02/removing-outliers-standard-deviation-python.html.
import numpy as np
import plotly.express as px
# For pulling data from CDAweb:
from ai import cdas
import datetime
# Import data:
start = datetime.datetime(2016, 1, 24, 0, 0, 0)
end = datetime.datetime(2016, 1, 25, 0, 0, 0)
data = cdas.get_data(
'sp_phys',
'THG_L2_MAG_'+ 'PG2',
start,
end,
['thg_mag_'+ 'pg2']
)
x =data['UT']
y =data['VERTICAL_DOWN_-_Z']
def reject_outliers(y): # y is the data in a 1D numpy array
n = 5 # 5 std deviations
mean = np.mean(y)
sd = np.std(y)
final_list = [x for x in y if (x > mean - 2 * sd)]
final_list = [x for x in final_list if (x < mean + 2 * sd)]
return final_list
px.scatter(reject_outliers(y))
print('Length of y: ')
print(len(y))
print('Length of y with outliers removed (should be the same): ')
print(len(reject_outliers(y)))
px.line(y=y, x=x)
# px.scatter(y) # It looks like the outliers are successfully dropped.
# px.line(y=reject_outliers(y), x=x) # This is the line I'd like to see work.
When I run 'px.scatter(reject_outliers(y))', it looks like the outliers are successfully getting dropped:
...but that's looking at the culled y vector relative to the index, rather than the datetime vector x as in the above plot. As the debugging text indicates, the vector is shortened because the outlier values are dropped rather than replaced.
How can I edit my 'reject_outliers()` function to assign those values to NaN, or to adjacent values, in order to keep the length of the array the same so that I can plot my data?
Use else in the list comprehension along the lines of:
[x if x_condition else other_value for x in y]
Got a less compact version to work. Full code:
import numpy as np
import plotly.express as px
# For pulling data from CDAweb:
from ai import cdas
import datetime
# Import data:
start = datetime.datetime(2016, 1, 24, 0, 0, 0)
end = datetime.datetime(2016, 1, 25, 0, 0, 0)
data = cdas.get_data(
'sp_phys',
'THG_L2_MAG_'+ 'PG2',
start,
end,
['thg_mag_'+ 'pg2']
)
x =data['UT']
y =data['VERTICAL_DOWN_-_Z']
def reject_outliers(y): # y is the data in a 1D numpy array
mean = np.mean(y)
sd = np.std(y)
final_list = np.copy(y)
for n in range(len(y)):
final_list[n] = y[n] if y[n] > mean - 5 * sd else np.nan
final_list[n] = final_list[n] if final_list[n] < mean + 5 * sd else np.nan
return final_list
px.scatter(reject_outliers(y))
print('Length of y: ')
print(len(y))
print('Length of y with outliers removed (should be the same): ')
print(len(reject_outliers(y)))
# px.line(y=y, x=x)
px.line(y=reject_outliers(y), x=x) # This is the line I wanted to get working - check!
More compact answer, sent via email by a friend:
In numpy you can select/index based on a Boolean array, and then make assignment with it:
def reject_outliers(y): # y is the data in a 1D numpy array
n = 5 # 5 std deviations
mean = np.mean(y)
sd = np.std(y)
final_list = y.copy()
final_list[np.abs(y - mean) > n * sd] = np.nan
return final_list
I also noticed that you didn’t use the value of n in your example code.
Alternatively, you can use the where method (https://numpy.org/doc/stable/reference/generated/numpy.where.html)
np.where(np.abs(y - mean) > n * sd, np.nan, y)
You don’t need the .copy() if you don’t mind modifying the input array.
Replace np.mean and np.std with np.nanmean and np.nanstd if you want the function to work on arrays that already contain nans, i.e. if you want to use this function recursively.
The answer about using if else in a list comprehension would work, but avoiding the list comprehension makes the function much faster if the arrays are large.

Python - subtraction inside rolling window

I need to make subtractions inside red frames as [20-10,60-40,100-70]
that results in [10,20,30]
Current code makes subtractions but I don't know how to define red frames
seq = [10, 20, 40, 60, 70, 100]
window_size = 2
for i in range(len(seq) - window_size+1):
x=seq[i: i + window_size]
y=x[1]-x[0]
print(y)
You can build a quick solution using the fact that seq[0::2] will give you every other element of seq starting at zero. So you can compute seq[1::2] - seq[0::2] to get this result.
Without using any packages you could do:
seq = [10, 20, 40, 60, 70, 100]
sub_seq = [0]*(len(seq)//2)
for i in range(len(sub_seq)):
sub_seq[i] = seq[1::2][i] - seq[0::2][i]
print(sub_seq)
Instead you could use Numpy. Using the numpy array object you can subtract the arrays rather than explicitly looping:
import numpy as np
seq = np.array([10, 20, 40, 60, 70, 100])
sub_seq = seq[1::2] - seq[0::2]
print(sub_seq)
Here's a solution using numpy which might be useful if you have to process large amounts of data in a short time. We select values based on whether their index is even (index % 2 == 0) or odd (index % 2 != 0).
import numpy as np
seq = [10, 20, 40, 60, 70, 100]
seq = np.array(seq)
index = np.arange(len(seq))
seq[index % 2 != 0] - seq[index % 2 == 0]

Vectorize step-wise function for column in pandas dataframe

I have a slightly complex function that assigns a quality level to given data by a pre-defined step-wise logic (dependent on fixed borders and also on relative borders based on the real value). The function 'get_quality()' below does this for each row and using pandas DataFrame.apply is quite slow for huge datasets. So I'd like to vectorize this calculation. Obviously I could do something like df.groupby(pd.cut(df.ground_truth, [-np.inf, 10.0, 20.0, 50.0, np.inf])) for the inner if-logic and then apply a similar sub-grouping within each group (based on the borders of each group), but how would I do that for the last bisect that depends on the given real/ground_truth value in each row?
Using df['quality'] = np.vectorize(get_quality)(df['measured'], df['ground_truth']) is a lot faster already, but is there a real vectorized way to calculate the same 'quality' column?
import pandas as pd
import numpy as np
from bisect import bisect
quality_levels = ['WayTooLow', 'TooLow', 'OK', 'TooHigh', 'WayTooHigh']
# Note: to make the vertical borders always lead towards the 'better' score we use a small epsilon around them
eps = 0.000001
def get_quality(measured_value, real_value):
diff = measured_value - real_value
if real_value <= 10.0:
i = bisect([-4.0-eps, -2.0-eps, 2.0+eps, 4.0+eps], diff)
return quality_levels[i]
elif real_value <= 20.0:
i = bisect([-14.0-eps, -6.0-eps, 6.0+eps, 14.0+eps], diff)
return quality_levels[i]
elif real_value <= 50.0:
i = bisect([-45.0-eps, -20.0-eps, 20.0+eps, 45.0+eps], diff)
return quality_levels[i]
else:
i = bisect([-0.5*real_value-eps, -0.25*real_value-eps,
0.25*real_value+eps, 0.5*real_value+eps], diff)
return quality_levels[i]
N = 100000
df = pd.DataFrame({'ground_truth': np.random.randint(0, 100, N),
'measured': np.random.randint(0, 100, N)})
df['quality'] = df.apply(lambda row: get_quality((row['measured']), (row['ground_truth'])), axis=1)
print(df.head())
print(df.quality2.value_counts())
# ground_truth measured quality
#0 51 1 WayTooLow
#1 7 25 WayTooHigh
#2 38 95 WayTooHigh
#3 76 32 WayTooLow
#4 0 18 WayTooHigh
#OK 30035
#WayTooHigh 24257
#WayTooLow 18998
#TooLow 14593
#TooHigh 12117
This is possible with np.select.
import numpy as np
quality_levels = ['WayTooLow', 'TooLow', 'OK', 'TooHigh', 'WayTooHigh']
def get_quality_vectorized(df):
# Prepare the first 4 conditions, to match the 4 sets of boundaries.
gt = df['ground_truth']
conds = [gt <= 10, gt <= 20, gt <= 50, True]
lo = np.select(conds, [2, 6, 20, 0.25 * gt])
hi = np.select(conds, [4, 14, 45, 0.5 * gt])
# Prepare inner 5 conditions, to match the 5 quality levels.
diff = df['measured'] - df['ground_truth']
quality_conds = [diff < -hi-eps, diff < -lo-eps, diff < lo+eps, diff < hi+eps, True]
df['quality'] = np.select(quality_conds, quality_levels)
return df

Expand numbers in a list

I have a list of numbers:
[10,20,30]
What I need is to expand it according to a predefined increment. Thus, let's call x the increment and x=2, my result should be:
[10,12,14,16,18,20,22,24,.....,38]
Right now I am using a for loop, but it is very slow and I am wondering if there is a faster way.
EDIT:
newA = []
for n in array:
newA= newA+ generateNewNumbers(n, p, t)
The function generates new number simply generate the new numbers to add to the list.
EDIT2:
To better define the problem the first array contains some timestamps:
[10,20,30]
I have two parameters one is the sampling rate and one is the sampling time, what I need is to expand the array adding between two timestamps the correct number of timestamps, according to the sampling rate.
For example, if I have a sampling rate 3 and a sampling time 3 the result should be:
[10,13,16,19,20,23,26,29,30,33,36,39]
You can add the same set of increments to each time stamp using np.add.outer and then flatten the result using ravel.
import numpy as np
a = [10,20,35]
inc = 3
ninc = 4
np.add.outer(a, inc * np.arange(ninc)).ravel()
# array([10, 13, 16, 19, 20, 23, 26, 29, 35, 38, 41, 44])
You can use list comprhensions but I'm not sure if I understand the stopping condition for the last point inclusion
a = [10, 20, 30, 40]
t = 3
sum([[x for x in range(y, z, t)] for y, z in zip(a[:-1], a[1:])], []) + [a[-1]]
will give
[10, 13, 16, 19, 20, 23, 26, 29, 30, 33, 36, 39, 40]
Using range and itertools.chain
l = [10,20,30]
x = 3
from itertools import chain
list(chain(*[range(i,i+10,x) for i in l]))
#Output:
#[10, 13, 16, 19, 20, 23, 26, 29, 30, 33, 36, 39]
Here is a bunch of good answers already. But I would advise numpy and linear interpolation.
# Now, this will give you the desired result with your first specifications
# And in pure Python too
t = [10, 20, 30]
increment = 2
last = int(round(t[-1]+((t[-1]-t[-2])/float(increment))-1)) # Value of last number in array
# Note if you insist on mathematically "incorrect" endpoint, do:
#last = ((t[-1]+(t[-1]-t[-2])) -((t[-1]-t[-2])/float(increment)))+1
newt = range(t[0], last+1, increment)
# And, of course, this may skip entered values (increment = 3
# But what you should do instead, when you use the samplerate is
# to use linear interpolation
# If you resample the original signal,
# Then you resample the time too
# And don't expand over the existing time
# Because the time doesn't change if you resampled the original properly
# You only get more or less samples at different time points
# But it lasts the same length of time.
# If you do what you originally meant, you actually shift your datapoints in time
# Which is wrong.
import numpy
t = [10, 20, 30, 40, 50, 60]
oldfs = 4000 # 4 KHz samplerate
newfs = 8000 # 8 KHz sample rate (2 times bigger signal and its time axis)
ratio = max(oldfs*1.0, newfs*1.0)/min(newfs, oldfs)
newlen = round(len(t)*ratio)
numpy.interp(
numpy.linspace(0.0, 1.0, newlen),
numpy.linspace(0.0, 1.0, len(t)),
t)
This code can resample your original signal too (if you have one). If you just want to cram in some more timepoints in between, you can also use interpolation. Again, don't go over the existing time. Although this code does it, to be compatible with the first one. And so that you can get ideas on what you can do.
t = [10, 20, 30]
increment = 2
last = t[-1]+((t[-1]-t[-2])/float(increment))-1 # Value of last number in array
t.append(last)
newlen = (t[-1]-t[0])/float(increment)+1 # How many samples we will get in the end
ratio = newlen / len(t)
numpy.interp(
numpy.linspace(0.0, 1.0, newlen),
numpy.linspace(0.0, 1.0, len(t)),
t)
This though results in an increment of 2.5 instead of 2. But it can be corrected. The thing is that this approach would work on floating time points as well as on integers. And fast. It will slow down if there is a lot of them, but until you reach some great number of them it will work pretty fast.

how to write symbol for sum over a variable's subscript in sympy

I want to write a sympy symbol for a summation, but the index summed over also appears as the subscript of a variable name in the summand. For example,
import numpy as np
import sympy
sympy.init_printing()
r = sympy.Symbol('r')
a = sympy.Matrix(sympy.symbols('a:4'))
rpowers = sympy.Matrix([r**i for i in range(len(a))])
long_expr = a.dot(rpowers)
n = sympy.Symbol('n')
a_n = sympy.Symbol('a_n')
short_expr = sympy.Sum(a_n * r**n, (n, 0, 3))
long_expr and short_expr denote the same thing mathematically. But with long_expr, I can substitute in the values for the a's and then lambdify that expression into a numpy function:
coeffed_long_expr = long_expr.subs(zip(a, [-1, 3, 23, 8]))
func_long_expr = sympy.lambdify([r], coeffed_long_expr, 'numpy')
How can I do the same with short_expr? Or is short_expr only useful for displaying the expression with a summation sign in this case? I would like to be able to display using the summation sign, especially for large ns.
You can accomplish this by using sympy.Function:
import sympy
a_seq = [-1, 3, 23, 8]
n, r = sympy.symbols('n, r')
a_n = sympy.Function('a')(n)
terms = 4
short_expr = sympy.Sum(a_n * r**n, (n, 0, terms - 1))
coeffed_short_expr = short_expr.doit().subs(
(a_n.subs(n, i), a_seq[i]) for i in range(terms)) # 8*r**3 + 23*r**2 + 3*r - 1
func_short_expr = sympy.lambdify(r, coeffed_short_expr, 'numpy')
If you wish for a cleaner, more efficient implementation, I suspect you may be able to define a subclass of sympy.Symbol that implements subs() properly for summations.

Categories