How do I formulate a conditional function [duplicate]

How do I formulate a conditional function [duplicate] - python

This question already has answers here:
Function of Numpy Array with if-statement
(6 answers)
Numpy equivalent of if/else without loop
(5 answers)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
(10 answers)
Closed last month.
I am trying to formulate te the following functions in python and I want to plot them
import matplotlib.pyplot as plt
import numpy as np
ss = np.linspace(300, 1000, 15)
def PT3000(ss):
if ss < 318.842719019854:
PT3 = 4.602 + 37440.0/ss
else :
PT3 =-0.3 + 3600.0/ss
return PT3
def PT2000(ss):
if ss < 318.842719019854:
PT2 = 4.602 + 37440.0/ss
elif ss > 945.33959:
PT2 =-0.3 + 3600.0/ss
else:
PT2 = 6.87109574235995e-6*ss**0.5*(-1 + 96000.0/ss) + 62.144
return PT2
fig= plt.figure()
plt.plot(ss,PT2000(ss))
plt.plot(ss,PT3000(ss))
plt.title('Productietijd [24x12]')
plt.xlabel('Verstijverafstand [mm]')
plt.ylabel('Productijd van een paneel [uur]')
plt.grid(visible=True)
plt.legend()
plt.show()
I run into an error but I don't understand what to do with it
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

You should apply PT2000 and PT3000 to each element of ss. However, Python passes the entire array ss at once and lets NumPy handle it.
Comparison of NumPy arrays returns an array, so code like ss < 318.842719019854 results in an array of Boolean values. Thus, the if statement becomes something like this:
if np.array([False, True, False, ...]):
# do stuff
...
"The truth value of an array with more than one element is ambiguous", because the array can contain both True values and False values.
The solution that's usually suggested is to "use a.any() or a.all()", which check whether any or all elements of the array are True. This is not what you need here, since piecewise functions like PT2000 and PT3000 act on individual numbers, not entire arrays.
You can use numpy.vectorize to apply your functions elementwise:
PT3000_vectorized = np.vectorize(PT3000)
plt.plot(ss, PT2000_vectorized(ss))
This will iterate over ss and pass its individual elements to the function, so comparisons will simply involve floats.

import matplotlib.legend as legend
import matplotlib.pyplot as plt
import numpy as np
ss = np.linspace(300, 1000, 15)
def PT3000(ss):
if np.all(ss < 318.842719019854):
PT3 = 4.602 + 37440.0/ss
else :
PT3 =-0.3 + 3600.0/ss
return PT3
def PT2000(ss):
if np.all(ss < 318.842719019854) :
PT2 = 4.602 + 37440.0/ss
elif np.all(ss > 945.33959):
PT2 =-0.3 + 3600.0/ss
else:
PT2 = 6.87109574235995e-6*ss**0.5*(-1 + 96000.0/ss) + 62.144
return PT2
fig= plt.figure()
plt.plot(ss,PT2000(ss), label='PT2000')
plt.plot(ss,PT3000(ss), label='PT3000')
plt.title('Productietijd [24x12]')
plt.xlabel('Verstijverafstand [mm]')
plt.ylabel('Productijd van een paneel [uur]')
plt.grid(visible=True)
plt.legend(loc='upper left')
plt.show()

Related

Replace outlier values with NaN in numpy? (preserve length of array)

I have an array of magnetometer data with artifacts every two hours due to power cycling.
I'd like to replace those indices with NaN so that the length of the array is preserved.
Here's a code example, adapted from https://www.kdnuggets.com/2017/02/removing-outliers-standard-deviation-python.html.
import numpy as np
import plotly.express as px
# For pulling data from CDAweb:
from ai import cdas
import datetime
# Import data:
start = datetime.datetime(2016, 1, 24, 0, 0, 0)
end = datetime.datetime(2016, 1, 25, 0, 0, 0)
data = cdas.get_data(
'sp_phys',
'THG_L2_MAG_'+ 'PG2',
start,
end,
['thg_mag_'+ 'pg2']
)
x =data['UT']
y =data['VERTICAL_DOWN_-_Z']
def reject_outliers(y): # y is the data in a 1D numpy array
n = 5 # 5 std deviations
mean = np.mean(y)
sd = np.std(y)
final_list = [x for x in y if (x > mean - 2 * sd)]
final_list = [x for x in final_list if (x < mean + 2 * sd)]
return final_list
px.scatter(reject_outliers(y))
print('Length of y: ')
print(len(y))
print('Length of y with outliers removed (should be the same): ')
print(len(reject_outliers(y)))
px.line(y=y, x=x)
# px.scatter(y) # It looks like the outliers are successfully dropped.
# px.line(y=reject_outliers(y), x=x) # This is the line I'd like to see work.
When I run 'px.scatter(reject_outliers(y))', it looks like the outliers are successfully getting dropped:
...but that's looking at the culled y vector relative to the index, rather than the datetime vector x as in the above plot. As the debugging text indicates, the vector is shortened because the outlier values are dropped rather than replaced.
How can I edit my 'reject_outliers()` function to assign those values to NaN, or to adjacent values, in order to keep the length of the array the same so that I can plot my data?

Use else in the list comprehension along the lines of:
[x if x_condition else other_value for x in y]

Got a less compact version to work. Full code:
import numpy as np
import plotly.express as px
# For pulling data from CDAweb:
from ai import cdas
import datetime
# Import data:
start = datetime.datetime(2016, 1, 24, 0, 0, 0)
end = datetime.datetime(2016, 1, 25, 0, 0, 0)
data = cdas.get_data(
'sp_phys',
'THG_L2_MAG_'+ 'PG2',
start,
end,
['thg_mag_'+ 'pg2']
)
x =data['UT']
y =data['VERTICAL_DOWN_-_Z']
def reject_outliers(y): # y is the data in a 1D numpy array
mean = np.mean(y)
sd = np.std(y)
final_list = np.copy(y)
for n in range(len(y)):
final_list[n] = y[n] if y[n] > mean - 5 * sd else np.nan
final_list[n] = final_list[n] if final_list[n] < mean + 5 * sd else np.nan
return final_list
px.scatter(reject_outliers(y))
print('Length of y: ')
print(len(y))
print('Length of y with outliers removed (should be the same): ')
print(len(reject_outliers(y)))
# px.line(y=y, x=x)
px.line(y=reject_outliers(y), x=x) # This is the line I wanted to get working - check!

More compact answer, sent via email by a friend:
In numpy you can select/index based on a Boolean array, and then make assignment with it:
def reject_outliers(y): # y is the data in a 1D numpy array
n = 5 # 5 std deviations
mean = np.mean(y)
sd = np.std(y)
final_list = y.copy()
final_list[np.abs(y - mean) > n * sd] = np.nan
return final_list
I also noticed that you didn’t use the value of n in your example code.
Alternatively, you can use the where method (https://numpy.org/doc/stable/reference/generated/numpy.where.html)
np.where(np.abs(y - mean) > n * sd, np.nan, y)
You don’t need the .copy() if you don’t mind modifying the input array.
Replace np.mean and np.std with np.nanmean and np.nanstd if you want the function to work on arrays that already contain nans, i.e. if you want to use this function recursively.
The answer about using if else in a list comprehension would work, but avoiding the list comprehension makes the function much faster if the arrays are large.

How to drop data above a certain frequency in a histogram/dataset?

To make things clearer, I don't want to remove the entire bin from the histogram, I just want to get rid of some of the data so that it is brought below a desired frequency. The line in the image shows the max frequency I would like
For context, I have a dataset containing a number of angles. My question is very similar to the question asked here Remove data above threshold in histogram in terms of the data used but unlike the question in the link, I dont wish to get rid of the data, just reduce it.
Can I do this directly from the histogram or will I need to just delete some of the data in the dataset?
edit (sorry I am new to coding and formatting here):
here is a solution i tried
bns = 30
hist, bins = np.histogram(dataset['Steering'], bins= bns)
removeddata = []
spb = 700
for j in range(bns):
rdata = []
for i in range(len(dataset['Steering'])):
if dataset['Steering'][i] >= bins[j] and dataset['Steering'][i] <=
bins[j+1]:
rdata.append(i)
rdata = shuffle(rdata)
rdata = rdata[spb:]
removeddata.extend(rdata)
print('removed:', len(removeddata))
dataset.drop(dataset.index[removeddata], inplace = True)
print ('remaining:', len(dataset))
center = (bins[:-1] + bins[1:])*0.5
plt.bar(center,hist,width=0.05)
plt.show()
This is somebody else's solution but it seemed to work for them. Even directly copying, it still throws errors. The error I got was "ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()", I tried to change 'and' to & and got the error "TypeError: Cannot perform 'rand_' with a dtyped [float64] array and scalar of type [bool]". Unsure what this exactly refers to but points to the line with the if statement. Checked the dtype of everything and they are all type float64, so unsure of my next step

This solution takes into account the clarified requirement that the original input data that exceeds the frequency threshold be dropped. I left my other answer because it is simpler and different enough that it may be useful to another user.
To clarify, this answer produces a new 1D array of data with fewer elements and then plots a histogram from that new data. The data are shuffled before the elements are removed (in case the input data were pre-sorted) in order to prevent bias in dropping data from either the low or high side of each bin.
import numpy as np
import matplotlib.pyplot as plt
from random import shuffle
def remove_gated_val_recursive(idx, to_gate_lst, bins_lst, data_lst):
if to_gate_lst[idx] == 0:
return(data_lst)
else:
bin_min, bin_max = bins_lst[idx], bins_lst[idx + 1]
for i in range(len(data_lst)):
if bin_min <= data_lst[i] < bin_max:
del data_lst[i]
to_gate_lst[idx] -= 1
break
return remove_gated_val_recursive(idx, to_gate_lst, bins_lst, data_lst)
threshold = 80
fig, ax1 = plt.subplots()
ax1.set_title("Some data")
np.random.seed(30)
data = np.random.randn(1000)
num_bins = 23
raw_hist, raw_bins = np.histogram(data, num_bins)
to_gate = []
for i in range(len(raw_hist)):
if raw_hist[i] > threshold:
to_gate.append(raw_hist[i] - threshold)
else:
to_gate.append(0)
data_lst = list(data)
shuffle(data_lst)
for idx in range(len(raw_hist)):
remove_gated_val_recursive(idx, to_gate, raw_bins, data_lst)
new_data = np.array(data_lst)
hist, bins = np.histogram(new_data, num_bins)
width = 0.7 * (bins[1] - bins[0])
center = (bins[:-1] + bins[1:]) * 0.5
ax1.bar(center, hist, align='center', width=width)
plt.show()
gives the following histogram, plotted from the new_data array.

This answer doesn't re-bin or re-center the data, but I believe it generally achieves what you're asking. Working from the example in the chosen answer of the post you linked, I edit the hist array so that the original input data is not changed as you indicated is your preferred solution:
import numpy as np
import matplotlib.pyplot as plt
fig, (ax1, ax2) = plt.subplots(1,2)
ax1.set_title("Some data")
ax2.set_title("Gated data < threshold")
np.random.seed(10)
data = np.random.randn(1000)
num_bins = 23
avg_samples_per_bin = 200
hist, bins = np.histogram(data, num_bins)
width = 0.7 * (bins[1] - bins[0])
center = (bins[:-1] + bins[1:]) * 0.5
ax1.bar(center, hist, align='center', width=width)
threshold = 80
gated = np.empty([len(hist)], dtype=np.int64)
for i in range(len(hist)):
if hist[i] > threshold:
gated[i] = threshold
else:
gated[i] = hist[i]
ax2.bar(center, gated, align="center", width=width)
plt.show()
which gives

Pyplot truth value of an array with more than one element is ambiguous

I am trying to implement a knn 1D estimate:
# nearest neighbors estimate
def nearest_n(x, k, data):
# Order dataset
#data = np.sort(data, kind='mergesort')
nnb = []
# iterate over all data and get k nearest neighbours around x
for n in data:
if nnb.__len__()<k:
nnb.append(n)
else:
for nb in np.arange(0,k):
if np.abs(x-n) < np.abs(x-nnb[nb]):
nnb[nb] = n
break
nnb = np.array(nnb)
# get volume(distance) v of k nearest neighbours around x
v = nnb.max() - nnb.min()
v = k/(data.__len__()*v)
return v
interval = np.arange(-4.0, 8.0, 0.1)
plt.figure()
for k in (2,8,35):
plt.plot(interval, nearest_n(interval, k,train_data), label=str(o))
plt.legend()
plt.show()
Which throws:
File "x", line 55, in nearest_n
if np.abs(x-n) < np.abs(x-nnb[nb]):
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I know the error comes from the array input in plot(), but I am not sure how to avoid this in a function with operators >/==/<
'data' comes from a 1D txt file containing floats.
I tried using vectorize:
nearest_n = np.vectorize(nearest_n)
which results in:
line 50, in nearest_n
for n in data:
TypeError: 'numpy.float64' object is not iterable
Here is an example, let's say:
data = [0.5,1.7,2.3,1.2,0.2,2.2]
k = 2
nearest_n(1.5) should then lead to
nbb=[1.2,1.7]
v = 0.5
and return 2/(6*0.5) = 2/3
The function runs for example neares_n(2.0,4,data) and gives 0.0741586011463

You're passing in np.arange(-4, 8, .01) as your x, which is an array of values. So x - n is an array of the same length as x, in this case 120 elements, since subtraction of an array and a scalar does element-wise subtraction. Same with nnb[nb]. So the result of your comparison there is a 120-length array with boolean values depending on whether each element of np.abs(x-n) is less than the corresponding element of np.abs(x-nnb[nb]). This can't be directly used as a conditional, you would need to coalesce these values to a single boolean (using all(), any(), or simply rethinking your code).

plt.figure()
X = np.arange(-4.0,8.0,0.1)
for k in [2,8,35]:
Y = []
for n in X:
Y.append(nearest_n(n,k,train_data))
plt.plot(X,Y,label=str(k))
plt.show()
is working fine. I thought pyplot.plot would do this exact thing for me already, but I guess it does not...

Mask outside of an interval of Two 2D array?

I have one 3D array, i.e. param:
param.shape = (20, 50, 50)
I want to mask its first axis outside of one interval, i.e. two 2D arrays, bot and top:
bot.shape = (50, 50)
top.shape = (50, 50)
What I have tried is:
bot_n = np.broadcast_to(bot[0, :, :], param.shape)
top_n = np.broadcast_to(top[0, :, :], param.shape)
output = np.ma.masked_outside(param, bot_n, top_n)
But I got the following error:
if v2 < v1:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
In fact, I want to extract the value of param which is between bot and top values.

You could construct the mask yourself:
output = np.ma.array(param, (param < bot_n) & (param > top_n))

The code for masked_outside is quite simple:
if v2 < v1:
(v1, v2) = (v2, v1)
xf = filled(x)
condition = (xf < v1) | (xf > v2)
return masked_where(condition, x, copy=copy)
The condition1 expression should work with your array bot_n, but the if v2<v1 test only works with scalar limits. The function author was thinking of a simple [3, 9] interval, not your more general 2d one.
So, yes, write your own mask.

matplotlib list not showing 0 or Nan's

Hi I have a list with X and a list with Y. I want to plot them using matplotlib.plt.plot(x,y)
I have some values of y that are 0 or 'empty' how can I make it that matplot doesn't connect the 0 or empty dots? and shows the rest? Do I have to split it into different lists?
Thanks in advance!

If you use numpy.nan instead of 0 or empty the line gets disconnected.
See:
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0,10,20)
y = np.sin(x)
y[3] = np.nan
y[7] = np.nan
plt.plot(x,y)

Use np.where to set the data not to be plotted to np.nan.
from numpy import *
a=linspace(1, 50, 1000)
b=sin(a)
c=where(b>-0.7, b, nan) #In this example, we plot only the values larger than -0.7
#if you want to skip the 0, c=where(b!=0, b, nan)
plt.plot(c)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I formulate a conditional function [duplicate] - python

Related

Replace outlier values with NaN in numpy? (preserve length of array)

How to drop data above a certain frequency in a histogram/dataset?

Pyplot truth value of an array with more than one element is ambiguous

Mask outside of an interval of Two 2D array?

matplotlib list not showing 0 or Nan's

Categories

Resources