I have periodic data with the index being a floating point number like so:
time = [0, 0.1, 0.21, 0.31, 0.40, 0.49, 0.51, 0.6, 0.71, 0.82, 0.93]
voltage = [1, -1, 1.1, -0.9, 1, -1, 0.9,-1.2, 0.95, -1.1, 1.11]
df = DataFrame(data=voltage, index=time, columns=['voltage'])
df.plot(marker='o')
I want to create a cross(df, y_val, direction='rise' | 'fall' | 'cross') function that returns an array of times (indexes) with all the
interpolated points where the voltage values equal y_val. For 'rise' only the values where the slope is positive are returned; for 'fall' only the values with a negative slope are retured; for 'cross' both are returned. So if y_val=0 and direction='cross' then an array with 10 values would be returned with the X values of the crossing points (the first one being about 0.025).
I was thinking this could be done with an iterator but was wondering if there was a better way to do this.
Thanks. I'm loving Pandas and the Pandas community.
To do this I ended up with the following. It is a vectorized version which is 150x faster than one that uses a loop.
def cross(series, cross=0, direction='cross'):
"""
Given a Series returns all the index values where the data values equal
the 'cross' value.
Direction can be 'rising' (for rising edge), 'falling' (for only falling
edge), or 'cross' for both edges
"""
# Find if values are above or bellow yvalue crossing:
above=series.values > cross
below=np.logical_not(above)
left_shifted_above = above[1:]
left_shifted_below = below[1:]
x_crossings = []
# Find indexes on left side of crossing point
if direction == 'rising':
idxs = (left_shifted_above & below[0:-1]).nonzero()[0]
elif direction == 'falling':
idxs = (left_shifted_below & above[0:-1]).nonzero()[0]
else:
rising = left_shifted_above & below[0:-1]
falling = left_shifted_below & above[0:-1]
idxs = (rising | falling).nonzero()[0]
# Calculate x crossings with interpolation using formula for a line:
x1 = series.index.values[idxs]
x2 = series.index.values[idxs+1]
y1 = series.values[idxs]
y2 = series.values[idxs+1]
x_crossings = (cross-y1)*(x2-x1)/(y2-y1) + x1
return x_crossings
# Test it out:
time = [0, 0.1, 0.21, 0.31, 0.40, 0.49, 0.51, 0.6, 0.71, 0.82, 0.93]
voltage = [1, -1, 1.1, -0.9, 1, -1, 0.9,-1.2, 0.95, -1.1, 1.11]
df = DataFrame(data=voltage, index=time, columns=['voltage'])
x_crossings = cross(df['voltage'])
y_crossings = np.zeros(x_crossings.shape)
plt.plot(time, voltage, '-ob', x_crossings, y_crossings, 'or')
plt.grid(True)
It was quite satisfying when this worked. Any improvements that can be made?
Related
I have several lists that can only contain the following values: 0, 0.5, 1, 1.5
I want to efficiently convert each of these lists into probability mass functions. So if a list is as follows: [0.5, 0.5, 1, 1.5], the PMF will look like this: [0, 0.5, 0.25, 0.25].
I need to do this many times (and with very large lists), so avoiding looping will be optimal, if at all possible. What's the most efficient way to make this happen?
Edit: Here's my current system. This feels like a really inefficient/unelegant way to do it:
def get_distribution(samplemodes1):
n, bin_edges = np.histogram(samplemodes1, bins = 9)
totalcount = np.sum(n)
bin_probability = n / totalcount
bins_per_point = np.fmin(np.digitize(samplemodes1, bin_edges), len(bin_edges)-1)
probability_perpoint = [bin_probability[bins_per_point[i]-1] for i in range(len(samplemodes1))]
counts = Counter(samplemodes1)
total = sum(counts.values())
probability_mass = {k:v/total for k,v in counts.items()}
#print(probability_mass)
key_values = {}
if(0 in probability_mass):
key_values[0] = probability_mass.get(0)
else:
key_values[0] = 0
if(0.5 in probability_mass):
key_values[0.5] = probability_mass.get(0.5)
else:
key_values[0.5] = 0
if(1 in probability_mass):
key_values[1] = probability_mass.get(1)
else:
key_values[1] = 0
if(1.5 in probability_mass):
key_values[1.5] = probability_mass.get(1.5)
else:
key_values[1.5] = 0
distribution = list(key_values.values())
return distribution
Here are some solution for you to benchmark:
Using collections.Counter
from collections import Counter
bins = [0, 0.5, 1, 1.5]
a = [0.5, 0.5, 1.0, 0.5, 1.0, 1.5, 0.5]
denom = len(a)
counts = Counter(a)
pmf = [counts[bin]/denom for bin in Bins]
NumPy based solution
import numpy as np
bins = [0, 0.5, 1, 1.5]
a = np.array([0.5, 0.5, 1.0, 0.5, 1.0, 1.5, 0.5])
denom = len(a)
pmf = [(a == bin).sum()/denom for bin in bins]
but you can probably do better by using np.bincount() instead.
Further reading on this idea: https://thispointer.com/count-occurrences-of-a-value-in-numpy-array-in-python/
I have the following dataframe with weights:
df = pd.DataFrame({'a': [0.1, 0.5, 0.1, 0.3], 'b': [0.2, 0.4, 0.2, 0.2], 'c': [0.3, 0.2, 0.4, 0.1],
'd': [0.1, 0.1, 0.1, 0.7], 'e': [0.2, 0.1, 0.3, 0.4], 'f': [0.7, 0.1, 0.1, 0.1]})
and then I normalize each row using:
df = df.div(df.sum(axis=1), axis=0)
I want to optimize the normalized weights of each row such that no weight is less than 0 or greater than 0.4.
If the weight is greater than 0.4, it will be clipped to 0.4 and the additional weight will be distributed to the other entries in a pro-rata fashion (meaning the second largest weight will receive more weight so it gets close to 0.4, and if there is any remaining weight, it will be distributed to the third and so on).
Can this be done using the "optimize" function?
Thank you.
UPDATE: I would also like to set a minimum bound for the weights. In my original question, the minimum weight bound was automatically considered as zero, however, I would like to set a constraint such that the minimum weight is at at least equal to 0.05, for example.
Unfortunately, I can only find a loop solution to this problem. When you trim off the excess weight and redistribute it proportionally, the underweight may go over the limit. Then they have to be trimmed off. And the cycle keep repeating until no value is overweight. The same goes for underweight rows.
# The original data frame. No normalization yet
df = pd.DataFrame(
{
"a": [0.1, 0.5, 0.1, 0.3],
"b": [0.2, 0.4, 0.2, 0.2],
"c": [0.3, 0.2, 0.4, 0.1],
"d": [0.1, 0.1, 0.1, 0.7],
"e": [0.2, 0.1, 0.3, 0.4],
"f": [0.7, 0.1, 0.1, 0.1],
}
)
def ensure_min_weight(row: np.array, min_weight: float):
while True:
underweight = row < min_weight
if not underweight.any():
break
missing_weight = min_weight * underweight.sum() - row[underweight].sum()
row[~underweight] -= missing_weight / row[~underweight].sum() * row[~underweight]
row[underweight] = min_weight
def ensure_max_weight(row: np.array, max_weight: float):
while True:
overweight = row > max_weight
if not overweight.any():
break
excess_weight = row[overweight].sum() - (max_weight * overweight.sum())
row[~overweight] += excess_weight / row[~overweight].sum() * row[~overweight]
row[overweight] = max_weight
values = df.to_numpy()
normalized = values / values.sum(axis=1)[:, None]
min_weight = 0.15 # just for fun
max_weight = 0.4
for i in range(len(values)):
row = normalized[i]
ensure_min_weight(row, min_weight)
ensure_max_weight(row, max_weight)
# Normalized weight
assert np.isclose(normalized.sum(axis=1), 1).all(), "Normalized weight must sum up to 1"
assert ((min_weight <= normalized) & (normalized <= max_weight)).all(), f"Normalized weight must be between {min_weight} and {max_weight}"
print(pd.DataFrame(normalized, columns=df.columns))
# Raw values
# values = normalized * values.sum(axis=1)[:, None]
# print(pd.DataFrame(values, columns=df.columns))
Note that this algorithm will run into infinite loop if your min_weight and max_weight are illogical: try min_weight = 0.4 and max_weight = 0.5. You should handle these errors in the 2 ensure functions.
Does anybody have an idea how to get the elements in a list whose values fall within a specific (from - to) range?
I need a loop to check if a list contains elements in a specific range, and if there are any, I need the biggest one to be saved in a variable..
Example:
list = [0.5, 0.56, 0.34, 0.45, 0.53, 0.6]
# range (0.5 - 0.58)
# biggest = 0.56
You could use a filtered comprehension to get only those elements in the range you want, then find the biggest of them using the built-in max():
lst = [0.5, 0.56, 0.34, 0.45, 0.53, 0.6]
biggest = max([e for e in lst if 0.5 < e < 0.58])
# biggest = 0.56
As an alternative to other answers, you can also use filter and lambda:
lst = [0.5, 0.56, 0.34, 0.45, 0.53, 0.6]
biggest = max([i for i in filter(lambda x: 0.5 < x < 0.58, lst)])
I suppose a normal if check would be faster, but I'll give this just for completeness.
Also, you should not use list = ... as list is a built-in in python.
You could also go about it a step at a time, as the approach may aid in debugging.
I used numpy in this case, which is also a helpful tool to put in your tool belt.
This should run as is:
import numpy as np
l = [0.5, 0.56, 0.34, 0.45, 0.53, 0.6]
a = np.array(l)
low = 0.5
high = 0.58
index_low = (a < high)
print(index_low)
a_low = a[index_low]
print(a_low)
index_in_range = (a_low >= low)
print(index_in_range)
a_in_range = a_low[index_in_range]
print(a_in_range)
a_max = a_in_range.max()
print(a_max)
I have an numpy array, lets say one with 4 rows and 6 (always even number) columns:
m=np.round(np.random.rand(4,6))
array([[ 0.99, 0.48, 0.05, 0.26, 0.92, 0.44],
[ 0.81, 0.54, 0.19, 0.38, 0.5 , 0.02],
[ 0.11, 0.96, 0.04, 0.69, 0.78, 0.31],
[ 0.5 , 0.53, 0.94, 0.77, 0.6 , 0.75]])
I now want to plot graphs according to the column pairs, in this case
Graph 1: x-values=m[:,1] and y-values=m[:,0]
Graph 2: x-values=m[:,3] and y-values=m[:,2]
Graph 3: x-values=m[:,5] and y-values=m[:,4]
The first two columns are basically a pair of values, the next two are another pair of values and the last two also are a pair of values.
All the graphs should be in the same plot!
I need a general solution for plotting multiple graphs like this with an undefined but EVEN number of columns of the array. Something like a loop!
Hope somebody can help me :)
you can loop on all values of the column pairs
import matplotlib.pyplot
i=1
while i<len(m[0]):
x = m[:,i]
y = m[:,i-1]
plt.plot(x,y)
plt.savefig('placeholderName_%d.png' % i)
plt.close()
i=i+2
note that I'm starting at 1, and incrementing by two. this conforms to the example you presented
This gives terrible results with the m arra y you specified, but if it was just a sample and your data is more realistic, the following should do:
for i in range(m.shape[1] // 2):
plt.figure()
plt.plot(m[:, 2* i], m[:, 2 * i + 1])
If you want all the plots on the same figure, just move the plt.figure() out of the loop:
plt.figure()
for i in range(m.shape[1] // 2):
plt.plot(m[:, 2* i], m[:, 2 * i + 1])
I'm trying to align my index values between multiple DataFrames or Series and I'm using
Series.interpolate but it doesn't seem to interpolate correctly. Or perhaps I am misunderstanding something. Here's a small example:
x1 = np.array([0, 0.25, 0.77, 1.2, 1.4, 2.6, 3.1])
y1 = np.array([0, 1.1, 0.5, 1.5, 1.2, 2.1, 2.4])
x2 = np.array([0, 0.25, 0.66, 1.0, 1.2, 1.4, 3.1])
y2 = np.array([0, 0.2, 0.8, 1.1, 2.2, 0.1, 2.4])
df1 = DataFrame(data=y1, index=x1, columns=['A'])
df1.plot(marker='o')
df2 = DataFrame(data=y2, index=x2, columns=['A'])
df2.plot(marker='o')
df3=df1 - df2
df3.plot(marker='o')
print df3
def resample(signals):
aligned_x_vals = reduce(lambda s1, s2: s1.index.union(s2.index), signals)
return map(lambda s: s.reindex(aligned_x_vals).apply(Series.interpolate), signals)
sig1, sig2 = resample([df1, df2])
sig3 = sig1 - sig2
plt.plot(df1.index, df1.values, marker='D')
plt.plot(sig1.index, sig1.values, marker='o')
plt.grid()
plt.figure()
plt.plot(df2.index, df2.values, marker='o')
plt.plot(sig2.index ,sig2.values, marker='o')
plt.grid()
I expect sig1 and sig2 to have more points than df1 and df2 but with the values interpolated. There are a few points that are not overlapping. Is this a bug or user error? I'm using v0.7.3
Thanks.
It might be a bug. Looking at the source, Series.interpolate doesn't look at the index values while doing interpolation. It assumes they are equally spaced and just uses len(serie) for indexes. Maybe this is the intention and it's not a bug. I'm not sure.
I modified the Series.interpolate method and came up with this interpolate function. This will do what you want.
import numpy as np
from pandas import *
def interpolate(serie):
try:
inds = np.array([float(d) for d in serie.index])
except ValueError:
inds = np.arange(len(serie))
values = serie.values
invalid = isnull(values)
valid = -invalid
firstIndex = valid.argmax()
valid = valid[firstIndex:]
invalid = invalid[firstIndex:]
inds = inds[firstIndex:]
result = values.copy()
result[firstIndex:][invalid] = np.interp(inds[invalid], inds[valid],
values[firstIndex:][valid])
return Series(result, index=serie.index, name=serie.name)
I don't think underlying mathematics apply that sum of interpolation equal to interpolation of sum. it only holds at special case