Possible bug in Series.interpolate - python

I'm trying to align my index values between multiple DataFrames or Series and I'm using
Series.interpolate but it doesn't seem to interpolate correctly. Or perhaps I am misunderstanding something. Here's a small example:
x1 = np.array([0, 0.25, 0.77, 1.2, 1.4, 2.6, 3.1])
y1 = np.array([0, 1.1, 0.5, 1.5, 1.2, 2.1, 2.4])
x2 = np.array([0, 0.25, 0.66, 1.0, 1.2, 1.4, 3.1])
y2 = np.array([0, 0.2, 0.8, 1.1, 2.2, 0.1, 2.4])
df1 = DataFrame(data=y1, index=x1, columns=['A'])
df1.plot(marker='o')
df2 = DataFrame(data=y2, index=x2, columns=['A'])
df2.plot(marker='o')
df3=df1 - df2
df3.plot(marker='o')
print df3
def resample(signals):
aligned_x_vals = reduce(lambda s1, s2: s1.index.union(s2.index), signals)
return map(lambda s: s.reindex(aligned_x_vals).apply(Series.interpolate), signals)
sig1, sig2 = resample([df1, df2])
sig3 = sig1 - sig2
plt.plot(df1.index, df1.values, marker='D')
plt.plot(sig1.index, sig1.values, marker='o')
plt.grid()
plt.figure()
plt.plot(df2.index, df2.values, marker='o')
plt.plot(sig2.index ,sig2.values, marker='o')
plt.grid()
I expect sig1 and sig2 to have more points than df1 and df2 but with the values interpolated. There are a few points that are not overlapping. Is this a bug or user error? I'm using v0.7.3
Thanks.

It might be a bug. Looking at the source, Series.interpolate doesn't look at the index values while doing interpolation. It assumes they are equally spaced and just uses len(serie) for indexes. Maybe this is the intention and it's not a bug. I'm not sure.
I modified the Series.interpolate method and came up with this interpolate function. This will do what you want.
import numpy as np
from pandas import *
def interpolate(serie):
try:
inds = np.array([float(d) for d in serie.index])
except ValueError:
inds = np.arange(len(serie))
values = serie.values
invalid = isnull(values)
valid = -invalid
firstIndex = valid.argmax()
valid = valid[firstIndex:]
invalid = invalid[firstIndex:]
inds = inds[firstIndex:]
result = values.copy()
result[firstIndex:][invalid] = np.interp(inds[invalid], inds[valid],
values[firstIndex:][valid])
return Series(result, index=serie.index, name=serie.name)

I don't think underlying mathematics apply that sum of interpolation equal to interpolation of sum. it only holds at special case

Related

Sum of sines and cosines from DFT

I have a signal and want to reconstruct it from its spectrum as a sum of sines and/or cosines. I am aware of the inverse FFT but I want to reconstruct the signal in this way.
An example would look like this:
sig = np.array([1, 5, -3, 0.7, 3.1, -5, -0.5, 3.2, -2.3, -1.1, 3, 0.3, -2.05, 2.1, 3.05, -2.3])
fft = np.fft.rfft(sig)
mag = np.abs(fft) * 2 / sig.size
phase = np.angle(fft)
x = np.arange(sig.size)
reconstructed = list()
for x_i in x:
val = 0
for i, (m, p) in enumerate(zip(mag, phase)):
val += ... # what's the correct form?
reconstructed.append(val)
What's the correct code to write in the next-to-last line?

Most efficient way to convert list of values to probability distribution?

I have several lists that can only contain the following values: 0, 0.5, 1, 1.5
I want to efficiently convert each of these lists into probability mass functions. So if a list is as follows: [0.5, 0.5, 1, 1.5], the PMF will look like this: [0, 0.5, 0.25, 0.25].
I need to do this many times (and with very large lists), so avoiding looping will be optimal, if at all possible. What's the most efficient way to make this happen?
Edit: Here's my current system. This feels like a really inefficient/unelegant way to do it:
def get_distribution(samplemodes1):
n, bin_edges = np.histogram(samplemodes1, bins = 9)
totalcount = np.sum(n)
bin_probability = n / totalcount
bins_per_point = np.fmin(np.digitize(samplemodes1, bin_edges), len(bin_edges)-1)
probability_perpoint = [bin_probability[bins_per_point[i]-1] for i in range(len(samplemodes1))]
counts = Counter(samplemodes1)
total = sum(counts.values())
probability_mass = {k:v/total for k,v in counts.items()}
#print(probability_mass)
key_values = {}
if(0 in probability_mass):
key_values[0] = probability_mass.get(0)
else:
key_values[0] = 0
if(0.5 in probability_mass):
key_values[0.5] = probability_mass.get(0.5)
else:
key_values[0.5] = 0
if(1 in probability_mass):
key_values[1] = probability_mass.get(1)
else:
key_values[1] = 0
if(1.5 in probability_mass):
key_values[1.5] = probability_mass.get(1.5)
else:
key_values[1.5] = 0
distribution = list(key_values.values())
return distribution
Here are some solution for you to benchmark:
Using collections.Counter
from collections import Counter
bins = [0, 0.5, 1, 1.5]
a = [0.5, 0.5, 1.0, 0.5, 1.0, 1.5, 0.5]
denom = len(a)
counts = Counter(a)
pmf = [counts[bin]/denom for bin in Bins]
NumPy based solution
import numpy as np
bins = [0, 0.5, 1, 1.5]
a = np.array([0.5, 0.5, 1.0, 0.5, 1.0, 1.5, 0.5])
denom = len(a)
pmf = [(a == bin).sum()/denom for bin in bins]
but you can probably do better by using np.bincount() instead.
Further reading on this idea: https://thispointer.com/count-occurrences-of-a-value-in-numpy-array-in-python/

how to iterate over certain rows based on condition to calculate cosine distance

I have a data frame that look like below. Notice that the index is not sequential.
pd.DataFrame(np.array([[0.1, 0.2, 0.1, 1], [0.4, 0.5, 0, 0], [0.2, 0.4, 0.2,0],[0.3, 0.1, 0.2,1],[0.4, 0.2, 0.2,1]]),
columns=['a', 'b', 'c','manager'])
df=df.set_index([pd.Index([0, 2, 10, 14,16])], 'id')
I would like to calculate the cosine distance between each row and those that have 1 in manager (excluding itself), and then take an average and append it to a new column cos_distance. For example, for row0, I will get cosine distance with row 3 and 4 and then take the average. How do I add the condition to restrict it to those with 1 in the manager column only?
I tried running below code, but probably because we don't have sequential indices, it returned an empty list.
from scipy.spatial.distance import cosine as cos
x=df.iloc[:, :3]
manager=df[df['manager']==1].iloc[:, :3]
lead_cos = []
for i in range(0):
person_cos = []
for j in range(0, len(manager)):
person_cos.append(cos(x.loc[i], manager.loc[j]))
lead_cos.append(np.average(person_cos))
lead_cos
Desired output:
This is what I'm trying. I'm not getting the exact values as your desired output, probably because for each "manager" I include itself in the cosine calculation (maybe you need to avoid that too, not sure).
EDIT: I manage to avoid repeating the current manager. However, index 14 gives me a value different than yours. I also included rounding to 2 decimal places.
from scipy.spatial.distance import cosine as cos
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[0.1, 0.2, 0.1, 1], [0.4, 0.5, 0, 0], [0.2, 0.4, 0.2,0],[0.3, 0.1, 0.2,1],[0.4, 0.2, 0.2,1]]),
columns=['a', 'b', 'c','manager'])
df=df.set_index([pd.Index([0, 2, 10, 14,16])], 'id')
n = df.shape[0]
x=df.iloc[:, :3]
manager=df[df['manager']==1].iloc[:, :3]
n_man = manager.shape[0]
lead_cos = []
for i in range(n):
person_cos = []
for j in range(n_man):
if x.index[i] != manager.index[j]:
person_cos.append(cos(x.values.tolist()[i], manager.values.tolist()[j]))
lead_cos.append(round(np.average(person_cos),2))
df['lead_cos'] = lead_cos
print(df)
Output:

"Unsorting" a Quicksort

(Quick note! While I know there are plenty of options for sorting in Python, this code is more of a generalized proof-of-concept and will later be ported to another language, so I won't be able to use any specific Python libraries or functions.
In addition, the solution you provide doesn't necessarily have to follow my approach below.)
Background
I have a quicksort algorithm and am trying to implement a method to allow later 'unsorting' of the new location of a sorted element. That is, if element A is at index x and is sorted to index y, then the 'pointer' (or, depending on your terminology, reference or mapping) array changes its value at index x from x to y.
In more detail:
You begin the program with an array, arr, with some given set of numbers. This array is later run through a quick sort algorithm, as sorting the array is important for future processing on it.
The ordering of this array is important. As such, you have another array, ref, which contains the indices of the original array such that when you map the reference array to the array, the original ordering of the array is reproduced.
Before the array is sorted, the array and mapping looks like this:
arr = [1.2, 1.5, 1.5, 1.0, 1.1, 1.8]
ref = [0, 1, 2, 3, 4, 5]
--------
map(arr,ref) -> [1.2, 1.5, 1.5, 1.0, 1.1, 1.8]
You can see that index 0 of ref points to index 0 of arr, giving you 1.2. Index 1 of ref points to index 1 of arr, giving you 1.5, and so on.
When the algorithm is sorted, ref should be rearranged such that when you map it according to the above procedure, it generates the pre-sorted arr:
arr = [1.0, 1.1, 1.2, 1.5, 1.5, 1.8]
ref = [2, 3, 4, 0, 1, 5]
--------
map(arr,ref) -> [1.2, 1.5, 1.5, 1.0, 1.1, 1.8]
Again, index 0 of ref is 2, so the first element of the mapped array is arr[2]=1.2. Index 1 of ref is 3, so the second element of the mapped array is arr[3]=1.5, and so on.
The Issue
The current implementation of my code works great for sorting, but horrible for the remapping of ref.
Given the same array arr, the output of my program looks like this:
arr = [1.0, 1.1, 1.2, 1.5, 1.5, 1.8]
ref = [3, 4, 0, 1, 2, 5]
--------
map(arr,ref) -> [1.5, 1.5, 1.0, 1.1, 1.2, 1.8]
This is a problem because this mapping is definitely not equal to the original:
[1.5, 1.5, 1.0, 1.1, 1.2, 1.8] != [1.2, 1.5, 1.5, 1.0, 1.1, 1.8]
My approach has been this:
When elements a and b, at indices x and y in arr are switched,
Then set ref[x] = y and ref[y] = x.
This is not working and I can't think of another solution that doesn't need O(n^2) time.
Thank you!
Minimally Reproducible Example
testing = [1.5, 1.2, 1.0, 1.0, 1.2, 1.2, 1.5, 1.3, 2.0, 0.7, 0.2, 1.4, 1.2, 1.8, 2.0, 2.1]
# This is the 'map(arr,ref) ->' function
def print_links(a,b):
tt = [a[b[i]-1] for i in range(0,len(a))]
print("map(arr,ref) -> {}".format(tt))
# This tests the re-mapping against an original copy of the array
f = 0
for i in range(0,len(testing)):
if testing[i] == tt[i]:
f += 1
print("{}/{}".format(f,len(a)))
def quick_sort(arr,ref,first=None,last=None):
if first == None:
first = 0
if last == None:
last = len(arr)-1
if first < last:
split = partition(arr,ref,first,last)
quick_sort(arr,ref,first,split-1)
quick_sort(arr,ref,split+1,last)
def partition(arr,ref,first,last):
pivot = arr[first]
left = first+1
right = last
done = False
while not done:
while left <= right and arr[left] <= pivot:
left += 1
while arr[right] >= pivot and right >= left:
right -= 1
if right < left:
done = True
else:
temp = arr[left]
arr[left] = arr[right]
arr[right] = temp
# This is my attempt at preserving indices part 1
temp = ref[left]
ref[left] = ref[right]
ref[right] = temp
temp = arr[first]
arr[first] = arr[right]
arr[right] = temp
# This is my attempt at preserving indices part 2
temp = ref[first]
ref[first] = ref[right]
ref[right] = temp
return right
# Main body of code
a = [1.5,1.2,1.0,1.0,1.2,1.2,1.5,1.3,2.0,0.7,0.2,1.4,1.2,1.8,2.0,2.1]
b = range(1,len(a)+1)
print("The following should match:")
print("a = {}".format(a))
a0 = a[:]
print("ref = {}".format(b))
print("----")
print_links(a,b)
print("\nQuicksort:")
quick_sort(a,b)
print(a)
print("\nThe following should match:")
print("arr = {}".format(a0))
print("ref = {}".format(b))
print("----")
print_links(a,b)
You can do what you ask, but when we have to do something like this in real life, we usually mess with the sort's comparison function instead of the swap function. Sorting routines provided with common languages usually have that capability built in so you don't have to write your own sort.
In this procedure, you sort the ref array (called order below), by the value of the arr value it points to. The generates the same ref array you already have, but without modifying arr.
Mapping with this ordering sorts the original array. You expected it to unsort the sorted array, which is why your code isn't working.
You can invert this ordering to get the ref array you were originally looking for, or you can just leave arr unsorted and map it through order when you need it ordered.
arr = [1.5, 1.2, 1.0, 1.0, 1.2, 1.2, 1.5, 1.3, 2.0, 0.7, 0.2, 1.4, 1.2, 1.8, 2.0, 2.1]
order = range(len(arr))
order.sort(key=lambda i:arr[i])
new_arr = [arr[order[i]] for i in range(len(arr))]
print("original array = {}".format(arr))
print("sorted ordering = {}".format(order))
print("sorted array = {}".format(new_arr))
ref = [0]*len(order)
for i in range(len(order)):
ref[order[i]]=i
unsorted = [new_arr[ref[i]] for i in range(len(ref))]
print("unsorted after sorting = {}".format(unsorted))
Output:
original array = [1.5, 1.2, 1.0, 1.0, 1.2, 1.2, 1.5, 1.3, 2.0, 0.7, 0.2, 1.4, 1.2, 1.8, 2.0, 2.1]
sorted ordering = [10, 9, 2, 3, 1, 4, 5, 12, 7, 11, 0, 6, 13, 8, 14, 15]
sorted array = [0.2, 0.7, 1.0, 1.0, 1.2, 1.2, 1.2, 1.2, 1.3, 1.4, 1.5, 1.5, 1.8, 2.0, 2.0, 2.1]
unsorted after sorting = [1.5, 1.2, 1.0, 1.0, 1.2, 1.2, 1.5, 1.3, 2.0, 0.7, 0.2, 1.4, 1.2, 1.8, 2.0, 2.1]
You don't need to maintain the map of indices and elements,just sort the indices as you sort your array.for example:
unsortedArray = [1.2, 1.5, 2.1]
unsortedIndexes = [0, 1, 2]
sortedAray = [1.2, 1.5, 2.1]
then you just swap 0 and 1as you sort unsortedArray.and get the sortedIndexes[1, 0, 2],you can get the origin array by sortedArray[1],sortedArray[0],sortedArray[2].
def inplace_quick_sort(s, indexes, start, end):
if start>= end:
return
pivot = getPivot(s, start, end)#it's should be a func
left = start
right = end - 1
while left <= right:
while left <= right and customCmp(pivot, s[left]):
# s[left] < pivot:
left += 1
while left <= right and customCmp(s[right], pivot):
# pivot < s[right]:
right -= 1
if left <= right:
s[left], s[right] = s[right], s[left]
indexes[left], indexes[right] = indexes[right], indexes[left]
left, right = left + 1, right -1
s[left], s[end] = s[end], s[left]
indexes[left], indexes[end] = indexes[end], indexes[left]
inplace_quick_sort(s, indexes, start, left-1)
inplace_quick_sort(s, indexes, left+1, end)
def customCmp(a, b):
return a > b
def getPivot(s, start, end):
return s[end]
if __name__ == '__main__':
arr = [1.5,1.2,1.0,1.0,1.2,1.2,1.5,1.3,2.0,0.7,0.2,1.4,1.2,1.8,2.0,2.1]
indexes = [i for i in range(len(arr))]
inplace_quick_sort(arr,indexes, 0, len(arr)-1)
print("sorted = {}".format(arr))
ref = [0]*len(indexes)
for i in range(len(indexes)):
#the core point of Matt Timmermans' answer about how to construct the ref
#the value of indexes[i] is index of the orignal array
#and i is the index of the sorted array,
#so we get the map by ref[indexes[i]] = i
ref[indexes[i]] = i
unsorted = [arr[ref[i]] for i in range(len(ref))]
print("unsorted after sorting = {}".format(unsorted))
It's not that horrible: you've merely reversed your reference usage. Your indices, ref, tell you how to build the sorted list from the original. However, you've used it in the opposite direction: you've applied it to the sorted list, trying to reconstruct the original. You need the inverse mapping.
Is that enough to get you to solve your problem?
I think you can just repair your ref array after the fact. From your code sample, just insert the following snippet after the call toquick_sort(a,b)
c = range(1, len(b)+1)
for i in range(0, len(b)):
c[ b[i]-1 ] = i+1
The c array should now contain the correct references.
Stealing/rewording what #Prune writes: what you have in b is the forward transformation, the sorting itself. Applying it to a0 provides the sorted list (print_links(a0,b))
You just have to revert it via looking up which element went to what position:
c=[b.index(i)+1 for i in range(1,len(a)+1)]
print_links(a,c)

Calculating crossing (intercept) points of a Series or DataFrame

I have periodic data with the index being a floating point number like so:
time = [0, 0.1, 0.21, 0.31, 0.40, 0.49, 0.51, 0.6, 0.71, 0.82, 0.93]
voltage = [1, -1, 1.1, -0.9, 1, -1, 0.9,-1.2, 0.95, -1.1, 1.11]
df = DataFrame(data=voltage, index=time, columns=['voltage'])
df.plot(marker='o')
I want to create a cross(df, y_val, direction='rise' | 'fall' | 'cross') function that returns an array of times (indexes) with all the
interpolated points where the voltage values equal y_val. For 'rise' only the values where the slope is positive are returned; for 'fall' only the values with a negative slope are retured; for 'cross' both are returned. So if y_val=0 and direction='cross' then an array with 10 values would be returned with the X values of the crossing points (the first one being about 0.025).
I was thinking this could be done with an iterator but was wondering if there was a better way to do this.
Thanks. I'm loving Pandas and the Pandas community.
To do this I ended up with the following. It is a vectorized version which is 150x faster than one that uses a loop.
def cross(series, cross=0, direction='cross'):
"""
Given a Series returns all the index values where the data values equal
the 'cross' value.
Direction can be 'rising' (for rising edge), 'falling' (for only falling
edge), or 'cross' for both edges
"""
# Find if values are above or bellow yvalue crossing:
above=series.values > cross
below=np.logical_not(above)
left_shifted_above = above[1:]
left_shifted_below = below[1:]
x_crossings = []
# Find indexes on left side of crossing point
if direction == 'rising':
idxs = (left_shifted_above & below[0:-1]).nonzero()[0]
elif direction == 'falling':
idxs = (left_shifted_below & above[0:-1]).nonzero()[0]
else:
rising = left_shifted_above & below[0:-1]
falling = left_shifted_below & above[0:-1]
idxs = (rising | falling).nonzero()[0]
# Calculate x crossings with interpolation using formula for a line:
x1 = series.index.values[idxs]
x2 = series.index.values[idxs+1]
y1 = series.values[idxs]
y2 = series.values[idxs+1]
x_crossings = (cross-y1)*(x2-x1)/(y2-y1) + x1
return x_crossings
# Test it out:
time = [0, 0.1, 0.21, 0.31, 0.40, 0.49, 0.51, 0.6, 0.71, 0.82, 0.93]
voltage = [1, -1, 1.1, -0.9, 1, -1, 0.9,-1.2, 0.95, -1.1, 1.11]
df = DataFrame(data=voltage, index=time, columns=['voltage'])
x_crossings = cross(df['voltage'])
y_crossings = np.zeros(x_crossings.shape)
plt.plot(time, voltage, '-ob', x_crossings, y_crossings, 'or')
plt.grid(True)
It was quite satisfying when this worked. Any improvements that can be made?

Categories