scipy.signal: filtering variable-time dataset - python

I have a large dataset of the form [t, y(t)] to which I want to apply an IIR low-pass filter (first- or second-order Butterworth should suffice) using scipy.signal (in particular scipy.filter.butter and scipy.filter.filtfilt). The problem is that t is not regularly spaced, which appears to be a requirement for the functions in scipy.signal.
For any "missing" points, I know that my signal remains unchanged from its previous value (so given two consecutive points t1 and t2 in my t-data and a point T not in the data, such that t1<T<t2, the "real" function Y(t) which I'm sampling would take the value Y(T)=Y(t1)). t is integer-valued, so I could simply add the missing points, but this would cause the size of my dataset to grow by a factor ~10, which is problematic given that it's already very large.
So the question is, is there a (sufficiently simple and low-overhead) way to filter my dataset without inserting all "missing" points?

You can efficiently "wrap" your data into a function.
If your data is in the form of a list of lists then you'll need to convert it into a dict and to create a sorted list of your t values. Then you can interpolate the missing values using the list bisection algorithm in the bisect module.
Here's some demo code written in Python 2, but it should be straight-forward to convert it to Python 3, if required.
from random import seed, sample
from bisect import bisect
#Create some fake data
seed(37)
data = dict((u, u/10.) for u in sample(xrange(50), 25))
keys = data.keys()
keys.sort()
print keys
def interp(t):
i = bisect(keys, t)
k = keys[max(0, i-1)]
return data[k]
for i in xrange(50):
print i, interp(i)
output
[2, 4, 8, 10, 14, 15, 19, 21, 22, 23, 26, 27, 29, 30,
32, 33, 34, 35, 37, 38, 39, 42, 43, 44, 48]
0 0.2
1 0.2
2 0.2
3 0.2
4 0.4
5 0.4
6 0.4
7 0.4
8 0.8
9 0.8
10 1.0
11 1.0
12 1.0
13 1.0
14 1.4
15 1.5
16 1.5
17 1.5
18 1.5
19 1.9
20 1.9
21 2.1
22 2.2
23 2.3
24 2.3
25 2.3
26 2.6
27 2.7
28 2.7
29 2.9
30 3.0
31 3.0
32 3.2
33 3.3
34 3.4
35 3.5
36 3.5
37 3.7
38 3.8
39 3.9
40 3.9
41 3.9
42 4.2
43 4.3
44 4.4
45 4.4
46 4.4
47 4.4
48 4.8
49 4.8
(I manually wrapped the output of keys to make it easier to read without horizontal scrolling).
You'll get a tiny speedup by re-writing the body of the interpolation function in one line:
def interp(t):
return data[keys[max(0, bisect(keys, t)-1)]]
It's much less readable, IMHO, but the speed difference may be worth it if the function gets called a lot.

The answer by PM 2Ring works, but assuming that your data are already ordered by t, it is less efficient than possible. It takes log-linear time and linear additional space. You can write a generator that produces a transformed dataset with regular sampling intervals in linear time and constant additional space:
# Assumes that dataset rows are lists as described in the question:
# [[t1, Y(t1)], [t2, Y(t2)], [t3, Y(t3)], ..., [tz, Y(tz)]]
# If this assumption is wrong, just extract t and Y(t) in another way.
# The generated range starts at t1 and ends directly after tz.
# Warning: will overgenerate points if the data are more densely sampled
# than the requested sampling interval.
def step_interpolate(dataset, interval):
left = next(dataset) # [t1, Y(t1)]
right = next(dataset) # [t2, Y(t2)]
t_regular = left[0]
while True:
if left is right: # same list object
right = next(dataset) # iteration stops when dataset stops
if right[0] <= t_regular:
left = right
yield [t_regular, left[1]]
t_regular += interval
Testing:
data = [[1, 10], [15, 2], [50, 100], [55, 17]]
for item in step_interpolate(iter(data), 10):
print item[0], item[1]
Output:
1 10
11 10
21 2
31 2
41 2
51 100
61 17

Related

My code is the exact same as the output should look like, but for some reason it is wrong

I am trying to write a program in which a function accepts the name of a file and a list of tuples. Each tuple in the list will contain 5 entries, where the 1st, 3rd, and 4th entries are ints, 2nd entry is a string, and 5th entry is a float. The goal of the program is to write the 5 elements of each tuple on a separate line. So for example, all 5 from tuple 1 on one line, all 5 from tuple 2 on the second line, etc, in the file.
Right now, my code is giving me the exact output as what is expected by the test cases, but for some reason it says that my answer is incorrect. I am pretty sure that the issue is that my code is adding an extra space at the end of each line. How can I fix my code?
def write_1301(filename, tuple_list):
file = open(filename, "w")
for tup in tuple_list:
for i in range(0, len(tup)):
print(tup[i], file = file, end = " ")
print(file = file)
file.close()
#The code below will test your function. It's not used
#for grading, so feel free to modify it! You may check
#output.cs1301 to see how it worked.
tuple1 = (1, "exam_1", 90, 100, 0.6)
tuple2 = (2, "exam_2", 95, 100, 0.4)
tupleList = [tuple1, tuple2]
write_1301("output.cs1301", tupleList)
Example of a test case:
We tested your code with filename = "AutomatedTestOutput-MAZIxa.txt", tuple_list = [(1, 'quiz_1', 13, 20, 0.17), (2, 'exam_1', 55, 85, 0.12), (3, 'exam_2', 15, 20, 0.1), (4, 'assignment_1', 15, 30, 0.11), (5, 'test_1', 21, 35, 0.17), (6, 'quiz_2', 44, 70, 0.11), (7, 'test_2', 85, 100, 0.12), (8, 'test_3', 35, 50, 0.1)]. We expected the file to contain this:
1 quiz_1 13 20 0.17
2 exam_1 55 85 0.12
3 exam_2 15 20 0.1
4 assignment_1 15 30 0.11
5 test_1 21 35 0.17
6 quiz_2 44 70 0.11
7 test_2 85 100 0.12
8 test_3 35 50 0.1
However, the file contained this:
1 quiz_1 13 20 0.17
2 exam_1 55 85 0.12
3 exam_2 15 20 0.1
4 assignment_1 15 30 0.11
5 test_1 21 35 0.17
6 quiz_2 44 70 0.11
7 test_2 85 100 0.12
8 test_3 35 50 0.1
I tried using end= in various ways but had no luck and am confused.
Your issue is that your manually writting each element individually and adding a space after each element. This means that when you write the last element you also add a trailing space. so your output looks like it matches but acutally your lines all have a trailing space.
Instead you can use string join to join all the elements in the list with a space. This will add the space only between elements not after every element. However it expects all the values passed to it to be strings. So we can use a list comprehension so convert all the elements in the tup to strings and pass that list of strings to the join function.
def write_1301(filename, tuple_list):
with open(filename, "w") as file:
for tup in tuple_list:
print(" ".join([str(t) for t in tup]), file=file)
# The code below will test your function. It's not used
# for grading, so feel free to modify it! You may check
# output.cs1301 to see how it worked.
tuple1 = (1, "exam_1", 90, 100, 0.6)
tuple2 = (2, "exam_2", 95, 100, 0.4)
tupleList = [tuple1, tuple2]
write_1301("output.cs1301", tupleList)

The very fast way to find repeating combinations in Python using pandas?

I have this "DrawsDB.csv" sample file as input:
Day,Hour,N1,N2,N3,N4,N5,N6,N7,N8,N9,N10,N11,N12,N13,N14,N15,N16,N17,N18,N19,N20
1996-03-18,15:00,4,9,10,16,21,22,23,26,27,34,35,41,42,48,62,66,68,73,76,78
1996-03-19,15:00,6,12,15,19,28,33,35,39,44,48,49,59,62,63,64,67,69,71,75,77
1996-03-21,15:00,2,4,6,7,15,16,17,19,20,26,28,45,48,52,54,69,72,73,75,77
1996-03-22,15:00,3,8,15,17,19,25,30,33,34,35,36,38,44,49,60,61,64,67,68,75
1996-03-25,15:00,2,10,11,14,18,22,26,27,29,30,42,44,45,55,60,61,66,67,75,79
2022-01-01,15:00,1,9,12,17,33,34,36,37,38,44,45,46,53,56,58,60,62,63,70,72
2022-01-01,22:50,1,3,4,14,19,22,24,27,32,33,35,36,44,48,53,55,69,70,76,78
2022-01-02,15:00,13,15,16,19,22,24,31,37,38,43,47,58,64,66,70,72,73,75,76,78
2022-01-02,22:50,5,10,11,14,16,28,29,36,41,53,54,56,58,59,61,67,68,71,73,77
2022-01-03,15:00,8,9,10,11,15,20,21,22,26,30,35,36,39,42,52,58,63,64,73,80
2022-01-03,22:50,4,9,17,21,22,32,33,34,36,37,38,41,48,49,50,60,64,69,70,75
2022-01-04,15:00,4,5,7,9,11,16,17,21,22,25,30,37,38,39,44,49,52,60,65,78
2022-01-04,22:50,17,18,22,26,27,30,31,40,43,49,55,62,63,64,65,71,72,73,76,80
2022-01-05,15:00,1,5,8,14,15,20,23,25,26,33,34,35,37,47,54,59,67,70,72,76
2022-01-05,22:50,6,7,14,15,16,18,26,37,39,41,45,51,52,54,55,59,61,70,71,80
2022-01-06,15:00,9,10,11,17,28,30,32,41,42,44,45,49,50,51,55,65,67,72,76,78
2022-01-06,22:50,1,2,6,9,11,15,21,26,31,37,40,43,47,51,52,54,67,68,73,75
This is just a sample. The real csv file is more than 50.000 rows in total.
N1 to N20 columns contains random values, non repeating across the same row, which means they are not duplicate. And they are sorted from smallest one (N1) to the biggest one (N20).
I want to get repeating combos (e.g. of 5 numbers let's say) across all rows from the DataFrame from columns N1 to N20.
So, for the entire .csv file posted above the output should be:
(6, 15, 26, 52, 54) 3
(17, 33, 34, 36, 38) 3
(17, 33, 34, 36, 60) 3
(17, 33, 34, 38, 60) 3
(17, 33, 36, 38, 60) 3
(17, 34, 36, 38, 60) 3
(33, 34, 36, 38, 60) 3
...
This is the full ouput which I'm not posting here because of text size limitations:
https://pastebin.com/4EVXXSn1
Please check it out.
Sorry for making such long output, I tried to create a shorter one but didn't succeed in getting representative combos for it.
This is the Python code I wrote to accomplish what I need: (please read its commented lines too)
import pandas as pd
from itertools import combinations
from collections import Counter
df = pd.read_csv("DrawsDB.csv")
# looping through db using method found here:
# https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas
df = df.reset_index() # make sure indexes pair with number of rows
draws = []
# please read this: https://stackoverflow.com/a/55557758/7710871 (Conclusion:iter is very slow)
for index, row in df.iterrows():
draws.append(
[row['N1'], row['N2'], row['N3'], row['N4'], row['N5'], row['N6'], row['N7'], row['N8'], row['N9'], row['N10'],
row['N11'], row['N12'], row['N13'], row['N14'], row['N15'], row['N16'], row['N17'], row['N18'], row['N19'],
row['N20']])
# comparing to each other in order to check for repeating combos:
repeating_combos = []
for i in range(len(draws)):
for j in draws[i + 1:]:
repeating_combos.append(sorted(list(set(draws[i]).intersection(j))))
# e.g. getting any repeating combo of 5 across all rows:
combos_of_5 = []
for each in repeating_combos:
if len(each) == 5:
combos_of_5.append(tuple(each))
# print(each)
elif len(each) > 5:
# e.g. a repeating sequence of 6 numbers means in fact 6 combos taken by 5 numbers in this case.
# e.g. a repeating sequence of 7 numbers means in fact 21 combos of 5 numbers and so on.
# Combinations(k, n)
for cmb in combinations(each, 5):
combos_of_5.append(tuple(sorted(list(set(cmb)))))
# count how many times each combo appear:
x = Counter(combos_of_5)
sorted_x = dict(sorted(x.items(), key=lambda item: item[1], reverse=True))
for k, v in sorted_x.items():
print(k, v)
It works very well, as expected but there is one single problem: for a bigger DataFrame it takes a lot of time to do its job done. More than that, if you want to get repeating combinations with more than 5 numbers (let's say with 6, 7, 8 or 9 numbers) it will take for ever to run.
How to do it in full pandas in a very fast and much more smarter way than I did?
Also, please note that it does not generate every combo in the first instance and after that start looking for each of those combos into DataFrame because it will take even longer.
Thank you very much in advance!
P.S. What if the numbers from N1 to N20 were not sorted? Will this make any difference?
I read this topic and many others already but none is asking for the same thing so I think it is not duplicate and this could help many other have the same or very similar problem.
Proof of work:
Given this part of your dataframe:
index
Day
Hour
N1
N2
N3
N4
N5
N6
N7
N8
N9
N10
N11
N12
N13
N14
N15
N16
N17
N18
0
1996-03-18
15:00
4
9
10
16
21
22
23
26
27
34
35
41
42
48
62
66
68
73
1
1996-03-19
15:00
6
12
15
19
28
33
35
39
44
48
49
59
62
63
64
67
69
71
2
1996-03-21
15:00
2
4
6
7
15
16
17
19
20
26
28
45
48
52
54
69
72
73
3
1996-03-22
15:00
3
8
15
17
19
25
30
33
34
35
36
38
44
49
60
61
64
67
You can update your code with something similar to the one below:
check = [6,15]
df['check'] = df.iloc[:,2:].apply(lambda r: all(s in r.values for s in check), axis=1)
true_count = df.check.sum()
print(f'The following numbers {check} appear {true_count} time(s) in the dataframe.')
Result:
The following numbers [6, 15] appear 2 time(s) in the dataframe.

Why is checking for a variables existence taking more time than copying an array which should be a O(1) vs O(n) operation?

These numbers don't make sense to me.
Why does checking for list existence or checking len() of an list take longer than an copy()?
It's O(1) vs O(n) operations.
Total time: 3.01392 s
File: all_combinations.py
Function: recurse at line 15
Line # Hits Time Per Hit % Time Line Contents
==============================================================
15 #profile
16 def recurse(in_arr, result=[]):
17 nonlocal count
18
19 1048576 311204.0 0.3 10.3 if not in_arr:
20 524288 141102.0 0.3 4.7 return
21
22 524288 193554.0 0.4 6.4 in_arr = in_arr.copy() # Note: this adds a O(n) operation
23
24 1572863 619102.0 0.4 20.5 for i in range(len(in_arr)):
25 1048575 541166.0 0.5 18.0 next = result + [in_arr.pop(0)]
26 1048575 854453.0 0.8 28.4 recurse(in_arr, next)
27 1048575 353342.0 0.3 11.7 count += 1
Total time: 2.84882 s
File: all_combinations.py
Function: recurse at line 38
Line # Hits Time Per Hit % Time Line Contents
==============================================================
38 #profile
39 def recurse(result=[], index=0):
40 nonlocal count
41 nonlocal in_arr
42
43 # base
44 1048576 374126.0 0.4 13.1 if index > len(in_arr):
45 return
46
47 # recur
48 2097151 846711.0 0.4 29.7 for i in range(index, len(in_arr)):
49 1048575 454619.0 0.4 16.0 next_result = result + [in_arr[i]]
50 1048575 838434.0 0.8 29.4 recurse(next_result, i + 1)
51 1048575 334930.0 0.3 11.8 count = count + 1
It's not that making the copy takes longer by itself than the O(1) operations you mentioned.
But remember that your base case is running far more often than the recursive case.
I'm not sure what you reference with "it" in your question. The generic ("royal") "it" does not take longer; your implementation is what takes longer. In most language implementations, len is O(1), because it's an instance attribute maintained along with any changes in the object. This existence implementation is slower because it recurs instead of simply iterating through the list, although it's still O(N).

Filter cells by value and column index

I searched the web and previous stackoverflow questions and couldn't find a good answer, so I would appreciate your help.
I have a pd.df =
12 13 14 15 ... 141 142
12 0.000000 3.856802 5.442729 7.637788 ... 31.144092 34.277933
13 3.856802 0.000000 3.825300 4.735988 ... 29.098527 32.350149
14 5.442729 3.825300 0.000000 3.817564 ... 25.837813 29.062540
15 7.637788 4.735988 3.817564 0.000000 ... 25.712116 29.186678
16 10.947102 8.529696 6.548704 3.853627 ... 23.226639 26.856628
17 12.473594 10.760961 7.616927 6.705854 ... 20.499088 4.051315
where the indexes continue as well, to about 140.
These are distances, and I want to find cells with distances smaller than 6 and also farther than 5 positions away.
So the cell [16, 15], even though it is smaller than 6, doesn't interest me, but [17, 142] does.
I can easily find the close cells using:
neighbors = df < 6
to get an array of True and False, which I can sum:
neighbors.sum(axis=1)
Currently, when I sum this way I get a series like this where sum =
12 3
13 4
.
.
.
138 5
139 3
140 6
141 4
142 4
143 5
I would like to have this same output, just with only including in the sum the cells that comply to the rule (cell with data smaller than 6 but indexes that are far (larger than 5 difference)).
Thanks anyone who can offer ideas!

Pandas DataFrame: Complex linear interpolation

I have a dataframe with 4 sections
Section 1: Product details
Section 2: 6 Potential product values based on a range of simulations
Section 3: Upper and lower bound for the input parameter to the simulations
Section 4: Randomly generated values for the input parameters
Section 2 is generated by pricing the product at equal intervals between the upper and lower bound.
I need to take the values in Section 4 and figure out the corresponding product value. Here is a possible setup for this dataframe:
table2 = pd.DataFrame({
'Product Type': ['A', 'B', 'C', 'D'],
'State_1_Value': [10, 11, 12, 13],
'State_2_Value': [20, 21, 22, 23],
'State_3_Value': [30, 31, 32, 33],
'State_4_Value': [40, 41, 42, 43],
'State_5_Value': [50, 51, 52, 53],
'State_6_Value': [60, 61, 62, 63],
'Lower_Bound': [-1, 1, .5, 5],
'Upper_Bound': [1, 2, .625, 15],
'sim_1': [0, 0, .61, 7],
'sim_2': [1, 1.5, .7, 9],
})
>>> table2
Lower_Bound Product Type State_1_Value State_2_Value State_3_Value \
0 -1.0 A 10 20 30
1 1.0 B 11 21 31
2 0.5 C 12 22 32
3 5.0 D 13 23 33
State_4_Value State_5_Value State_6_Value Upper_Bound sim_1 sim_2
0 40 50 60 1.000 0.0 1.0
1 41 51 61 2.000 0.0 1.5
2 42 52 62 0.625 0.61 0.7
3 43 53 63 15.000 7.0 9.0
I will run through a couple examples of this calculation to make it clear what my question is.
Product A - sim_2
The input here is 1.0. This is equal to the upper bound for this product. Therefore the simulation value is equivalent to the state_6 value - 60
Product B - sim_2
The input here is 1.5. the LB to UB range is (1,2), therefore the 6 states are {1,1.2,1.4,1.6,1.8,2}. 1.5 is exactly in the middle of state_3 which has a value of 31 and state 4 which has a value of 41. Therefore the simulation value is 36.
Product C - sim_1
The input here is .61. The LB to UB range is (.5,.625), therefore the 6 states are {.5,.525,.55,.575,.6,.625}. .61 is between state 5 and 6. Specifically the bucket it would fall under would be 5*(.61-.5)/(.625-.5)+1 = 5.4 (it is multiplied by 5 as that is the number of intervals - you can calculate it other ways and get the same result). Then to calculate the value we use that bucket in a weighing of the values for state 5 and state 6: (62-52)*(5.4-5)+52 = 56.
Product B - sim_1
The input here is 0 which is below the lower bound of 1. Therefore we need to extrapolate the value. We use the same formula as above we just use the values of state 1 and state 2 to extrapolate. The bucket would be 5*(0-1)/(2-1)+1 = -4. The two values used at 11 and 21, so the value is (21-11)*(-4-1)+11= -39
I've also simplified the problem to try to visualize the solution, my final code needs to run on 500 values and 10,000 simulations, and the dataframe will have about 200 rows.
Here are the formulas I've used for the interpolation although I'm not committed to them specifically.
Bucket = N*(sim_value-LB)/(UB-LB) + 1
where N is the number of intervals
then nLower is the state value directly below the bucket, and nHigher is the state value directly above the bucket. If the bucket is outside the UB/LB, then force nLower and nHigher to be either the first two or last two values.
Final_value = (nHigher-nLower)*(Bucket1 - number_value_of_nLower)+nLower
To summarize, my question is how I can generate the final results based on the combination of input data provided. The most challenging part to me is how to make the connection from the Bucket number to the nLower and nHigher values.
I was able to generate the result using the following code. I'm not sure of the memory implications on a large dataframe, so still interested in better answers or improvements.
Edit: Ran this code on the full dataset, 141 rows, 500 intervals, 10,000 simulations, and it took slightly over 1.5 hours. So not quite as useless as I assumed, but there is probably a smarter way of doing this in a tiny fraction of that time.
for i in range(1,3):
table2['Bucket%s'%i] = 5 * (table2['sim_%s'%i] - table2['Lower_Bound']) / (table2['Upper_Bound'] - table2['Lower_Bound']) + 1
table2['lv'] = table2['Bucket%s'%i].map(int)
table2['hv'] = table2['Bucket%s'%i].map(int) + 1
table2.ix[table2['lv'] < 1 , 'lv'] = 1
table2.ix[table2['lv'] > 5 , 'lv'] = 5
table2.ix[table2['hv'] > 6 , 'hv'] = 6
table2.ix[table2['hv'] < 2 , 'hv'] = 2
table2['nLower'] = table2.apply(lambda row: row['State_%s_Value'%row['lv']],axis=1)
table2['nHigher'] = table2.apply(lambda row: row['State_%s_Value'%row['hv']],axis=1)
table2['Final_value_%s'%i] = (table2['nHigher'] - table2['nLower'])*(table2['Bucket%s'%i]-table2['lv']) + table2['nLower']
Output:
>>> table2
Lower_Bound Product Type State_1_Value State_2_Value State_3_Value \
0 -1.0 A 10 20 30
1 1.0 B 11 21 31
2 0.5 C 12 22 32
3 5.0 D 13 23 33
State_4_Value State_5_Value State_6_Value Upper_Bound sim_1 sim_2 \
0 40 50 60 1.000 0.00 1.0
1 41 51 61 2.000 0.00 1.5
2 42 52 62 0.625 0.61 0.7
3 43 53 63 15.000 7.00 9.0
Bucket1 lv hv nLower nHigher Final_value_1 Bucket2 Final_value_2
0 3.5 5 6 50 60 35.0 6.0 60.0
1 -4.0 3 4 31 41 -39.0 3.5 36.0
2 5.4 5 6 52 62 56.0 9.0 92.0
3 2.0 3 4 33 43 23.0 3.0 33.0
I posted a superior solution with no loops here:
Alternate method to avoid loop in pandas dataframe
df= pd.DataFrame({
'Product Type': ['A', 'B', 'C', 'D'],
'State_1_Value': [10, 11, 12, 13],
'State_2_Value': [20, 21, 22, 23],
'State_3_Value': [30, 31, 32, 33],
'State_4_Value': [40, 41, 42, 43],
'State_5_Value': [50, 51, 52, 53],
'State_6_Value': [60, 61, 62, 63],
'Lower_Bound': [-1, 1, .5, 5],
'Upper_Bound': [1, 2, .625, 15],
'sim_1': [0, 0, .61, 7],
'sim_2': [1, 1.5, .7, 9],
})
buckets = df.ix[:,-2:].sub(df['Lower_Bound'],axis=0).div(df['Upper_Bound'].sub(df['Lower_Bound'],axis=0),axis=0) * 5 + 1
low = buckets.applymap(int)
high = buckets.applymap(int) + 1
low = low.applymap(lambda x: 1 if x < 1 else x)
low = low.applymap(lambda x: 5 if x > 5 else x)
high = high.applymap(lambda x: 6 if x > 6 else x)
high = high.applymap(lambda x: 2 if x < 2 else x)
low_value = pd.DataFrame(df.filter(regex="State|Type").values[np.arange(low.shape[0])[:,None], low])
high_value = pd.DataFrame(df.filter(regex="State|Type").values[np.arange(high.shape[0])[:,None], high])
df1 = (high_value - low_value).mul((buckets - low).values) + low_value
df1['Product Type'] = df['Product Type']

Categories