Using NumPy argmax to count vs for loop

Using NumPy argmax to count vs for loop - python

I currently use something like the similar bit of code to determine comparison
list_of_numbers = [29800.0, 29795.0, 29795.0, 29740.0, 29755.0, 29745.0]
high = 29980.0
lookback = 10
counter = 1
for number in list_of_numbers:
if (high >= number) \
and (counter < lookback):
counter += 1
else:
break
The resulted counter magnitude will be 7. However, it is very taxing on large data arrays. So, I have looked for a solution and came up with np.argmax(), but there seems to be an issue. For example the following:
list_of_numbers = [29800.0, 29795.0, 29795.0, 29740.0, 29755.0, 29745.0]
np_list = np.array(list_of_numbers)
high = 29980.0
print(np.argmax(np_list > high) + 1)
this will get output 1, just like argmax is suppose to .. but I want it to get output 7. Is there another method to do this that will give me similar output for the if statement ?

You can get a boolean array for where high >= number using NumPy:
list_of_numbers = [29800.0, 29795.0, 29795.0, 29740.0, 29755.0, 29745.0]
high = 29980.0
lookback = 10
boolean_arr = np.less_equal(np.array(list_of_numbers), high)
Then finding where is the first False argument in that to satisfy break condition in your code. Furthermore, to consider countering, you can use np.cumsum on the boolean array and find the first argument that satisfying specified lookback magnitude. So, the result will be the smaller value between break_arr and lookback_lim:
break_arr = np.where(boolean_arr == False)[0][0] + 1
lookback_lim = np.where(np.cumsum(boolean_arr) == lookback)[0][0] + 1
result = min(break_arr, lookback_lim)
If your list_of_numbers have not any bigger value than your specified high limit for break_arr or the specified lookback exceeds values in np.cumsum(boolean_arr) for lookback_lim, the aforementioned code will get stuck with an error like the following, relating to np.where:
IndexError: index 0 is out of bounds for axis 0 with size 0
Which can be handled by try-except or if statements e.g.:
try:
break_arr = np.where(boolean_arr == False)[0][0] + 1
except:
break_arr = len(boolean_arr) + 1
try:
lookback_lim = np.where(np.cumsum(boolean_arr) == lookback)[0][0] + 1
except:
lookback_lim = len(boolean_arr) + 1

You have you less than sign backwards, no? The following should work as the for-loop:
print(np.min([np.sum(np.array(list_of_numbers) < high) + 1, lookback]))

A look back can be accomplished using shift. A cumcount can be used to get a running total. A query can be used as a filter

Related

Code is executing process that should have been denied in the initial statement and i dont know why

so i have a data set that contains data similar to: (Left:date, Middle: value, Right:time difference between dates).
I am developing a code that will scan this data set and if the first value in the right column is bigger than 1 and the successive ones are less than 1, then get me the max value(middle) and tell me the date it happened and put those in a new list. So in the above example, it should check the first 5 rows, find the max value to be 13.15 and tell me the date it happened and store it in a new list. However, my code is not doing this, in fact sometimes is actually produces duplicates and im having trouble finding out why. Code is below:
list_final_multiple = []
for i in range(0,len(file_dates_list)): #gets all of the rest of the data
n = 1
if (file_gap_list[i] > 1 or i == 0) and file_gap_list[i+n] <= 1:
while ((i + n) < len(file_dates_list)) and (file_gap_list[i + n] <= 1):
n = n + 1
max_value = (max(file_hs_list[i:i + n]))
max_value_location = file_hs_list.index(max_value)
list_final_multiple.append([file_dates_list[max_value_location], file_hs_list[max_value_location]])
any help would be appreciated.

List of fixed length that sums to a number but minimizes standard deviation

Im not sure if I am even asking this question the right way but here goes:
Say I want to create a python list with 20 non-zero integer elements and those elements must sum to 87.
How can I go about this to ensure that the integers chosen minimize the standard deviation of the list as a whole (not sure this is the right metric).
The following code example works, but I'm thinking there must be a better way to do this
import pandas as pd
import numpy as np
target = 87
target_length = 20
starter_series = pd.Series([1 for val in range(target_length)])
while True:
current_sum = starter_series.sum()
if current_sum==target:
break
if target - current_sum > 20:
starter_series += 1
continue
else:
to_be_added = target - current_sum
index_points = np.random.choice(starter_series.index.to_list(), to_be_added, replace=False)
starter_series.loc[index_points] += 1

This simple code should work:
n = 20
s = 87
q,r = divmod(s,n)
l = [q+1]*r + [q]*(n-r)

Speeding a numpy correlation program using the fact that lists are sorted

I am currently using python and numpy for calculations of correlations between 2 lists: data_0 and data_1. Each list contains respecively sorted times t0 and t1.
I want to calculate all the events where 0 < t1 - t0 < t_max.
for time_0 in np.nditer(data_0):
delta_time = np.subtract(data_1, np.full(data_1.size, time_0))
delta_time = delta_time[delta_time >= 0]
delta_time = delta_time[delta_time < time_max]
Doing so, as the list are sorted, I am selecting a subarray of data_1 of the form data_1[index_min: index_max].
So I need in fact to find two indexes to get what I want.
And what's interesting is that when I go to the next time_0, as data_0 is also sorted, I just need to find the new index_min / index_max such as new_index_min >= index_min / new_index_max >= index_max.
Meaning that I don't need to scann again all the data_1.
(data list from scratch).
I have implemented such a solution not using the numpy methods (just with while loop) and it gives me the same results as before but not as fast than before (15 times longer!).
I think as normally it requires less calculation, there should be a way to make it faster using numpy methods but I don't know how to do it.
Does anyone have an idea?
I am not sure if I am super clear so if you have any questions, do not hestitate.
Thank you in advance,
Paul

Here is a vectorized approach using argsort. It uses a strategy similar to your avoid-full-scan idea:
import numpy as np
def find_gt(ref, data, incl=True):
out = np.empty(len(ref) + len(data) + 1, int)
total = (data, ref) if incl else (ref, data)
out[1:] = np.argsort(np.concatenate(total), kind='mergesort')
out[0] = -1
split = (out < len(data)) if incl else (out >= len(ref))
if incl:
out[~split] -= len(data)
split[0] = False
return np.maximum.accumulate(np.where(split, -1, out))[split] + 1
def find_intervals(ref, data, span, incl=(True, True)):
index_min = find_gt(ref, data, incl[0])
index_max = len(ref) - find_gt(-ref[::-1], -span-data[::-1], incl[1])[::-1]
return index_min, index_max
ref = np.sort(np.random.randint(0,20000,(10000,)))
data = np.sort(np.random.randint(0,20000,(10000,)))
span = 2
idmn, idmx = find_intervals(ref, data, span, (True, True))
print('checking')
for d,mn,mx in zip(data, idmn, idmx):
assert mn == len(ref) or ref[mn] >= d
assert mn == 0 or ref[mn-1] < d
assert mx == len(ref) or ref[mx] > d+span
assert mx == 0 or ref[mx-1] <= d+span
print('ok')
It works by
indirectly sorting both sets together
finding for each time in one set the preceding time in the other
this is done using maximum.reduce
the preceding steps are applied twice, the second time the times in
one set are shifted by span

Python code not working as intended

I started learning Python < 2 weeks ago.
I'm trying to make a function to compute a 7 day moving average for data. Something wasn't going right so I tried it without the function.
moving_average = np.array([])
i = 0
for i in range(len(temp)-6):
sum_7 = np.array([])
avg_7 = 0
missing = 0
total = 7
j = 0
for j in range(i,i+7):
if pd.isnull(temp[j]):
total -= 1
missing += 1
if missing == 7:
moving_average = np.append(moving_average, np.nan)
break
if not pd.isnull(temp[j]):
sum_7 = np.append(sum_7, temp[j])
if j == (i+6):
avg_7 = sum(sum_7)/total
moving_average = np.append(moving_average, avg_7)
If I run this and look at the value of sum_7, it's just a single value in the numpy array which made all the moving_average values wrong. But if I remove the first for loop with the variable i and manually set i = 0 or any number in the range of the data set and run the exact same code from the inner for loop, sum_7 comes out as a length 7 numpy array. Originally, I just did sum += temp[j] but the same problem occurred, the total sum ended up as just the single value.
I've been staring at this trying to fix it for 3 hours and I'm clueless what's wrong. Originally I wrote the function in R so all I had to do was convert to python language and I don't know why sum_7 is coming up as a single value when there are two for loops. I tried to manually add an index variable to act as i to use it in the range(i, i+7) but got some weird error instead. I also don't know why that is.
https://gyazo.com/d900d1d7917074f336567b971c8a5cee
https://gyazo.com/132733df8bbdaf2847944d1be02e57d2

Hey you can using rolling() function and mean() function from pandas.
Link to the documentation :
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.rolling.html
df['moving_avg'] = df['your_column'].rolling(7).mean()
This would give you some NaN values also, but that is a part of rolling mean because you don't have all past 7 data points for first 6 values.

Seems like you misindented the important line:
moving_average = np.array([])
i = 0
for i in range(len(temp)-6):
sum_7 = np.array([])
avg_7 = 0
missing = 0
total = 7
j = 0
for j in range(i,i+7):
if pd.isnull(temp[j]):
total -= 1
missing += 1
if missing == 7:
moving_average = np.append(moving_average, np.nan)
break
# The following condition should be indented one more level
if not pd.isnull(temp[j]):
sum_7 = np.append(sum_7, temp[j])
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
if j == (i+6):
# this ^ condition does not do what you meant
# you should use a flag instead
avg_7 = sum(sum_7)/total
moving_average = np.append(moving_average, avg_7)
Instead of a flag you can use a for-else construct, but this is not readable. Here's the relevant documentation.
Shorter way to do this:
moving_average = np.array([])
for i in range(len(temp)-6):
ngram_7 = [t for t in temp[i:i+7] if not pd.isnull(t)]
average = (sum(ngram_7) / len(ngram_7)) if ngram_7 else np.nan
moving_average = np.append(moving_average, average)
This could be refactored further:
def average(ngram):
valid = [t for t in temp[i:i+7] if not pd.isnull(t)]
if not valid:
return np.nan
return sum(valid) / len(valid)
def ngrams(seq, n):
for i in range(len(seq) - n):
yield seq[i:i+n]
moving_average = [average(k) for k in ngrams(temp, 7)]

I cant get my code to work. it keeps saying: IndexError: List index out of range

My code is using the lengths of lists to try and find a percentage of how many scores are over an entered number.It all makes sense but I think some of the code needs some editing because it comes up with that error code.How can I fix it???
Here is the code:
result = [("bob",7),("jeff",2),("harold",3)]
score = [7,2,3]
lower = []
higher = []
index2 = len(score)
indexy = int(index2)
index1 = 0
chosen = int(input("the number of marks you want the percentage to be displayed higher than:"))
for counter in score[indexy]:
if score[index1] >= chosen:
higher.append(score[index1])
else:
lower.append(score[index1])
index1 = index1 + 1
original = indexy
new = len(higher)
decrease = int(original) - int(new)
finished1 = decrease/original
finished = finished1 * 100
finishedlow = original - finished
print(finished,"% of the students got over",chosen,"marks")
print(finishedlow,"% of the students got under",chosen,"marks")

Just notice one thing:
>>>score = [7,2,3]
>>>len(score) = 3
but ,index of list start counting from 0, so
>>>score[3]
IndexError: list index out of range
fix your row 12 to:
...
for counter in score:
if counter >= chosen:
...
if you really want to get the index and use them:
....
for index, number in enumerate(score):
if score[index] >= chosen:
......

Your mistake is in Line 9: for counter in score[indexy]:
counter should iterate through a list not through an int and even that you are referring to a value that is out of index range of your list:
1 - Remember indexing should be from 0 to (len(list)-0).
2 - You cannot iterate through a fixed value of int.
So, you should change Line 9 to :
for counter in score
But I'm not sure of the result you will get from your code, you need to check out your code logic.
There is a lot to optimize in your code.

index2 is an int, so no need to convert it to indexy. Indizes in Python are counted from 0, so the highest index is len(list)-1.
You have a counter, so why use index1 in for-loop? You cannot iterate over a number score[indexy].
results = [("bob",7),("jeff",2),("harold",3)]
chosen = int(input("the number of marks you want the percentage to be displayed higher than:"))
higher = sum(score >= chosen for name, score in results)
finished = higher / len(results)
finishedlow = 1 - finished
print("{0:.0%} of the students got over {1} marks".format(finished, chosen))
print("{0:.0%} of the students got under {1} marks".format(finishedlow, chosen))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using NumPy argmax to count vs for loop - python

You have you less than sign backwards, no? The following should work as the for-loop: print(np.min([np.sum(np.array(list_of_numbers) < high) + 1, lookback]))

A look back can be accomplished using shift. A cumcount can be used to get a running total. A query can be used as a filter

Related

Code is executing process that should have been denied in the initial statement and i dont know why

List of fixed length that sums to a number but minimizes standard deviation

Speeding a numpy correlation program using the fact that lists are sorted

Python code not working as intended

I cant get my code to work. it keeps saying: IndexError: List index out of range

Categories

Resources