customize step in loop through pandas - python

I know this question was asked a few times, but I couldn't understand the answers or apply them to my case.
I'm trying to iterate over a dataframe, and for each row, if column A has 1 add one to the counter, if it has 0 don't count the line in the counter (but don't skip it).
When we reach 10 for the counter, take all the rows and put them in an array and restart the counter. After searching a bit, it seems that generators could do the trick but I have a bit of trouble with them. So far I have something like this thanks to the help of SO community !
data = pd.DataFrame(np.random.randint(0,50,size=(50, 4)), columns=list('ABCD'))
data['C'] = np.random.randint(2, size=50)
data
counter = 0
chunk = 10
arrays = []
for x in range(0, len(data), chunk):
array = data.iloc[x: x+chunk]
arrays.append(array)
print(array)
the idea looks something like this :
while counter <= 10:
if data['A'] == 1:
counter += 1
yield counter
if counter > 10:
counter = 0
But I don't know how to combine this pseudo code with my current for loop.

When we use pandas, we should try not do for loop, based on your question , we can use groupby
arrays=[frame for _,frame in data.groupby(data.A.eq(1).cumsum().sub(1)//10)]
Explain :
we do cumsum with A if it is 1, then we will add the number up, 0 will keep same sum as pervious row, and // here is get the div to split the dataframe by step of 10 , for example 10//10 will return 1 and 20//10 will return 2.

Related

Code is executing process that should have been denied in the initial statement and i dont know why

so i have a data set that contains data similar to: (Left:date, Middle: value, Right:time difference between dates).
I am developing a code that will scan this data set and if the first value in the right column is bigger than 1 and the successive ones are less than 1, then get me the max value(middle) and tell me the date it happened and put those in a new list. So in the above example, it should check the first 5 rows, find the max value to be 13.15 and tell me the date it happened and store it in a new list. However, my code is not doing this, in fact sometimes is actually produces duplicates and im having trouble finding out why. Code is below:
list_final_multiple = []
for i in range(0,len(file_dates_list)): #gets all of the rest of the data
n = 1
if (file_gap_list[i] > 1 or i == 0) and file_gap_list[i+n] <= 1:
while ((i + n) < len(file_dates_list)) and (file_gap_list[i + n] <= 1):
n = n + 1
max_value = (max(file_hs_list[i:i + n]))
max_value_location = file_hs_list.index(max_value)
list_final_multiple.append([file_dates_list[max_value_location], file_hs_list[max_value_location]])
any help would be appreciated.

How to keep track of various types of streaks

I am writing a script that watches an online coin flip game, and keeps a tally of the results. I would like to find a simpler way of finding out how many times the streak ended after three of the same results, four of the same result etc.
if result = heads:
headsCount += 1
headsStreak +=1
tailsCount = 0
tailsStreak = 0
headsCount is the total amount of heads results witnessed in a session, and the streak is just so I can display how many heads have appeared in a row. This is update by:
if headsCount >= headsStreak:
headsStreak = headsCount
My problem - I wish to keep track of how many times the streak ends at one, ends at two, ends at three etc...
A silly way I have for now:
if headsStreak = 1:
oneHeadsStreak +=1
if headsStreak = 2
twoHeadsStreal +=1
But it is very tedious. So is there an easier way to create the variables... for example:
for i in range (1, 20):
(i)streak = 0
and then something like
for i in range (1, 20):
if headsStreak = i:
(i)streak += 1
Thank you in advance!
You could use a list to keep track of the streak counter. You will have to think about which index is which streak length (e.g. index 0 is for streak length 1, index 1 for length 2 etc.).
Initialize all list elements to zero:
l = [0 for i in range(20)]
Then, whenever a streak ends, increment the list element at the corresponding index: l[3] += 1 for a 4-streak.
Using a defaultdict, You just need three variables,
from collections import defaultdict
current_streak = 0
current_side = "heads"
streak_changes = defaultdict(int)
Then store the values in a dictionary when the streak changes
if side == current_side:
current_streak += 1
else:
current_side = side
streak_changes[current_streak] += 1
current_streak = 1

how to refer to the loop value in the loop itself in python panda?

I am trying to do do a loop to repeat the following instructions. The loop should consider nivel1, nivel2,nivel3 and nivel4. Is there a smart way to do this? So far I have tried
for x in range(2, 5):
n_index = len(VaR_nivel1.index)
n_columns = len(VaR_nivel1.columns)
VaR_profit_nivel1=pd.DataFrame(np.random.rand(n_index ,n_columns ))
VaR_profit_nivel1.columns = VaR_nivel1.columns
zero_one_nivel1= pd.DataFrame(np.zeros ((n_index, n_columns)))
columna=0
indices=0
while indices<n_index:
while columna< n_columns:
VaR_profit_nivel1.iloc[indices,columna]=VaR_nivel1.iloc[indices,columna] + profit_nivel1.iloc[indices,columna]
if VaR_profit_nivel1.iloc[indices,columna] <0:
zero_one_nivel1.iloc[indices,columna]=1
columna += 1
indices += 1
And then I have to change the level1 for something like levelx...
Thank you.

Python code not working as intended

I started learning Python < 2 weeks ago.
I'm trying to make a function to compute a 7 day moving average for data. Something wasn't going right so I tried it without the function.
moving_average = np.array([])
i = 0
for i in range(len(temp)-6):
sum_7 = np.array([])
avg_7 = 0
missing = 0
total = 7
j = 0
for j in range(i,i+7):
if pd.isnull(temp[j]):
total -= 1
missing += 1
if missing == 7:
moving_average = np.append(moving_average, np.nan)
break
if not pd.isnull(temp[j]):
sum_7 = np.append(sum_7, temp[j])
if j == (i+6):
avg_7 = sum(sum_7)/total
moving_average = np.append(moving_average, avg_7)
If I run this and look at the value of sum_7, it's just a single value in the numpy array which made all the moving_average values wrong. But if I remove the first for loop with the variable i and manually set i = 0 or any number in the range of the data set and run the exact same code from the inner for loop, sum_7 comes out as a length 7 numpy array. Originally, I just did sum += temp[j] but the same problem occurred, the total sum ended up as just the single value.
I've been staring at this trying to fix it for 3 hours and I'm clueless what's wrong. Originally I wrote the function in R so all I had to do was convert to python language and I don't know why sum_7 is coming up as a single value when there are two for loops. I tried to manually add an index variable to act as i to use it in the range(i, i+7) but got some weird error instead. I also don't know why that is.
https://gyazo.com/d900d1d7917074f336567b971c8a5cee
https://gyazo.com/132733df8bbdaf2847944d1be02e57d2
Hey you can using rolling() function and mean() function from pandas.
Link to the documentation :
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.rolling.html
df['moving_avg'] = df['your_column'].rolling(7).mean()
This would give you some NaN values also, but that is a part of rolling mean because you don't have all past 7 data points for first 6 values.
Seems like you misindented the important line:
moving_average = np.array([])
i = 0
for i in range(len(temp)-6):
sum_7 = np.array([])
avg_7 = 0
missing = 0
total = 7
j = 0
for j in range(i,i+7):
if pd.isnull(temp[j]):
total -= 1
missing += 1
if missing == 7:
moving_average = np.append(moving_average, np.nan)
break
# The following condition should be indented one more level
if not pd.isnull(temp[j]):
sum_7 = np.append(sum_7, temp[j])
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
if j == (i+6):
# this ^ condition does not do what you meant
# you should use a flag instead
avg_7 = sum(sum_7)/total
moving_average = np.append(moving_average, avg_7)
Instead of a flag you can use a for-else construct, but this is not readable. Here's the relevant documentation.
Shorter way to do this:
moving_average = np.array([])
for i in range(len(temp)-6):
ngram_7 = [t for t in temp[i:i+7] if not pd.isnull(t)]
average = (sum(ngram_7) / len(ngram_7)) if ngram_7 else np.nan
moving_average = np.append(moving_average, average)
This could be refactored further:
def average(ngram):
valid = [t for t in temp[i:i+7] if not pd.isnull(t)]
if not valid:
return np.nan
return sum(valid) / len(valid)
def ngrams(seq, n):
for i in range(len(seq) - n):
yield seq[i:i+n]
moving_average = [average(k) for k in ngrams(temp, 7)]

Array organizing in Python

I have python code below:
ht_24 = []
ht_23 = []
ht_22 = []
...
all_arr = [ht_24, ht_23, ht_22, ht_21, ht_20, ht_19, ht_18, ht_17, ht_16, ht_15, ht_14, ht_13, ht_12, ht_11, ht_10, ht_09, ht_08, ht_07, ht_06, ht_05, ht_04, ht_03, ht_02, ht_01]
i = 0
j = 0
while i < 24:
while j < 24864:
all_arr[i].append(read_matrix[j+i])
j += 24
print(j)
i += 1
print(i)
where read_matrix is an array of shape 24864, 17.
I want to read every 24th line from different starting indexs (0-24) and append them to the corresponding arrays for each line. Please help, this is so hard!
Two things to learn in Python:
ONE: for loops -- when you know ahead of time how many times you're going through the loop. Your while loops above are both this type. Try these instead:
for i in range(24):
for j in range(0, 24864, 24):
all_arr[i].append(read_matrix[j+i])
print(j)
print(i)
It's better when you let the language handle the index values for you.
TWO: List comprehensions: sort of a for loop inside a list construction. Your entire posted code can turn into a single statement:
all_arr = [[read_matrix[j+i] \
for j in range(0, 24864, 24) ] \
for i in range(24) ]
Your question is a little unclear, but I think
list(zip(*zip(*[iter(read_matrix)]*24)))
may be what you're looking for.
list(zip(*zip(*[iter(range(24864))]*24)))[0][:5]
The above just looks at the indices, and the first few elements of the first sublist are
(0, 24, 48, 72, 96)
Can numpy library do what you want?
import numpy as np
# 24864 row, 17 columns
read_matrix = np.arange(24864*17).reshape(24864,17)
new_matrices = [[] for i in range(24)]
for i in range(24):
# a has 17 columns
a = read_matrix[slice(i,None,24)]
new_matrices[i].append(a)

Categories