Add (and calculate) rows to dataframe until condition is met: - python

I'm attempting to build a dataframe that adds 1 to the prior row in a column until a condition is met. In this case, I want to continue to add rows until column 'AGE' = 100.
import pandas as pd
import numpy as np
RP = {'AGE' : pd.Series([10]),
'SI' : pd.Series([60])}
RPdata = pd.DataFrame(RP)
i = RPdata.tail(1)['AGE']
RPdata2 = pd.DataFrame()
while [i < 100]:
RPdata2['AGE'] = i + 1
RPdata2['SI'] = RPdata.tail(1)['SI']
RPdata = pd.concat([RPdata, RPdata2], axis = 0)
break
print RPdata
Results
Age SI
0 10 60
0 11 60
I understand that the break statement prevents multiple iterations, but the loop appears to be infinite without it.
I'm attempting to achieve:
Age SI
0 10 60
0 11 60
0 12 60
0 13 60
0 14 60
. . 60
0 100 60
Is there a way to accomplish this with a while loop? Should I pursue a for loop solution instead?

There may be other problems, but you're going to get in an infinite loop with while [i < 100]: since a non-empty list will always evaluate to True. Change that to while (i < 100): (parens optional) and remove your break statement, which is forcing just one iteration.

Related

Dropping value in a dataframe in a loop

I have a dataframe with sorted values:
import numpy as np
import pandas as pd
sub_run = pd.DataFrame({'Runoff':[45,10,5,26,30,23,35], 'ind':[3, 10, 25,43,53,60,93]})
I would like to start from the highest value in Runoff (45), drop all values with which the difference in "ind" is less than 30 (10, 5), reupdate the DataFrame , then go to the second highest value (35): drop the indices with which the difference in "ind" is < 30 , then the the third highest value (30) and drop 26 and 23...
I wrote the following code :
pre_ind = []
for (idx1, row1) in sub_run.iterrows():
var = row1.ind
pre_ind.append(np.array(var))
for (idx2,row2) in sub_run.iterrows():
if (row2.ind != var) and (row2.ind not in pre_ind):
test = abs(row2.ind - var)
print("test" , test)
if test <= 30:
sub_run = sub_run.drop(sub_run[sub_run.ind == row2.ind].index)
I expect to find as an output the values [45,35,30]. However I only find the first one.
Many thanks
Try this:
list_pre_max = []
while True:
try:
max_val = sub_run.Runoff.sort_values(ascending=False).iloc[len(list_pre_max)]
except:
break
max_ind = sub_run.loc[sub_run['Runoff'] == max_val, 'ind'].item()
list_pre_max.append(max_val)
dropped_indices = sub_run.loc[(abs(sub_run['ind']-max_ind) <= 30) & (sub_run['ind'] != max_ind) & (~sub_run.Runoff.isin(list_pre_max))].index
sub_run.drop(index=dropped_indices, inplace=True)
Output:
>>>sub_run
Runoff ind
0 45 3
4 30 53
6 35 93
You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iterrows.html
In your case, the modification of sub_run has no effect immediately on the iteration.
Therefore, in the outer loop, after iteration on 45, 3,
the next row iterated is 35, 93, followed by 30, 53, 26, 43, 23, 60, 10, 10, 5, 25. For the inner loop, your modification works since you re-enter a new loop through iteration on the outer loop.
Here is my advice code, inspired by bubble sort.
import pandas as pd
sub_run = pd.DataFrame({'Runoff': [45,10,5,26,30,23,35],
'ind': [3,10,25,43,53,60,93]})
sub_run = sub_run.sort_values(by=['Runoff'], ascending=False)
highestRow = 0
while highestRow < len(sub_run) - 1:
cur_run = sub_run
highestRunoffInd = cur_run.iloc[highestRow].ind
for i in range(highestRow + 1, len(cur_run)):
ind = cur_run.iloc[i].ind
if abs(ind - highestRunoffInd) <= 30:
sub_run = sub_run.drop(sub_run[sub_run.ind == ind].index)
highestRow += 1
print(sub_run)
Output:
Runoff ind
0 45 3
6 35 93
4 30 53

Find combinations (without "of size=r") from a set with decreasing sum value using Python

(Revised for clarity 02-08-2021)
This is similar to the question here:
Find combinations of size r from a set with decreasing sum value
This is different from the answer posted in the link above because I am looking for answers without "size r=3".
I have a set (array) of numbers.
I need to have the sums of combinations of the numbers sorted from largest to smallest and show the numbers from the array that were used to get the total for that row.
Any number in the array can only be used once per row but all the numbers don't have to be used in each row as the total decreases.
If a number is not used then zero should be used as a placeholder instead so I can create a CSV file with the columns aligned.
Input Example #1 with 7 numbers in the array: [30,25,20,15,10,5,1]
Desired Output Example #1 format where the last number in each row is the total (sum) of the row:
Beginning of list
30,25,20,15,10,5,1,106
30,25,20,15,10,5,0,105
30,25,20,15,10,0,1,101
30,25,20,15,10,0,0,100
...(all number combinations in between)
30,0,0,0,0,0,0,30
0,25,0,0,0,5,0,30
...(all number combinations in between)
0,0,0,15,0,0,1,16
0,0,0,15,0,0,0,15
0,0,0,0,10,5,0,15
0,0,0,0,10,0,1,11
0,0,0,0,0,5,1,6
0,0,0,0,0,5,0,5
0,0,0,0,0,0,1,1
End of list
Also, duplicate totals are allowed and preferred showing different combinations that have the same total (sum) of the row.
For Example #1:
30,0,0,0,0,0,0,30
0,25,0,0,0,5,0,30
For example this is one row of output based on the Input Example #1 above:
30,25,0,0,0,5,1,61
Last number in the row is the total. The total can also be the first number but the important thing is that the output list is sorted in descending order by the total.
Input Example #2 with 5 numbers in the array: [20,15,10,5,1]
Desired Output Example #2 format where the last number in each row is the total (sum) of the row:
Beginning of list
20,15,10,5,1,51
20,15,10,5,0,50
20,15,10,0,1,46
20,15,10,0,0,45
...(all number combinations in between)
20,0,10,0,0,30
0,15,10,5,0,30
...(all number combinations in between)
0,15,0,0,1,16
0,15,0,0,0,15
0,0,10,5,0,15
0,0,10,0,1,11
0,0,10,0,0,10
0,0,0,5,1,6
0,0,0,5,0,5
0,0,0,0,1,1
End of list
Input Example #1: [30,25,20,15,10,5,1]
Every row of the output should show each number in the array used only once at most per row to get the total for the row.
The rows must be sorted in decreasing order by the sums of the numbers used to get the total.
The first output row in the list would show the result of 30 + 25 + 20 + 15 + 10 + 5 + 1 = 106
The second output row in the list would show the result of 30 + 25 + 20 + 15 + 10 + 5 + 0 = 105
The third output row in the list would show the result of 30 + 25 + 20 + 15 + 10 + 0 + 1 = 101
...The rest of the rows would continue with the total (sum) of the row getting smaller until it reaches 1...
The third to last output row in the list would show the result of 0 + 0 + 0 + 0 + 0 + 5 + 1 = 6
The second to last output row in the list would show the result of 0 + 0 + 0 + 0 + 0 + 5 + 0 = 5
The last output row in the list would show the result of 0 + 0 + 0 + 0 + 0 + 0 + 1 = 1
I started with the code provided by user Divyanshu modified with different input numbers and the () added to the last line (but I need to use all the numbers in the array instead of size=4 as shown here):
import itertools
array = [30,25,20,15,10,5,1]
size = 4
answer = [] # to store all combination
order = [] # to store order according to sum
number = 0 # index of combination
for comb in itertools.combinations(array,size):
answer.append(comb)
order.append([sum(comb),number]) # Storing sum and index
number += 1
order.sort(reverse=True) # sorting in decreasing order
for key in order:
print (key[0],answer[key[1]]) # key[0] is sum of combination
So this is what I need as an Input (in this example):
[30,25,20,15,10,5,1]
size=4 in the above code limits the output to 4 of the numbers in the array.
If I take out size=4 I get an error. I need to use the entire array of numbers.
I can manually change size=4 to size=1 and run it then size=2 then run it and so on.
Entering size=1 through size=7 in the code and running it (7 times in this example) to get a list of all possible combinations gives me 7 different outputs.
I could then manually put the lists together but that won't work for larger sets (arrays) of numbers.
Can I modify the code referenced above or do I need to use a different approach?
I think you could do it as follow:
Import the following:
import pandas as pd
import numpy as np
The beginning of the code as in the questions:
import itertools
array = [30,25,20,15,10,5,1]
size = 4
answer = [] # to store all combination
order = [] # to store order according to sum
number = 0 # index of combination
for comb in itertools.combinations(array,size):
answer.append(comb)
order.append([sum(comb),number]) # Storing sum and index
number += 1
order.sort(reverse=True) # sorting in decreasing order
for key in order:
print (key[0],answer[key[1]]) # key[0] is sum of combination
The code, I think could help to obtain the final option:
array_len = array.__len__()
# Auxiliary to place in reference to the original array
dict_array = {}
for i in range(0,array_len):
print(i)
dict_array[array[i]]=i
# Reorder the previous combinations
aux = []
for key in order:
array_zeros = np.zeros([1, array_len+1])
for i in answer[key[1]]:
print(i,dict_array[i] )
array_zeros[0][dict_array[i]] = i
# Let add the total
array_zeros[0][array_len]=key[0]
aux.append(array_zeros[0])
# Tranform into a dataframe
aux = pd.DataFrame(aux)
# This is to add the names to the columns
# for the dataframe
aux.columns=array + ['total']
aux = aux.astype(int)
print(aux.head().astype(int))
30 25 20 15 10 5 1 total
0 30 25 20 15 0 0 0 90
1 30 25 20 0 10 0 0 85
2 30 25 0 15 10 0 0 80
3 30 25 20 0 0 5 0 80
4 30 25 20 0 0 0 1 76
Now generalization for all sizes
import itertools
import pandas as pd
import numpy as np
array = [30,25,20,15,10,5,1]
array_len = array.__len__()
answer = [] # to store all combination
order = [] # to store order according to sum
number = 0 # index of combination
for size in range(1,array_len+1):
print(size)
for comb in itertools.combinations(array,size):
answer.append(comb)
order.append([sum(comb),number]) # Storing sum and index
number += 1
order.sort(reverse=True) # sorting in decreasing order
for key in order:
print (key[0],answer[key[1]]) # key[0] is sum of combination
# Auxiliary to place in reference to the original array
dict_array = {}
for i in range(0,array_len):
print(i)
dict_array[array[i]]=i
# Reorder the previous combinations
aux = []
for key in order:
array_zeros = np.zeros([1, array_len+1])
for i in answer[key[1]]:
print(i,dict_array[i] )
array_zeros[0][dict_array[i]] = i
# Let add the total
array_zeros[0][array_len]=key[0]
aux.append(array_zeros[0])
# Tranform into a dataframe
aux = pd.DataFrame(aux)
# This is to add the names to the columns
# for the dataframe
aux.columns=array + ['total']
aux = aux.astype(int)
print(aux.head().astype(int))
30 25 20 15 10 5 1 total
0 30 25 20 15 10 5 1 106
1 30 25 20 15 10 5 0 105
2 30 25 20 15 10 0 1 101
3 30 25 20 15 10 0 0 100
4 30 25 20 15 0 5 1 96
Thanks to #RafaelValero (Rafael Valero) I was able to learn about pandas, numpy, and dataframes. I looked up options for pandas to get the desired output.
Here is the final code with some extra lines left in for reference but commented out:
import itertools
import pandas as pd
import numpy as np
array = [30,25,20,15,10,5,1]
array_len = array.__len__()
answer = [] # to store all combination
order = [] # to store order according to sum
number = 0 # index of combination
for size in range(1,array_len+1):
# Commented out line below as it was giving extra information
# print(size)
for comb in itertools.combinations(array,size):
answer.append(comb)
order.append([sum(comb),number]) # Storing sum and index
number += 1
order.sort(reverse=True) # sorting in decreasing order
# Commented out two lines below as it was from the original code and giving extra information
#for key in order:
# print (key[0],answer[key[1]]) # key[0] is sum of combination
# Auxiliary to place in reference to the original array
dict_array = {}
for i in range(0,array_len):
# Commented out line below as it was giving extra information
# print(i)
dict_array[array[i]]=i
# Reorder the previous combinations
aux = []
for key in order:
array_zeros = np.zeros([1, array_len+1])
for i in answer[key[1]]:
# Commented out line below as it was giving extra information
# print(i,dict_array[i] )
array_zeros[0][dict_array[i]] = i
# Let add the total
array_zeros[0][array_len]=key[0]
aux.append(array_zeros[0])
# Tranform into a dataframe
aux = pd.DataFrame(aux)
# This is to add the names to the columns
# for the dataframe
# Update: removed this line below as I didn't need a header
# aux.columns=array + ['total']
aux = aux.astype(int)
# Tried option below first but it was not necessary when using to_csv
# pd.set_option('display.max_rows', None)
print(aux.to_csv(index=False,header=None))
Searched references:
Similar question:
Find combinations of size r from a set with decreasing sum value
Pandas references:
https://thispointer.com/python-pandas-how-to-display-full-dataframe-i-e-print-all-rows-columns-without-truncation/
https://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.to_csv.html
Online compiler used:
https://www.programiz.com/python-programming/online-compiler/
Output using Input Example #1 with 7 numbers in the array: [30,25,20,15,10,5,1]:
30,25,20,15,10,5,1,106
30,25,20,15,10,5,0,105
30,25,20,15,10,0,1,101
30,25,20,15,10,0,0,100
30,25,20,15,0,5,1,96
30,25,20,15,0,5,0,95
30,25,20,0,10,5,1,91
30,25,20,15,0,0,1,91
30,25,20,0,10,5,0,90
30,25,20,15,0,0,0,90
30,25,0,15,10,5,1,86
30,25,20,0,10,0,1,86
30,25,0,15,10,5,0,85
30,25,20,0,10,0,0,85
30,0,20,15,10,5,1,81
30,25,0,15,10,0,1,81
30,25,20,0,0,5,1,81
30,0,20,15,10,5,0,80
30,25,0,15,10,0,0,80
30,25,20,0,0,5,0,80
0,25,20,15,10,5,1,76
30,0,20,15,10,0,1,76
30,25,0,15,0,5,1,76
30,25,20,0,0,0,1,76
0,25,20,15,10,5,0,75
30,0,20,15,10,0,0,75
30,25,0,15,0,5,0,75
30,25,20,0,0,0,0,75
0,25,20,15,10,0,1,71
30,0,20,15,0,5,1,71
30,25,0,0,10,5,1,71
30,25,0,15,0,0,1,71
0,25,20,15,10,0,0,70
30,0,20,15,0,5,0,70
30,25,0,0,10,5,0,70
30,25,0,15,0,0,0,70
0,25,20,15,0,5,1,66
30,0,20,0,10,5,1,66
30,0,20,15,0,0,1,66
30,25,0,0,10,0,1,66
0,25,20,15,0,5,0,65
30,0,20,0,10,5,0,65
30,0,20,15,0,0,0,65
30,25,0,0,10,0,0,65
0,25,20,0,10,5,1,61
30,0,0,15,10,5,1,61
0,25,20,15,0,0,1,61
30,0,20,0,10,0,1,61
30,25,0,0,0,5,1,61
0,25,20,0,10,5,0,60
30,0,0,15,10,5,0,60
0,25,20,15,0,0,0,60
30,0,20,0,10,0,0,60
30,25,0,0,0,5,0,60
0,25,0,15,10,5,1,56
0,25,20,0,10,0,1,56
30,0,0,15,10,0,1,56
30,0,20,0,0,5,1,56
30,25,0,0,0,0,1,56
0,25,0,15,10,5,0,55
0,25,20,0,10,0,0,55
30,0,0,15,10,0,0,55
30,0,20,0,0,5,0,55
30,25,0,0,0,0,0,55
0,0,20,15,10,5,1,51
0,25,0,15,10,0,1,51
0,25,20,0,0,5,1,51
30,0,0,15,0,5,1,51
30,0,20,0,0,0,1,51
0,0,20,15,10,5,0,50
0,25,0,15,10,0,0,50
0,25,20,0,0,5,0,50
30,0,0,15,0,5,0,50
30,0,20,0,0,0,0,50
0,0,20,15,10,0,1,46
0,25,0,15,0,5,1,46
30,0,0,0,10,5,1,46
0,25,20,0,0,0,1,46
30,0,0,15,0,0,1,46
0,0,20,15,10,0,0,45
0,25,0,15,0,5,0,45
30,0,0,0,10,5,0,45
0,25,20,0,0,0,0,45
30,0,0,15,0,0,0,45
0,0,20,15,0,5,1,41
0,25,0,0,10,5,1,41
0,25,0,15,0,0,1,41
30,0,0,0,10,0,1,41
0,0,20,15,0,5,0,40
0,25,0,0,10,5,0,40
0,25,0,15,0,0,0,40
30,0,0,0,10,0,0,40
0,0,20,0,10,5,1,36
0,0,20,15,0,0,1,36
0,25,0,0,10,0,1,36
30,0,0,0,0,5,1,36
0,0,20,0,10,5,0,35
0,0,20,15,0,0,0,35
0,25,0,0,10,0,0,35
30,0,0,0,0,5,0,35
0,0,0,15,10,5,1,31
0,0,20,0,10,0,1,31
0,25,0,0,0,5,1,31
30,0,0,0,0,0,1,31
0,0,0,15,10,5,0,30
0,0,20,0,10,0,0,30
0,25,0,0,0,5,0,30
30,0,0,0,0,0,0,30
0,0,0,15,10,0,1,26
0,0,20,0,0,5,1,26
0,25,0,0,0,0,1,26
0,0,0,15,10,0,0,25
0,0,20,0,0,5,0,25
0,25,0,0,0,0,0,25
0,0,0,15,0,5,1,21
0,0,20,0,0,0,1,21
0,0,0,15,0,5,0,20
0,0,20,0,0,0,0,20
0,0,0,0,10,5,1,16
0,0,0,15,0,0,1,16
0,0,0,0,10,5,0,15
0,0,0,15,0,0,0,15
0,0,0,0,10,0,1,11
0,0,0,0,10,0,0,10
0,0,0,0,0,5,1,6
0,0,0,0,0,5,0,5
0,0,0,0,0,0,1,1

Add a value in a column as a function of the timestamp and another column

The title may not be very clear, but with an example I hope it would make some sense.
I would like to create an output column (called "outputTics"), and put a 1 in it 0.21 seconds after a 1 appears in the "inputTics" column.
As you see, there is no value 0.21 seconds exactly after another value, so I'll put the 1 in the outputTics column two rows after : an example would be at the index 3, there is a 1 at 11.4 seconds so I'm putting an 1 in the output column at 11.6 seconds
If there is a 1 in the "inputTics" column 0.21 second of earlier, do not put a one in the output column : an example would be at the index 1 in the input column
Here is an example of the red column I would like to create.
Here is the code to create the dataframe :
A = pd.DataFrame({"Timestamp":[11.1,11.2,11.3,11.4,11.5,11.6,11.7,11.8,11.9,12.0,12.1,12.2,12.3,12.4,12.5,12.6,12.7,12.8,12.9,13.0],
"inputTics":[0,1,0,1,0,0,0,1,0,0,0,1,1,0,0,0,0,1,1,1],
"outputTics":[0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0]})
You can use pd.Timedelta if you can to avoid python rounded numbers if you want
Create the column with zeros.
df['outputTics'] = 0
Define a function set_output_tic in the following manner
def set_output_tic(row):
if row['inputTics'] == 0:
return 0
index = df[df == row].dropna().index
# check for a 1 in input within 0.11 seconds
t = row['Timestamp'] + pd.TimeDelta(seconds = 0.11)
indices = df[df.Timestamp <= t].index
c = 0
for i in indices:
if df.loc[i,'inputTics'] == 0:
c = c + 1
else:
c = 0
break
if c > 0:
df.loc[indices[-1] + 1, 'outputTics'] = 1
return 0
then call the above function using df.apply
temp = df.apply(set_output_tic, axis = 1) # temp is practically useless
This was actually kinda tricky, but by playing with indices in numpy you can do it.
# Set timestamp as index for a moment
A = A.set_index(['Timestamp'])
# Find the timestamp indices of inputTics and add your 0.11
input_indices = A[A['inputTics']==1].index + 0.11
# Iterate through the indices and find the indices to update outputTics
output_indices = []
for ii in input_indices:
# Compare indices to full dataframe's timestamps
# and return index of nearest timestamp
oi = np.argmax((A.index - ii)>=0)
output_indices.append(oi)
# Create column of output ticks with 1s in the right place
output_tics = np.zeros(len(A))
output_tics[output_indices] = 1
# Add it to dataframe
A['outputTics'] = outputTics
# Add condition that if inputTics is 1, outputTics is 0
A['outputTics'] = A['outputTics'] - A['inputTics']
# Clean up negative values
A[A['outputTic']<0] = 0
# The first row becomes 1 because of indexing; change to 0
A = A.reset_index()
A.at[0, 'outputTics'] = 0

Find specific Row of Data from Pandas Dataframe in While Loop

I am trying to take a csv, and read it as a Pandas Dataframe.
This Dataframe contains 4 rows of numbers.
I want to pick a specific row of data from the Dataframe.
In a While Loop, I want to select a random row from the Dataframe, and compare it to row that I picked.
I want it to continue to run through the while loop until that random row, is 100% equal to the row I picked prior.
Then I want the While Loop to break and I want it to have counted how many tries it took to match the random number.
Here's what I have so far:
This is an example of the Dataframe:
A B C D
1 2 7 12 14
2 4 5 11 23
3 4 6 14 20
4 4 7 13 50
5 9 6 14 35
Here is an example of my efforts:
import time
import pandas as pd
then = time.time()
count = 0
df = pd.read_csv('Get_Numbers.csv')
df.columns = ['A', 'B', 'C', 'D']
while True:
df_elements = df.sample(n=1)
random_row = df_elements
print(random_row)
find_this_row = df['A','B','C','D' == '4','7','13,'50']
print(find_this_row)
if find_this_row != random_row:
count += 1
else:
break
print("You found the correct numbers! And it only took " + str(count) + " tries to get there! Your numbers were: " + str(find_this_row))
now = time.time()
print("It took: ", now-then, " seconds")
The above code gives an obvious error... but I have tried so many different versions now of finding the find_this_row numbers that I just don't know what to do anymore, so I left this attempt in.
What I would like to try to avoid is using the specific index for the row I am trying to find, I would rather use just the values to find this.
I am using df_elements = df.sample(n=1) to select a row at random. This was to avoid using random.choice as I was not sure if that would work or which way is more time/memory efficient, but I'm open to advice on that as well.
In my mind it seems simple, randomly select a row of data, if it doesn't match the row of data that I want, keep randomly selecting rows of data until it does match. But I can't seem to execute it.
Any help is EXTREMELY Appreciated!
You can use values which returns np.ndarray of shape=(1, 2), use values[0] to get just 1D array.
Then compare the arrays with any()
import time
import pandas as pd
then = time.time()
df = pd.DataFrame(data={'A': [1, 2, 3],
'B': [8, 9, 10]})
find_this_row = [2, 9]
print("Looking for: {}".format(find_this_row))
count = 0
while True:
random_row = df.sample(n=1).values[0]
print(random_row)
if any(find_this_row != random_row):
count += 1
else:
break
print("You found the correct numbers! And it only took " + str(count) + " tries to get there! Your numbers were: " + str(find_this_row))
now = time.time()
print("It took: ", now-then, " seconds")
How about using values?
values will return you a list of values. And then you can compare two lists easily.
list1 == list2 will return an array of True and False values as it compares indexes of the corresponding lists. You can check if all of the values returned are True
Here's a method that tests one row at a time. We check if the values of the chosen row are equal to the values of the sampled DataFrame. We require that they all match.
row = df.sample(1)
counter = 0
not_a_match = True
while not_a_match:
not_a_match = ~(df.sample(n=1).values == row.values).all()
counter+=1
print(f'It took {counter} tries and the numbers were\n{row}')
#It took 9 tries and the numbers were
# A B C D
#4 4 7 13 50
If you want to get a little bit faster, you select one row and then sample the DataFrame with replacement many times. You can then check for the first time the sampled row equals your sampled DataFrame, giving you how many 'tries' it would have taken in a while loop, but in much less time. The loop protects against the unlikely case we do not find a match, given that it's sampling with replacement.
row = df.sample(1)
n = 0
none_match = True
k = 10 # Increase to check more matches at once.
while none_match:
matches = (df.sample(n=len(df)*k, replace=True).values == row.values).all(1)
none_match = ~matches.any() # Determine if none still match
n += k*len(df)*none_match # Only increment if none match
n = n + matches.argmax() + 1
print(f'It took {n} tries and the numbers were\n{row}')
#It took 3 tries and the numbers were
# A B C D
#4 4 7 13 50
A couple of hints first. This line does not work for me:
find_this_row = df['A','B','C','D' == '4','7','13,'50']
For 2 reasons:
a missing " ' " after ,'13
df is a DataFrame(), so using keys like below is not supported
df['A','B','C','D' ...
Either use keys to return a DataFrame():
df[['A','B','C','D']]
or as a Series():
df['A']
Since you need the whole row with multiple columns do this:
df2.iloc[4].values
array(['4', '7', '13', '50'], dtype=object)
Do the same with your sample row:
df2.sample(n=1).values
Comparison between rows needs to be done for all() elements/columns:
df2.sample(n=1).values == df2.iloc[4].values
array([[ True, False, False, False]])
with adding .all() like the following:
(df2.sample(n=1).values == df2.iloc[4].values).all()
which returns
True/False
All together:
import time
import pandas as pd
then = time.time()
count = 0
while True:
random_row = df2.sample(n=1).values
find_this_row = df2.iloc[4].values
if (random_row == find_this_row).all() == False:
count += 1
else:
break
print("You found the correct numbers! And it only took " + str(count) + " tries to get there! Your numbers were: " + str(find_this_row))
now = time.time()
print("It took: ", now-then, " seconds")

Delimiting contiguous regions with values above a certain threshold in Pandas DataFrame

I have a Pandas Dataframe of indices and values between 0 and 1, something like this:
6 0.047033
7 0.047650
8 0.054067
9 0.064767
10 0.073183
11 0.077950
I would like to retrieve tuples of the start and end points of regions of more than 5 consecutive values that are all over a certain threshold (e.g. 0.5). So that I would have something like this:
[(150, 185), (632, 680), (1500,1870)]
Where the first tuple is of a region that starts at index 150, has 35 values that are all above 0.5 in row, and ends on index 185 non-inclusive.
I started by filtering for only values above 0.5 like so
df = df[df['values'] >= 0.5]
And now I have values like this:
632 0.545700
633 0.574983
634 0.572083
635 0.595500
636 0.632033
637 0.657617
638 0.643300
639 0.646283
I can't show my actual dataset, but the following one should be a good representation
import numpy as np
from pandas import *
np.random.seed(seed=901212)
df = DataFrame(range(1,501), columns=['indices'])
df['values'] = np.random.rand(500)*.5 + .35
yielding:
1 0.491233
2 0.538596
3 0.516740
4 0.381134
5 0.670157
6 0.846366
7 0.495554
8 0.436044
9 0.695597
10 0.826591
...
Where the region (2,4) has two values above 0.5. However this would be too short. On the other hand, the region (25,44) with 19 values above 0.5 in a row would be added to list.
You can find the first and last element of each consecutive region by looking at the series and 1-row shifted values, and then filter the pairs which are adequately apart from each other:
# tag rows based on the threshold
df['tag'] = df['values'] > .5
# first row is a True preceded by a False
fst = df.index[df['tag'] & ~ df['tag'].shift(1).fillna(False)]
# last row is a True followed by a False
lst = df.index[df['tag'] & ~ df['tag'].shift(-1).fillna(False)]
# filter those which are adequately apart
pr = [(i, j) for i, j in zip(fst, lst) if j > i + 4]
so for example the first region would be:
>>> i, j = pr[0]
>>> df.loc[i:j]
indices values tag
15 16 0.639992 True
16 17 0.593427 True
17 18 0.810888 True
18 19 0.596243 True
19 20 0.812684 True
20 21 0.617945 True
I think this prints what you want. It is based heavily on Joe Kington's answer here I guess it is appropriate to up-vote that.
import numpy as np
# from Joe Kington's answer here https://stackoverflow.com/a/4495197/3751373
# with minor edits
def contiguous_regions(condition):
"""Finds contiguous True regions of the boolean array "condition". Returns
a 2D array where the first column is the start index of the region and the
second column is the end index."""
# Find the indicies of changes in "condition"
d = np.diff(condition,n=1, axis=0)
idx, _ = d.nonzero()
# We need to start things after the change in "condition". Therefore,
# we'll shift the index by 1 to the right. -JK
# LB this copy to increment is horrible but I get
# ValueError: output array is read-only without it
mutable_idx = np.array(idx)
mutable_idx += 1
idx = mutable_idx
if condition[0]:
# If the start of condition is True prepend a 0
idx = np.r_[0, idx]
if condition[-1]:
# If the end of condition is True, append the length of the array
idx = np.r_[idx, condition.size] # Edit
# Reshape the result into two columns
idx.shape = (-1,2)
return idx
def main():
import pandas as pd
RUN_LENGTH_THRESHOLD = 5
VALUE_THRESHOLD = 0.5
np.random.seed(seed=901212)
data = np.random.rand(500)*.5 + .35
df = pd.DataFrame(data=data,columns=['values'])
match_bools = df.values > VALUE_THRESHOLD
print('with boolian array')
for start, stop in contiguous_regions(match_bools):
if (stop - start > RUN_LENGTH_THRESHOLD):
print (start, stop)
if __name__ == '__main__':
main()
I would be surprised if there were not more elegant ways

Categories