Pseudorandomisation with python - python

I currently have a problem with pseudorandomizing my trials. I am using a while loop in order to create 12 files containing 38 rows (or trials) that match 1 criterion:
1) max color1expl cannot be identical in 3 consecutive rows
Where color1expl is one of the columns in my dataframe.
When I have files of only 38 rows to create, the following script seems to work perfectly.
import pandas as pd
n_dataset_int = 0
n_dataset = str(n_dataset_int)
df_all_possible_trials = pd.read_excel('GroupF' + n_dataset + '.xlsx') # this is my dataset with all possible trials
# creating the files
for iterations in range(0,12): #I need 12 files with pseudorandom combinations
n_dataset_int += 1 #this keeps track of the number of iterations
n_dataset = str(n_dataset_int)
df_experiment = df_all_possible_trials.sample(n=38) #38 is the total number of trials
df_experiment.reset_index(drop=True, inplace=True)
#max color1expl cannot be identical in 3 consecutive trials (maximum in 2 consecutive t.)
randomized = False
while not randomized: #thise while loop will make every time a randomization of the rows in the dataframe
experimental_df_2 = df_experiment.sample(frac=1).reset_index(drop=True)
for i in range(0, len(experimental_df_2)):
try:
if i == len(experimental_df_2) - 1:
randomized = True
elif (experimental_df_2['color1expl'][i] != experimental_df_2['color1expl'][i+1]) or (experimental_df_2['color1expl'][i] != experimental_df_2['color1expl'][i+2])
continue
elif (experimental_df_2['color1expl'][i] == experimental_df_2['color1expl'][i+1]) and (experimental_df_2['color1expl'][i] == experimental_df_2['color1expl'][i+2]):
break
except:
pass
#export the excel file
experimental_df_2.to_excel('GroupF_r' + n_dataset + '.xlsx', index=False) #creates a new
However, when doing the same procedure increasing the number from n=38 to n=228, the script seems to run for an indefinite amount of time. So far, more than one day and it did not produce any of the 12 files. Probably because there are too many combinations to try.
Is there a way to improve this script so that it works with a larger amount of rows?

I think you can change the way you use to generate random samples (pseudo-code):
n = 38 # or anything else
my_sample = []
my_sample.append( pop_one_random_from(df_all_possible_trials) )
my_sample.append( pop_one_random_from(df_all_possible_trials) )
while len(my_sample) < n:
next_one = pop_one_random_from(df_all_possible_trials)
if next_one is equal to my_sample[-1] and my_sample[-2]:
put next_one back to df_all_possible_trials
continue
else:
my_sample.append( next_one )
If I get it right, all the different samples (totalling number of combinations, 'len(df_all_possible_trials) choose n') have the same probability to be chosen, which is what you're looking for. And it should work faster.

Related

Why is the array my fucntion is returing not defined outside of my function?

I am working on an assignment that requires I graph an array from one part of the assignment in another part. I cannot get the averages_table array to be defined outside of my function, so I cannot pass it through another. I am pretty new to coding so there may be some other sub-optimal parts of my code, but for the purposes of this assignment I just need my array to be accessible in other parts of my code. I have pasted what I have worked on so far below. Thank you to anyone that takes a look!
import numpy as np
import matplotlib.pyplot as plt
def Section1_5_airportAverages():
print('5. File aircrafts.csv contains monthly aircraft arrival and departure data recorded at an airport\n'
'from 2010 to 2015. Open the aircrafts.csv file using any text editor (or using any spreadsheet\n'
'application) to see the content - use Notepad, TextEdit, or similar text editor to see comma separated\n'
'values and format. Your task is to write a program to analyze the data using Numpy arrays and output\n'
'average number of arrivals and departures in each year – in the following format.\n')
airports = np.genfromtxt('aircrafts.csv', delimiter=',', skip_header=1, dtype=str)
year = 2010
count = 0
total_arrivals_per_year = 0
total_departures_per_year = 0
averages_table = np.empty((0),str)
print('Year Arrivals Departures') # Header for table
while True:
for i, value in enumerate(airports): # Iterates through airports and assigns an index to each element.
if count < 11 and year <= 2015:
count += 1
total_arrivals_per_year += airports[i][1].astype(int)
total_departures_per_year += airports[i][2].astype(int)
elif count == 11 and year < 2015: # Allows holding variable to be reset, and increments year.
count += 1
total_arrivals_per_year += airports[i][1].astype(int)
average_arrivals_per_year = total_arrivals_per_year/count
total_departures_per_year += airports[i][2].astype(int)
average_departures_per_year = total_departures_per_year / count
print(year, average_arrivals_per_year.round(2), average_departures_per_year.round(2))
count = 0
year += 1
total_arrivals_per_year = 0
total_departures_per_year = 0
averages_table = np.append(averages_table,[year, average_arrivals_per_year, average_departures_per_year])
elif count == 11 and year == 2015: # Ends the function after year 2015 is complete.
count += 1
total_arrivals_per_year += airports[i][1].astype(int)
average_arrivals_per_year = total_arrivals_per_year / count
total_departures_per_year += airports[i][2].astype(int)
average_departures_per_year = total_departures_per_year / count
print(year, average_arrivals_per_year.round(2), average_departures_per_year.round(2))
averages_table = np.append(averages_table,
[year, average_arrivals_per_year, average_departures_per_year]).reshape((6,3)).astype(float).round(2)
return averages_table
I would try to explain as below, there might be definitely a better way though,
your question
I cannot get the averages_table array to be defined outside of my function, so I cannot pass it through another.
Answer
The array averages_table itself is created inside the function Section1_5_airportAverages, you just need to return it from the function to access it else where.
Two things you need to correct
return statement from within the last elif condition.
This will return averages_table array if and only if that particular condition is True.
Correction: write a return statement outside of the for loop (a specific if statement is executed according to the condition, you don't need to return from the individual if statements)
Not sure why you are using a while=True conditon.
This will make your code run indefinitely(infinite loop)
Correction: please remove the while=True condition. I think the you only need to iterate over the whole code for 11 times( one year time frame), then the condition should be,
while count <= 11
I would suggest using pandas dataframe for your assignment which simplify things and is a lot easier to implement.

Making a right triangle by arrays in python using loops and auto filling the arrays

I want to write a program that can help me to calculate some values using python.
The main idea is I want to make a table, which is only half of the cells are having value. (That means I want to make a triangle) So, the triangle I want to make is like this format.
12345
1234
123
12
1
But, things are not that easy. Please see the table below. (Also same as picture 1)
6.00 6.45 6.80 7.10 7.36 7.56 7.77
6.90 7.20 7.47 7.70 7.88 8.06
7.50 7.75 7.97 8.12 8.30
8.00 8.20 8.33 8.50
8.40 8.50 8.67
8.60 8.80
9.00
In the above triangle, each of the rows is make from the previous row by a certain formula. Please refer to the picture 2 and picture 3.
Also, please note that in the above triangle, each of the cells was multiplied by 100. For example, the 6.00 in the upper left corner is 6% (0.06) indeed. << Please do NOT multiplied the figures by 100 in my program, because it is just an accident here for demonstration.
Okay. So, I want to make a program that as long as I input the first row, then the program will calculate and show the remaining rows for me. (That means showing the whole triangle)
Below is my code. Please have a look. I have tried to write a frame, but there are two problems that I encounter.
I don't know how to automatically fill my other arrays(the other rows) base on the first arrays (the first row) that I input in the program.
I keep receiving the error of "IndexError: list index out of range" when I try to run my programme.
Here are my codes
array1 = [0.06,0.0645,0.068,0.071,0.0736,0.0756,0.0777] #This is the initial array that I input
array2 = [] #This should be the output after the first iteration, I hope that my programme can auto fill the remaining arrays (including this array) after I run my programme
array3 = [] #This is a blank array waiting for auto fill after I run my program
array4 = [] #This is a blank array waiting for auto fill after I run my program
array5 = [] #This is a blank array waiting for auto fill after I run my program
array6 = [] #This is a blank array waiting for auto fill after I run my program
array7 = [] #This is a blank array waiting for auto fill after I run my program
array8 = [] #This is a blank array waiting for auto fill after I run my program
r = 0 # Set the initial position for the table that I want to make, and this is similar to the x-asis. Please refer to picture 2.
s = 1 # Set the initial position for the table that I want to make, and this is similar to the y-asis. Please refer to picture 2.
def SpotRateForcast(i,j,k):
global r
global s
while (r <= len(k)): # I try to make the first loop here
s = 1 # Reset the value when the first row is completed such that the second loop can work in the second row
while (s <= len(k)): # this is the second loop to determine when to stop in the each row
x = ((((1 + k[j - 1])**j) / ((1 + k[i - 1])**i))**(1/(j-i))) - 1 #Calculate the value by the certain formula
print(("%.4f" % round(x,4)), end=' , ') #Print the value up to 4 decimal places
s += 1 #Record one value is calculated and done
SpotRateForcast(i,j+1,k) #Start to calculate and print the next value
r += 1 # Switch to the next row when the second loop is finished
SpotRateForcast(r,s,array1) #The code to run the program
Thank you for reading my question.
I would like to ask someone to finish the programme for me. As you can see, my coding maybe bad because I am a newbie in programming. But I already tried my best to do everything I am able to do.
I do not mind your edit my codes. It is even better if you can write a new codes for me. And I can learn how to write better codes from you.
Lastly, I have one more request. Please add many many comments in the codes you wrote otherwise I may not able to understand what you wrote(I am a beginner). Thank you very much!
You are overthinking it.
If you only need to print a triangle, just print each row in sequence, and for each row print its values in sequence:
S = [0.06,0.0645,0.068,0.071,0.0736,0.0756,0.0777]
n = len(S)
print(('{:0.4f} ' * n).format(*S)) # print initial line
for i in range(1,n): # loop over rows
for j in range(i+1, n+1): # then over colums
num = (1 + S[j -1])**j
den = (1 + S[i - 1])**i
x = (num/den)**(1/(j - i)) -1 # ok we have f(i, j)
print('{:0.4f}'.format(x), end=' ') # print it followed with a space
print('') # add a newline after a full row
It gives:
0.0600 0.0645 0.0680 0.0710 0.0736 0.0756 0.0777
0.0690 0.0720 0.0747 0.0770 0.0787 0.0807
0.0750 0.0775 0.0797 0.0812 0.0830
0.0801 0.0821 0.0833 0.0850
0.0841 0.0849 0.0867
0.0857 0.0880
0.0904
Now if you want to first build arrays because you will later post-process them, just append each field value to a list for the row, and then append each of those lists to a global list of lists
S = [0.06,0.0645,0.068,0.071,0.0736,0.0756,0.0777]
n = len(S)
result = [S] # add the initial row
for i in range(1,n): # loop over rows
row = [] # prepare a new list for the row
result.append(row) # and append it to result
for j in range(i+1, n+1): # then over colums
num = (1 + S[j -1])**j
den = (1 + S[i - 1])**i
x = (num/den)**(1/(j - i)) -1 # ok we have f(i, j)
row.append(x) # as a list is modifiable we can append to it
array=[[0.06,0.0645,0.068,0.071,0.0736,0.0756,0.0777]]
n = len(array[0])
#outer loop
for i in range(1, n):
#append new row
array.append([])
for j in range(n-i):
#i dont understand your calculations but they'd stand
array[i][j] = #here

How do I speed up incredibly slow iteration through pandas dataframe?

I have collected 500ms stock data for several weeks. No I am wanting to go through each days data an iterate through it to determine at any given last price value, how many times a specific lowerbound would be passed followed by a specific upperbound being passed for the rest of the day. The lowerbound has to be passed before the script starts searching for whether the upperbound will be reached.
So essentially if the last price at row i is 10 and the lowerbound then is 9 and 11. The code will first try to find a moment when the remaining rows i+1...i+2... are reach the lowerbound, as soon as the lower bound is reached the code switches into looking for when the upperbound is reached. If the upperbound is reached then the success will add 1 and the code starts looking for the lowerbound again, doing this whole process again.
This entire process occurs for every single row, so essentially for each row we will have a column for how many times a successful lower and upper bound reach occurred in the rows following that given row.
The problem I am having is that I have about 14400 rows per day, and about 40 days so around 576000 rows of data. The iteration takes absolutely forever, and in order for me to do this across all of my data I will need my computer to run a few days. Surely I am not doing this in the most efficient way possible am I? Can anybody maybe point to a concept that I can rewrite this code in a much more effective way? Or am I just stuck waiting for ever for it to prepare my data?
range_per = .00069 #the percentage determining lower and upper bound
data['Success?']=np.nan
data['Success? Count']=np.nan
#For every row count how many times the trade in the range would be successful
for i in range(0,len(data)):
last_price = data.at[i,'lastPrice']
lower_bound = last_price - last_price*range_per
upper_bound = last_price + last_price*range_per
lower_bound_reached = False
upper_bound_reached = False
success=0
for b in range(i+1,len(data)):
last_price = data.at[b,'lastPrice']
while lower_bound_reached == False:
if lower_bound - last_price >=0:
upper_bound_reached = False
lower_bound_reached = True
else:
break
while (upper_bound_reached == False and lower_bound_reached ==True):
if upper_bound - last_price <=0:
success+=1
lower_bound_reached = False
upper_bound_reached = True
else:
break
print('row %s: %s times' %(i, success))
data['Success? Count'][i] = success
if success>0:
data['Success?'][i] = True
else:
data['Success?'][i] = False

How do I make sure all of my values are computed in my loop?

I am working on a 'keep the change assignment' where I round the purchases to the whole dollar and add the change to the savings account. However, the loop is not going through all of the values in my external text file. It only computes the last value. I tried splitting the file but it gives me an error. What might be the issue? my external text file is as so:
10.90
13.59
12.99
(each on different lines)
def main():
account1 = BankAccount()
file1 = open("data.txt","r+") # reading the file, + indicated read and write
s = 0 # to keep track of the new savings
for n in file1:
n = float(n) #lets python know that the values are floats and not a string
z= math.ceil(n) #rounds up to the whole digit
amount = float(z-n) # subtract the rounded sum with actaul total to get change
print(" Saved $",round(amount,2), "on this purchase",file = file1)
s = amount + s
x = (account1.makeSavings(s))
I'm fairly sure the reason for this is because you are printing the amount of money you have saved to the file. In general, you don't want to alter the length of an object you are iterating over because it can cause problems.
account1 = BankAccount()
file1 = open("data.txt","r+") # reading the file, + indicated read and write
s = 0 # to keep track of the new savings
amount_saved = []
for n in file1:
n = float(n) #lets python know that the values are floats and not a string
z= math.ceil(n) #rounds up to the whole digit
amount = float(z-n) # subtract the rounded sum with actaul total to get change
amount_saved.append(round(amount,2))
s = amount + s
x = (account1.makeSavings(s))
for n in amount_saved:
print(" Saved $",round(amount,2), "on this purchase",file = file1)
This will print the amounts you have saved at the end of the file after you are finished iterating through it.

How to speed up Python string matching code

I have this code which computes the Longest Common Subsequence between random strings to see how accurately one can reconstruct an unknown region of the input. To get good statistics I need to iterate it many times but my current python implementation is far too slow. Even using pypy it currently takes 21 seconds to run once and I would ideally like to run it 100s of times.
#!/usr/bin/python
import random
import itertools
#test to see how many different unknowns are compatible with a set of LCS answers.
def lcs(x, y):
n = len(x)
m = len(y)
# table is the dynamic programming table
table = [list(itertools.repeat(0, n+1)) for _ in xrange(m+1)]
for i in range(n+1): # i=0,1,...,n
for j in range(m+1): # j=0,1,...,m
if i == 0 or j == 0:
table[i][j] = 0
elif x[i-1] == y[j-1]:
table[i][j] = table[i-1][j-1] + 1
else:
table[i][j] = max(table[i-1][j], table[i][j-1])
# Now, table[n, m] is the length of LCS of x and y.
return table[n][m]
def lcses(pattern, text):
return [lcs(pattern, text[i:i+2*l]) for i in xrange(0,l)]
l = 15
#Create the pattern
pattern = [random.choice('01') for i in xrange(2*l)]
#create text start and end and unknown.
start = [random.choice('01') for i in xrange(l)]
end = [random.choice('01') for i in xrange(l)]
unknown = [random.choice('01') for i in xrange(l)]
lcslist= lcses(pattern, start+unknown+end)
count = 0
for test in itertools.product('01',repeat = l):
test=list(test)
testlist = lcses(pattern, start+test+end)
if (testlist == lcslist):
count += 1
print count
I tried converting it to numpy but I must have done it badly as it actually ran more slowly. Can this code be sped up a lot somehow?
Update. Following a comment below, it would be better if lcses used a recurrence directly which gave the LCS between pattern and all sublists of text of the same length. Is it possible to modify the classic dynamic programming LCS algorithm somehow to do this?
The recurrence table table is being recomputed 15 times on every call to lcses() when it is only dependent upon m and n where m has a maximum value of 2*l and n is at most 3*l.
If your program only computed table once, it would be dynamic programming which it is not currently. A Python idiom for this would be
table = None
def use_lcs_table(m, n, l):
global table
if table is None:
table = lcs(2*l, 3*l)
return table[m][n]
Except using an class instance would be cleaner and more extensible than a global table declaration. But this gives you an idea of why its taking so much time.
Added in reply to comment:
Dynamic Programming is an optimization that requires a trade-off of extra space for less time. In your example you appear to be doing a table pre-computation in lcs() but you build the whole list on every single call and then throw it away. I don't claim to understand the algorithm you are trying to implement, but the way you have it coded, it either:
Has no recurrence relation, thus no grounds for DP optimization, or
Has a recurrence relation, the implementation of which you bungled.

Categories