After collecting game input in the form of key combinations of w a s d and the corresponding screen image, when I try to balance the data there's a few issues. The original code had just 3 inputs of just w, a or d. I scaled this up to 9 possibilities like aw, sd or nokeys for example. Part of balancing the data is having all input vectors of the same length. But this is is where it seems to go wrong. The original code is commented out.
The balancing code:
# balance_data.py
import numpy as np
import pandas as pd
from collections import Counter
from random import shuffle
import sys
train_data = np.load('training_data-1.npy')
df = pd.DataFrame(train_data)
print(df.head())
print(Counter(df[1].apply(str)))
##lefts = []
##rights = []
##forwards = []
##
##shuffle(train_data)
##
##for data in train_data:
## img = data[0]
## choice = data[1]
##
## if choice == [1,0,0]:
## lefts.append([img,choice])
## elif choice == [0,1,0]:
## forwards.append([img,choice])
## elif choice == [0,0,1]:
## rights.append([img,choice])
## else:
## print('no matches')
##
##
##forwards = forwards[:len(lefts)][:len(rights)]
##lefts = lefts[:len(forwards)]
##rights = rights[:len(forwards)]
##
##final_data = forwards + lefts + rights
##shuffle(final_data)
w = []
a = []
d = []
s = []
wa = []
wd = []
sd = []
sa = []
nk = []
shuffle(train_data)
for data in train_data:
img = data[0]
choice = data[1]
print(choice)
if choice == [0,1,0,0]:
w.append([img,choice])
elif choice == [1,0,0,0]:
a.append([img,choice])
elif choice == [0,0,1,0]:
d.append([img,choice])
elif choice == [0,0,0,1]:
s.append([img,choice])
elif choice == [1,1,0,0]:
wa.append([img,choice])
elif choice == [0,1,1,0]:
wd.append([img,choice])
elif choice == [0,0,1,1]:
sd.append([img,choice])
elif choice == [1,0,0,1]:
sa.append([img,choice])
elif choice == [0,0,0,0]:
nk.append([img,choice])
else:
print('no matches')
min_length = 10000
print (len(w))
print (len(a))
print (len(d))
print (len(s))
print (len(wa))
print (len(wd))
print (len(sd))
print (len(sa))
print (len(nk))
if len(w) < min_length:
min_length = len(w)
if len(a) < min_length:
min_length = len(a)
if len(d) < min_length:
min_length = len(d)
if len(s) < min_length:
min_length = len(s)
if len(wa) < min_length:
min_length = len(wa)
if len(wd) < min_length:
min_length = len(wd)
if len(sd) < min_length:
min_length = len(sd)
if len(sa) < min_length:
min_length = len(sa)
w = w[min_length]
a = a[min_length]
d = d[min_length]
s = s[min_length]
wa = wa[min_length]
wd = wd[min_length]
sd = sd[min_length]
sa = sa[min_length]
nk = nk[min_length]
final_data = w + a + d + s + wa + wd + sd + sa + nk
shuffle(final_data)
np.save('training_data-1-balanced.npy', final_data)
And the vector lengths and error after it.
9715
920
510
554
887
1069
132
128
6085
Traceback (most recent call last):
File "C:\Users\StefBrands\Documents\GitHub\pygta5 - Copy\balance_data.py", line 115, in <module>
sa = sa[min_length]
IndexError: list index out of range
So now mainly two things:
1. Did I make a mistake somewhere, probably yes :)
2. Is there a better way of balancing?
You're not considering the difference between the length of a list and its maximum index - for example, the list [0, 5, 1] has length 3, but maximum index 2. As such, you should reduce the calculation of min_length by 1.
We can neaten the calculations up significantly. The lines from if if len(w) < min_length... to final_data = ... can be replaced with the following:
key_lists = (w, a, d, s, wa, wd, sd, sa, nk)
min_length = min(len(x)-1 for x in key_lists)
final_data = sum(x[min_length] for x in key_lists)
We create a tuple containing each of the lists for each key. We can then use generator expressions to find our min_length and then again to sum the values. The advantage of this is that if an additional key combo is added, we can just append its list variable to key_lists.
Related
I need some help with my code. I need to look for the presence of resistance genes in a water sample. That translates in having a huge file of reads coming from the water sample and a file of resistances genes. My problem is making the code run under 5 minutes, a thing that is not happening right now. Probably the issue relays on discarting reads as fast as possible, on having a smart method to only analyze meaningful reads. Do you have any suggestion? I cannot use any non standard python library
This is my code
import time
def build_lyb(TargetFile):
TargetFile = open(TargetFile)
res_gen = {}
for line in TargetFile:
if line.startswith(">"):
header = line[:-1]
res_gen[header] = ""
else:
res_gen[header] += line[:-1]
return res_gen
def build_kmers(sequence, k_size):
kmers = []
n_kmers = len(sequence) - k_size + 1
for i in range(n_kmers):
kmer = sequence[i:i + k_size]
kmers.append(kmer)
return kmers
def calculation(kmers, g):
matches = []
for i in range(0, len(genes[g])):
matches.append(0)
k = 0
while k < len(kmers):
if kmers[k] in genes[g]:
pos = genes[g].find(kmers[k])
for i in range(pos, pos+19):
matches[i] = 1
k += 19
else:
k += 1
return matches
def coverage(matches, g):
counter = 0
for i in matches[g]:
if i >= 1:
counter += 1
cov = counter/len(res_genes[g])*100
return cov
st = time.time()
genes = build_lyb("resistance_genes.fsa")
infile = open('test2.txt', 'r')
res_genes = {}
Flag = False
n_line = 0
for line in infile:
n_line += 1
if line.startswith("+"):
Flag = False
if Flag:
kmers = build_kmers(line[:-1], 19)
for g in genes:
counter = 18
k = 20
while k <= 41:
if kmers[k] in genes[g]:
counter += 19
k += 19
else:
k += 1
if counter >= 56:
print(n_line)
l1 = calculation(kmers, g)
if g in res_genes:
l2 = res_genes[g]
lr = [sum(i) for i in zip(l1, l2)]
res_genes[g] = lr
else:
res_genes[g] = l1
if line.startswith('#'):
Flag = True
for g in res_genes:
print(g)
for i in genes[g]:
print(i, " ", end='')
print('')
for i in res_genes[g]:
print(i, " ", end='')
print('')
print(coverage(res_genes, g))
et = time.time()
elapsed_time = et-st
print("Execution time:", elapsed_time, "s")
Code implements the dynamic programming solution for global pairwise alignment of two sequences. Trying to perform a semi-global alignment between the SARS-CoV-2 reference genome and the first read in the Nanopore sample. The length of the reference genome is 29903 base pairs and the length of the first Nanopore read is 1246 base pairs. When I run the following code, I get this message in my terminal:
import sys
import numpy as np
GAP = -2
MATCH = 5
MISMATCH = -3
MAXLENGTH_A = 29904
MAXLENGTH_B = 1247
# insert sequence files
A = open("SARS-CoV-2 reference genome.txt", "r")
B = open("Nanopore.txt", "r")
def max(A, B, C):
if (A >= B and A >= C):
return A
elif (B >= A and B >= C):
return B
else:
return C
def Tmax(A, B, C):
if (A > B and A > C):
return 'D'
elif (B > A and B > C):
return 'L'
else:
return 'U'
def m(p, q):
if (p == q):
return MATCH
else:
return MISMATCH
def append(st, c):
return c + "".join(i for i in st)
if __name__ == "__main__":
if (len(sys.argv) != 2):
print("Usage: align <input file>")
sys.exit()
if (not os.path.isfile(sys.argv[1])):
print("input file not found.")
sys.exit()
S = np.empty([MAXLENGTH_A, MAXLENGTH_B], dtype = int)
T = np.empty([MAXLENGTH_A, MAXLENGTH_B], dtype = str)
with open(sys.argv[1], "r") as file:
A = str(A.readline())[:-1]
B = str(B.readline())[:-1]
print("Sequence A:",A)
print("Sequence B:",B)
N = len(A)
M = len(B)
S[0][0] = 0
T[0][0] = 'D'
for i in range(0, N + 1):
S[i][0] = GAP * i
T[i][0] = 'U'
for i in range(0, M + 1):
S[0][i] = GAP * i
T[0][i] = 'L'
for i in range(1, N + 1):
for j in range(1, M + 1):
S[i][j] = max(S[i-1][j-1]+m(A[i-1],B[j-1]),S[i][j-1]+GAP,S[i-1][j]+GAP)
T[i][j] = Tmax(S[i-1][j-1]+m(A[i-1],B[j-1]),S[i][j-1]+GAP,S[i-1][j]+GAP)
print("The score of the alignment is :",S[N][M])
i, j = N, M
RA = RB = RM = ""
while (i != 0 or j != 0):
if (T[i][j]=='D'):
RA = append(RA,A[i-1])
RB = append(RB,B[j-1])
if (A[i-1] == B[j-1]):
RM = append(RM,'|')
else:
RM = append(RM,'*')
i -= 1
j -= 1
elif (T[i][j]=='L'):
RA = append(RA,'-')
RB = append(RB,B[j-1])
RM = append(RM,' ')
j -= 1
elif (T[i][j]=='U'):
RA = append(RA,A[i-1])
RB = append(RB,'-')
RM = append(RM,' ')
i -= 1
print(RA)
print(RM)
print(RB)
This has nothing to do with python. The error message comes this line:
if (len(sys.argv) != 2):
print("Usage: align <input file>")
The code expects to be called with exactly one argument, the input file:
align path/to/input/file
You provided a different number of arguments, probably zero.
Whenever k = 2, the code runs in a loop
if k > 2 it sets all, but one of the centroids location to 0,0
I've reviewed it a couple of times , and it doesn't seem like there are any errors probably some sort of logic flaw. The code starts by having a class and its methods which initiate the centroids, calculate the Euclidean distance, and reassign centroids to the average positions of the points that are in the cluster. It then runs a loop that consists of reassigning and calculating distance until a list of the assignments are equal and then plots it.
class Kmeans:
def __init__(self, K, dataset, centroids, sorting):
self.K = K
self.dataset = dataset
self.centroids = centroids
self.sorting = sorting
#sets starting position of centroids
def initializeCentroids(self):
bigX = 0
bigY = 0
self.centroids = []
for i in self.dataset:
if i[0] > bigX:
bigX = i[0]
if i[1] > bigY:
bigY = i[1]
for q in range(self.K):
self.centroids.append([random.randint(0, bigX), random.randint(0, bigY)])
plt.scatter((self.centroids[0][0], self.centroids[1][0]), (self.centroids[0][1], self.centroids[1][1]))
return self.centroids
#calculates euclidean distance
def calcDistance(self):
self.sorting = []
for w in self.dataset:
print(w)
distances = []
counter = 0
for centr in self.centroids:
distances.append(math.sqrt(abs((centr[0] - w[0] * centr[0] - w[0]) + (centr[1] - w[1] * centr[1] - w[1]))))
counter += 1
if counter > 0:
try:
if distances[0] > distances[1]:
distances.pop(0)
if distances[1] > distances[0]:
distances.pop(1)
counter -= 1
except IndexError:
pass
self.sorting.append([w, counter, distances[0]])
return self.sorting
def reassignCentroids(self):
counter3 = 1
for r in range(len(self.centroids)):
positionsX = []
positionsY = []
for t in self.sorting:
if t[1] == counter3:
positionsX.append(t[0][0])
positionsY.append(t[0][1])
population = len(positionsY)
if population == 0:
population = 1
self.centroids.append([sum(positionsX) / population, sum(positionsY) / population])
counter3 += 1
self.centroids.pop(0)
return
k = 4
dataSetSize = input("Enter the amount of tuples you want generated: ")
data_set = []
for o in range(int(dataSetSize)):
data_set.append((random.randint(0, 1000), random.randint(0, 1000)))
attempt = Kmeans(k, data_set, 0, 0)
attempt.initializeCentroids()
xvals = []
yvals = []
sortCompare = []
# plots
for p in data_set:
xvals.append(p[0])
yvals.append(p[1])
running = True
while running:
if len(sortCompare) > 1:
centroidChoice0 = []
centroidChoice1 = []
for p in sortCompare[0]:
centroidChoice0.append(p[1])
for d in sortCompare[1]:
centroidChoice1.append(d[1])
print(centroidChoice1)
print(attempt.centroids)
if centroidChoice1 == centroidChoice0:
running = False
for m in attempt.centroids:
plt.scatter((attempt.centroids[0][0], attempt.centroids[1][0]), (attempt.centroids[0][1], attempt.centroids[1][1]))
running = False
sortCompare.pop(0)
attempt.calcDistance()
sortCompare.append(attempt.sorting)
attempt.reassignCentroids()
So I started to create a word search generator in Python 3 with the least complexity possible. I ended up with the following code and hope to find out where I need to improve.
The issue:
The main issue is that the code might get stuck in an infinite loop trying to put the last word in the puzzle. My aim is to make the code more efficient by avoiding that.
The main file (wsg.py) and the word database (n_words.txt) are available at:
https://github.com/mannysayah/wordsearchpuzzle_py3
Here's how wsg.py looks:
import random
import string
import numpy as np
import numpy.ma as ma
# How many meaningful words should take up the puzzle
make_up_total = 0.70
# Grid size
width = 10
height = 10
# Probabilities
ph = 0.3
pd = 0.3
pv = 0.4
total_words_letters = int(width * height * make_up_total)
filename = 'n_words.txt'
with open(filename) as f:
data = f.readlines()
# remove whitespace characters like `\n` at the end of each line
data = [x.strip() for x in data]
list_of_all_words = []
for x in data:
if len(x) < width and len(x) < height:
list_of_all_words.append(x)
data = list_of_all_words
total_take = 0
word_list = []
while total_take < total_words_letters - 4:
item = random.choice(data)
if len(item) < total_words_letters - total_take:
word_list.append(item)
total_take += len(item)
print(word_list)
puzzle = np.zeros((width, height)).astype(int).astype(str)
directions = [[1,0],[0,1],[1,1]]
def existing_word(d,word):
xsize = width if d[0] == 0 else width - len(word)
ysize = width if d[1] == 0 else height - len(word)
x = random.randrange(0,xsize)
y = random.randrange(0,ysize)
temp_word = []
for i in range(0,len(word)):
temp_word.append(str(puzzle[y+d[1]*i][x+d[0]*i]))
return("".join(temp_word),x,y)
def compare_them(w1, w2):
comp = []
for i,j in zip(w1, w2):
if j == "0":
comp.append("Zero")
else:
if i == j:
comp.append("Match")
else:
comp.append(False)
if ( comp.count("Match") <= 1 and comp.count(False) == 0 ):
return True
else:
return False
def fill_with_rand(puzzle,w,h):
mask = ([puzzle == "0"])
new_puzzle = np.copy(puzzle)
alpha_puzzle = np.copy([[random.choice(string.ascii_uppercase) for i in range(0,w)] for j in range(0,h)])
new_puzzle[tuple(mask)] = alpha_puzzle[tuple(mask)]
return new_puzzle
for word in word_list:
while (True):
word = random.choice([word, word[::-1]]) # word can be backwards
d = np.random.choice([0,1,2],p=[ph,pv,pd]) # Take a random choice of [0,1,2]
d = directions[d] # now take that out of [directions]
from_puzzle = existing_word(d, word) # now take an empty space from the puzzle with the same length as the word
x = from_puzzle[1]
y = from_puzzle[2]
if compare_them(word, from_puzzle[0]):
for i in range(0,len(word)):
puzzle[y+d[1]*i][x+d[0]*i] = word[i]
break
print(fill_with_rand(puzzle, width, height))
I have converted the code given at this link into a python version. The code is supposed to calculate the correct value of maximum value to be filled in knapsack of weight W. I have attached the code below:
#http://www.geeksforgeeks.org/branch-and-bound-set-2-implementation-of-01-knapsack/
from queue import Queue
class Node:
def __init__(self):
self.level = None
self.profit = None
self.bound = None
self.weight = None
def __str__(self):
return "Level: %s Profit: %s Bound: %s Weight: %s" % (self.level, self.profit, self.bound, self.weight)
def bound(node, n, W, items):
if(node.weight >= W):
return 0
profit_bound = int(node.profit)
j = node.level + 1
totweight = int(node.weight)
while ((j < n) and (totweight + items[j].weight) <= W):
totweight += items[j].weight
profit_bound += items[j].value
j += 1
if(j < n):
profit_bound += (W - totweight) * items[j].value / float(items[j].weight)
return profit_bound
Q = Queue()
def KnapSackBranchNBound(weight, items, total_items):
items = sorted(items, key=lambda x: x.value/float(x.weight), reverse=True)
u = Node()
v = Node()
u.level = -1
u.profit = 0
u.weight = 0
Q.put(u)
maxProfit = 0;
while not Q.empty():
u = Q.get()
if u.level == -1:
v.level = 0
if u.level == total_items - 1:
continue
v.level = u.level + 1
v.weight = u.weight + items[v.level].weight
v.profit = u.profit + items[v.level].value
if (v.weight <= weight and v.profit > maxProfit):
maxProfit = v.profit;
v.bound = bound(v, total_items, weight, items)
if (v.bound > maxProfit):
Q.put(v)
v.weight = u.weight
v.profit = u.profit
v.bound = bound(v, total_items, weight, items)
if (v.bound > maxProfit):
# print items[v.level]
Q.put(v)
return maxProfit
if __name__ == "__main__":
from collections import namedtuple
Item = namedtuple("Item", ['index', 'value', 'weight'])
input_data = open("test.data").read()
lines = input_data.split('\n')
firstLine = lines[0].split()
item_count = int(firstLine[0])
capacity = int(firstLine[1])
print "running from main"
items = []
for i in range(1, item_count+1):
line = lines[i]
parts = line.split()
items.append(Item(i-1, int(parts[0]), float(parts[1])))
kbb = KnapSackBranchNBound(capacity, items, item_count)
print kbb
The program is supposed to calculate value of 235 for following items inside file test.data:
5 10
40 2
50 3.14
100 1.98
95 5
30 3
The first line shows number of items and knapsack weight. Lines below first line shows the value and weight of those items. Items are made using a namedtuple and sorted according to value/weight. For this problem I am getting 135 instead of 235. What am I doing wrong here?
EDIT:
I have solved the problem of finding correct items based on branch and bound. If needed, one can check it here
The problem is that you're inserting multiple references to the same Node() object into your queue. The fix is to initialize two new v objects in each iteration of the while-loop as follows:
while not Q.empty():
u = Q.get()
v = Node() # Added line
if u.level == -1:
v.level = 0
if u.level == total_items - 1:
continue
v.level = u.level + 1
v.weight = u.weight + items[v.level].weight
v.profit = u.profit + items[v.level].value
if (v.weight <= weight and v.profit > maxProfit):
maxProfit = v.profit;
v.bound = bound(v, total_items, weight, items)
if (v.bound > maxProfit):
Q.put(v)
v = Node() # Added line
v.level = u.level + 1 # Added line
v.weight = u.weight
v.profit = u.profit
v.bound = bound(v, total_items, weight, items)
if (v.bound > maxProfit):
# print(items[v.level])
Q.put(v)
Without these reinitializations, you're modifying the v object that you already inserted into the queue.
This is different from C++ where the Node objects are values that are implicitly copied into the queue to avoid aliasing problems such as these.