Query Board challenge on Python, need some pointers - python

So, I have this challenge on CodeEval, but I seem don't know where to start, so I need some pointers (and answers if you can) to help me figure out this challenge.
DESCRIPTION:
There is a board (matrix). Every cell of the board contains one integer, which is 0 initially.
The next operations can be applied to the Query Board:
SetRow i x: it means that all values in the cells on row "i" have been change value to "x" after this operation.
SetCol j x: it means that all values in the cells on column "j" have been changed to value "x" after this operation.
QueryRow i: it means that you should output the sum of values on row "i".
QueryCol j: it means that you should output the sum of values on column "j".
The board's dimensions are 256x256
i and j are integers from 0 to 255
x is an integer from 0 to 31
INPUT SAMPLE:
Your program should accept as its first argument a path to a filename. Each line in this file contains an operation of a query. E.g.
SetCol 32 20
SetRow 15 7
SetRow 16 31
QueryCol 32
SetCol 2 14
QueryRow 10
SetCol 14 0
QueryRow 15
SetRow 10 1
QueryCol 2
OUTPUT SAMPLE:
For each query, output the answer of the query. E.g.
5118
34
1792
3571
I'm not that great on Python, but this challenge is pretty interesting, although I didn't have any clues on how to solve it. So, I need some help from you guys.
Thanks!

You could use a sparse matrix for this; addressed by (col, row) tuples as keys in a dictionary, to save memory. 64k cells is a big list otherwise (2MB+ on a 64-bit system):
matrix = {}
This is way more efficient, as the challenge is unlikely to set values for all rows and columns on the board.
Setting a column or row is then:
def set_col(col, x):
for i in range(256):
matrix[i, col] = x
def set_row(row, x):
for i in range(256):
matrix[row, i] = x
and summing a row or column is then:
def get_col(col):
return sum(matrix.get((i, col), 0) for i in range(256))
def get_row(row):
return sum(matrix.get((row, i), 0) for i in range(256))

WIDTH, HEIGHT = 256, 256
board = [[0] * WIDTH for i in range(HEIGHT)]
def set_row(i, x):
global board
board[i] = [x]*WIDTH
... implement each function, then parse each line of input to decide which function to call,
for line in inf:
dat = line.split()
if dat[0] == "SetRow":
set_row(int(dat[1]), int(dat[2]))
elif ...
Edit: Per Martijn's comments:
total memory usage for board is about 2.1MB. By comparison, after 100 random row/column writes, matrix is 3.1MB (although it tops out there and doesn't get any bigger).
yes, global is unnecessary when modifying a global object (just don't try to assign to it).
while dispatching from a dict is good and efficient, I did not want to inflict it on someone who is "not that great on Python", especially for just four entries.
For sake of comparison, how about
time = 0
WIDTH, HEIGHT = 256, 256
INIT = 0
rows = [(time, INIT) for _ in range(WIDTH)]
cols = [(time, INIT) for _ in range(HEIGHT)]
def set_row(i, x):
global time
time += 1
rows[int(i)] = (time, int(x))
def set_col(i, x):
global time
time += 1
cols[int(i)] = (time, int(x))
def query_row(i):
rt, rv = rows[int(i)]
total = rv * WIDTH + sum(cv - rv for ct, cv in cols if ct > rt)
print(total)
def query_col(j):
ct, cv = cols[int(j)]
total = cv * HEIGHT + sum(rv - cv for rt, rv in rows if rt > ct)
print(total)
ops = {
"SetRow": set_row,
"SetCol": set_col,
"QueryRow": query_row,
"QueryCol": query_col
}
inf = """SetCol 32 20
SetRow 15 7
SetRow 16 31
QueryCol 32
SetCol 2 14
QueryRow 10
SetCol 14 0
QueryRow 15
SetRow 10 1
QueryCol 2""".splitlines()
for line in inf:
line = line.split()
op = line.pop(0)
ops[op](*line)
which only uses 4.3k of memory for rows[] and cols[].
Edit2:
using your code from above for matrix, set_row, set_col,
import sys
for n in range(256):
set_row(n, 1)
print("{}: {}".format(2*(n+1)-1, sys.getsizeof(matrix)))
set_col(n, 1)
print("{}: {}".format(2*(n+1), sys.getsizeof(matrix)))
which returns (condensed:)
1: 12560
2: 49424
6: 196880
22: 786704
94: 3146000
... basically the allocated memory quadruples at each step. If I change the memory measure to include key-tuples,
def get_matrix_size():
return sys.getsizeof(matrix) + sum(sys.getsizeof(key) for key in matrix)
it increases more smoothly, but still takes a bit jump at the above points:
5 : 127.9k
6 : 287.7k
21 : 521.4k
22 : 1112.7k
60 : 1672.0k
61 : 1686.1k <-- approx expected size on your reported problem set
93 : 2121.1k
94 : 4438.2k

Related

Find where the slope changes in my data as a parameter that can be easily indexed and extracted

I have the following data:
0.8340502011561366 0.8423491600218922
0.8513456021654467
0.8458192388553084
0.8440111276014195
0.8489589671423143
0.8738088120491972
0.8845129900705279
0.8988298998926688
0.924633964692693
0.9544790734065157
0.9908034431246875
1.0236430466543138
1.061619773027915
1.1050038249835414
1.1371449802490126
1.1921182610371368
1.2752207659022576
1.344047620255176
1.4198117350668353
1.507943067143741
1.622137968203745
1.6814098429502085
1.7646810054280595
1.8485457435775694
1.919591124757554
1.9843144220593145
2.030158014640226
2.018184122476175
2.0323466012624207
2.0179200409023874
2.0316932950853723
2.013683870089898
2.03010703506514
2.0216151623726977
2.038855467786505
2.0453923522466093
2.03759031642753
2.019424996752278
2.0441806106428606
2.0607521369415136
2.059310067318373
2.0661157975162485
2.053216429539864
2.0715123971225564
2.0580473413362075
2.055814512721712
2.0808278560688964
2.0601637029377113
2.0539429365156003
2.0609648613513754
2.0585135712612646
2.087674625814453
2.062482961966647
2.066476100210777
2.0568444178944967
2.0587903943282266
2.0506399365756396
The data plotted looks like:
I want to find the point where the slope changes in sign (I circled it in black. Should be around index 26):
I need to find this point of change for several hundred files. So far I tried the recommendation from this post:
Finding the point of a slope change as a free parameter- Python
I think since my data is a bit noisey I am not getting a smooth transition in the change of the slope.
This is the code I have tried so far:
import numpy as np
#load 1-D data file
file = str(sys.argv[1])
y = np.loadtxt(file)
#create X based on file length
x = np.linspace(1,len(y), num=len(y))
Find first derivative:
m = np.diff(y)/np.diff(x)
print(m)
#Find second derivative
b = np.diff(m)
print(b)
#find Index
index = 0
for difference in b:
index += 1
if difference < 0:
print(index, difference)
Since my data is noisey I am getting some negative values before the index I want. The index I want it to retrieve in this case is around 26 (which is where my data becomes constant). Does anyone have any suggestions on what I can do to solve this issue? Thank you!
A gradient approach is useless in this case because you don't care about velocities or vector fields. The knowledge of the gradient don't add extra information to locate the maximum value since the run are always positive hence will not effect the sign of the gradient. A method based entirly on raise is suggested.
Detect the indices for which the data are decreasing, find the difference between them and the location of the max value. Then by index manipulation you can find the value for which data has a maximum.
data = '0.8340502011561366 0.8423491600218922 0.8513456021654467 0.8458192388553084 0.8440111276014195 0.8489589671423143 0.8738088120491972 0.8845129900705279 0.8988298998926688 0.924633964692693 0.9544790734065157 0.9908034431246875 1.0236430466543138 1.061619773027915 1.1050038249835414 1.1371449802490126 1.1921182610371368 1.2752207659022576 1.344047620255176 1.4198117350668353 1.507943067143741 1.622137968203745 1.6814098429502085 1.7646810054280595 1.8485457435775694 1.919591124757554 1.9843144220593145 2.030158014640226 2.018184122476175 2.0323466012624207 2.0179200409023874 2.0316932950853723 2.013683870089898 2.03010703506514 2.0216151623726977 2.038855467786505 2.0453923522466093 2.03759031642753 2.019424996752278 2.0441806106428606 2.0607521369415136 2.059310067318373 2.0661157975162485 2.053216429539864 2.0715123971225564 2.0580473413362075 2.055814512721712 2.0808278560688964 2.0601637029377113 2.0539429365156003 2.0609648613513754 2.0585135712612646 2.087674625814453 2.062482961966647 2.066476100210777 2.0568444178944967 2.0587903943282266 2.0506399365756396'
data = data.split()
import numpy as np
a = np.array(data, dtype=float)
diff = np.diff(a)
neg_indeces = np.where(diff<0)[0]
neg_diff = np.diff(neg_indeces)
i_max_dif = np.where(neg_diff == neg_diff.max())[0][0] + 1
i_max = neg_indeces[i_max_dif] - 1 # because aise as a difference of two consecutive values
print(i_max, a[i_max])
Output
26 1.9843144220593145
Some details
print(neg_indeces) # all indeces of the negative values in the data
# [ 2 3 27 29 31 33 36 37 40 42 44 45 47 48 50 52 54 56]
print(neg_diff) # difference between such indices
# [ 1 24 2 2 2 3 1 3 2 2 1 2 1 2 2 2 2]
print(neg_diff.max()) # value with highest difference
# 24
print(i_max_dif) # location of the max index of neg_indeces -> 27
# 2
print(i_max) # index of the max of the origonal data
# 26
When the first derivative changes sign, that's when the slope sign changes. I don't think you need the second derivative, unless you want to determine the rate of change of the slope. You also aren't getting the second derivative. You're just getting the difference of the first derivative.
Also, you seem to be assigning arbitrary x values. If you're y-values represent points that are equally spaced apart, than it's ok, otherwise the derivative will be wrong.
Here's an example of how to get first and second der...
import numpy as np
x = np.linspace(1, 100, 1000)
y = np.cos(x)
# Find first derivative:
m = np.diff(y)/np.diff(x)
#Find second derivative
m2 = np.diff(m)/np.diff(x[:-1])
print(m)
print(m2)
# Get x-values where slope sign changes
c = len(m)
changes_index = []
for i in range(1, c):
prev_val = m[i-1]
val = m[i]
if prev_val < 0 and val > 0:
changes_index.append(i)
elif prev_val > 0 and val < 0:
changes_index.append(i)
for i in changes_index:
print(x[i])
notice I had to curtail the x values for the second der. That's because np.diff() returns one less point than the original input.

How can i solve "killed"?

I'm trying to plot clusters for my data which is stored in .data file using the density peak clustering algorithm using this code but got killed as the file size is 8 Giga and my Ram is 32. how can I solve this problem, please?
the core problem in loading the whole file by this method
def density_and_distance(self, distance_file, dc = None):
print("Begin")
distance, num, max_dis, min_dis = load_data(distance_file)
print("end")
if dc == None:
dc = auto_select_dc(distance, num, max_dis, min_dis)
rho = local_density(distance, num, dc)
delta, nearest_neighbor = min_distance(distance, num, max_dis, rho)
self.distance = distance
self.rho = rho
self.delta = delta
self.nearest_neighbor = nearest_neighbor
self.num = num
self.dc = dc
return rho, delta
I got Begin word printed then got killed after some minutes
the file contains like
1 2 19.86
1 3 36.66
1 4 87.94
1 5 11.07
1 6 36.94
1 7 52.04
1 8 173.68
1 9 28.10
1 10 74.00
1 11 85.36
1 12 40.04
1 13 95.24
1 14 67.29
....
the method of reading the file is
def load_data(distance_file):
distance = {}
min_dis, max_dis = sys.float_info.max, 0.0
num = 0
with open(distance_file, 'r', encoding = 'utf-8') as infile:
for line in infile:
content = line.strip().split(' ')
assert(len(content) == 3)
idx1, idx2, dis = int(content[0]), int(content[1]), float(content[2])
num = max(num, idx1, idx2)
min_dis = min(min_dis, dis)
max_dis = max(max_dis, dis)
distance[(idx1, idx2)] = dis
distance[(idx2, idx1)] = dis
for i in range(1, num + 1):
distance[(i, i)] = 0.0
infile.close()
return distance, num, max_dis, min_dis
to be
import dask.dataframe as dd
def load_data(distance_file):
distance = {}
min_dis, max_dis = sys.float_info.max, 0.0
num = 0
#with open(distance_file, 'r', encoding = 'utf-8') as infile:
df_dd = dd.read_csv("ex3.csv")
print("df_dd",df_dd.head())
#for line in df_dd:
#content = df_dd.strip().split(' ')
#print(content)
idx1, idx2, dis = df_dd.partitions[0], df_dd.partitions[1], df_dd.partitions[2]
print("df_dd.partitions[0]",df_dd.partitions[0])
num = max(num, idx1, idx2)
min_dis = min(min_dis, dis)
max_dis = max(max_dis, dis)
distance[(idx1, idx2)] = dis
distance[(idx2, idx1)] = dis
for i in range(1, num + 1):
distance[(i, i)] = 0.0
return distance, num, max_dis, min_dis
You are using Python native integers and floats: these alone take tens of bytes for each actual number in your data (28 bytes for an integer).
If you simply use Numpy or Pandas for that, your memory consumption might be slashed by a factor of 4 or more, without further adjustments.
Your lines average 10 bytes this early - at an 8GB file you should have less than 800 million registers - if you use 16bit integer numbers and 32 bit float that would mean that your data might fit in 10GB of memory. It is still a tight call, as the default pandas behavior is to copy everything on changes to a column. There are other options:
Since your code depends on indexing the rows as you've done there, you could just offload your data to an SQLite DB, and use in-sqlite indices instead of the dict you are using, as well as its min and max operators: this would offset memory usage, and sqlite would make its job with minimal fuss.
Another option would be to use "dask" instead of Pandas: it will take care of offloading data that would not fit in memory to disk.
TL;DR: the way your problem is arranged, maybe going to sqlite might be the way that would require less changes in what you have thought.

Generate the all possible unique peptides (permutants) in Python/Biopython

I have a scenario in which I have a peptide frame having 9 AA. I want to generate all possible peptides by replacing a maximum of 3 AA on this frame ie by replacing only 1 or 2 or 3 AA.
The frame is CKASGFTFS and I want to see all the mutants by replacing a maximum of 3 AA from the pool of 20 AA.
we have a pool of 20 different AA (A,R,N,D,E,G,C,Q,H,I,L,K,M,F,P,S,T,W,Y,V).
I am new to coding so Can someone help me out with how to code for this in Python or Biopython.
output is supposed to be a list of unique sequences like below:
CKASGFTFT, CTTSGFTFS, CTASGKTFS, CTASAFTWS, CTRSGFTFS, CKASEFTFS ....so on so forth getting 1, 2, or 3 substitutions from the pool of AA without changing the existing frame.
Ok, so after my code finished, I worked the calculations backwards,
Case1, is 9c1 x 19 = 171
Case2, is 9c2 x 19 x 19 = 12,996
Case3, is 9c3 x 19 x 19 x 19 = 576,156
That's a total of 589,323 combinations.
Here is the code for all 3 cases, you can run them sequentially.
You also requested to join the array into a single string, I have updated my code to reflect that.
import copy
original = ['C','K','A','S','G','F','T','F','S']
possibilities = ['A','R','N','D','E','G','C','Q','H','I','L','K','M','F','P','S','T','W','Y','V']
storage=[]
counter=1
# case 1
for i in range(len(original)):
for x in range(20):
temp = copy.deepcopy(original)
if temp[i] == possibilities[x]:
pass
else:
temp[i] = possibilities[x]
storage.append(''.join(temp))
print(counter,''.join(temp))
counter += 1
# case 2
for i in range(len(original)):
for j in range(i+1,len(original)):
for x in range(len(possibilities)):
for y in range(len(possibilities)):
temp = copy.deepcopy(original)
if temp[i] == possibilities[x] or temp[j] == possibilities[y]:
pass
else:
temp[i] = possibilities[x]
temp[j] = possibilities[y]
storage.append(''.join(temp))
print(counter,''.join(temp))
counter += 1
# case 3
for i in range(len(original)):
for j in range(i+1,len(original)):
for k in range(j+1,len(original)):
for x in range(len(possibilities)):
for y in range(len(possibilities)):
for z in range(len(possibilities)):
temp = copy.deepcopy(original)
if temp[i] == possibilities[x] or temp[j] == possibilities[y] or temp[k] == possibilities[z]:
pass
else:
temp[i] = possibilities[x]
temp[j] = possibilities[y]
temp[k] = possibilities[z]
storage.append(''.join(temp))
print(counter,''.join(temp))
counter += 1
The outputs look like this, (just the beginning and the end).
The results will also be saved to a variable named storage which is a native python list.
1 AKASGFTFS
2 RKASGFTFS
3 NKASGFTFS
4 DKASGFTFS
5 EKASGFTFS
6 GKASGFTFS
...
...
...
589318 CKASGFVVF
589319 CKASGFVVP
589320 CKASGFVVT
589321 CKASGFVVW
589322 CKASGFVVY
589323 CKASGFVVV
It takes around 10 - 20 minutes to run depending on your computer.
It will display all the combinations, skipping over changing AAs if any one is same as the original in case1 or 2 in case2 or 3 in case 3.
This code both prints them and stores them to a list variable so it can be storage or memory intensive and CPU intensive.
You could reduce the memory foot print if you want to store the string by replacing the letters with numbers cause they might take less space, you could even consider using something like pandas or appending to a csv file in storage.
You can iterate over the storage variable to go through the strings if you wish, like this.
for i in storage:
print(i)
Or you can convert it to a pandas series, dataframe or write line by line directly to a csv file in storage.
Let's compute the total number of mutations that you are looking for.
Say you want to replace a single AA. Firstly, there are 9 AAs in your frame, each of which can be changed into one of 19 other AA. That's 9 * 19 = 171
If you want to change two AA, there are 9c2 = 36 combinations of AA in your frame, and 19^2 permutations of two of the pool. That gives us 36 * 19^2 = 12996
Finally, if you want to change three, there are 9c3 = 84 combinations and 19^3 permutations of three of the pool. That gives us 84 * 19^3 = 576156
Put it all together and you get 171 + 12996 + 576156 = 589323 possible mutations. Hopefully, this helps illustrate the scale of the task you are trying to accomplish!

Solving knapsack problem using a greedy python algorithm

I'm trying to solve the knapsack problem using Python, implementing a greedy algorithm. The result I'm getting back makes no sense to me.
Knapsack:
The first line gives the number of items, in this case 20. The last line gives the capacity of the knapsack, in this case 524. The remaining lines give the index, value and weight of each item.
20
1 91 29
2 60 65
3 61 71
4 9 60
5 79 45
6 46 71
7 19 22
8 57 97
9 8 6
10 84 91
11 20 57
12 72 60
13 32 49
14 31 89
15 28 2
16 81 30
17 55 90
18 43 25
19 100 82
20 27 19
524
Python code:
import os
def constructive():
knapsack = []
Weight = 0
while(Weight <= cap):
best = max(values)
i = values.index(best)
knapsack.append(i)
Weight = Weight + weights[i]
del values[i]
del weights[i]
return knapsack, Weight
def read_kfile(fname):
with open(fname, 'rU') as kfile:
lines = kfile.readlines() # reads the whole file
n = int(lines[0])
c = int(lines[n+1])
vs = []
ws = []
lines = lines[1:n+1] # Removes the first and last line
for l in lines:
numbers = l.split() # Converts the string into a list
vs.append(int(numbers[1])) # Appends value, need to convert to int
ws.append(int(numbers[2])) # Appends weigth, need to convert to int
return n, c, vs, ws
dir_path = os.path.dirname(os.path.realpath(__file__)) # Get the directory where the file is located
os.chdir(dir_path) # Change the working directory so we can read the file
knapfile = 'knap20.txt'
nitems, cap, values, weights = read_kfile(knapfile)
val1,val2 =constructive()
print ('knapsack',val1)
print('weight', val2)
print('cap', cap)
Result:
knapsack [18, 0, 8, 13, 3, 8, 1, 0, 3]
weight 570
cap 524
Welcome. the reason why your program is giving a weights over the cap limit is because on the final item you are putting in the knapsack, you aren't checking if it can fit in it. To do this just add an if statement, Also you should check if the list of values is empty. Do note that I have append (i+1) since your text file's index is starting at 1 but Python starts it's list index at 0:
def constructive():
knapsack = []
Weight = 0
while(Weight <= cap and values):
best = max(values)
i = values.index(best)
if weights[i] <= cap-Weight:
knapsack.append(i+1)
Weight = Weight + weights[i]
del values[i]
del weights[i]
return knapsack, Weight
The problem is -- in the last step -- the best item you find will exceed the maximum weight. But since you already entered the loop you add it anyway.
In the next iteration you recognize that you are over the cap and stop.
I am not sure how you want to proceed once the next best is too heavy. In case you simple want to stop and not add anything more you can simply modify your constructive to look as follows:
def constructive():
knapsack = []
Weight = 0
while(True):
best = max(values)
i = values.index(best)
if Weight + weights[i] > cap:
break
knapsack.append(i)
Weight = Weight + weights[i]
del values[i]
del weights[i]
return knapsack, Weight

Python - Shuffling a list with constraints

I've been working for a couple of months, on and off, on a script to shuffle a list in a textfile. I am a beginner in Python (the only language I sort of understand a bit), and after a while I have managed to come up with a few lines of code which do sort of what I need.
The input file I have is a tabbed list. it has 5 words per row, but I'll make it numbers so it looks clearer in the example:
01 02 03 04 05
06 07 08 09 10
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
Now, after a few efforts and a huge amount of work from SO users, I've managed to shuffle these elements so that they don't appear in the same line as their original "partners". This is the code I'm using:
import csv,StringIO
import random
from random import shuffle
datalist = open('lista.txt', 'r')
leyendo = datalist.read()
separando = csv.reader(StringIO.StringIO(leyendo), delimiter = '\t')
macrolist = list(separando)
l = [group[:] for group in macrolist]
random.shuffle(l)
nicendone = []
prev_i = -1
while any(a for a in l):
new_i = max(((i,a) for i,a in enumerate(l) if i != prev_i), key=lambda x: len(x[1]))[0]
nicendone.append(l[new_i].pop(random.randint(0, len(l[new_i]) - 1)))
prev_i = new_i
with open('randolista.txt', 'w') as newdoc:
for i, m in enumerate(nicendone, 1):
newdoc.write(m + [', ', '\n'][i % 5 == 0])
datalist.close()
This does the job, but what I actually need is a bit more complicated. I need to shuffle the list with the following restrictions:
The words in the first and second column should be shuffled ONLY within their own column.
The new randomised list should have no two elements appearing in the same line again.
What I'd like to get is something like the following:
01 17 25 19 13
16 22 13 03 20
etc
So that items in the first and second column are only shuffled within their own columns, and no two items are in the same row in the output that were in the same row in the input. I realise in a 5 row example this last constraint is constantly broken, but the real input file has 100 rows.
I really don't know how to even start doing this. My programming abilities are limited, but the problem is that I can't even come up with a pseudocode for it. How can I make Python identify the elements of the first two columns so that it only shuffles them vertically?
Thanks in advance
Shuffling the first two columns in such a way that two values that used to be on the same row do not appear on the same row can be accomplished by transposing the the columns with a random number. For example: you could push the first column 20 rows down and the second column 10 rows down where 20 and 10 are random integers less than the numbers of rows.
A sample code that randomizes the first two columns:
from random import sample
text = \
"""a b c d e
f g h i j
k l m n o
p q r s t"""
# Translate file to matrix (list of lists)
matrix = map(lambda x: x.split(" "), text.split("\n"))
# Determine height and height of matrix
height = len(matrix)
width = len(matrix[0])
# Choose two (unique) numbers for transposing the first two columns
transpose_list = sample(xrange(0, height), 2)
# Now build a new matrix, transposing only the first two
# columns.
new_matrix = []
for y in range(0, height):
row = []
for x in range(0, 2):
transpose = (y + transpose_list[x]) % height
row.append(matrix[transpose][x])
for x in range(2, width):
row.append(matrix[y][x])
new_matrix.append(row)
# And create a list again
new_text = "\n".join(map(lambda x: " ".join(x), new_matrix))
print new_text
This results in something like:
a l c d e
f q h i j
k b m n o
p g r s t
If I understand you post correctly, you already have an algorithm for randomizing the rest of the table?
I hope this is of any help :-).
Wout

Categories