Generate the all possible unique peptides (permutants) in Python/Biopython - python

I have a scenario in which I have a peptide frame having 9 AA. I want to generate all possible peptides by replacing a maximum of 3 AA on this frame ie by replacing only 1 or 2 or 3 AA.
The frame is CKASGFTFS and I want to see all the mutants by replacing a maximum of 3 AA from the pool of 20 AA.
we have a pool of 20 different AA (A,R,N,D,E,G,C,Q,H,I,L,K,M,F,P,S,T,W,Y,V).
I am new to coding so Can someone help me out with how to code for this in Python or Biopython.
output is supposed to be a list of unique sequences like below:
CKASGFTFT, CTTSGFTFS, CTASGKTFS, CTASAFTWS, CTRSGFTFS, CKASEFTFS ....so on so forth getting 1, 2, or 3 substitutions from the pool of AA without changing the existing frame.

Ok, so after my code finished, I worked the calculations backwards,
Case1, is 9c1 x 19 = 171
Case2, is 9c2 x 19 x 19 = 12,996
Case3, is 9c3 x 19 x 19 x 19 = 576,156
That's a total of 589,323 combinations.
Here is the code for all 3 cases, you can run them sequentially.
You also requested to join the array into a single string, I have updated my code to reflect that.
import copy
original = ['C','K','A','S','G','F','T','F','S']
possibilities = ['A','R','N','D','E','G','C','Q','H','I','L','K','M','F','P','S','T','W','Y','V']
storage=[]
counter=1
# case 1
for i in range(len(original)):
for x in range(20):
temp = copy.deepcopy(original)
if temp[i] == possibilities[x]:
pass
else:
temp[i] = possibilities[x]
storage.append(''.join(temp))
print(counter,''.join(temp))
counter += 1
# case 2
for i in range(len(original)):
for j in range(i+1,len(original)):
for x in range(len(possibilities)):
for y in range(len(possibilities)):
temp = copy.deepcopy(original)
if temp[i] == possibilities[x] or temp[j] == possibilities[y]:
pass
else:
temp[i] = possibilities[x]
temp[j] = possibilities[y]
storage.append(''.join(temp))
print(counter,''.join(temp))
counter += 1
# case 3
for i in range(len(original)):
for j in range(i+1,len(original)):
for k in range(j+1,len(original)):
for x in range(len(possibilities)):
for y in range(len(possibilities)):
for z in range(len(possibilities)):
temp = copy.deepcopy(original)
if temp[i] == possibilities[x] or temp[j] == possibilities[y] or temp[k] == possibilities[z]:
pass
else:
temp[i] = possibilities[x]
temp[j] = possibilities[y]
temp[k] = possibilities[z]
storage.append(''.join(temp))
print(counter,''.join(temp))
counter += 1
The outputs look like this, (just the beginning and the end).
The results will also be saved to a variable named storage which is a native python list.
1 AKASGFTFS
2 RKASGFTFS
3 NKASGFTFS
4 DKASGFTFS
5 EKASGFTFS
6 GKASGFTFS
...
...
...
589318 CKASGFVVF
589319 CKASGFVVP
589320 CKASGFVVT
589321 CKASGFVVW
589322 CKASGFVVY
589323 CKASGFVVV
It takes around 10 - 20 minutes to run depending on your computer.
It will display all the combinations, skipping over changing AAs if any one is same as the original in case1 or 2 in case2 or 3 in case 3.
This code both prints them and stores them to a list variable so it can be storage or memory intensive and CPU intensive.
You could reduce the memory foot print if you want to store the string by replacing the letters with numbers cause they might take less space, you could even consider using something like pandas or appending to a csv file in storage.
You can iterate over the storage variable to go through the strings if you wish, like this.
for i in storage:
print(i)
Or you can convert it to a pandas series, dataframe or write line by line directly to a csv file in storage.

Let's compute the total number of mutations that you are looking for.
Say you want to replace a single AA. Firstly, there are 9 AAs in your frame, each of which can be changed into one of 19 other AA. That's 9 * 19 = 171
If you want to change two AA, there are 9c2 = 36 combinations of AA in your frame, and 19^2 permutations of two of the pool. That gives us 36 * 19^2 = 12996
Finally, if you want to change three, there are 9c3 = 84 combinations and 19^3 permutations of three of the pool. That gives us 84 * 19^3 = 576156
Put it all together and you get 171 + 12996 + 576156 = 589323 possible mutations. Hopefully, this helps illustrate the scale of the task you are trying to accomplish!

Related

Optimizing loop sequence

I am trying to check whether an item from a list exists one or more times in a data frame column, and if so, then use some info of that entire row to extract some data.
The data frame has entries like this:
df =
prefix value binary
---------------------------------------------------
0 30 yes 01010000101000000000000000001101
1 29 yes 01010000101001111110111110101011
2 29 no 10000000010011011011110001111011
The current code looks something like this:
list1 = []
list2 = []
for i, binary in enumerate(list_of_binary_numbers):
print(f"Executing {i+1}")
list1_tmp = 0
list2_tmp = 0
for index, row in df.iterrows():
if binary == row["binary"][0 : len(binary)]:
if row["value"] == "yes":
list1_tmp += 2 ** (32 - int(row["prefix"]))
elif row["value"] == "no":
list2_tmp += 2 ** (32 - int(row["prefix"]))
list1.append(list1_tmp)
list2.append(list2_tmp)
So basically list_of_binary_numbers is a list with shortened binary numbers, and I need to check whether this shortened part of a full binary number exists in the df. That's why I do the [0 : len(binary)] so they have the same length.
List looks like this:
list_of_binary_numbers =
0 00000010011010000
1 0000001001101000100
2 000000100110101000000110
3 000000100110101000000111
4 00000010011010100010000
The issue is that the list_of_binary_numbers are roughly 150.000 items, and so is the data frame. So each main iteration takes roughly 1 sec to do, hence, this will take forever to complete.
I just can't see any other good way to achieve this, so that's why I am asking for some help.

Find combinations (without "of size=r") from a set with decreasing sum value using Python

(Revised for clarity 02-08-2021)
This is similar to the question here:
Find combinations of size r from a set with decreasing sum value
This is different from the answer posted in the link above because I am looking for answers without "size r=3".
I have a set (array) of numbers.
I need to have the sums of combinations of the numbers sorted from largest to smallest and show the numbers from the array that were used to get the total for that row.
Any number in the array can only be used once per row but all the numbers don't have to be used in each row as the total decreases.
If a number is not used then zero should be used as a placeholder instead so I can create a CSV file with the columns aligned.
Input Example #1 with 7 numbers in the array: [30,25,20,15,10,5,1]
Desired Output Example #1 format where the last number in each row is the total (sum) of the row:
Beginning of list
30,25,20,15,10,5,1,106
30,25,20,15,10,5,0,105
30,25,20,15,10,0,1,101
30,25,20,15,10,0,0,100
...(all number combinations in between)
30,0,0,0,0,0,0,30
0,25,0,0,0,5,0,30
...(all number combinations in between)
0,0,0,15,0,0,1,16
0,0,0,15,0,0,0,15
0,0,0,0,10,5,0,15
0,0,0,0,10,0,1,11
0,0,0,0,0,5,1,6
0,0,0,0,0,5,0,5
0,0,0,0,0,0,1,1
End of list
Also, duplicate totals are allowed and preferred showing different combinations that have the same total (sum) of the row.
For Example #1:
30,0,0,0,0,0,0,30
0,25,0,0,0,5,0,30
For example this is one row of output based on the Input Example #1 above:
30,25,0,0,0,5,1,61
Last number in the row is the total. The total can also be the first number but the important thing is that the output list is sorted in descending order by the total.
Input Example #2 with 5 numbers in the array: [20,15,10,5,1]
Desired Output Example #2 format where the last number in each row is the total (sum) of the row:
Beginning of list
20,15,10,5,1,51
20,15,10,5,0,50
20,15,10,0,1,46
20,15,10,0,0,45
...(all number combinations in between)
20,0,10,0,0,30
0,15,10,5,0,30
...(all number combinations in between)
0,15,0,0,1,16
0,15,0,0,0,15
0,0,10,5,0,15
0,0,10,0,1,11
0,0,10,0,0,10
0,0,0,5,1,6
0,0,0,5,0,5
0,0,0,0,1,1
End of list
Input Example #1: [30,25,20,15,10,5,1]
Every row of the output should show each number in the array used only once at most per row to get the total for the row.
The rows must be sorted in decreasing order by the sums of the numbers used to get the total.
The first output row in the list would show the result of 30 + 25 + 20 + 15 + 10 + 5 + 1 = 106
The second output row in the list would show the result of 30 + 25 + 20 + 15 + 10 + 5 + 0 = 105
The third output row in the list would show the result of 30 + 25 + 20 + 15 + 10 + 0 + 1 = 101
...The rest of the rows would continue with the total (sum) of the row getting smaller until it reaches 1...
The third to last output row in the list would show the result of 0 + 0 + 0 + 0 + 0 + 5 + 1 = 6
The second to last output row in the list would show the result of 0 + 0 + 0 + 0 + 0 + 5 + 0 = 5
The last output row in the list would show the result of 0 + 0 + 0 + 0 + 0 + 0 + 1 = 1
I started with the code provided by user Divyanshu modified with different input numbers and the () added to the last line (but I need to use all the numbers in the array instead of size=4 as shown here):
import itertools
array = [30,25,20,15,10,5,1]
size = 4
answer = [] # to store all combination
order = [] # to store order according to sum
number = 0 # index of combination
for comb in itertools.combinations(array,size):
answer.append(comb)
order.append([sum(comb),number]) # Storing sum and index
number += 1
order.sort(reverse=True) # sorting in decreasing order
for key in order:
print (key[0],answer[key[1]]) # key[0] is sum of combination
So this is what I need as an Input (in this example):
[30,25,20,15,10,5,1]
size=4 in the above code limits the output to 4 of the numbers in the array.
If I take out size=4 I get an error. I need to use the entire array of numbers.
I can manually change size=4 to size=1 and run it then size=2 then run it and so on.
Entering size=1 through size=7 in the code and running it (7 times in this example) to get a list of all possible combinations gives me 7 different outputs.
I could then manually put the lists together but that won't work for larger sets (arrays) of numbers.
Can I modify the code referenced above or do I need to use a different approach?
I think you could do it as follow:
Import the following:
import pandas as pd
import numpy as np
The beginning of the code as in the questions:
import itertools
array = [30,25,20,15,10,5,1]
size = 4
answer = [] # to store all combination
order = [] # to store order according to sum
number = 0 # index of combination
for comb in itertools.combinations(array,size):
answer.append(comb)
order.append([sum(comb),number]) # Storing sum and index
number += 1
order.sort(reverse=True) # sorting in decreasing order
for key in order:
print (key[0],answer[key[1]]) # key[0] is sum of combination
The code, I think could help to obtain the final option:
array_len = array.__len__()
# Auxiliary to place in reference to the original array
dict_array = {}
for i in range(0,array_len):
print(i)
dict_array[array[i]]=i
# Reorder the previous combinations
aux = []
for key in order:
array_zeros = np.zeros([1, array_len+1])
for i in answer[key[1]]:
print(i,dict_array[i] )
array_zeros[0][dict_array[i]] = i
# Let add the total
array_zeros[0][array_len]=key[0]
aux.append(array_zeros[0])
# Tranform into a dataframe
aux = pd.DataFrame(aux)
# This is to add the names to the columns
# for the dataframe
aux.columns=array + ['total']
aux = aux.astype(int)
print(aux.head().astype(int))
30 25 20 15 10 5 1 total
0 30 25 20 15 0 0 0 90
1 30 25 20 0 10 0 0 85
2 30 25 0 15 10 0 0 80
3 30 25 20 0 0 5 0 80
4 30 25 20 0 0 0 1 76
Now generalization for all sizes
import itertools
import pandas as pd
import numpy as np
array = [30,25,20,15,10,5,1]
array_len = array.__len__()
answer = [] # to store all combination
order = [] # to store order according to sum
number = 0 # index of combination
for size in range(1,array_len+1):
print(size)
for comb in itertools.combinations(array,size):
answer.append(comb)
order.append([sum(comb),number]) # Storing sum and index
number += 1
order.sort(reverse=True) # sorting in decreasing order
for key in order:
print (key[0],answer[key[1]]) # key[0] is sum of combination
# Auxiliary to place in reference to the original array
dict_array = {}
for i in range(0,array_len):
print(i)
dict_array[array[i]]=i
# Reorder the previous combinations
aux = []
for key in order:
array_zeros = np.zeros([1, array_len+1])
for i in answer[key[1]]:
print(i,dict_array[i] )
array_zeros[0][dict_array[i]] = i
# Let add the total
array_zeros[0][array_len]=key[0]
aux.append(array_zeros[0])
# Tranform into a dataframe
aux = pd.DataFrame(aux)
# This is to add the names to the columns
# for the dataframe
aux.columns=array + ['total']
aux = aux.astype(int)
print(aux.head().astype(int))
30 25 20 15 10 5 1 total
0 30 25 20 15 10 5 1 106
1 30 25 20 15 10 5 0 105
2 30 25 20 15 10 0 1 101
3 30 25 20 15 10 0 0 100
4 30 25 20 15 0 5 1 96
Thanks to #RafaelValero (Rafael Valero) I was able to learn about pandas, numpy, and dataframes. I looked up options for pandas to get the desired output.
Here is the final code with some extra lines left in for reference but commented out:
import itertools
import pandas as pd
import numpy as np
array = [30,25,20,15,10,5,1]
array_len = array.__len__()
answer = [] # to store all combination
order = [] # to store order according to sum
number = 0 # index of combination
for size in range(1,array_len+1):
# Commented out line below as it was giving extra information
# print(size)
for comb in itertools.combinations(array,size):
answer.append(comb)
order.append([sum(comb),number]) # Storing sum and index
number += 1
order.sort(reverse=True) # sorting in decreasing order
# Commented out two lines below as it was from the original code and giving extra information
#for key in order:
# print (key[0],answer[key[1]]) # key[0] is sum of combination
# Auxiliary to place in reference to the original array
dict_array = {}
for i in range(0,array_len):
# Commented out line below as it was giving extra information
# print(i)
dict_array[array[i]]=i
# Reorder the previous combinations
aux = []
for key in order:
array_zeros = np.zeros([1, array_len+1])
for i in answer[key[1]]:
# Commented out line below as it was giving extra information
# print(i,dict_array[i] )
array_zeros[0][dict_array[i]] = i
# Let add the total
array_zeros[0][array_len]=key[0]
aux.append(array_zeros[0])
# Tranform into a dataframe
aux = pd.DataFrame(aux)
# This is to add the names to the columns
# for the dataframe
# Update: removed this line below as I didn't need a header
# aux.columns=array + ['total']
aux = aux.astype(int)
# Tried option below first but it was not necessary when using to_csv
# pd.set_option('display.max_rows', None)
print(aux.to_csv(index=False,header=None))
Searched references:
Similar question:
Find combinations of size r from a set with decreasing sum value
Pandas references:
https://thispointer.com/python-pandas-how-to-display-full-dataframe-i-e-print-all-rows-columns-without-truncation/
https://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.to_csv.html
Online compiler used:
https://www.programiz.com/python-programming/online-compiler/
Output using Input Example #1 with 7 numbers in the array: [30,25,20,15,10,5,1]:
30,25,20,15,10,5,1,106
30,25,20,15,10,5,0,105
30,25,20,15,10,0,1,101
30,25,20,15,10,0,0,100
30,25,20,15,0,5,1,96
30,25,20,15,0,5,0,95
30,25,20,0,10,5,1,91
30,25,20,15,0,0,1,91
30,25,20,0,10,5,0,90
30,25,20,15,0,0,0,90
30,25,0,15,10,5,1,86
30,25,20,0,10,0,1,86
30,25,0,15,10,5,0,85
30,25,20,0,10,0,0,85
30,0,20,15,10,5,1,81
30,25,0,15,10,0,1,81
30,25,20,0,0,5,1,81
30,0,20,15,10,5,0,80
30,25,0,15,10,0,0,80
30,25,20,0,0,5,0,80
0,25,20,15,10,5,1,76
30,0,20,15,10,0,1,76
30,25,0,15,0,5,1,76
30,25,20,0,0,0,1,76
0,25,20,15,10,5,0,75
30,0,20,15,10,0,0,75
30,25,0,15,0,5,0,75
30,25,20,0,0,0,0,75
0,25,20,15,10,0,1,71
30,0,20,15,0,5,1,71
30,25,0,0,10,5,1,71
30,25,0,15,0,0,1,71
0,25,20,15,10,0,0,70
30,0,20,15,0,5,0,70
30,25,0,0,10,5,0,70
30,25,0,15,0,0,0,70
0,25,20,15,0,5,1,66
30,0,20,0,10,5,1,66
30,0,20,15,0,0,1,66
30,25,0,0,10,0,1,66
0,25,20,15,0,5,0,65
30,0,20,0,10,5,0,65
30,0,20,15,0,0,0,65
30,25,0,0,10,0,0,65
0,25,20,0,10,5,1,61
30,0,0,15,10,5,1,61
0,25,20,15,0,0,1,61
30,0,20,0,10,0,1,61
30,25,0,0,0,5,1,61
0,25,20,0,10,5,0,60
30,0,0,15,10,5,0,60
0,25,20,15,0,0,0,60
30,0,20,0,10,0,0,60
30,25,0,0,0,5,0,60
0,25,0,15,10,5,1,56
0,25,20,0,10,0,1,56
30,0,0,15,10,0,1,56
30,0,20,0,0,5,1,56
30,25,0,0,0,0,1,56
0,25,0,15,10,5,0,55
0,25,20,0,10,0,0,55
30,0,0,15,10,0,0,55
30,0,20,0,0,5,0,55
30,25,0,0,0,0,0,55
0,0,20,15,10,5,1,51
0,25,0,15,10,0,1,51
0,25,20,0,0,5,1,51
30,0,0,15,0,5,1,51
30,0,20,0,0,0,1,51
0,0,20,15,10,5,0,50
0,25,0,15,10,0,0,50
0,25,20,0,0,5,0,50
30,0,0,15,0,5,0,50
30,0,20,0,0,0,0,50
0,0,20,15,10,0,1,46
0,25,0,15,0,5,1,46
30,0,0,0,10,5,1,46
0,25,20,0,0,0,1,46
30,0,0,15,0,0,1,46
0,0,20,15,10,0,0,45
0,25,0,15,0,5,0,45
30,0,0,0,10,5,0,45
0,25,20,0,0,0,0,45
30,0,0,15,0,0,0,45
0,0,20,15,0,5,1,41
0,25,0,0,10,5,1,41
0,25,0,15,0,0,1,41
30,0,0,0,10,0,1,41
0,0,20,15,0,5,0,40
0,25,0,0,10,5,0,40
0,25,0,15,0,0,0,40
30,0,0,0,10,0,0,40
0,0,20,0,10,5,1,36
0,0,20,15,0,0,1,36
0,25,0,0,10,0,1,36
30,0,0,0,0,5,1,36
0,0,20,0,10,5,0,35
0,0,20,15,0,0,0,35
0,25,0,0,10,0,0,35
30,0,0,0,0,5,0,35
0,0,0,15,10,5,1,31
0,0,20,0,10,0,1,31
0,25,0,0,0,5,1,31
30,0,0,0,0,0,1,31
0,0,0,15,10,5,0,30
0,0,20,0,10,0,0,30
0,25,0,0,0,5,0,30
30,0,0,0,0,0,0,30
0,0,0,15,10,0,1,26
0,0,20,0,0,5,1,26
0,25,0,0,0,0,1,26
0,0,0,15,10,0,0,25
0,0,20,0,0,5,0,25
0,25,0,0,0,0,0,25
0,0,0,15,0,5,1,21
0,0,20,0,0,0,1,21
0,0,0,15,0,5,0,20
0,0,20,0,0,0,0,20
0,0,0,0,10,5,1,16
0,0,0,15,0,0,1,16
0,0,0,0,10,5,0,15
0,0,0,15,0,0,0,15
0,0,0,0,10,0,1,11
0,0,0,0,10,0,0,10
0,0,0,0,0,5,1,6
0,0,0,0,0,5,0,5
0,0,0,0,0,0,1,1

Long multiplication of two numbers given as strings

I am trying to solve a problem of multiplication. I know that Python supports very large numbers and it can be done but what I want to do is
Enter 2 numbers as strings.
Multiply those two numbers in the same manner as we used to do in school.
Basic idea is to convert the code given in the link below to Python code but I am not very good at C++/Java. What I want to do is to understand the code given in the link below and apply it for Python.
https://www.geeksforgeeks.org/multiply-large-numbers-represented-as-strings/
I am stuck at the addition point.
I want to do it it like in the image given below
So I have made a list which stores the values of ith digit of first number to jth digit of second. Please help me to solve the addition part.
def mul(upper_no,lower_no):
upper_len=len(upper_no)
lower_len=len(lower_no)
list_to_add=[] #saves numbers in queue to add in the end
for lower_digit in range(lower_len-1,-1,-1):
q='' #A queue to store step by step multiplication of numbers
carry=0
for upper_digit in range(upper_len-1,-1,-1):
num2=int(lower_no[lower_digit])
num1=int(upper_no[upper_digit])
print(num2,num1)
x=(num2*num1)+carry
if upper_digit==0:
q=str(x)+q
else:
if x>9:
q=str(x%10)+q
carry=x//10
else:
q=str(x%10)+q
carry=0
num=x%10
print(q)
list_to_add.append(int(''.join(q)))
print(list_to_add)
mul('234','567')
I have [1638,1404,1170] as a result for the function call mul('234','567') I am supposed to add these numbers but stuck because these numbers have to be shifted for each list. for example 1638 is supposed to be added as 16380 + 1404 with 6 aligning with 4, 3 with 0 and 8 with 4 and so on. Like:
1638
1404x
1170xx
--------
132678
--------
I think this might help. I've added a place variable to keep track of what power of 10 each intermediate value should be multiplied by, and used the itertools.accumulate function to produce the intermediate accumulated sums that doing so produces (and you want to show).
Note I have also reformatted your code so it closely follows PEP 8 - Style Guide for Python Code in an effort to make it more readable.
from itertools import accumulate
import operator
def mul(upper_no, lower_no):
upper_len = len(upper_no)
lower_len = len(lower_no)
list_to_add = [] # Saves numbers in queue to add in the end
place = 0
for lower_digit in range(lower_len-1, -1, -1):
q = '' # A queue to store step by step multiplication of numbers
carry = 0
for upper_digit in range(upper_len-1, -1, -1):
num2 = int(lower_no[lower_digit])
num1 = int(upper_no[upper_digit])
print(num2, num1)
x = (num2*num1) + carry
if upper_digit == 0:
q = str(x) + q
else:
if x>9:
q = str(x%10) + q
carry = x//10
else:
q = str(x%10) + q
carry = 0
num = x%10
print(q)
list_to_add.append(int(''.join(q)) * (10**place))
place += 1
print(list_to_add)
print(list(accumulate(list_to_add, operator.add)))
mul('234', '567')
Output:
7 4
7 3
7 2
1638
6 4
6 3
6 2
1404
5 4
5 3
5 2
1170
[1638, 14040, 117000]
[1638, 15678, 132678]

Query Board challenge on Python, need some pointers

So, I have this challenge on CodeEval, but I seem don't know where to start, so I need some pointers (and answers if you can) to help me figure out this challenge.
DESCRIPTION:
There is a board (matrix). Every cell of the board contains one integer, which is 0 initially.
The next operations can be applied to the Query Board:
SetRow i x: it means that all values in the cells on row "i" have been change value to "x" after this operation.
SetCol j x: it means that all values in the cells on column "j" have been changed to value "x" after this operation.
QueryRow i: it means that you should output the sum of values on row "i".
QueryCol j: it means that you should output the sum of values on column "j".
The board's dimensions are 256x256
i and j are integers from 0 to 255
x is an integer from 0 to 31
INPUT SAMPLE:
Your program should accept as its first argument a path to a filename. Each line in this file contains an operation of a query. E.g.
SetCol 32 20
SetRow 15 7
SetRow 16 31
QueryCol 32
SetCol 2 14
QueryRow 10
SetCol 14 0
QueryRow 15
SetRow 10 1
QueryCol 2
OUTPUT SAMPLE:
For each query, output the answer of the query. E.g.
5118
34
1792
3571
I'm not that great on Python, but this challenge is pretty interesting, although I didn't have any clues on how to solve it. So, I need some help from you guys.
Thanks!
You could use a sparse matrix for this; addressed by (col, row) tuples as keys in a dictionary, to save memory. 64k cells is a big list otherwise (2MB+ on a 64-bit system):
matrix = {}
This is way more efficient, as the challenge is unlikely to set values for all rows and columns on the board.
Setting a column or row is then:
def set_col(col, x):
for i in range(256):
matrix[i, col] = x
def set_row(row, x):
for i in range(256):
matrix[row, i] = x
and summing a row or column is then:
def get_col(col):
return sum(matrix.get((i, col), 0) for i in range(256))
def get_row(row):
return sum(matrix.get((row, i), 0) for i in range(256))
WIDTH, HEIGHT = 256, 256
board = [[0] * WIDTH for i in range(HEIGHT)]
def set_row(i, x):
global board
board[i] = [x]*WIDTH
... implement each function, then parse each line of input to decide which function to call,
for line in inf:
dat = line.split()
if dat[0] == "SetRow":
set_row(int(dat[1]), int(dat[2]))
elif ...
Edit: Per Martijn's comments:
total memory usage for board is about 2.1MB. By comparison, after 100 random row/column writes, matrix is 3.1MB (although it tops out there and doesn't get any bigger).
yes, global is unnecessary when modifying a global object (just don't try to assign to it).
while dispatching from a dict is good and efficient, I did not want to inflict it on someone who is "not that great on Python", especially for just four entries.
For sake of comparison, how about
time = 0
WIDTH, HEIGHT = 256, 256
INIT = 0
rows = [(time, INIT) for _ in range(WIDTH)]
cols = [(time, INIT) for _ in range(HEIGHT)]
def set_row(i, x):
global time
time += 1
rows[int(i)] = (time, int(x))
def set_col(i, x):
global time
time += 1
cols[int(i)] = (time, int(x))
def query_row(i):
rt, rv = rows[int(i)]
total = rv * WIDTH + sum(cv - rv for ct, cv in cols if ct > rt)
print(total)
def query_col(j):
ct, cv = cols[int(j)]
total = cv * HEIGHT + sum(rv - cv for rt, rv in rows if rt > ct)
print(total)
ops = {
"SetRow": set_row,
"SetCol": set_col,
"QueryRow": query_row,
"QueryCol": query_col
}
inf = """SetCol 32 20
SetRow 15 7
SetRow 16 31
QueryCol 32
SetCol 2 14
QueryRow 10
SetCol 14 0
QueryRow 15
SetRow 10 1
QueryCol 2""".splitlines()
for line in inf:
line = line.split()
op = line.pop(0)
ops[op](*line)
which only uses 4.3k of memory for rows[] and cols[].
Edit2:
using your code from above for matrix, set_row, set_col,
import sys
for n in range(256):
set_row(n, 1)
print("{}: {}".format(2*(n+1)-1, sys.getsizeof(matrix)))
set_col(n, 1)
print("{}: {}".format(2*(n+1), sys.getsizeof(matrix)))
which returns (condensed:)
1: 12560
2: 49424
6: 196880
22: 786704
94: 3146000
... basically the allocated memory quadruples at each step. If I change the memory measure to include key-tuples,
def get_matrix_size():
return sys.getsizeof(matrix) + sum(sys.getsizeof(key) for key in matrix)
it increases more smoothly, but still takes a bit jump at the above points:
5 : 127.9k
6 : 287.7k
21 : 521.4k
22 : 1112.7k
60 : 1672.0k
61 : 1686.1k <-- approx expected size on your reported problem set
93 : 2121.1k
94 : 4438.2k

How to randomly sample from 4 csv files so that no more than 2/3 rows appear in order from each csv file, in Python

Hi I'm very new to python and trying to create a program that takes a random sample from a CSV file and makes a new file with some conditions. What I have done so far is probably highly over-complicated and not efficient (though it doesn't need to be).
I have 4 CSV files that contain 264 rows in total, where each full row is unique, though they all share common values in some columns.
csv1 = 72 rows, csv2 = 72 rows, csv3 = 60 rows, csv4 = 60 rows. I need to take a random sample of 160 rows which will make 4 blocks of 40, where in each block 10 must come from each csv file. The tricky part is that no more than 2 or 3 rows from the same CSV file can appear in order in the final file.
So far I have managed to take a random sample of 40 from each CSV (just using random.sample) and output them to 4 new CSV files. Then I split each csv into 4 new files each containing 10 rows so that I have each in a separate folder(1-4). So I now have 4 folders each containing 4 csv files. Now I need to combine these so that rows that came from the original CSV file don't repeat more than 2 or 3 times and the row order will be as random as possible. This is where I'm completely lost, I'm presuming that I should combine the 4 files in each folder (which I can do) and then re-sample or shuffle in a loop until the conditions are met, or something to that effect but I'm not sure how to proceed or am I going about this in the completely wrong way. Any help anyone can give me would be greatly appreciated and I can provide any further details that are necessary.
var_start = 1
total_condition_amount_start = 1
while (var_start < 5):
with open("condition"+`var_start`+".csv", "rb") as population1:
conditions1 = [line for line in population1]
random_selection1 = random.sample(conditions1, 40)
with open("./temp/40cond"+`var_start`+".csv", "wb") as temp_output:
temp_output.write("".join(random_selection1))
var_start = var_start + 1
while (total_condition_amount_start < total_condition_amount):
folder_no = 1
splitter.split(open("./temp/40cond"+`total_condition_amount_start`+".csv", 'rb'));
shutil.move("./temp/output_1.csv", "./temp/block"+`folder_no`+"/output_"+`total_condition_amount_start`+".csv")
folder_no = folder_no + 1
shutil.move("./temp/output_2.csv", "./temp/block"+`folder_no`+"/output_"+`total_condition_amount_start`+".csv")
folder_no = folder_no + 1
shutil.move("./temp/output_3.csv", "./temp/block"+`folder_no`+"/output_"+`total_condition_amount_start`+".csv")
folder_no = folder_no + 1
shutil.move("./temp/output_4.csv", "./temp/block"+`folder_no`+"/output_"+`total_condition_amount_start`+".csv")
total_condition_amount_start = total_condition_amount_start + 1
You should probably try using the CSV built in lib: http://docs.python.org/3.3/library/csv.html
That way you can handle each file as a list of dictionaries, which will make your task a lot easier.
from random import randint, sample, choice
def create_random_list(length):
return [randint(0, 100) for i in range(length)]
# This should be your list of four initial csv files
# with the 264 rows in total, read with the csv lib
lists = [create_random_list(264) for i in range(4)]
# Take a randomized sample from the lists
lists = map(lambda x: sample(x, 40), lists)
# Add some variables to the
lists = map(lambda x: {'data': x, 'full_count': 0}, lists)
final = [[] for i in range(4)]
for l in final:
prev = None
count = 0
while len(l) < 40:
current = choice(lists)
if current['full_count'] == 10 or (current is prev and count == 3):
continue
# Take an item from the chosen list if it hasn't been used 3 times in a
# row or is already used 10 times. Append that item to the final list
total_left = 40 - len(l)
maxx = 0
for i in lists:
if i is not current and 10 - i['full_count'] > maxx:
maxx = 10 - i['full_count']
current_left = 10 - current['full_count']
max_left = maxx + maxx/3.0
if maxx > 3 and total_left <= max_left:
# Make sure that in te future it can still be split in to sets of
# max 3
continue
l.append(current['data'].pop())
count += 1
current['full_count'] += 1
if current is not prev:
count = 0
prev = current
for li in lists:
li['full_count'] = 0

Categories