Generate the all possible unique peptides (permutants) in Python/Biopython - python
I have a scenario in which I have a peptide frame having 9 AA. I want to generate all possible peptides by replacing a maximum of 3 AA on this frame ie by replacing only 1 or 2 or 3 AA.
The frame is CKASGFTFS and I want to see all the mutants by replacing a maximum of 3 AA from the pool of 20 AA.
we have a pool of 20 different AA (A,R,N,D,E,G,C,Q,H,I,L,K,M,F,P,S,T,W,Y,V).
I am new to coding so Can someone help me out with how to code for this in Python or Biopython.
output is supposed to be a list of unique sequences like below:
CKASGFTFT, CTTSGFTFS, CTASGKTFS, CTASAFTWS, CTRSGFTFS, CKASEFTFS ....so on so forth getting 1, 2, or 3 substitutions from the pool of AA without changing the existing frame.
Ok, so after my code finished, I worked the calculations backwards,
Case1, is 9c1 x 19 = 171
Case2, is 9c2 x 19 x 19 = 12,996
Case3, is 9c3 x 19 x 19 x 19 = 576,156
That's a total of 589,323 combinations.
Here is the code for all 3 cases, you can run them sequentially.
You also requested to join the array into a single string, I have updated my code to reflect that.
import copy
original = ['C','K','A','S','G','F','T','F','S']
possibilities = ['A','R','N','D','E','G','C','Q','H','I','L','K','M','F','P','S','T','W','Y','V']
storage=[]
counter=1
# case 1
for i in range(len(original)):
for x in range(20):
temp = copy.deepcopy(original)
if temp[i] == possibilities[x]:
pass
else:
temp[i] = possibilities[x]
storage.append(''.join(temp))
print(counter,''.join(temp))
counter += 1
# case 2
for i in range(len(original)):
for j in range(i+1,len(original)):
for x in range(len(possibilities)):
for y in range(len(possibilities)):
temp = copy.deepcopy(original)
if temp[i] == possibilities[x] or temp[j] == possibilities[y]:
pass
else:
temp[i] = possibilities[x]
temp[j] = possibilities[y]
storage.append(''.join(temp))
print(counter,''.join(temp))
counter += 1
# case 3
for i in range(len(original)):
for j in range(i+1,len(original)):
for k in range(j+1,len(original)):
for x in range(len(possibilities)):
for y in range(len(possibilities)):
for z in range(len(possibilities)):
temp = copy.deepcopy(original)
if temp[i] == possibilities[x] or temp[j] == possibilities[y] or temp[k] == possibilities[z]:
pass
else:
temp[i] = possibilities[x]
temp[j] = possibilities[y]
temp[k] = possibilities[z]
storage.append(''.join(temp))
print(counter,''.join(temp))
counter += 1
The outputs look like this, (just the beginning and the end).
The results will also be saved to a variable named storage which is a native python list.
1 AKASGFTFS
2 RKASGFTFS
3 NKASGFTFS
4 DKASGFTFS
5 EKASGFTFS
6 GKASGFTFS
...
...
...
589318 CKASGFVVF
589319 CKASGFVVP
589320 CKASGFVVT
589321 CKASGFVVW
589322 CKASGFVVY
589323 CKASGFVVV
It takes around 10 - 20 minutes to run depending on your computer.
It will display all the combinations, skipping over changing AAs if any one is same as the original in case1 or 2 in case2 or 3 in case 3.
This code both prints them and stores them to a list variable so it can be storage or memory intensive and CPU intensive.
You could reduce the memory foot print if you want to store the string by replacing the letters with numbers cause they might take less space, you could even consider using something like pandas or appending to a csv file in storage.
You can iterate over the storage variable to go through the strings if you wish, like this.
for i in storage:
print(i)
Or you can convert it to a pandas series, dataframe or write line by line directly to a csv file in storage.
Let's compute the total number of mutations that you are looking for.
Say you want to replace a single AA. Firstly, there are 9 AAs in your frame, each of which can be changed into one of 19 other AA. That's 9 * 19 = 171
If you want to change two AA, there are 9c2 = 36 combinations of AA in your frame, and 19^2 permutations of two of the pool. That gives us 36 * 19^2 = 12996
Finally, if you want to change three, there are 9c3 = 84 combinations and 19^3 permutations of three of the pool. That gives us 84 * 19^3 = 576156
Put it all together and you get 171 + 12996 + 576156 = 589323 possible mutations. Hopefully, this helps illustrate the scale of the task you are trying to accomplish!
Related
Optimizing loop sequence
I am trying to check whether an item from a list exists one or more times in a data frame column, and if so, then use some info of that entire row to extract some data. The data frame has entries like this: df = prefix value binary --------------------------------------------------- 0 30 yes 01010000101000000000000000001101 1 29 yes 01010000101001111110111110101011 2 29 no 10000000010011011011110001111011 The current code looks something like this: list1 = [] list2 = [] for i, binary in enumerate(list_of_binary_numbers): print(f"Executing {i+1}") list1_tmp = 0 list2_tmp = 0 for index, row in df.iterrows(): if binary == row["binary"][0 : len(binary)]: if row["value"] == "yes": list1_tmp += 2 ** (32 - int(row["prefix"])) elif row["value"] == "no": list2_tmp += 2 ** (32 - int(row["prefix"])) list1.append(list1_tmp) list2.append(list2_tmp) So basically list_of_binary_numbers is a list with shortened binary numbers, and I need to check whether this shortened part of a full binary number exists in the df. That's why I do the [0 : len(binary)] so they have the same length. List looks like this: list_of_binary_numbers = 0 00000010011010000 1 0000001001101000100 2 000000100110101000000110 3 000000100110101000000111 4 00000010011010100010000 The issue is that the list_of_binary_numbers are roughly 150.000 items, and so is the data frame. So each main iteration takes roughly 1 sec to do, hence, this will take forever to complete. I just can't see any other good way to achieve this, so that's why I am asking for some help.
Find combinations (without "of size=r") from a set with decreasing sum value using Python
(Revised for clarity 02-08-2021) This is similar to the question here: Find combinations of size r from a set with decreasing sum value This is different from the answer posted in the link above because I am looking for answers without "size r=3". I have a set (array) of numbers. I need to have the sums of combinations of the numbers sorted from largest to smallest and show the numbers from the array that were used to get the total for that row. Any number in the array can only be used once per row but all the numbers don't have to be used in each row as the total decreases. If a number is not used then zero should be used as a placeholder instead so I can create a CSV file with the columns aligned. Input Example #1 with 7 numbers in the array: [30,25,20,15,10,5,1] Desired Output Example #1 format where the last number in each row is the total (sum) of the row: Beginning of list 30,25,20,15,10,5,1,106 30,25,20,15,10,5,0,105 30,25,20,15,10,0,1,101 30,25,20,15,10,0,0,100 ...(all number combinations in between) 30,0,0,0,0,0,0,30 0,25,0,0,0,5,0,30 ...(all number combinations in between) 0,0,0,15,0,0,1,16 0,0,0,15,0,0,0,15 0,0,0,0,10,5,0,15 0,0,0,0,10,0,1,11 0,0,0,0,0,5,1,6 0,0,0,0,0,5,0,5 0,0,0,0,0,0,1,1 End of list Also, duplicate totals are allowed and preferred showing different combinations that have the same total (sum) of the row. For Example #1: 30,0,0,0,0,0,0,30 0,25,0,0,0,5,0,30 For example this is one row of output based on the Input Example #1 above: 30,25,0,0,0,5,1,61 Last number in the row is the total. The total can also be the first number but the important thing is that the output list is sorted in descending order by the total. Input Example #2 with 5 numbers in the array: [20,15,10,5,1] Desired Output Example #2 format where the last number in each row is the total (sum) of the row: Beginning of list 20,15,10,5,1,51 20,15,10,5,0,50 20,15,10,0,1,46 20,15,10,0,0,45 ...(all number combinations in between) 20,0,10,0,0,30 0,15,10,5,0,30 ...(all number combinations in between) 0,15,0,0,1,16 0,15,0,0,0,15 0,0,10,5,0,15 0,0,10,0,1,11 0,0,10,0,0,10 0,0,0,5,1,6 0,0,0,5,0,5 0,0,0,0,1,1 End of list Input Example #1: [30,25,20,15,10,5,1] Every row of the output should show each number in the array used only once at most per row to get the total for the row. The rows must be sorted in decreasing order by the sums of the numbers used to get the total. The first output row in the list would show the result of 30 + 25 + 20 + 15 + 10 + 5 + 1 = 106 The second output row in the list would show the result of 30 + 25 + 20 + 15 + 10 + 5 + 0 = 105 The third output row in the list would show the result of 30 + 25 + 20 + 15 + 10 + 0 + 1 = 101 ...The rest of the rows would continue with the total (sum) of the row getting smaller until it reaches 1... The third to last output row in the list would show the result of 0 + 0 + 0 + 0 + 0 + 5 + 1 = 6 The second to last output row in the list would show the result of 0 + 0 + 0 + 0 + 0 + 5 + 0 = 5 The last output row in the list would show the result of 0 + 0 + 0 + 0 + 0 + 0 + 1 = 1 I started with the code provided by user Divyanshu modified with different input numbers and the () added to the last line (but I need to use all the numbers in the array instead of size=4 as shown here): import itertools array = [30,25,20,15,10,5,1] size = 4 answer = [] # to store all combination order = [] # to store order according to sum number = 0 # index of combination for comb in itertools.combinations(array,size): answer.append(comb) order.append([sum(comb),number]) # Storing sum and index number += 1 order.sort(reverse=True) # sorting in decreasing order for key in order: print (key[0],answer[key[1]]) # key[0] is sum of combination So this is what I need as an Input (in this example): [30,25,20,15,10,5,1] size=4 in the above code limits the output to 4 of the numbers in the array. If I take out size=4 I get an error. I need to use the entire array of numbers. I can manually change size=4 to size=1 and run it then size=2 then run it and so on. Entering size=1 through size=7 in the code and running it (7 times in this example) to get a list of all possible combinations gives me 7 different outputs. I could then manually put the lists together but that won't work for larger sets (arrays) of numbers. Can I modify the code referenced above or do I need to use a different approach?
I think you could do it as follow: Import the following: import pandas as pd import numpy as np The beginning of the code as in the questions: import itertools array = [30,25,20,15,10,5,1] size = 4 answer = [] # to store all combination order = [] # to store order according to sum number = 0 # index of combination for comb in itertools.combinations(array,size): answer.append(comb) order.append([sum(comb),number]) # Storing sum and index number += 1 order.sort(reverse=True) # sorting in decreasing order for key in order: print (key[0],answer[key[1]]) # key[0] is sum of combination The code, I think could help to obtain the final option: array_len = array.__len__() # Auxiliary to place in reference to the original array dict_array = {} for i in range(0,array_len): print(i) dict_array[array[i]]=i # Reorder the previous combinations aux = [] for key in order: array_zeros = np.zeros([1, array_len+1]) for i in answer[key[1]]: print(i,dict_array[i] ) array_zeros[0][dict_array[i]] = i # Let add the total array_zeros[0][array_len]=key[0] aux.append(array_zeros[0]) # Tranform into a dataframe aux = pd.DataFrame(aux) # This is to add the names to the columns # for the dataframe aux.columns=array + ['total'] aux = aux.astype(int) print(aux.head().astype(int)) 30 25 20 15 10 5 1 total 0 30 25 20 15 0 0 0 90 1 30 25 20 0 10 0 0 85 2 30 25 0 15 10 0 0 80 3 30 25 20 0 0 5 0 80 4 30 25 20 0 0 0 1 76 Now generalization for all sizes import itertools import pandas as pd import numpy as np array = [30,25,20,15,10,5,1] array_len = array.__len__() answer = [] # to store all combination order = [] # to store order according to sum number = 0 # index of combination for size in range(1,array_len+1): print(size) for comb in itertools.combinations(array,size): answer.append(comb) order.append([sum(comb),number]) # Storing sum and index number += 1 order.sort(reverse=True) # sorting in decreasing order for key in order: print (key[0],answer[key[1]]) # key[0] is sum of combination # Auxiliary to place in reference to the original array dict_array = {} for i in range(0,array_len): print(i) dict_array[array[i]]=i # Reorder the previous combinations aux = [] for key in order: array_zeros = np.zeros([1, array_len+1]) for i in answer[key[1]]: print(i,dict_array[i] ) array_zeros[0][dict_array[i]] = i # Let add the total array_zeros[0][array_len]=key[0] aux.append(array_zeros[0]) # Tranform into a dataframe aux = pd.DataFrame(aux) # This is to add the names to the columns # for the dataframe aux.columns=array + ['total'] aux = aux.astype(int) print(aux.head().astype(int)) 30 25 20 15 10 5 1 total 0 30 25 20 15 10 5 1 106 1 30 25 20 15 10 5 0 105 2 30 25 20 15 10 0 1 101 3 30 25 20 15 10 0 0 100 4 30 25 20 15 0 5 1 96
Thanks to #RafaelValero (Rafael Valero) I was able to learn about pandas, numpy, and dataframes. I looked up options for pandas to get the desired output. Here is the final code with some extra lines left in for reference but commented out: import itertools import pandas as pd import numpy as np array = [30,25,20,15,10,5,1] array_len = array.__len__() answer = [] # to store all combination order = [] # to store order according to sum number = 0 # index of combination for size in range(1,array_len+1): # Commented out line below as it was giving extra information # print(size) for comb in itertools.combinations(array,size): answer.append(comb) order.append([sum(comb),number]) # Storing sum and index number += 1 order.sort(reverse=True) # sorting in decreasing order # Commented out two lines below as it was from the original code and giving extra information #for key in order: # print (key[0],answer[key[1]]) # key[0] is sum of combination # Auxiliary to place in reference to the original array dict_array = {} for i in range(0,array_len): # Commented out line below as it was giving extra information # print(i) dict_array[array[i]]=i # Reorder the previous combinations aux = [] for key in order: array_zeros = np.zeros([1, array_len+1]) for i in answer[key[1]]: # Commented out line below as it was giving extra information # print(i,dict_array[i] ) array_zeros[0][dict_array[i]] = i # Let add the total array_zeros[0][array_len]=key[0] aux.append(array_zeros[0]) # Tranform into a dataframe aux = pd.DataFrame(aux) # This is to add the names to the columns # for the dataframe # Update: removed this line below as I didn't need a header # aux.columns=array + ['total'] aux = aux.astype(int) # Tried option below first but it was not necessary when using to_csv # pd.set_option('display.max_rows', None) print(aux.to_csv(index=False,header=None)) Searched references: Similar question: Find combinations of size r from a set with decreasing sum value Pandas references: https://thispointer.com/python-pandas-how-to-display-full-dataframe-i-e-print-all-rows-columns-without-truncation/ https://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.to_csv.html Online compiler used: https://www.programiz.com/python-programming/online-compiler/ Output using Input Example #1 with 7 numbers in the array: [30,25,20,15,10,5,1]: 30,25,20,15,10,5,1,106 30,25,20,15,10,5,0,105 30,25,20,15,10,0,1,101 30,25,20,15,10,0,0,100 30,25,20,15,0,5,1,96 30,25,20,15,0,5,0,95 30,25,20,0,10,5,1,91 30,25,20,15,0,0,1,91 30,25,20,0,10,5,0,90 30,25,20,15,0,0,0,90 30,25,0,15,10,5,1,86 30,25,20,0,10,0,1,86 30,25,0,15,10,5,0,85 30,25,20,0,10,0,0,85 30,0,20,15,10,5,1,81 30,25,0,15,10,0,1,81 30,25,20,0,0,5,1,81 30,0,20,15,10,5,0,80 30,25,0,15,10,0,0,80 30,25,20,0,0,5,0,80 0,25,20,15,10,5,1,76 30,0,20,15,10,0,1,76 30,25,0,15,0,5,1,76 30,25,20,0,0,0,1,76 0,25,20,15,10,5,0,75 30,0,20,15,10,0,0,75 30,25,0,15,0,5,0,75 30,25,20,0,0,0,0,75 0,25,20,15,10,0,1,71 30,0,20,15,0,5,1,71 30,25,0,0,10,5,1,71 30,25,0,15,0,0,1,71 0,25,20,15,10,0,0,70 30,0,20,15,0,5,0,70 30,25,0,0,10,5,0,70 30,25,0,15,0,0,0,70 0,25,20,15,0,5,1,66 30,0,20,0,10,5,1,66 30,0,20,15,0,0,1,66 30,25,0,0,10,0,1,66 0,25,20,15,0,5,0,65 30,0,20,0,10,5,0,65 30,0,20,15,0,0,0,65 30,25,0,0,10,0,0,65 0,25,20,0,10,5,1,61 30,0,0,15,10,5,1,61 0,25,20,15,0,0,1,61 30,0,20,0,10,0,1,61 30,25,0,0,0,5,1,61 0,25,20,0,10,5,0,60 30,0,0,15,10,5,0,60 0,25,20,15,0,0,0,60 30,0,20,0,10,0,0,60 30,25,0,0,0,5,0,60 0,25,0,15,10,5,1,56 0,25,20,0,10,0,1,56 30,0,0,15,10,0,1,56 30,0,20,0,0,5,1,56 30,25,0,0,0,0,1,56 0,25,0,15,10,5,0,55 0,25,20,0,10,0,0,55 30,0,0,15,10,0,0,55 30,0,20,0,0,5,0,55 30,25,0,0,0,0,0,55 0,0,20,15,10,5,1,51 0,25,0,15,10,0,1,51 0,25,20,0,0,5,1,51 30,0,0,15,0,5,1,51 30,0,20,0,0,0,1,51 0,0,20,15,10,5,0,50 0,25,0,15,10,0,0,50 0,25,20,0,0,5,0,50 30,0,0,15,0,5,0,50 30,0,20,0,0,0,0,50 0,0,20,15,10,0,1,46 0,25,0,15,0,5,1,46 30,0,0,0,10,5,1,46 0,25,20,0,0,0,1,46 30,0,0,15,0,0,1,46 0,0,20,15,10,0,0,45 0,25,0,15,0,5,0,45 30,0,0,0,10,5,0,45 0,25,20,0,0,0,0,45 30,0,0,15,0,0,0,45 0,0,20,15,0,5,1,41 0,25,0,0,10,5,1,41 0,25,0,15,0,0,1,41 30,0,0,0,10,0,1,41 0,0,20,15,0,5,0,40 0,25,0,0,10,5,0,40 0,25,0,15,0,0,0,40 30,0,0,0,10,0,0,40 0,0,20,0,10,5,1,36 0,0,20,15,0,0,1,36 0,25,0,0,10,0,1,36 30,0,0,0,0,5,1,36 0,0,20,0,10,5,0,35 0,0,20,15,0,0,0,35 0,25,0,0,10,0,0,35 30,0,0,0,0,5,0,35 0,0,0,15,10,5,1,31 0,0,20,0,10,0,1,31 0,25,0,0,0,5,1,31 30,0,0,0,0,0,1,31 0,0,0,15,10,5,0,30 0,0,20,0,10,0,0,30 0,25,0,0,0,5,0,30 30,0,0,0,0,0,0,30 0,0,0,15,10,0,1,26 0,0,20,0,0,5,1,26 0,25,0,0,0,0,1,26 0,0,0,15,10,0,0,25 0,0,20,0,0,5,0,25 0,25,0,0,0,0,0,25 0,0,0,15,0,5,1,21 0,0,20,0,0,0,1,21 0,0,0,15,0,5,0,20 0,0,20,0,0,0,0,20 0,0,0,0,10,5,1,16 0,0,0,15,0,0,1,16 0,0,0,0,10,5,0,15 0,0,0,15,0,0,0,15 0,0,0,0,10,0,1,11 0,0,0,0,10,0,0,10 0,0,0,0,0,5,1,6 0,0,0,0,0,5,0,5 0,0,0,0,0,0,1,1
Long multiplication of two numbers given as strings
I am trying to solve a problem of multiplication. I know that Python supports very large numbers and it can be done but what I want to do is Enter 2 numbers as strings. Multiply those two numbers in the same manner as we used to do in school. Basic idea is to convert the code given in the link below to Python code but I am not very good at C++/Java. What I want to do is to understand the code given in the link below and apply it for Python. https://www.geeksforgeeks.org/multiply-large-numbers-represented-as-strings/ I am stuck at the addition point. I want to do it it like in the image given below So I have made a list which stores the values of ith digit of first number to jth digit of second. Please help me to solve the addition part. def mul(upper_no,lower_no): upper_len=len(upper_no) lower_len=len(lower_no) list_to_add=[] #saves numbers in queue to add in the end for lower_digit in range(lower_len-1,-1,-1): q='' #A queue to store step by step multiplication of numbers carry=0 for upper_digit in range(upper_len-1,-1,-1): num2=int(lower_no[lower_digit]) num1=int(upper_no[upper_digit]) print(num2,num1) x=(num2*num1)+carry if upper_digit==0: q=str(x)+q else: if x>9: q=str(x%10)+q carry=x//10 else: q=str(x%10)+q carry=0 num=x%10 print(q) list_to_add.append(int(''.join(q))) print(list_to_add) mul('234','567') I have [1638,1404,1170] as a result for the function call mul('234','567') I am supposed to add these numbers but stuck because these numbers have to be shifted for each list. for example 1638 is supposed to be added as 16380 + 1404 with 6 aligning with 4, 3 with 0 and 8 with 4 and so on. Like: 1638 1404x 1170xx -------- 132678 --------
I think this might help. I've added a place variable to keep track of what power of 10 each intermediate value should be multiplied by, and used the itertools.accumulate function to produce the intermediate accumulated sums that doing so produces (and you want to show). Note I have also reformatted your code so it closely follows PEP 8 - Style Guide for Python Code in an effort to make it more readable. from itertools import accumulate import operator def mul(upper_no, lower_no): upper_len = len(upper_no) lower_len = len(lower_no) list_to_add = [] # Saves numbers in queue to add in the end place = 0 for lower_digit in range(lower_len-1, -1, -1): q = '' # A queue to store step by step multiplication of numbers carry = 0 for upper_digit in range(upper_len-1, -1, -1): num2 = int(lower_no[lower_digit]) num1 = int(upper_no[upper_digit]) print(num2, num1) x = (num2*num1) + carry if upper_digit == 0: q = str(x) + q else: if x>9: q = str(x%10) + q carry = x//10 else: q = str(x%10) + q carry = 0 num = x%10 print(q) list_to_add.append(int(''.join(q)) * (10**place)) place += 1 print(list_to_add) print(list(accumulate(list_to_add, operator.add))) mul('234', '567') Output: 7 4 7 3 7 2 1638 6 4 6 3 6 2 1404 5 4 5 3 5 2 1170 [1638, 14040, 117000] [1638, 15678, 132678]
Query Board challenge on Python, need some pointers
So, I have this challenge on CodeEval, but I seem don't know where to start, so I need some pointers (and answers if you can) to help me figure out this challenge. DESCRIPTION: There is a board (matrix). Every cell of the board contains one integer, which is 0 initially. The next operations can be applied to the Query Board: SetRow i x: it means that all values in the cells on row "i" have been change value to "x" after this operation. SetCol j x: it means that all values in the cells on column "j" have been changed to value "x" after this operation. QueryRow i: it means that you should output the sum of values on row "i". QueryCol j: it means that you should output the sum of values on column "j". The board's dimensions are 256x256 i and j are integers from 0 to 255 x is an integer from 0 to 31 INPUT SAMPLE: Your program should accept as its first argument a path to a filename. Each line in this file contains an operation of a query. E.g. SetCol 32 20 SetRow 15 7 SetRow 16 31 QueryCol 32 SetCol 2 14 QueryRow 10 SetCol 14 0 QueryRow 15 SetRow 10 1 QueryCol 2 OUTPUT SAMPLE: For each query, output the answer of the query. E.g. 5118 34 1792 3571 I'm not that great on Python, but this challenge is pretty interesting, although I didn't have any clues on how to solve it. So, I need some help from you guys. Thanks!
You could use a sparse matrix for this; addressed by (col, row) tuples as keys in a dictionary, to save memory. 64k cells is a big list otherwise (2MB+ on a 64-bit system): matrix = {} This is way more efficient, as the challenge is unlikely to set values for all rows and columns on the board. Setting a column or row is then: def set_col(col, x): for i in range(256): matrix[i, col] = x def set_row(row, x): for i in range(256): matrix[row, i] = x and summing a row or column is then: def get_col(col): return sum(matrix.get((i, col), 0) for i in range(256)) def get_row(row): return sum(matrix.get((row, i), 0) for i in range(256))
WIDTH, HEIGHT = 256, 256 board = [[0] * WIDTH for i in range(HEIGHT)] def set_row(i, x): global board board[i] = [x]*WIDTH ... implement each function, then parse each line of input to decide which function to call, for line in inf: dat = line.split() if dat[0] == "SetRow": set_row(int(dat[1]), int(dat[2])) elif ... Edit: Per Martijn's comments: total memory usage for board is about 2.1MB. By comparison, after 100 random row/column writes, matrix is 3.1MB (although it tops out there and doesn't get any bigger). yes, global is unnecessary when modifying a global object (just don't try to assign to it). while dispatching from a dict is good and efficient, I did not want to inflict it on someone who is "not that great on Python", especially for just four entries. For sake of comparison, how about time = 0 WIDTH, HEIGHT = 256, 256 INIT = 0 rows = [(time, INIT) for _ in range(WIDTH)] cols = [(time, INIT) for _ in range(HEIGHT)] def set_row(i, x): global time time += 1 rows[int(i)] = (time, int(x)) def set_col(i, x): global time time += 1 cols[int(i)] = (time, int(x)) def query_row(i): rt, rv = rows[int(i)] total = rv * WIDTH + sum(cv - rv for ct, cv in cols if ct > rt) print(total) def query_col(j): ct, cv = cols[int(j)] total = cv * HEIGHT + sum(rv - cv for rt, rv in rows if rt > ct) print(total) ops = { "SetRow": set_row, "SetCol": set_col, "QueryRow": query_row, "QueryCol": query_col } inf = """SetCol 32 20 SetRow 15 7 SetRow 16 31 QueryCol 32 SetCol 2 14 QueryRow 10 SetCol 14 0 QueryRow 15 SetRow 10 1 QueryCol 2""".splitlines() for line in inf: line = line.split() op = line.pop(0) ops[op](*line) which only uses 4.3k of memory for rows[] and cols[]. Edit2: using your code from above for matrix, set_row, set_col, import sys for n in range(256): set_row(n, 1) print("{}: {}".format(2*(n+1)-1, sys.getsizeof(matrix))) set_col(n, 1) print("{}: {}".format(2*(n+1), sys.getsizeof(matrix))) which returns (condensed:) 1: 12560 2: 49424 6: 196880 22: 786704 94: 3146000 ... basically the allocated memory quadruples at each step. If I change the memory measure to include key-tuples, def get_matrix_size(): return sys.getsizeof(matrix) + sum(sys.getsizeof(key) for key in matrix) it increases more smoothly, but still takes a bit jump at the above points: 5 : 127.9k 6 : 287.7k 21 : 521.4k 22 : 1112.7k 60 : 1672.0k 61 : 1686.1k <-- approx expected size on your reported problem set 93 : 2121.1k 94 : 4438.2k
How to randomly sample from 4 csv files so that no more than 2/3 rows appear in order from each csv file, in Python
Hi I'm very new to python and trying to create a program that takes a random sample from a CSV file and makes a new file with some conditions. What I have done so far is probably highly over-complicated and not efficient (though it doesn't need to be). I have 4 CSV files that contain 264 rows in total, where each full row is unique, though they all share common values in some columns. csv1 = 72 rows, csv2 = 72 rows, csv3 = 60 rows, csv4 = 60 rows. I need to take a random sample of 160 rows which will make 4 blocks of 40, where in each block 10 must come from each csv file. The tricky part is that no more than 2 or 3 rows from the same CSV file can appear in order in the final file. So far I have managed to take a random sample of 40 from each CSV (just using random.sample) and output them to 4 new CSV files. Then I split each csv into 4 new files each containing 10 rows so that I have each in a separate folder(1-4). So I now have 4 folders each containing 4 csv files. Now I need to combine these so that rows that came from the original CSV file don't repeat more than 2 or 3 times and the row order will be as random as possible. This is where I'm completely lost, I'm presuming that I should combine the 4 files in each folder (which I can do) and then re-sample or shuffle in a loop until the conditions are met, or something to that effect but I'm not sure how to proceed or am I going about this in the completely wrong way. Any help anyone can give me would be greatly appreciated and I can provide any further details that are necessary. var_start = 1 total_condition_amount_start = 1 while (var_start < 5): with open("condition"+`var_start`+".csv", "rb") as population1: conditions1 = [line for line in population1] random_selection1 = random.sample(conditions1, 40) with open("./temp/40cond"+`var_start`+".csv", "wb") as temp_output: temp_output.write("".join(random_selection1)) var_start = var_start + 1 while (total_condition_amount_start < total_condition_amount): folder_no = 1 splitter.split(open("./temp/40cond"+`total_condition_amount_start`+".csv", 'rb')); shutil.move("./temp/output_1.csv", "./temp/block"+`folder_no`+"/output_"+`total_condition_amount_start`+".csv") folder_no = folder_no + 1 shutil.move("./temp/output_2.csv", "./temp/block"+`folder_no`+"/output_"+`total_condition_amount_start`+".csv") folder_no = folder_no + 1 shutil.move("./temp/output_3.csv", "./temp/block"+`folder_no`+"/output_"+`total_condition_amount_start`+".csv") folder_no = folder_no + 1 shutil.move("./temp/output_4.csv", "./temp/block"+`folder_no`+"/output_"+`total_condition_amount_start`+".csv") total_condition_amount_start = total_condition_amount_start + 1
You should probably try using the CSV built in lib: http://docs.python.org/3.3/library/csv.html That way you can handle each file as a list of dictionaries, which will make your task a lot easier. from random import randint, sample, choice def create_random_list(length): return [randint(0, 100) for i in range(length)] # This should be your list of four initial csv files # with the 264 rows in total, read with the csv lib lists = [create_random_list(264) for i in range(4)] # Take a randomized sample from the lists lists = map(lambda x: sample(x, 40), lists) # Add some variables to the lists = map(lambda x: {'data': x, 'full_count': 0}, lists) final = [[] for i in range(4)] for l in final: prev = None count = 0 while len(l) < 40: current = choice(lists) if current['full_count'] == 10 or (current is prev and count == 3): continue # Take an item from the chosen list if it hasn't been used 3 times in a # row or is already used 10 times. Append that item to the final list total_left = 40 - len(l) maxx = 0 for i in lists: if i is not current and 10 - i['full_count'] > maxx: maxx = 10 - i['full_count'] current_left = 10 - current['full_count'] max_left = maxx + maxx/3.0 if maxx > 3 and total_left <= max_left: # Make sure that in te future it can still be split in to sets of # max 3 continue l.append(current['data'].pop()) count += 1 current['full_count'] += 1 if current is not prev: count = 0 prev = current for li in lists: li['full_count'] = 0