Optimizing loop sequence - python

I am trying to check whether an item from a list exists one or more times in a data frame column, and if so, then use some info of that entire row to extract some data.
The data frame has entries like this:
df =
prefix value binary
---------------------------------------------------
0 30 yes 01010000101000000000000000001101
1 29 yes 01010000101001111110111110101011
2 29 no 10000000010011011011110001111011
The current code looks something like this:
list1 = []
list2 = []
for i, binary in enumerate(list_of_binary_numbers):
print(f"Executing {i+1}")
list1_tmp = 0
list2_tmp = 0
for index, row in df.iterrows():
if binary == row["binary"][0 : len(binary)]:
if row["value"] == "yes":
list1_tmp += 2 ** (32 - int(row["prefix"]))
elif row["value"] == "no":
list2_tmp += 2 ** (32 - int(row["prefix"]))
list1.append(list1_tmp)
list2.append(list2_tmp)
So basically list_of_binary_numbers is a list with shortened binary numbers, and I need to check whether this shortened part of a full binary number exists in the df. That's why I do the [0 : len(binary)] so they have the same length.
List looks like this:
list_of_binary_numbers =
0 00000010011010000
1 0000001001101000100
2 000000100110101000000110
3 000000100110101000000111
4 00000010011010100010000
The issue is that the list_of_binary_numbers are roughly 150.000 items, and so is the data frame. So each main iteration takes roughly 1 sec to do, hence, this will take forever to complete.
I just can't see any other good way to achieve this, so that's why I am asking for some help.

Related

Generate the all possible unique peptides (permutants) in Python/Biopython

I have a scenario in which I have a peptide frame having 9 AA. I want to generate all possible peptides by replacing a maximum of 3 AA on this frame ie by replacing only 1 or 2 or 3 AA.
The frame is CKASGFTFS and I want to see all the mutants by replacing a maximum of 3 AA from the pool of 20 AA.
we have a pool of 20 different AA (A,R,N,D,E,G,C,Q,H,I,L,K,M,F,P,S,T,W,Y,V).
I am new to coding so Can someone help me out with how to code for this in Python or Biopython.
output is supposed to be a list of unique sequences like below:
CKASGFTFT, CTTSGFTFS, CTASGKTFS, CTASAFTWS, CTRSGFTFS, CKASEFTFS ....so on so forth getting 1, 2, or 3 substitutions from the pool of AA without changing the existing frame.
Ok, so after my code finished, I worked the calculations backwards,
Case1, is 9c1 x 19 = 171
Case2, is 9c2 x 19 x 19 = 12,996
Case3, is 9c3 x 19 x 19 x 19 = 576,156
That's a total of 589,323 combinations.
Here is the code for all 3 cases, you can run them sequentially.
You also requested to join the array into a single string, I have updated my code to reflect that.
import copy
original = ['C','K','A','S','G','F','T','F','S']
possibilities = ['A','R','N','D','E','G','C','Q','H','I','L','K','M','F','P','S','T','W','Y','V']
storage=[]
counter=1
# case 1
for i in range(len(original)):
for x in range(20):
temp = copy.deepcopy(original)
if temp[i] == possibilities[x]:
pass
else:
temp[i] = possibilities[x]
storage.append(''.join(temp))
print(counter,''.join(temp))
counter += 1
# case 2
for i in range(len(original)):
for j in range(i+1,len(original)):
for x in range(len(possibilities)):
for y in range(len(possibilities)):
temp = copy.deepcopy(original)
if temp[i] == possibilities[x] or temp[j] == possibilities[y]:
pass
else:
temp[i] = possibilities[x]
temp[j] = possibilities[y]
storage.append(''.join(temp))
print(counter,''.join(temp))
counter += 1
# case 3
for i in range(len(original)):
for j in range(i+1,len(original)):
for k in range(j+1,len(original)):
for x in range(len(possibilities)):
for y in range(len(possibilities)):
for z in range(len(possibilities)):
temp = copy.deepcopy(original)
if temp[i] == possibilities[x] or temp[j] == possibilities[y] or temp[k] == possibilities[z]:
pass
else:
temp[i] = possibilities[x]
temp[j] = possibilities[y]
temp[k] = possibilities[z]
storage.append(''.join(temp))
print(counter,''.join(temp))
counter += 1
The outputs look like this, (just the beginning and the end).
The results will also be saved to a variable named storage which is a native python list.
1 AKASGFTFS
2 RKASGFTFS
3 NKASGFTFS
4 DKASGFTFS
5 EKASGFTFS
6 GKASGFTFS
...
...
...
589318 CKASGFVVF
589319 CKASGFVVP
589320 CKASGFVVT
589321 CKASGFVVW
589322 CKASGFVVY
589323 CKASGFVVV
It takes around 10 - 20 minutes to run depending on your computer.
It will display all the combinations, skipping over changing AAs if any one is same as the original in case1 or 2 in case2 or 3 in case 3.
This code both prints them and stores them to a list variable so it can be storage or memory intensive and CPU intensive.
You could reduce the memory foot print if you want to store the string by replacing the letters with numbers cause they might take less space, you could even consider using something like pandas or appending to a csv file in storage.
You can iterate over the storage variable to go through the strings if you wish, like this.
for i in storage:
print(i)
Or you can convert it to a pandas series, dataframe or write line by line directly to a csv file in storage.
Let's compute the total number of mutations that you are looking for.
Say you want to replace a single AA. Firstly, there are 9 AAs in your frame, each of which can be changed into one of 19 other AA. That's 9 * 19 = 171
If you want to change two AA, there are 9c2 = 36 combinations of AA in your frame, and 19^2 permutations of two of the pool. That gives us 36 * 19^2 = 12996
Finally, if you want to change three, there are 9c3 = 84 combinations and 19^3 permutations of three of the pool. That gives us 84 * 19^3 = 576156
Put it all together and you get 171 + 12996 + 576156 = 589323 possible mutations. Hopefully, this helps illustrate the scale of the task you are trying to accomplish!

Long multiplication of two numbers given as strings

I am trying to solve a problem of multiplication. I know that Python supports very large numbers and it can be done but what I want to do is
Enter 2 numbers as strings.
Multiply those two numbers in the same manner as we used to do in school.
Basic idea is to convert the code given in the link below to Python code but I am not very good at C++/Java. What I want to do is to understand the code given in the link below and apply it for Python.
https://www.geeksforgeeks.org/multiply-large-numbers-represented-as-strings/
I am stuck at the addition point.
I want to do it it like in the image given below
So I have made a list which stores the values of ith digit of first number to jth digit of second. Please help me to solve the addition part.
def mul(upper_no,lower_no):
upper_len=len(upper_no)
lower_len=len(lower_no)
list_to_add=[] #saves numbers in queue to add in the end
for lower_digit in range(lower_len-1,-1,-1):
q='' #A queue to store step by step multiplication of numbers
carry=0
for upper_digit in range(upper_len-1,-1,-1):
num2=int(lower_no[lower_digit])
num1=int(upper_no[upper_digit])
print(num2,num1)
x=(num2*num1)+carry
if upper_digit==0:
q=str(x)+q
else:
if x>9:
q=str(x%10)+q
carry=x//10
else:
q=str(x%10)+q
carry=0
num=x%10
print(q)
list_to_add.append(int(''.join(q)))
print(list_to_add)
mul('234','567')
I have [1638,1404,1170] as a result for the function call mul('234','567') I am supposed to add these numbers but stuck because these numbers have to be shifted for each list. for example 1638 is supposed to be added as 16380 + 1404 with 6 aligning with 4, 3 with 0 and 8 with 4 and so on. Like:
1638
1404x
1170xx
--------
132678
--------
I think this might help. I've added a place variable to keep track of what power of 10 each intermediate value should be multiplied by, and used the itertools.accumulate function to produce the intermediate accumulated sums that doing so produces (and you want to show).
Note I have also reformatted your code so it closely follows PEP 8 - Style Guide for Python Code in an effort to make it more readable.
from itertools import accumulate
import operator
def mul(upper_no, lower_no):
upper_len = len(upper_no)
lower_len = len(lower_no)
list_to_add = [] # Saves numbers in queue to add in the end
place = 0
for lower_digit in range(lower_len-1, -1, -1):
q = '' # A queue to store step by step multiplication of numbers
carry = 0
for upper_digit in range(upper_len-1, -1, -1):
num2 = int(lower_no[lower_digit])
num1 = int(upper_no[upper_digit])
print(num2, num1)
x = (num2*num1) + carry
if upper_digit == 0:
q = str(x) + q
else:
if x>9:
q = str(x%10) + q
carry = x//10
else:
q = str(x%10) + q
carry = 0
num = x%10
print(q)
list_to_add.append(int(''.join(q)) * (10**place))
place += 1
print(list_to_add)
print(list(accumulate(list_to_add, operator.add)))
mul('234', '567')
Output:
7 4
7 3
7 2
1638
6 4
6 3
6 2
1404
5 4
5 3
5 2
1170
[1638, 14040, 117000]
[1638, 15678, 132678]

Find specific Row of Data from Pandas Dataframe in While Loop

I am trying to take a csv, and read it as a Pandas Dataframe.
This Dataframe contains 4 rows of numbers.
I want to pick a specific row of data from the Dataframe.
In a While Loop, I want to select a random row from the Dataframe, and compare it to row that I picked.
I want it to continue to run through the while loop until that random row, is 100% equal to the row I picked prior.
Then I want the While Loop to break and I want it to have counted how many tries it took to match the random number.
Here's what I have so far:
This is an example of the Dataframe:
A B C D
1 2 7 12 14
2 4 5 11 23
3 4 6 14 20
4 4 7 13 50
5 9 6 14 35
Here is an example of my efforts:
import time
import pandas as pd
then = time.time()
count = 0
df = pd.read_csv('Get_Numbers.csv')
df.columns = ['A', 'B', 'C', 'D']
while True:
df_elements = df.sample(n=1)
random_row = df_elements
print(random_row)
find_this_row = df['A','B','C','D' == '4','7','13,'50']
print(find_this_row)
if find_this_row != random_row:
count += 1
else:
break
print("You found the correct numbers! And it only took " + str(count) + " tries to get there! Your numbers were: " + str(find_this_row))
now = time.time()
print("It took: ", now-then, " seconds")
The above code gives an obvious error... but I have tried so many different versions now of finding the find_this_row numbers that I just don't know what to do anymore, so I left this attempt in.
What I would like to try to avoid is using the specific index for the row I am trying to find, I would rather use just the values to find this.
I am using df_elements = df.sample(n=1) to select a row at random. This was to avoid using random.choice as I was not sure if that would work or which way is more time/memory efficient, but I'm open to advice on that as well.
In my mind it seems simple, randomly select a row of data, if it doesn't match the row of data that I want, keep randomly selecting rows of data until it does match. But I can't seem to execute it.
Any help is EXTREMELY Appreciated!
You can use values which returns np.ndarray of shape=(1, 2), use values[0] to get just 1D array.
Then compare the arrays with any()
import time
import pandas as pd
then = time.time()
df = pd.DataFrame(data={'A': [1, 2, 3],
'B': [8, 9, 10]})
find_this_row = [2, 9]
print("Looking for: {}".format(find_this_row))
count = 0
while True:
random_row = df.sample(n=1).values[0]
print(random_row)
if any(find_this_row != random_row):
count += 1
else:
break
print("You found the correct numbers! And it only took " + str(count) + " tries to get there! Your numbers were: " + str(find_this_row))
now = time.time()
print("It took: ", now-then, " seconds")
How about using values?
values will return you a list of values. And then you can compare two lists easily.
list1 == list2 will return an array of True and False values as it compares indexes of the corresponding lists. You can check if all of the values returned are True
Here's a method that tests one row at a time. We check if the values of the chosen row are equal to the values of the sampled DataFrame. We require that they all match.
row = df.sample(1)
counter = 0
not_a_match = True
while not_a_match:
not_a_match = ~(df.sample(n=1).values == row.values).all()
counter+=1
print(f'It took {counter} tries and the numbers were\n{row}')
#It took 9 tries and the numbers were
# A B C D
#4 4 7 13 50
If you want to get a little bit faster, you select one row and then sample the DataFrame with replacement many times. You can then check for the first time the sampled row equals your sampled DataFrame, giving you how many 'tries' it would have taken in a while loop, but in much less time. The loop protects against the unlikely case we do not find a match, given that it's sampling with replacement.
row = df.sample(1)
n = 0
none_match = True
k = 10 # Increase to check more matches at once.
while none_match:
matches = (df.sample(n=len(df)*k, replace=True).values == row.values).all(1)
none_match = ~matches.any() # Determine if none still match
n += k*len(df)*none_match # Only increment if none match
n = n + matches.argmax() + 1
print(f'It took {n} tries and the numbers were\n{row}')
#It took 3 tries and the numbers were
# A B C D
#4 4 7 13 50
A couple of hints first. This line does not work for me:
find_this_row = df['A','B','C','D' == '4','7','13,'50']
For 2 reasons:
a missing " ' " after ,'13
df is a DataFrame(), so using keys like below is not supported
df['A','B','C','D' ...
Either use keys to return a DataFrame():
df[['A','B','C','D']]
or as a Series():
df['A']
Since you need the whole row with multiple columns do this:
df2.iloc[4].values
array(['4', '7', '13', '50'], dtype=object)
Do the same with your sample row:
df2.sample(n=1).values
Comparison between rows needs to be done for all() elements/columns:
df2.sample(n=1).values == df2.iloc[4].values
array([[ True, False, False, False]])
with adding .all() like the following:
(df2.sample(n=1).values == df2.iloc[4].values).all()
which returns
True/False
All together:
import time
import pandas as pd
then = time.time()
count = 0
while True:
random_row = df2.sample(n=1).values
find_this_row = df2.iloc[4].values
if (random_row == find_this_row).all() == False:
count += 1
else:
break
print("You found the correct numbers! And it only took " + str(count) + " tries to get there! Your numbers were: " + str(find_this_row))
now = time.time()
print("It took: ", now-then, " seconds")

Read specific Bits from bitstring.BitArray

I have a Bitarray and want to read from a certain position to another position.
I have the int variable length in a for loop, so for example I have:
length = 2
and my Bitarray looks something like:
msgstr = bitstring.BitArray(0b11110011001111110)
I then want to read the first two bits and convert them into an int, so that I have:
id == 3
And for the next round when length has changed in value it should start from the third bit etc.
id = bitstring.BitArray()
m = 0
while 5 != m:
/////////////
Length changes in value part of Code
/////////////
x = 0
if m == 0:
while length != x:
id.append = msgstr[x] #msgstr is the BitArray that needs to be read
x = x + 1
m = m + 1
What you want here is called slicing.
for i in range(0,len(msgstr),length):
print msgstr[i:i+length].uint
This code will get you what you are asking for. It will take the first two bits and convert them into an int, then will take the third and fourth bits and convert them to an int, etc.

How to randomly sample from 4 csv files so that no more than 2/3 rows appear in order from each csv file, in Python

Hi I'm very new to python and trying to create a program that takes a random sample from a CSV file and makes a new file with some conditions. What I have done so far is probably highly over-complicated and not efficient (though it doesn't need to be).
I have 4 CSV files that contain 264 rows in total, where each full row is unique, though they all share common values in some columns.
csv1 = 72 rows, csv2 = 72 rows, csv3 = 60 rows, csv4 = 60 rows. I need to take a random sample of 160 rows which will make 4 blocks of 40, where in each block 10 must come from each csv file. The tricky part is that no more than 2 or 3 rows from the same CSV file can appear in order in the final file.
So far I have managed to take a random sample of 40 from each CSV (just using random.sample) and output them to 4 new CSV files. Then I split each csv into 4 new files each containing 10 rows so that I have each in a separate folder(1-4). So I now have 4 folders each containing 4 csv files. Now I need to combine these so that rows that came from the original CSV file don't repeat more than 2 or 3 times and the row order will be as random as possible. This is where I'm completely lost, I'm presuming that I should combine the 4 files in each folder (which I can do) and then re-sample or shuffle in a loop until the conditions are met, or something to that effect but I'm not sure how to proceed or am I going about this in the completely wrong way. Any help anyone can give me would be greatly appreciated and I can provide any further details that are necessary.
var_start = 1
total_condition_amount_start = 1
while (var_start < 5):
with open("condition"+`var_start`+".csv", "rb") as population1:
conditions1 = [line for line in population1]
random_selection1 = random.sample(conditions1, 40)
with open("./temp/40cond"+`var_start`+".csv", "wb") as temp_output:
temp_output.write("".join(random_selection1))
var_start = var_start + 1
while (total_condition_amount_start < total_condition_amount):
folder_no = 1
splitter.split(open("./temp/40cond"+`total_condition_amount_start`+".csv", 'rb'));
shutil.move("./temp/output_1.csv", "./temp/block"+`folder_no`+"/output_"+`total_condition_amount_start`+".csv")
folder_no = folder_no + 1
shutil.move("./temp/output_2.csv", "./temp/block"+`folder_no`+"/output_"+`total_condition_amount_start`+".csv")
folder_no = folder_no + 1
shutil.move("./temp/output_3.csv", "./temp/block"+`folder_no`+"/output_"+`total_condition_amount_start`+".csv")
folder_no = folder_no + 1
shutil.move("./temp/output_4.csv", "./temp/block"+`folder_no`+"/output_"+`total_condition_amount_start`+".csv")
total_condition_amount_start = total_condition_amount_start + 1
You should probably try using the CSV built in lib: http://docs.python.org/3.3/library/csv.html
That way you can handle each file as a list of dictionaries, which will make your task a lot easier.
from random import randint, sample, choice
def create_random_list(length):
return [randint(0, 100) for i in range(length)]
# This should be your list of four initial csv files
# with the 264 rows in total, read with the csv lib
lists = [create_random_list(264) for i in range(4)]
# Take a randomized sample from the lists
lists = map(lambda x: sample(x, 40), lists)
# Add some variables to the
lists = map(lambda x: {'data': x, 'full_count': 0}, lists)
final = [[] for i in range(4)]
for l in final:
prev = None
count = 0
while len(l) < 40:
current = choice(lists)
if current['full_count'] == 10 or (current is prev and count == 3):
continue
# Take an item from the chosen list if it hasn't been used 3 times in a
# row or is already used 10 times. Append that item to the final list
total_left = 40 - len(l)
maxx = 0
for i in lists:
if i is not current and 10 - i['full_count'] > maxx:
maxx = 10 - i['full_count']
current_left = 10 - current['full_count']
max_left = maxx + maxx/3.0
if maxx > 3 and total_left <= max_left:
# Make sure that in te future it can still be split in to sets of
# max 3
continue
l.append(current['data'].pop())
count += 1
current['full_count'] += 1
if current is not prev:
count = 0
prev = current
for li in lists:
li['full_count'] = 0

Categories