Python, divide string into several substrings

Python, divide string into several substrings - python

I have a string of RNA i.e:
AUGGCCAUA
I would like to generate all substrings by the following way:
#starting from 0 character
AUG, GCC, AUA
#starting from 1 character
UGG, CCA
#starting from 2 character
GGC, CAU
I wrote a code that solves the first sub-problem:
for i in range(0,len(rna)):
if fmod(i,3)==0:
print rna[i:i+3]
I have tried to change the starting position i.e.:
for i in range(1,len(rna)):
But it produces me the incorrect results:
GCC, UA #instead of UGG, CCA
Could you please give me a hint where is my mistake?

The problem with your code is that you are always extracting substring from the index which is divisible by 3. Instead, try this
a = 'AUGGCCAUA'
def getSubStrings(RNA, position):
return [RNA[i:i+3] for i in range(position, len(RNA) - 2, 3)]
print getSubStrings(a, 0)
print getSubStrings(a, 1)
print getSubStrings(a, 2)
Output
['AUG', 'GCC', 'AUA']
['UGG', 'CCA']
['GGC', 'CAU']
Explanation
range(position, len(RNA) - 2, 3) will generate a list of numbers with common difference 3, starting from the position till the length of the list - 2. For example,
print range(1, 8, 3)
1 is the starting number, 8 is the last number, 3 is the common difference and it will give
[1, 4, 7]
These are our starting indices. And then we use list comprehension to generate the new list like this
[RNA[i:i+3] for i in range(position, len(RNA) - 2, 3)]

Is this what you're looking for?
for i in range(len(rna)):
if rna[i+3:]:
print(rna[i:i+3])
outputs:
AUG
UGG
GGC
GCC
CCA
CAU

I thought of this oneliner:
a = 'AUGGCCAUA'
[a[x:x+3] for x in range(len(a))][:-2]

def generate(str, index):
for i in range(index, len(str), 3):
if len(str[i:i+3]) == 3:
print str[i:i+3]
Example:
In [29]: generate(str, 1)
UGG
CCA
In [30]: generate(str, 0)
AUG
GCC
AUA

Related

Python: How can I use a String for a If-Statement?

In Python I have to build a (long) if statement dynamically.
How can I do this?
I tried the following test code to store the necessary if-statement within a string with the function "buildFilterCondition".
But this doesn´t work...
Any ideas? What is going wrong?
Thank you very much.
Input = [1,2,3,4,5,6,7]
Filter = [4,7]
FilterCondition = ""
def buildFilterCondition():
global FilterCondition
for f in Filter:
FilterCondition = FilterCondition + "(x==" + str(f) +") | "
#remove the last "| " sign
FilterCondition = FilterCondition[:-2]
print("Current Filter: " + FilterCondition)
buildFilterCondition()
for x in Input:
if( FilterCondition ):
print(x)
With my Function buildFilterCondition() I want to reach the following situation, because the function generates the string "(x==4) | (x==7)", but this doesn´t work:
for x in Input:
if( (x==4) | (x==7) ):
print(x)
The output, the result should be 4,7 (--> filtered)
The background of my question actually had a different intention than to replace an if-statement.
I need a longer multiple condition to select specific columns of a pandas dataframe.
For example:
df2=df.loc[(df['Discount1'] == 1000) & (df['Discount2'] == 2000)]
I wanted to keep the column names and the values (1000, 2000) in 2 separate lists (or dictionary) to make my code a little more "generic".
colmnHeader = ["Discount1", "Discount2"]
filterValue = [1000, 2000]
To "filter" the data frame, I then only need to adjust the lists.
How do I now rewrite the call to the .loc method so that it works for iterating over the lists?
df2=df.loc[(df[colmHeader[0] == [filterValue[0]) & (df[colmHeader[1]] == filterValue[1])]
Unfortunately, my current attempt with the following code does not work because the panda-loc function has not to be called sequentially, but in parallel.
So I need ALL the conditions from the lists directly in the .loc call.
#FILTER
colmn = ["colmn1", "colmn2", "colmn3"]
cellContent = ["1000", "2000", "3000"]
# first make sure, the lists have the same size
if( len(colmn) == len(cellContent)):
curIdx = 0
for curColmnName in colmn:
df_columns= df_columns.loc[df_columns [curColmnName]==cellContent[curIdx]]
curIdx += 1
Thank you again!

Use in operator
Because simple if better than complex.
inputs = [1, 2, 3, 4, 5, 6, 7]
value_filter = [4, 7]
for x in inputs:
if x in value_filter:
print(x, end=' ')
# 4 7

Use operator module
With the operator module, you can build a condition at runtime with a list of operator and values pairs to test the current value.
import operator
inputs = [1, 2, 3, 4, 5, 6, 7]
# This list can be dynamically changed if you need to
conditions = [
(operator.ge, 4), # value need to be greater or equal to 4
(operator.lt, 7), # value need to be lower than 7
]
for x in inputs:
# all to apply a and operator on all condition, use any for or
if all(condition(x, value) for condition, value in conditions):
print(x, end=' ')
# 4 5 6

Getting Index out of range error while trying to do Totaling statements

I am trying to do a multiple totaling statements but it keeps saying index out of range.
Here is the section of code:
for m in range(len(mo)):
for o in range(len(mag)):
if mag[o] == 0 and mo[m] ==1 :
countfujita[m] = countfujita[m] + 1
and I am trying to get the totals into list a list such as this:
countfujita = [0,0,0,0,0,0]

I suspect this is because you are looping over mag for every item in mo when this is not what you want. Let me know if the following fixes your issue:
for m in range(len(mo)):
if mag[m] == 0 and mo[m] ==1 :
countfujita[m] = countfujita[m] + 1
(this assumes that len(mag) = len(mo))

In order for your code to run successfully you need to ensure that countfujita is at least as long as the mo list.
The following would be a robust approach:
mo = [1, 0, 3]
mag = [3, 0, 1, 11]
# construct a list of the same length as *mo* and fill with zeroes
countfujita = [0] * len(mo)
for m in range(len(mo)):
for o in range(len(mag)):
if mag[o] == 0 and mo[m] == 1:
countfujita[m] += 1
print(countfujita)
Output:
[1, 0, 0]

Indexing multiple lists inside nested loops is error-prone. In general we want to leverage the python built-in modules as much as possible.
For example (starting from the example defined by Lancelot du Lac) we can use itertools.product to generate all combinations of mo and mag. enumerate gives us the index corresponding to the element of mo:
from itertools import product
for (imo, xmo), xmag in product(enumerate(mo), mag):
if (xmo, xmag) == (1, 0):
countfujita[imo] += 1
To push this even further, we can combine this with Counter to first generate a list of all mo indices and then count. This results in a counter object Counter({0: 1}) object, similar to a dict, which might or might not be appropriate depending on what you do with countfujita later on:
from itertools import product
from collections import Counter
Counter([imo for (imo, xmo), xmag
in product(enumerate(mo), mag)
if (xmo, xmag) == (1, 0)])
# Counter({0: 1})

How to get inverse of integer?

I am not sure of inverse is the proper name, but I think it is.
This example will clarify what I need:
I have a max height, 5 for example, and so height can range from 0 to 4. In this case we're talking integers, so the options are: 0, 1, 2, 3, 4.
What I need, given an input ranging from 0 up to (and including) 4, is to get the inverse number.
Example:
input: 3
output: 1
visual:
0 1 2 3 4
4 3 2 1 0
I know I can do it like this:
position_list = list(range(5))
index_list = position_list[::-1]
index = index_list[3]
But this will probably use unnecessary memory, and probably unnecessary cpu usage creating two lists. The lists will be deleted after these lines of code, and will recreated every time the code is ran (within method). I'd rather find a way not needing the lists at all.
What is an efficient way to achieve the same? (while still keeping the code readable for someone new to the code)

Isn't it just max - in...?
>>> MAX=4
>>> def calc(in_val):
... out_val = MAX - in_val
... print('%s -> %s' % ( in_val, out_val ))
...
>>> calc(3)
3 -> 1
>>> calc(1)
1 -> 3

You just need to subtract from the max:
def return_inverse(n, mx):
return mx - n
For the proposed example:
position_list = list(range(5))
mx = max(position_list)
[return_inverse(i, mx) for i in position_list]
# [4, 3, 2, 1, 0]

You have maximum heigth, let's call it max_h.
Your numbers are counted from 0, so they are in [0; max_h - 1]
You want to find the complementation number that becomes max_h in sum with input number
It is max_h - 1 - your_number:
max_height = 5
input_number = 2
for input_number in range(5):
print('IN:', input_number, 'OUT:', max_height - input_number - 1)
IN: 1 OUT: 3
IN: 2 OUT: 2
IN: 3 OUT: 1
IN: 4 OUT: 0

Simply compute the reverse index and then directly access the corresponding element.
n = 5
inp = 3
position_list = list(range(n))
position_list[n-1-inp]
# 1

You can just derive the index from the list's length and the desired position, to arrive at the "inverse":
position_list = list(range(5))
position = 3
inverse = position_list[len(position_list)-1-position]
And:
for i in position_list:
print(i, position_list[len(position_list)-1-i])

In this case, you can just have the output = 4-input. If it's just increments of 1 up to some number a simple operation like that should be enough. For example, if the max was 10 and the min was 5, then you could just do 9-input+5. The 9 can be replaced by the max-1 and the 5 can be replaced with the min.
So max-1-input+min

How to sum value of integers based on position?

The situation is as followed. I want to sum and eventually calculate their average of specific values based on their positions. So far I have tried many different things and I can come up with the following code, I can't seem to figure out how to match these different positions with their belonging values.
count_pos = 0
for character in score:
asci = ord(character)
count_pos += 1
print(count_pos,asci)
if asci == 10 :
count_pos = 0
print asci generates the following output:
1 35
2 52
3 61
4 68
5 70
6 70
1 35
2 49
3 61
4 68
5 68
6 70
The numbers 1-6 are the positions and the other integers are the values belonging to this value. So what I basically am trying to do is to sum the value of position 1 (35+35) which should give me : 70, and the sum of the values of position 2 should give me (52+49) : 101 and this for all positions.
The only thing so far I thought about was comparing the counter like this:
if count_pos == count_pos:
#Do calculation
NOTE: This is just a part of the data. The real data goes on like this with more than 1000 of these counting and not just 2 like displayed here.

Solution
This would work:
from collections import defaultdict
score = '#4=DFF\n#1=DDF\n'
res = defaultdict(int)
for entry in score.splitlines():
for pos, char in enumerate(entry, 1):
res[pos] += ord(char)
Now:
>>> res
defaultdict(int, {1: 70, 2: 101, 3: 122, 4: 136, 5: 138, 6: 140})
>>> res[1]
70
>>> res[2]
101
In Steps
Your score string looks like this (extracted from your asci numbers):
score = '#4=DFF\n#1=DDF\n'
Instead of looking for asci == 10, just split at new line characters with
the string method splitlines().
The defaultdict from the module collections gives you a dictionary that
you can initiate with a function. We use int() here. That will call int() if we access a key does not exist. So, if you do:
res[pos] += ord(char)
and the key pos does not exit yet, it will call int(), which gives a 0
and you can add your number to it. The next time around, if the number of
pos is already a key in your dictionary, you will get the value and you add
to it, summing up the value for each position.
The enumerate here:
for pos, char in enumerate(entry, 1):
gives you the position in each row named pos, starting with 1.

If you have the two lists to be added in two lists you may do this :
Using zip:
[x + y for x, y in zip(List1, List2)]
or
zipped_list = zip(List1,List2)
print([sum(item) for item in zipped_list])
Eg: If the lists were,
List1=[1, 2, 3]
List2=[4, 5, 6]
Output would be : [5, 7, 9]
Using Numpy:
import numpy as np
all = [list1,list2,list3 ...]
result = sum(map(np.array, all))
Eg:
>>> li=[1,3]
>>> li1=[1,3]
>>> li2=[1,3]
>>> li3=[1,3]
>>> import numpy as np
>>> all=[li,li1,li2,li3]
>>> mylist = sum(map(np.array, all))
>>> mylist
array([ 4, 12])

Edit Distance in Python

I'm programming a spellcheck program in Python. I have a list of valid words (the dictionary) and I need to output a list of words from this dictionary that have an edit distance of 2 from a given invalid word.
I know I need to start by generating a list with an edit distance of one from the invalid word(and then run that again on all the generated words). I have three methods, inserts(...), deletions(...) and changes(...) that should output a list of words with an edit distance of 1, where inserts outputs all valid words with one more letter than the given word, deletions outputs all valid words with one less letter, and changes outputs all valid words with one different letter.
I've checked a bunch of places but I can't seem to find an algorithm that describes this process. All the ideas I've come up with involve looping through the dictionary list multiple times, which would be extremely time consuming. If anyone could offer some insight, I'd be extremely grateful.

The thing you are looking at is called an edit distance and here is a nice explanation on wiki. There are a lot of ways how to define a distance between the two words and the one that you want is called Levenshtein distance and here is a DP (dynamic programming) implementation in python.
def levenshteinDistance(s1, s2):
if len(s1) > len(s2):
s1, s2 = s2, s1
distances = range(len(s1) + 1)
for i2, c2 in enumerate(s2):
distances_ = [i2+1]
for i1, c1 in enumerate(s1):
if c1 == c2:
distances_.append(distances[i1])
else:
distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
distances = distances_
return distances[-1]
And a couple of more implementations are here.

difflib in the standard library has various utilities for sequence matching, including the get_close_matches method that you could use. It uses an algorithm adapted from Ratcliff and Obershelp.
From the docs
>>> from difflib import get_close_matches
>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']

Here is my version for Levenshtein distance
def edit_distance(s1, s2):
m=len(s1)+1
n=len(s2)+1
tbl = {}
for i in range(m): tbl[i,0]=i
for j in range(n): tbl[0,j]=j
for i in range(1, m):
for j in range(1, n):
cost = 0 if s1[i-1] == s2[j-1] else 1
tbl[i,j] = min(tbl[i, j-1]+1, tbl[i-1, j]+1, tbl[i-1, j-1]+cost)
return tbl[i,j]
print(edit_distance("Helloworld", "HalloWorld"))

#this calculates edit distance not levenstein edit distance
word1="rice"
word2="ice"
len_1=len(word1)
len_2=len(word2)
x =[[0]*(len_2+1) for _ in range(len_1+1)]#the matrix whose last element ->edit distance
for i in range(0,len_1+1): #initialization of base case values
x[i][0]=i
for j in range(0,len_2+1):
x[0][j]=j
for i in range (1,len_1+1):
for j in range(1,len_2+1):
if word1[i-1]==word2[j-1]:
x[i][j] = x[i-1][j-1]
else :
x[i][j]= min(x[i][j-1],x[i-1][j],x[i-1][j-1])+1
print x[i][j]

Using the SequenceMatcher from Python built-in difflib is another way of doing it, but (as correctly pointed out in the comments), the result does not match the definition of an edit distance exactly. Bonus: it supports ignoring "junk" parts (e.g. spaces or punctuation).
from difflib import SequenceMatcher
a = 'kitten'
b = 'sitting'
required_edits = [
code
for code in (
SequenceMatcher(a=a, b=b, autojunk=False)
.get_opcodes()
)
if code[0] != 'equal'
]
required_edits
# [
# # (tag, i1, i2, j1, j2)
# ('replace', 0, 1, 0, 1), # replace a[0:1]="k" with b[0:1]="s"
# ('replace', 4, 5, 4, 5), # replace a[4:5]="e" with b[4:5]="i"
# ('insert', 6, 6, 6, 7), # insert b[6:7]="g" after a[6:6]="n"
# ]
# the edit distance:
len(required_edits) # == 3

I would recommend not creating this kind of code on your own. There are libraries for that.
For instance the Levenshtein library.
In [2]: Levenshtein.distance("foo", "foobar")
Out[2]: 3
In [3]: Levenshtein.distance("barfoo", "foobar")
Out[3]: 6
In [4]: Levenshtein.distance("Buroucrazy", "Bureaucracy")
Out[4]: 3
In [5]: Levenshtein.distance("Misisipi", "Mississippi")
Out[5]: 3
In [6]: Levenshtein.distance("Misisipi", "Misty Mountains")
Out[6]: 11
In [7]: Levenshtein.distance("Buroucrazy", "Born Crazy")
Out[7]: 4

Similar to Santoshi's solution above but I made three changes:
One line initialization instead of five
No need to define cost alone (just use int(boolean) 0 or 1)
Instead of double for loop use product, (this last one is only cosmetic, double loop seems unavoidable)
from itertools import product
def edit_distance(s1,s2):
d={ **{(i,0):i for i in range(len(s1)+1)},**{(0,j):j for j in range(len(s2)+1)}}
for i, j in product(range(1,len(s1)+1), range(1,len(s2)+1)):
d[i,j]=min((s1[i-1]!=s2[j-1]) + d[i-1,j-1], d[i-1,j]+1, d[i,j-1]+1)
return d[i,j]

Instead of going with Levenshtein distance algo use BK tree or TRIE, as these algorithms have less complexity then edit distance. A good browse over these topic will give a detailed description.
This link will help you more about spell checking.

You need Minimum Edit Distance for this task.
Following is my version of MED a.k.a Levenshtein Distance.
def MED_character(str1,str2):
cost=0
len1=len(str1)
len2=len(str2)
#output the length of other string in case the length of any of the string is zero
if len1==0:
return len2
if len2==0:
return len1
accumulator = [[0 for x in range(len2)] for y in range(len1)] #initializing a zero matrix
# initializing the base cases
for i in range(0,len1):
accumulator[i][0] = i;
for i in range(0,len2):
accumulator[0][i] = i;
# we take the accumulator and iterate through it row by row.
for i in range(1,len1):
char1=str1[i]
for j in range(1,len2):
char2=str2[j]
cost1=0
if char1!=char2:
cost1=2 #cost for substitution
accumulator[i][j]=min(accumulator[i-1][j]+1, accumulator[i][j-1]+1, accumulator[i-1][j-1] + cost1 )
cost=accumulator[len1-1][len2-1]
return cost

Fine tuned codes based on the version from #Santosh and should address the issue brought up by #Artur Krajewski; The biggest difference is replacing an effective 2d matrix
def edit_distance(s1, s2):
# add a blank character for both strings
m=len(s1)+1
n=len(s2)+1
# launch a matrix
tbl = [[0] * n for i in range(m)]
for i in range(m): tbl[i][0]=i
for j in range(n): tbl[0][j]=j
for i in range(1, m):
for j in range(1, n):
#if strings have same letters, set operation cost as 0 otherwise 1
cost = 0 if s1[i-1] == s2[j-1] else 1
#find min practice
tbl[i][j] = min(tbl[i][j-1]+1, tbl[i-1][j]+1, tbl[i-1][j-1]+cost)
return tbl
edit_distance("birthday", "Birthdayyy")

following up on #krassowski's answer
from difflib import SequenceMatcher
def sequence_matcher_edits(word_a, word_b):
required_edits = [code for code in (
SequenceMatcher(a=word_a, b=word_b, autojunk=False).get_opcodes()
)
if code[0] != 'equal'
]
return len(required_edits)
print(f"sequence_matcher_edits {sequence_matcher_edits('kitten', 'sitting')}")
# -> sequence_matcher_edits 3

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python, divide string into several substrings - python

Is this what you're looking for? for i in range(len(rna)): if rna[i+3:]: print(rna[i:i+3]) outputs: AUG UGG GGC GCC CCA CAU

I thought of this oneliner: a = 'AUGGCCAUA' [a[x:x+3] for x in range(len(a))][:-2]

def generate(str, index): for i in range(index, len(str), 3): if len(str[i:i+3]) == 3: print str[i:i+3] Example: In [29]: generate(str, 1) UGG CCA In [30]: generate(str, 0) AUG GCC AUA

Related

Python: How can I use a String for a If-Statement?

Getting Index out of range error while trying to do Totaling statements

How to get inverse of integer?

How to sum value of integers based on position?

Edit Distance in Python

Categories

Resources