Python fast insert multiple characters into all possible places of string - python

I want to insert multiple characters into all possible places of string, my current implementation is using itertools.combinations_with_replacement (doc) to list all possible places of a string, then converting the string to numpy array, calling numpy.insert(doc) to insert the characters into the array, finally using join to convert inserted string array back to string. Taking inserting 2 characters as example:
import numpy as np
import itertools
string = "stack"
str_array = np.array(list(string), dtype=str)
characters = np.array(["x", "y"], dtype=str)
new_strings = ["".join(np.insert(str_array, ix, characters)) for ix in itertools.combinations_with_replacement(range(len(string)+1), len(characters))]
Outputs:
['xystack', 'xsytack', 'xstyack', 'xstayck', 'xstacyk', 'xstacky', 'sxytack', 'sxtyack', 'sxtayck', 'sxtacyk', 'sxtacky', 'stxyack', 'stxayck', 'stxacyk', 'stxacky', 'staxyck', 'staxcyk', 'staxcky', 'stacxyk', 'stacxky', 'stackxy']
It seems complicated, but I can't find better way to achieve this if I want to insert any number (e.g., 3) of characters into a string. Did I miss any better and faster way to do this?

A recursive solution:
def mix(s, t, p=''):
return s and t and mix(s[1:], t, p+s[0]) + mix(s, t[1:], p+t[0]) or [p + s + t]
My p is the prefix built so far. In each recursive step, I extend it with the first character from s or the first character from t. Unless one of them doesn't have a character left, in which case I just return the prefix plus whatever is left.
Demo:
>>> mix('xy', 'stack')
['xystack', 'xsytack', 'xstyack', 'xstayck', 'xstacyk', 'xstacky', 'sxytack',
'sxtyack', 'sxtayck', 'sxtacyk', 'sxtacky', 'stxyack', 'stxayck', 'stxacyk',
'stxacky', 'staxyck', 'staxcyk', 'staxcky', 'stacxyk', 'stacxky', 'stackxy']
It's about 20 times faster than yours on your example case.

Related

String slicing in numpy array

Say we have an numpy.ndarray with numpy.str_ elements. For example, below arr is the numpy.ndarray with two numpy.str_ elements like this:
arr = ['12345"""ABCDEFG' '1A2B3C"""']
Trying to perform string slicing on each numpy element.
For example, how can we slice the first element '12345"""ABCDEFG' so that we replace its 10 last characters with the string REPL, i.e.
arr = ['12345REPL' '1A2B3C"""']
Also, is it possible to perform string substitutions, e.g. substitute all characters after a specific symbol?
Strings are immutable, so you should either create slices and manually recombine or use regular expressions. For example, to replace the last 10 characters of the first element in your array, arr, you could do:
import numpy as np
import re
arr = np.array(['12345"""ABCDEFG', '1A2B3C"""'])
arr[0] = re.sub(arr[0][-10:], 'REPL', arr[0])
print(arr)
#['12345REPL' '1A2B3C"""']
If you want to replace all characters after a specific character you could use a regular expression or find the index of that character in the string and use that as the slicing index.
EDIT: Your comment is more about regular expressions than simply Python slicing, but this is how you could replace everything after the triple quote:
re.sub('["]{3}(.+)', 'REPL', arr[0])
This line essentially says, "Find the triple quote and everything after it, but only replace every character after the triple quotes."
In python, strings are immutable. Also, in NumPy, array scalars are immutable; your string is therefore immutable.
What you would want to do in order to slice is to treat your string like a list and access the elements.
Say we had a string where we wanted to slice at the 3rd letter, excluding the third letter:
my_str = 'purple'
sliced_str = my_str[:3]
Now that we have the part of the string, say we wanted to substitute z's for every letter following where we sliced. We would have to work with the new string that pulled out the letters we wanted, and create an additional string with the desired string that we want to create:
# say I want to replace the end of 'my_str', from where we sliced, with a string named 's'
s = 'dandylion'
new_string = sliced_str + s # returns 'pudandylion'
Because string types are immutable, you have to store elements you want to keep, then combine the stored elements with the elements you would like to add in a new variable.
np.char has replace function, which applies the corresponding string method to each element of the array:
In [598]: arr = np.array(['12345"""ABCDEFG', '1A2B3C"""'])
In [599]: np.char.replace(arr,'"""ABCDEFG',"REPL")
Out[599]:
array(['12345REPL', '1A2B3C"""'],
dtype='<U9')
In this particular example it can be made to work, but it isn't nearly as general purpose as re.sub. Also these char functions are only modestly faster than iterating on the array. There are some good examples of that in #Divakar's link.

best way to get an integer from string without using regex

I would like to get some integers from a string (the 3rd one). Preferable without using regex.
I saw a lot of stuff.
my string:
xp = '93% (9774/10500)'
So i would like the code to return a list with integers from a string. So the desired output would be: [93, 9774, 10500]
Some stuff like this doesn't work:
>>> new = [int(s) for s in xp.split() if s.isdigit()]
>>> print new
[]
>>> int(filter(str.isdigit, xp))
93977410500
Since the problem is that you have to split on different chars, you can first replace everything that's not a digit by a space then split, a one-liner would be :
xp = '93% (9774/10500)'
''.join([ x if x.isdigit() else ' ' for x in xp ]).split() # ['93', '9774', '10500']
Using regex (sorry!) to split the string by a non-digit, then filter on digits (can have empty fields) and convert to int.
import re
xp = '93% (9774/10500)'
print([int(x) for x in filter(str.isdigit,re.split("\D+",xp))])
result:
[93, 9774, 10500]
Since this is Py2, using str, it looks like you don't need to consider the full Unicode range; since you're doing this more than once, you can slightly improve on polku's answer using str.translate:
# Create a translation table once, up front, that replaces non-digits with
import string
nondigits = ''.join(c for c in map(chr, range(256)) if not c.isdigit())
nondigit_to_space_table = string.maketrans(nondigits, ' ' * len(nondigits))
# Then, when you need to extract integers use the table to efficiently translate
# at C layer in a single function call:
xp = '93% (9774/10500)'
intstrs = xp.translate(nondigit_to_space_table).split() # ['93', '9774', 10500]
myints = map(int, intstrs) # Wrap in `list` constructor on Py3
Performance-wise, for the test string on my 64 bit Linux 2.7 build, using translate takes about 374 nanoseconds to run, vs. 2.76 microseconds for the listcomp and join solution; the listcomp+join takes >7x longer. For larger strings (where the fixed overhead is trivial compared to the actual work), the listcomp+join solution takes closer to 20x longer.
Main advantage to polku's solution is that it requires no changes on Py3 (on which it should seamlessly support non-ASCII strings), where str.translate builds the translation table a different way there (str.translate) and it would be impractical to make a translation table that handled all non-digits in the whole Unicode space.
Since the format is fixed, you can use consecutive split().
It's not very pretty, or general, but sometimes the direct and "stupid" solution is not so bad:
a, b = xp.split("%")
x = int(a)
y = int(b.split("/")[0].strip()[1:])
z = int(b.split("/")[1].strip()[:-1])
print(x, y, z) # prints "93 9774 10500"
Edit: Clarified that the poster specifically said that his format is fixed. This solution is not very pretty, but it does what it's supposed to.

Comparing and Combining List Items in Python

Im working on Advent of Code: Day 2, and Im having trouble working with lists. My code takes a string, for example 2x3x4, and splits it into a list. Then it checks for an 'x' in the list and removes them and feeds the value to a method that calculates the area needed. The problem is that before it removes the 'x's I need to find out if there are two numbers before the 'x' and combine them, to account for double digit numbers. I've looked into regular expressions but I don't think I've been using it right. Any ideas?
def CalcAreaBox(l, w, h):
totalArea = (2*(l*w)) + (2*(w*h))+ (2*(h*l))
extra = l * w
toOrder = totalArea + extra
print(toOrder)
def ProcessString(dimStr):
#seperate chars into a list
dimStrList = list(dimStr)
#How to deal with double digit nums?
#remove any x
for i in dimStrList:
if i == 'x':
dimStrList.remove(i)
#Feed the list to CalcAreaBox
CalcAreaBox(int(dimStrList[0]), int(dimStrList[1]), int(dimStrList[2]))
dimStr = "2x3x4"
ProcessString(dimStr)
You could use split on your string
#remove any x and put in list of ints
dims = [int(dim) for dim in dimStrList.split('x')]
#Feed the list to CalcAreaBox
CalcAreaBox(dims[0], dims[1], dims[2])
Of course you will want to consider handling the cases where there are not exactly two X's in the string
Your question is more likely to fit on Code Review and not Stack Overflow.
As your task is a little challenge, I would not tell you an exact solution, but give you a hint towards the split method of Python strings (see the documentation).
Additionally, you should check the style of your code against the recommendation in PEP8, e.g. Python usually has function/variable names in all lowercase letters, words separated by underscores (like calc_area_box).

Python text encryption: rot13

I am currently doing an assignment that encrypts text by using rot 13, but some of my text wont register.
# cgi is to escape html
# import cgi
def rot13(s):
#string encrypted
scrypt=''
alph='abcdefghijklmonpqrstuvwxyz'
for c in s:
# check if char is in alphabet
if c.lower() in alph:
#find c in alph and return its place
i = alph.find(c.lower())
#encrypt char = c incremented by 13
ccrypt = alph[ i+13 : i+14 ]
#add encrypted char to string
if c==c.lower():
scrypt+=ccrypt
if c==c.upper():
scrypt+=ccrypt.upper()
#dont encrypt special chars or spaces
else:
scrypt+=c
return scrypt
# return cgi.escape(scrypt, quote = True)
given_string = 'Rot13 Test'
print rot13(given_string)
OUTPUT:
13 r
[Finished in 0.0s]
Hmmm, seems like a bunch of things are not working.
Main problem should be in ccrypt = alph[ i+13 : i+14 ]: you're missing a % len(alph) otherwise if, for example, i is equal to 18, then you'll end out of the list boundary.
In your output, in fact, only e is encoded to r because it's the only letter in your test string which, moved by 13, doesn't end out of boundary.
The rest of this answer are just tips to clean the code a little bit:
instead of alph='abc.. you can declare an import string at the beginning of the script and use a string.lowercase
instead of using string slicing, for just one character it's better to use string[i], gets the work done
instead of c == c.upper(), you can use builtin function if c.isupper() ....
The trouble you're having is with your slice. It will be empty if your character is in the second half of the alphabet, because i+13 will be off the end. There are a few ways you could fix it.
The simplest might be to simply double your alphabet string (literally: alph = alph * 2). This means you can access values up to 52, rather than just up to 26. This is a pretty crude solution though, and it would be better to just fix the indexing.
A better option would be to subtract 13 from your index, rather than adding 13. Rot13 is symmetric, so both will have the same effect, and it will work because negative indexes are legal in Python (they refer to positions counted backwards from the end).
In either case, it's not actually necessary to do a slice at all. You can simply grab a single value (unlike C, there's no char type in Python, so single characters are strings too). If you were to make only this change, it would probably make it clear why your current code is failing, as trying to access a single value off the end of a string will raise an exception.
Edit: Actually, after thinking about what solution is really best, I'm inclined to suggest avoiding index-math based solutions entirely. A better approach is to use Python's fantastic dictionaries to do your mapping from original characters to encrypted ones. You can build and use a Rot13 dictionary like this:
alph="abcdefghijklmnopqrstuvwxyz"
rot13_table = dict(zip(alph, alph[13:]+alph[:13])) # lowercase character mappings
rot13_table.update((c.upper(),rot13_table[c].upper()) for c in alph) # upppercase
def rot13(s):
return "".join(rot13_table.get(c, c) for c in s) # non-letters are ignored
First thing that may have caused you some problems - your string list has the n and the o switched, so you'll want to adjust that :) As for the algorithm, when you run:
ccrypt = alph[ i+13 : i+14 ]
Think of what happens when you get 25 back from the first iteration (for z). You are now looking for the index position alph[38:39] (side note: you can actually just say alph[38]), which is far past the bounds of the 26-character string, which will return '':
In [1]: s = 'abcde'
In [2]: s[2]
Out[2]: 'c'
In [3]: s[2:3]
Out[3]: 'c'
In [4]: s[49:50]
Out[4]: ''
As for how to fix it, there are a number of interesting methods. Your code functions just fine with a few modifications. One thing you could do is create a mapping of characters that are already 'rotated' 13 positions:
alph = 'abcdefghijklmnopqrstuvwxyz'
coded = 'nopqrstuvwxyzabcdefghijklm'
All we did here is split the original list into halves of 13 and then swap them - we now know that if we take a letter like a and get its position (0), the same position in the coded list will be the rot13 value. As this is for an assignment I won't spell out how to do it, but see if that gets you on the right track (and #Makoto's suggestion is a perfect way to check your results).
This line
ccrypt = alph[ i+13 : i+14 ]
does not do what you think it does - it returns a string slice from i+13 to i+14, but if these indices are greater than the length of the string, the slice will be empty:
"abc"[5:6] #returns ''
This means your solution turns everything from n onward into an empty string, which produces your observed output.
The correct way of implementing this would be (1.) using a modulo operation to constrain the index to a valid number and (2.) using simple character access instead of string slices, which is easier to read, faster, and throws an IndexError for invalid indices, meaning your error would have been obvious.
ccrypt = alph[(i+13) % 26]
If you're doing this as an exercise for a course in Python, ignore this, but just saying...
>>> import codecs
>>> codecs.encode('Some text', 'rot13')
'Fbzr grkg'
>>>

Looking for elegant glob-like DNA string expansion

I'm trying to make a glob-like expansion of a set of DNA strings that have multiple possible bases.
The base of my DNA strings contains the letters A, C, G, and T. However, I can have special characters like M which could be an A or a C.
For example, say I have the string:
ATMM
I would like to take this string as input and output the four possible matching strings:
ATAA
ATAC
ATCA
ATCC
Rather than brute force a solution, I feel like there must be some elegant Python/Perl/Regular Expression trick to do this.
Thank you for any advice.
Edit, thanks cortex for the product operator. This is my solution:
Still a Python newbie, so I bet there's a better way to handle each dictionary key than another for loop. Any suggestions would be great.
import sys
from itertools import product
baseDict = dict(M=['A','C'],R=['A','G'],W=['A','T'],S=['C','G'],
Y=['C','T'],K=['G','T'],V=['A','C','G'],
H=['A','C','T'],D=['A','G','T'],B=['C','G','T'])
def glob(str):
strings = [str]
## this loop visits very possible base in the dictionary
## probably a cleaner way to do it
for base in baseDict:
oldstrings = strings
strings = []
for string in oldstrings:
strings += map("".join,product(*[baseDict[base] if x == base
else [x] for x in string]))
return strings
for line in sys.stdin.readlines():
line = line.rstrip('\n')
permutations = glob(line)
for x in permutations:
print x
Agree with other posters that it seems like a strange thing to want to do. Of course, if you really want to, there is (as always) an elegant way to do it in Python (2.6+):
from itertools import product
map("".join, product(*[['A', 'C'] if x == "M" else [x] for x in "GMTTMCA"]))
Full solution with input handling:
import sys
from itertools import product
base_globs = {"M":['A','C'], "R":['A','G'], "W":['A','T'],
"S":['C','G'], "Y":['C','T'], "K":['G','T'],
"V":['A','C','G'], "H":['A','C','T'],
"D":['A','G','T'], "B":['C','G','T'],
}
def base_glob(glob_sequence):
production_sequence = [base_globs.get(base, [base]) for base in glob_sequence]
return map("".join, product(*production_sequence))
for line in sys.stdin.readlines():
productions = base_glob(line.strip())
print "\n".join(productions)
You probably could do something like this in python using the yield operator
def glob(str):
if str=='':
yield ''
return
if str[0]!='M':
for tail in glob(str[1:]):
yield str[0] + tail
else:
for c in ['A','G','C','T']:
for tail in glob(str[1:]):
yield c + tail
return
EDIT: As correctly pointed out I was making a few mistakes. Here is a version which I tried out and works.
This isn't really an "expansion" problem and it's almost certainly not doable with any sensible regular expression.
I believe what you're looking for is "how to generate permutations".
You could for example do this recursively. Pseudo-code:
printSequences(sequence s)
switch "first special character in sequence"
case ...
case M:
s1 = s, but first M replaced with A
printSequences(s1)
s2 = s, but first M replaced with C
printSequences(s2)
case none:
print s;
Regexps match strings, they're not intended to be turned into every string they might match.
Also, you're looking at a lot of strings being output from this - for instance:
MMMMMMMMMMMMMMMM (16 M's)
produces 65,536 16 character strings - and I'm guessing that DNA sequences are usually longer than that.
Arguably any solution to this is pretty much 'brute force' from a computer science perspective, because your algorithm is O(2^n) on the original string length. There's actually quite a lot of work to be done.
Why do you want to produce all the combinations? What are you going to do with them? (If you're thinking to produce every string possibility and then look for it in a large DNA sequence, then there are much better ways of doing that.)

Categories