Creating an exhaustive list of `n` substrings - python

I am training to split a string into n substrings and return a list of tuples of them.
I now use the code
for (w1,w2) in [(w[:i],w[i:]) for i in range(len(w))] where w is the variable which contains the word. So if w='house' then this will return [('','house'),('h','ouse') etc..
This works perfectly for splitting a string into all possible pairs of strings, but now I want to make other splits (e.g. n=3) of strings such as 'ho','u','se' which split a single string into all possible ways of n substrings. How can I do that efficiently?

Here is one way to do this, recursively, using a generator:
def split_str(s, n):
if n == 1:
yield (s,)
else:
for i in range(len(s)):
left, right = s[:i], s[i:]
for substrings in split_str(right, n - 1):
yield (left,) + substrings
for substrings in split_str('house', 3):
print substrings
This prints out:
('', '', 'house')
('', 'h', 'ouse')
('', 'ho', 'use')
('', 'hou', 'se')
('', 'hous', 'e')
('h', '', 'ouse')
('h', 'o', 'use')
('h', 'ou', 'se')
('h', 'ous', 'e')
('ho', '', 'use')
('ho', 'u', 'se')
('ho', 'us', 'e')
('hou', '', 'se')
('hou', 's', 'e')
('hous', '', 'e')
If you don't want the empty strings, change the loop bounds to
for i in range(1, len(s) - n + 2):

Related

extracting characters from python string using variable index

I'm trying to split a string of letters and numbers into a list of tuples like this:
[(37, 'M'), (1, 'I'), (5, 'M'), (1, 'D'), (25, 'M'), (33, 'S')]
This is what is kind of working, but when I try to get print "37" (print(cigar[d:pos])) it does not print the entire string, only 3.
#iterate through cigar sequence
print(cigar)
#count position in cigar sequence
pos=0
#count position of last key
d=0
splitCigar=[]
for char in cigar:
#print(cigar[pos])
if char.isalpha() == False:
print("first for-loop")
print(cigar[d])
print(cigar[pos])
print(cigar[d:pos])
num=(cigar[d:pos])
pos+=1
if char.isalpha() == True:
print("second for-loop")
splitCigar.append((num,char))
pos+=1
d=pos
print(splitCigar)
The output of this code:
37M1I5M1D25M33S
first for-loop
3
3
first for-loop
3
7
3
second for-loop
<and so on...>
second for-loop
[('3', 'M'), ('', 'I'), ('', 'M'), ('', 'D'), ('2', 'M'), ('3', 'S')]
Solution using regexp:
import re
cigar = "37M1I5M1D25M33S"
digits = re.findall('[0-9]+', cigar)
chars = re.findall('[A-Z]+', cigar)
results = list(zip(digits, chars))
Everything printed so you can see what it does:
>>> print(digits)
['37', '1', '5', '1', '25', '33']
>>> print(chars)
['M', 'I', 'M', 'D', 'M', 'S']
>>> print(results)
[('37', 'M'), ('1', 'I'), ('5', 'M'), ('1', 'D'), ('25', 'M'), ('33', 'S')]
I hope this "functional" approach suits you
Pyparsing library makes writing parsers more maintainable and readable.
If the format of the data changes, you can modify the parser without too much effort.
import pyparsing as pp
def make_grammar():
# Number consists of several digits
num = pp.Word(pp.nums).setName("Num")
# Convert the num to int
num = num.setParseAction(
pp.pyparsing_common.convertToInteger)
# 1 letter
letter = pp.Word(pp.alphas, exact=1)\
.setName("Letter")
# 1 num followed by letter with possibly
# some spaces in between
package = pp.Group(num + letter)
# 1 or more packages
grammar = pp.OneOrMore(package)
return grammar
def main():
x = "37M1I5M1D25M33S"
g = make_grammar()
result = g.parseString(x, parseAll=True)
print(result)
# [[37, 'M'], [1, 'I'], [5, 'M'],
# [1, 'D'], [25, 'M'], [33, 'S']]
# If you really want tuples:
print([tuple(r) for r in result])
main()
Sounds like a job for itertools.groupby
inp = '37M1I5M1D25M33S'
e = [''.join(g) for k, g in itertools.groupby(inp, key=lambda l: l.isdigit())]
print(e)
This will give you-
['37', 'M', '1', 'I', '5', 'M', '1', 'D', '25', 'M', '33', 'S']
Basically, groupby collects all consecutive elements that satisfy the key function (.isdigit) into groups, each of those groups is turned into a string using ''.join
Now, all you have to do is zip them together-
res = list(zip(e[::2], e[1::2]))
print(res)
That will give you
[('37', 'M'), ('1', 'I'), ('5', 'M'), ('1', 'D'), ('25', 'M'), ('33', 'S')]
If you want numericals instead of string representation of numbers, that's also super simple-
res = list(map(lambda l: (int(l[0]), l[1]), res))
Which yields
[(37, 'M'), (1, 'I'), (5, 'M'), (1, 'D'), (25, 'M'), (33, 'S')]
I'd say this is a pretty pythonic solution for your problem.
You can simply attain the desired output as follows:
cigar= '37M1I5M1D25M33S'
splitCigar=[]
t=[]
num=''
for char in cigar:
if char.isalpha()==False:
num+= char
else:
t.append(num)
num=''
t.append(char)
splitCigar.append(tuple(t))
t=[]
print(splitCigar)
Output:
[('37', 'M'), ('1', 'I'), ('5', 'M'), ('1', 'D'), ('25', 'M'), ('33', 'S')]

Python while loop return Nth letter

I have a list of strings
X=['kmo','catlin','mept']
I was trying to write a loop that would return a list that contains lists of every Nth letter of each word:
[['k','c','m'], ['m','a','e'],['o','t','p']]
But all the methods I tried only returned a list of all the letters returned consecutively in one list:
['k','m','o','c','a','t','l','i'.....'t']
Here is one version of my code:
def letters(X):
prefix=[]
for i in X:
j=0
while j < len(i):
while j < len(i):
prefix.append(i[j])
break
j+=1
return prefix
I know I'm looping within each word, but I'm not sure how to correct it.
It seems that the length of the resulting list is dictated by the length of the smallest string in the original list. If that is indeed the case, you could simply do it like this:
X = ['kmo','catlin','mept']
l = len(min(X, key=len))
res = [[x[i] for x in X] for i in range(l)]
which returns:
print(res) # -> [['k', 'c', 'm'], ['m', 'a', 'e'], ['o', 't', 'p']]
or the even simpler (kudos #JonClemens):
res = [list(el) for el in zip(*X)]
with the same result. Note that this works because zip automatically stops iterating as soon as one of its elements is depleted.
If you want to fill the blanks so to speak, itertools has got your back with its zip_longest method. See this for more information. The fillvalue can be anything you chose; here, '-' is used to demonstrate the use. An empty string '' might be a better option for production code.
res = list(zip_longest(*X, fillvalue = '-'))
print(res) # -> [('k', 'c', 'm'), ('m', 'a', 'e'), ('o', 't', 'p'), ('-', 'l', 't'), ('-', 'i', '-'), ('-', 'n', '-')]
You can use zip.
output=list(zip(*X))
print(output)
*X will unpack all the elements present in X.
After unpacking I'm zipping all of them together. The zip() function returns a zip object, which is an iterator of tuples where the first item in each passed iterator is paired together, and then the second item in each passed iterator are paired together etc. Finally, I wrapped everything in a list using list.
output
[('k', 'c', 'm'), ('m', 'a', 'e'), ('o', 't', 'p')]
If you want output to be a list of lists. Then use map.
output=list(map(list,zip(*X)))
print(output)
output
[['k', 'c', 'm'], ['m', 'a', 'e'], ['o', 't', 'p']]
X=['kmo','catlin','mept']
y = []
j=0
for i in X:
item =''
for element in X :
if (len(element) > j):
item = item + element[j]
y.append(item)
j=j+1
print("X = [",X,"]")
print("Y = [",y,"]")
try this
def letters(X):
prefix=[]
# First lets zip the list element
zip_elemets = zip(*X)
for element in zip_elemets:
prefix.append(list(element))
return prefix

How to get all permutations of string as list of strings (instead of list of tuples)?

The goal was to create a list of all possible combinations of certain letters in a word... Which is fine, except it now ends up as a list of tuples with too many quotes and commas.
import itertools
mainword = input(str("Enter a word: "))
n_word = int((len(mainword)))
outp = (list(itertools.permutations(mainword,n_word)))
What I want:
[yes, yse, eys, esy, sye, sey]
What I'm getting:
[('y', 'e', 's'), ('y', 's', 'e'), ('e', 'y', 's'), ('e', 's', 'y'), ('s', 'y', 'e'), ('s', 'e', 'y')]
Looks to me I just need to remove all the brackets, quotes, and commas.
I've tried:
def remove(old_list, val):
new_list = []
for items in old_list:
if items!=val:
new_list.append(items)
return new_list
print(new_list)
where I just run the function a few times. But it doesn't work.
You can recombine those tuples with a comprehension like:
Code:
new_list = [''.join(d) for d in old_list]
Test Code:
data = [
('y', 'e', 's'), ('y', 's', 'e'), ('e', 'y', 's'),
('e', 's', 'y'), ('s', 'y', 'e'), ('s', 'e', 'y')
]
data_new = [''.join(d) for d in data]
print(data_new)
Results:
['yes', 'yse', 'eys', 'esy', 'sye', 'sey']
You need to call str.join() on your string tuples in order to convert it back to a single string. Your code can be simplified with list comprehension as:
>>> from itertools import permutations
>>> word = 'yes'
>>> [''.join(w) for w in permutations(word)]
['yes', 'yse', 'eys', 'esy', 'sye', 'sey']
OR you may also use map() to get the desired result as:
>>> list(map(''.join, permutations(word)))
['yes', 'yse', 'eys', 'esy', 'sye', 'sey']
You can use the join function . Below code works perfect .
I am also attach the screenshot of the output.
import itertools
mainword = input(str("Enter a word: "))
n_word = int((len(mainword)))
outp = (list(itertools.permutations(mainword,n_word)))
for i in range(0,6):
outp[i]=''.join(outp[i])
print(outp)

How to force regex stop when hits a 'character' and continue from the start again

import re
match = re.findall(r'(a)(?:.*?(b)|.*?)(?:.*?(c)|.*?)(d)?',
'axxxbxd,axxbxxcd,axxxxxd,axcxxx')
print (match)
output: [('a', 'b', 'c', 'd'), ('a', '', 'c', '')]
I want output as below:
[('a','b','','d'),('a','b','c','d'),('a','','','d'),('a','','c','')]
Each list starts with 'a' and has 4 items from the string separated by comma respectively.
If you want to obtain several matches from a delimited string, either split the string with the delimiters first and run your regex, or replace the . with the [^<YOUR_DELIMITING_CHARS>] (paying attention to \, ^, ] and - that must be escaped). Also note that you can get rid of redundancy in the pattern using optional non-capturing groups.
Note that I assume that a, b and c are placeholders and the real life values can be both single and multicharacter values.
import re
s = 'axxxbxd,axxbxxcd,axxxxxd,axcxxx'
r = r'(a)(?:.*?(b))?(?:.*?(c))?(d)?'
print([re.findall(r, x) for x in s.split(',')])
print ([re.findall(r, x) for x in re.split(r'\W', s)])
# => [('a', 'b', '', ''), ('a', 'b', 'c', 'd'), ('a', '', '', ''), ('a', '', 'c', '')]
See the Python demo.
If your delimiters are non-word chars, use \W.
import re
s = 'axxxbxd,axxbxxcd,axxxxxd,axcxxx'
r = r'(a)(?:.*?(b)|.*?)(?:.*?(c)|.*?)(d)?'
print([re.findall(r, x) for x in s.split(',')])
print ([re.findall(r, x) for x in re.split(r'\W', s)])
# => [[('a', 'b', '', '')], [('a', 'b', 'c', 'd')], [('a', '', '', '')], [('a', '', 'c', '')]]
See the Python demo
If the strings can contain line breaks, pass re.DOTALL modifier to the re.findall calls.
Pattern details
(a) - Group 1 capturing a
(?:.*?(b))? - an optional non-capturing group matching a sequence of:
.*? - any char (other than line break chars if the re.S / re.DOTALL modifier is not used), zero or more occurrences, but as few as possible
(b) - Group 2: a b value
(?:.*?(c))?
.*? - any char (other than line break chars if the re.S / re.DOTALL modifier is not used), zero or more occurrences, but as few as possible
(c) - Group 3: a c value
(d)? - Group 4 (optional): a d.
Considering that the crucial sequence a... b... c... d should be matched in strict order - use straight-forward approach:
s = 'axxxbxd,xxbxxcxxd,xxbxxxd|axcxxx' # extended example
result = []
for seq in re.split(r'\W', s): # split by non-word character
result.append([c if c in seq else '' for c in ('a','b','c','d')])
print(result)
The output:
[['a', 'b', '', 'd'], ['', 'b', 'c', 'd'], ['', 'b', '', 'd'], ['a', '', 'c', '']]

How to simultaneously loop over different lists?

I'm trying to loop over a list of strings:
someplace["Canada", "USA", "England"]
for charNum in range(len(lst)): #charNum the length of the characters
print(lst[charNum])
for country in range(len([charNum])): #country is the the string
print(lst[charNum][country])
the output I'm trying to achieve is this:
c U E
a S n
n A g
a l
d a
a n
d
more details:
for k in range(len(lst[0])):
print(lst[0][k])
If run this it would print out
c
a
n
a
d
a
this because it's getting the length of the index 0. But I have to be to loop through the other numbers: 0, 1, 2.
I made some progress I created a nested for-loop:
for i in range(len(lst)): # length of list
for j in range(len(lst[i])): # length of each string
print(lst[i][j])
use itertools.izip_longest to loop over all simultaneously
from itertools import izip_longest # zip_longest in python 3
places = ["Canada", "USA", "England"]
for chars in izip_longest(*places, fillvalue=' '):
print(' '.join(chars))
Output:
C U E
a S n
n A g
a l
d a
a n
d
The Process:
The output of izip_longest is:
[('C', 'U', 'E'), ('a', 'S', 'n'), ('n', 'A', 'g'), ('a', ' ', 'l'), ('d', ' ', 'a'), ('a', ' ', 'n'), (' ', ' ', 'd')]
The for loop then assigns each "row" to chars sequentially, starting with ('C', 'U', 'E')
' '.join(chars) combines that tuple into a string, with spaces between each list member. For the first element, that would be 'C U E'
To Internate a List, You Need to have One.
my_list = ["Paris"]
SomePlaces = ['Canada', 'USA', 'England'] //Array
for SomePlaces in range(len(lst)):
print 'Places :', lst[SomePlaces]
Note: I didn't test it, sorry if I'm wrong.

Categories