Split string into chunks of same letters [duplicate]

Split string into chunks of same letters [duplicate] - python

This question already has answers here:
Splitting a string with repeated characters into a list
(4 answers)
Closed 6 years ago.
this is easy, I just can't do it! In this example, all I want to do is split the string below into chunks of same letters that are beside each other, e.g. in the below example: test = "AAATGG", would be split into "AAA","T","GG". I've been trying different ways, one example below. I'd appreciate the help.
I know the idea is to go through the string, if the next letter is the same as the current letter, continue on, else, break and print and start again, I just can't implement it properly.
test = "AAATGG"
TestDict = {}
for index,i in enumerate(test[:-1]):
string = ""
if test[index] == test[index+1]:
string = i + test[index]
else:
break
print string

One way is to use groupby from itertools:
from itertools import groupby
[''.join(g) for _, g in groupby(test)]
# ['AAA', 'T', 'GG']

I'd probably just use itertools.groupby:
>>> import itertools as it
>>> s = 'AAATGG'
>>> for k, g in it.groupby(s):
... print(k, list(g))
...
('A', ['A', 'A', 'A'])
('T', ['T'])
('G', ['G', 'G'])
>>>
>>> # Multiple non-consecutive occurrences of a given value.
>>> s = 'AAATTGGAAA'
>>> for k, g in it.groupby(s):
... print(k, list(g))
...
('A', ['A', 'A', 'A'])
('T', ['T', 'T'])
('G', ['G', 'G'])
('A', ['A', 'A', 'A'])
As you can see, g becomes an iterable that yields all consecutive occurrences of the given character (k). I used list(g), to consume the iterable, but you could do anything you like with it (including ''.join(g) to get a string, or sum(1 for _ in g) to get the count).

You can use regex:
>>> re.findall(r'((\w)\2*)', test)
[('AAA', 'A'), ('T', 'T'), ('GG', 'G')]

You could also use regex.findall. In this case, I assumed only the letters A, T, C, and G are present.
import re
re.findall('(A+|T+|G+|C+)', test)
['AAA', 'T', 'GG']

Related

Python while loop return Nth letter

I have a list of strings
X=['kmo','catlin','mept']
I was trying to write a loop that would return a list that contains lists of every Nth letter of each word:
[['k','c','m'], ['m','a','e'],['o','t','p']]
But all the methods I tried only returned a list of all the letters returned consecutively in one list:
['k','m','o','c','a','t','l','i'.....'t']
Here is one version of my code:
def letters(X):
prefix=[]
for i in X:
j=0
while j < len(i):
while j < len(i):
prefix.append(i[j])
break
j+=1
return prefix
I know I'm looping within each word, but I'm not sure how to correct it.

It seems that the length of the resulting list is dictated by the length of the smallest string in the original list. If that is indeed the case, you could simply do it like this:
X = ['kmo','catlin','mept']
l = len(min(X, key=len))
res = [[x[i] for x in X] for i in range(l)]
which returns:
print(res) # -> [['k', 'c', 'm'], ['m', 'a', 'e'], ['o', 't', 'p']]
or the even simpler (kudos #JonClemens):
res = [list(el) for el in zip(*X)]
with the same result. Note that this works because zip automatically stops iterating as soon as one of its elements is depleted.
If you want to fill the blanks so to speak, itertools has got your back with its zip_longest method. See this for more information. The fillvalue can be anything you chose; here, '-' is used to demonstrate the use. An empty string '' might be a better option for production code.
res = list(zip_longest(*X, fillvalue = '-'))
print(res) # -> [('k', 'c', 'm'), ('m', 'a', 'e'), ('o', 't', 'p'), ('-', 'l', 't'), ('-', 'i', '-'), ('-', 'n', '-')]

You can use zip.
output=list(zip(*X))
print(output)
*X will unpack all the elements present in X.
After unpacking I'm zipping all of them together. The zip() function returns a zip object, which is an iterator of tuples where the first item in each passed iterator is paired together, and then the second item in each passed iterator are paired together etc. Finally, I wrapped everything in a list using list.
output
[('k', 'c', 'm'), ('m', 'a', 'e'), ('o', 't', 'p')]
If you want output to be a list of lists. Then use map.
output=list(map(list,zip(*X)))
print(output)
output
[['k', 'c', 'm'], ['m', 'a', 'e'], ['o', 't', 'p']]

X=['kmo','catlin','mept']
y = []
j=0
for i in X:
item =''
for element in X :
if (len(element) > j):
item = item + element[j]
y.append(item)
j=j+1
print("X = [",X,"]")
print("Y = [",y,"]")

try this
def letters(X):
prefix=[]
# First lets zip the list element
zip_elemets = zip(*X)
for element in zip_elemets:
prefix.append(list(element))
return prefix

Itertools Combinations No Repeats: Where rgb is equivelant to rbg etc

I'm trying to use itertools.combinations to return unique combinations. I've searched through several similar questions but have not been able to find an answer.
An example:
>>> import itertools
>>> e = ['r','g','b','g']
>>> list(itertools.combinations(e,3))
[('r', 'g', 'b'), ('r', 'g', 'g'), ('r', 'b', 'g'), ('g', 'b', 'g')]
For my purposes, (r,g,b) is identical to (r,b,g) and so I would want to return only (rgb),(rgg) and (gbg).
This is just an illustrative example and I would want to ignore all such 'duplicates'. The list e could contain up to 5 elements. Each individual element would be either r, g or b. Always looking for combinations of 3 elements from e.
To be concrete, the following are the only combinations I wish to call 'valid': (rrr), (ggg), (bbb), (rgb).
So perhaps the question boils down to how to treat any variation of (rgb) as equal to (rgb) and therefore ignore it.
Can I use itertools to achieve this or do I need to write my own code to drop the 'dupliates' here? If no itertools solution then I can just easily check if each is a variation of (rgb), but this feels a bit 'un-pythonic'.

You can use a set to discard duplicates.
In your case the number of characters is the way you identify duplicates so you could use collections.Counter. In order to save them in a set you need to convert them to frozensets though (because Counter isn't hashable):
>>> import itertools
>>> from collections import Counter
>>> e = ['r','g','b','g']
>>> result = []
>>> seen = set()
>>> for comb in itertools.combinations(e,3):
... cnts = frozenset(Counter(comb).items())
... if cnts in seen:
... pass
... else:
... seen.add(cnts)
... result.append(comb)
>>> result
[('r', 'g', 'b'), ('r', 'g', 'g'), ('g', 'b', 'g')]
If you want to convert them to strings use:
result.append(''.join(comb)) # instead of result.append(comb)
and it will give:
['rgb', 'rgg', 'gbg']
The approach is a variation of the unique_everseen recipe (itertools module documentation) - so it's probably "quite pythonic".

According to your definition of "valid outputs", you can directly build them like this:
from collections import Counter
# Your distinct values
values = ['r', 'g', 'b']
e = ['r','g','b','g', 'g']
count = Counter(e)
# Counter({'g': 3, 'r': 1, 'b': 1})
# If x appears at least 3 times, 'xxx' is a valid combination
combinations = [x*3 for x in values if count[x] >=3]
# If all values appear at least once, 'rgb' is a valid combination
if all([count[x]>=1 for x in values]):
combinations.append('rgb')
print(combinations)
#['ggg', 'rgb']
This will be more efficient than creating all possible combinations and filtering the valid ones afterwards.

It is not completely clear what you want to return. It depends on what comes first when iterating. For example if gbr is found first, then rgb will be discarded as a duplicate:
import itertools
e = ['r','g','b','g']
s = set(e)
v = [s] * len(s)
solns = []
for c in itertools.product(*v):
in_order = sorted(c)
if in_order not in solns:
solns.append(in_order)
print solns
This would give you:
[['r', 'r', 'r'], ['b', 'r', 'r'], ['g', 'r', 'r'], ['b', 'b', 'r'], ['b', 'g', 'r'], ['g', 'g', 'r'], ['b', 'b', 'b'], ['b', 'b', 'g'], ['b', 'g', 'g'], ['g', 'g', 'g']]

How to print a list of tupled tuples in CSV-acceptable format? [duplicate]

This question already has answers here:
Flatten an irregular (arbitrarily nested) list of lists
(51 answers)
Closed 5 years ago.
I have a list of tuples I would like to print in CSV format without quotes or brackets.
[(('a','b','c'), 'd'), ... ,(('e','f','g'), 'h')]
Desired output:
a,b,c,d,e,f,g,h
I can get rid of some of the punctuation using chain, .join() or the *-operator, but my knowledge is not sophisticated enough to get rid of all of it for my particular use case.
Thank you.

So, in your case there is a pattern which makes this relatively easy:
>>> x = [(('a','b','c'), 'd') ,(('e','f','g'), 'h')]
>>> [c for a,b in x for c in (*a, b)]
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
Or, an itertools.chain solution:
>>> list(chain.from_iterable((*a, b) for a,b in x))
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
>>>
And, in case you are on an old version of Python, and can't use (*a, b) you will need something like:
[c for a,b in x for c in a+(b,)]

Split a string in Python having parenthesis (multiple splitters)

I have a string, for example:
"ab(abcds)kadf(sd)k(afsd)(lbne)"
I want to split it to a list such that the list is stored like this:
a
b
abcds
k
a
d
f
sd
k
afsd
lbne
I need to get the elements outside the parenthesis in separate rows and the ones inside it in separate ones.
I am not able to think of any solution to this problem.

You can use iter to make an iterator and use itertools.takewhile to extract the strings between the parens:
it = iter(s)
from itertools import takewhile
print([ch if ch != "(" else "".join(takewhile(lambda x: x!= ")",it)) for ch in it])
['a', 'b', 'abcds', 'k', 'a', 'd', 'f', 'sd', 'k', 'afsd', 'lbne']
If ch is not equal to ( we just take the char else if ch is a ( we use takewhile which will keep taking chars until we hit a ) .
Or using re.findall get all strings starting and ending in () with \((.+?))` and all other characters with :
print([''.join(tup) for tup in re.findall(r'\((.+?)\)|(\w)', s)])
['a', 'b', 'abcds', 'k', 'a', 'd', 'f', 'sd', 'k', 'afsd', 'lbne']

You just need to use the magic of 're.split' and some logic.
import re
string = "ab(abcds)kadf(sd)k(afsd)(lbne)"
temp = []
x = re.split(r'[(]',string)
#x = ['ab', 'abcds)kadf', 'sd)k', 'afsd)', 'lbne)']
for i in x:
if ')' not in i:
temp.extend(list(i))
else:
t = re.split(r'[)]',i)
temp.append(t[0])
temp.extend(list(t[1]))
print temp
#temp = ['a', 'b', 'abcds', 'k', 'a', 'd', 'f', 'sd', 'k', 'afsd', 'lbne']
Have a look at difference in append and extend here.
I hope this helps.

You have two options. The really easy one is to just iterate over the string. For example:
in_parens=False
buffer=''
for char in my_string:
if char =='(':
in_parens=True
elif char==')':
in_parens = False
my_list.append(buffer)
buffer=''
elif in_parens:
buffer+=char
else:
my_list.append(char)
The other option is regex.
I would suggest regex. It is worth practicing.

Try: Python re. If you are new to re it may take a bit of time but you can do all kind of string manipulations once you get it.
import re
search_string = 'ab(abcds)kadf(sd)k(afsd)(lbne)'
re_pattern = re.compile('(\w)|\((\w*)\)') # Match single character or characters in parenthesis
print [x if x else y for x,y in re_pattern.findall(search_string)]

match the pattern at the end of a string?

Imagine I have the following strings:
['a','b','c_L1', 'c_L2', 'c_L3', 'd', 'e', 'e_L1', 'e_L2']
Where the "c" string has important sub-categories (L1, L2, L3). These indicate special data for our purposes that have been generated in a program based a pre-designated string "L". In other words, I know that the special entries should have the form:
name_Lnumber
Knowing that I'm looking for this pattern, and that I am using "L" or more specifically "_L" as my designation of these objects, how could I return a list of entries that meet this condition? In this case:
['c', 'e']

Use a simple filter:
>>> l = ['a','b','c_L1', 'c_L2', 'c_L3', 'd', 'e', 'e_L1', 'e_L2']
>>> filter(lambda x: "_L" in x, l)
['c_L1', 'c_L2', 'c_L3', 'e_L1', 'e_L2']
Alternatively, use a list comprehension
>>> [s for s in l if "_L" in s]
['c_L1', 'c_L2', 'c_L3', 'e_L1', 'e_L2']
Since you need the prefix only, you can just split it:
>>> set(s.split("_")[0] for s in l if "_L" in s)
set(['c', 'e'])

you can use the following list comprehension :
>>> set(i.split('_')[0] for i in l if '_L' in i)
set(['c', 'e'])
Or if you want to match the elements that ends with _L(digit) and not something like _Lm you can use regex :
>>> import re
>>> set(i.split('_')[0] for i in l if re.match(r'.*?_L\d$',i))
set(['c', 'e'])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Split string into chunks of same letters [duplicate] - python

One way is to use groupby from itertools: from itertools import groupby [''.join(g) for _, g in groupby(test)] # ['AAA', 'T', 'GG']

You can use regex: >>> re.findall(r'((\w)\2*)', test) [('AAA', 'A'), ('T', 'T'), ('GG', 'G')]

You could also use regex.findall. In this case, I assumed only the letters A, T, C, and G are present. import re re.findall('(A+|T+|G+|C+)', test) ['AAA', 'T', 'GG']

Related

Python while loop return Nth letter

Itertools Combinations No Repeats: Where rgb is equivelant to rbg etc

How to print a list of tupled tuples in CSV-acceptable format? [duplicate]

Split a string in Python having parenthesis (multiple splitters)

match the pattern at the end of a string?

Categories

Resources