Remove duplicates but retain sequence - python

I'm trying to reduce a string with duplicates however I do not want to create a set. For example
mystring = 'TTTTTPPPTPTTTTPPPPPPPPP'
The sequence of the letters is 'TPTPTP', so I need a resulting string of
newstring = 'TPTPTP'
I'm sure there is an easy one-liner but its evading me

You're looking for itertools.groupby.
>>> mystring = 'TTTTTPPPTPTTTTPPPPPPPPP'
>>> groups = [x for x, y in itertools.groupby(mystring)]
>>> groups
['T', 'P', 'T', 'P', 'T', 'P']
>>> ''.join(groups)
TPTPTP
Official documentation

zip each character with the one before and take those which are different:
>>> a
'TTTTTPPPTPTTTTPPPPPPPPP'
>>> ''.join(i for i, j in zip(a, '\0' + a) if i != j)
'TPTPTP'

You can also use regular expressions if you feel like it.
>>> import re
>>> mystring = 'TTTTTPPPTPTTTTPPPPPPPPP'
>>> ''.join(re.findall(r'(.)\1*', mystring))
'TPTPTP'
That looks for any character, followed by the same found character zero or more times.

Related

Splitting a string and retaining the delimiter with the delimiter appearing contiguously

I have the following string:
bar = 'F9B2Z1F8B30Z4'
I have a function foo that splits the string on F, then adds back the F delimiter.
def foo(my_str):
res = ['F' + elem for elem in my_str.split('F') if elem != '']
return res
This works unless there are two "F"s back-to-back in the string. For example,
foo('FF9B2Z1F8B30Z4')
returns
['F9B2Z1', 'F8B30Z4']
(the double "F" at the start of the string is not processed)
I'd like the function to split on the first "F" and add it to the list, as follows:
['F', 'F9B2Z1', 'F8B30Z4']
If there is a double "F" in the middle of the string, then the desired behavior would be:
foo('F9B2Z1FF8B30Z4')
['F9B2Z1', 'F', 'F8B30Z4']
Any help would be greatly appreciated.
Instead of the filtering if, use slicing instead because an empty string is a problem only at the beginning:
def foo(my_str):
res = ['F' + elem for elem in my_str.split('F')]
return res[1:] if my_str and my_str[0]=='F' else res
Output:
>>> foo('FF9B2Z1F8B30Z4')
['F', 'F9B2Z1', 'F8B30Z4']
>>> foo('FF9B2Z1FF8B30Z4FF')
['F', 'F9B2Z1', 'F', 'F8B30Z4', 'F', 'F']
>>> foo('9B2Z1F8B30Z4')
['F9B2Z1', 'F8B30Z4']
>>> foo('')
['F']
Using regex it can be done with
import re
pattern = r'^[^F]+|(?<=F)[^F]*'
The ^[^F]+ captures all characters at the beginning of strings that do not start with F.
(?<=F)[^F]* captures anything following an F so long as it is not an F character including empty matches.
>>> print(['F' + x for x in re.findall(pattern, 'abcFFFAFF')])
['Fabc', 'F', 'F', 'FA', 'F', 'F']
>>> print(['F' + x for x in re.findall(pattern, 'FFabcFA')])
['F', 'Fabc', 'FA']
>>> print(['F' + x for x in re.findall(pattern, 'abc')])
['Fabc']
Note that this returns nothing for empty strings. If empty strings need to return ['F'] then pattern can be changed to pattern = r'^[^F]+|(?<=F)[^F]*|^$' adding ^$ to capture empty strings.

Python: replace an exact matching substring with variable

I have a list of strings like 'cdbbdbda', 'fgfghjkbd', 'cdbbd' etc. I have also a variable fed from another list of strings. What I need is to replace a substring in the first list's strings, say b by z, only if it is preceeded by a substring from the variable list, all the other occurrences being intouched.
What I have:
a = ['cdbbdbda', 'fgfghjkbd', 'cdbbd']
c = ['d', 'f', 'l']
What I do:
for i in a:
for j in c:
if j+'b' in i:
i = re.sub('b', 'z', i)
What I need:
'cdzbdzda'
'fgfghjkbd'
'cdzbd'
What I get:
'cdzzdzda'
'fgfghjkbd'
'cdzzd'
all instances of 'b' are replaced.
I'm new in it, any help is very welcome. Looking for answer at Stackoverflow I have found many solutions with regex based on word boundaries or with re either with str.replace based on count, but I can't use it as the lenght of the string and number of occurrences of 'b' can vary.
I think if you include j in the find and replace, you'll get what you want.
>>> for i in a:
... for j in c:
... i = re.sub(j+'b', j+'z', i)
... print i
...
cdzbdzda
fgfghjkbd
cdzbd
>>>
I added print i because your loop doesn't make in-place changes, so without that output, it's not possible to see what replacements were made.
You should simply use regular expressions with a positive lookbehind assertion.
Like this:
import re
for i in a:
for j in c:
i = re.sub('(?<=' + j + ')b', 'z', i)
The base case is:
re.sub('(?<=d)b', 'z', 'cdbbdbda')
You can use a list comprehension:
import re
a = ['cdbbdbda', 'fgfghjkbd', 'cdbbd']
c = ['d', 'f', 'l']
new_a = [re.sub('|'.join('(?<={})b'.format(i) for i in c), 'z', b) for b in a]
Output:
['cdzbdzda', 'fgfghjkbd', 'cdzbd']

Split a string in Python having parenthesis (multiple splitters)

I have a string, for example:
"ab(abcds)kadf(sd)k(afsd)(lbne)"
I want to split it to a list such that the list is stored like this:
a
b
abcds
k
a
d
f
sd
k
afsd
lbne
I need to get the elements outside the parenthesis in separate rows and the ones inside it in separate ones.
I am not able to think of any solution to this problem.
You can use iter to make an iterator and use itertools.takewhile to extract the strings between the parens:
it = iter(s)
from itertools import takewhile
print([ch if ch != "(" else "".join(takewhile(lambda x: x!= ")",it)) for ch in it])
['a', 'b', 'abcds', 'k', 'a', 'd', 'f', 'sd', 'k', 'afsd', 'lbne']
If ch is not equal to ( we just take the char else if ch is a ( we use takewhile which will keep taking chars until we hit a ) .
Or using re.findall get all strings starting and ending in () with \((.+?))` and all other characters with :
print([''.join(tup) for tup in re.findall(r'\((.+?)\)|(\w)', s)])
['a', 'b', 'abcds', 'k', 'a', 'd', 'f', 'sd', 'k', 'afsd', 'lbne']
You just need to use the magic of 're.split' and some logic.
import re
string = "ab(abcds)kadf(sd)k(afsd)(lbne)"
temp = []
x = re.split(r'[(]',string)
#x = ['ab', 'abcds)kadf', 'sd)k', 'afsd)', 'lbne)']
for i in x:
if ')' not in i:
temp.extend(list(i))
else:
t = re.split(r'[)]',i)
temp.append(t[0])
temp.extend(list(t[1]))
print temp
#temp = ['a', 'b', 'abcds', 'k', 'a', 'd', 'f', 'sd', 'k', 'afsd', 'lbne']
Have a look at difference in append and extend here.
I hope this helps.
You have two options. The really easy one is to just iterate over the string. For example:
in_parens=False
buffer=''
for char in my_string:
if char =='(':
in_parens=True
elif char==')':
in_parens = False
my_list.append(buffer)
buffer=''
elif in_parens:
buffer+=char
else:
my_list.append(char)
The other option is regex.
I would suggest regex. It is worth practicing.
Try: Python re. If you are new to re it may take a bit of time but you can do all kind of string manipulations once you get it.
import re
search_string = 'ab(abcds)kadf(sd)k(afsd)(lbne)'
re_pattern = re.compile('(\w)|\((\w*)\)') # Match single character or characters in parenthesis
print [x if x else y for x,y in re_pattern.findall(search_string)]

match the pattern at the end of a string?

Imagine I have the following strings:
['a','b','c_L1', 'c_L2', 'c_L3', 'd', 'e', 'e_L1', 'e_L2']
Where the "c" string has important sub-categories (L1, L2, L3). These indicate special data for our purposes that have been generated in a program based a pre-designated string "L". In other words, I know that the special entries should have the form:
name_Lnumber
Knowing that I'm looking for this pattern, and that I am using "L" or more specifically "_L" as my designation of these objects, how could I return a list of entries that meet this condition? In this case:
['c', 'e']
Use a simple filter:
>>> l = ['a','b','c_L1', 'c_L2', 'c_L3', 'd', 'e', 'e_L1', 'e_L2']
>>> filter(lambda x: "_L" in x, l)
['c_L1', 'c_L2', 'c_L3', 'e_L1', 'e_L2']
Alternatively, use a list comprehension
>>> [s for s in l if "_L" in s]
['c_L1', 'c_L2', 'c_L3', 'e_L1', 'e_L2']
Since you need the prefix only, you can just split it:
>>> set(s.split("_")[0] for s in l if "_L" in s)
set(['c', 'e'])
you can use the following list comprehension :
>>> set(i.split('_')[0] for i in l if '_L' in i)
set(['c', 'e'])
Or if you want to match the elements that ends with _L(digit) and not something like _Lm you can use regex :
>>> import re
>>> set(i.split('_')[0] for i in l if re.match(r'.*?_L\d$',i))
set(['c', 'e'])

Removing all elements containing (",") from a list

muutujad = list(input("Muutujad (sisesta formaadis A,B,C,...): "))
while "," in muutujad == True:
muutujad.remove(",")
print (muutujad)
My brain says that this code should remove all the commas from the list and in the end
the list should contain only ["A","B","C" ....] but it still contains all the elements. When i tried to visualize the code online, it said like [ "," in muutujad ] is always False but when i check the same command from the console it says it is True. I know it is a simple question but i would like to understand the basics.
You can use a list comprehension instead of a while loop:
muutujad = [elem for elem in muutujad if elem != ',']
Your if test itself is also wrong. You never need to test for == True for if in any case, that's what if does. But in your case you test the following:
("," in muutujad) and (muutujad == True)
which is always going to be False. In python, comparison operators like in and == are chained. Leaving off the == True would make your while loop work much better.
I'm not sure you understand what happens when you call list() on a string though; it'll split it into individual characters:
>>> list('Some,string')
['S', 'o', 'm', 'e', ',', 's', 't', 'r', 'i', 'n', 'g']
If you wanted to split the input into elements separated by a comma, use the .split() method instead, and you won't have to remove the commas at all:
>>> 'Some,string'.split(',')
['Some', 'string']
The best option here is to simply parse the string in a better way:
>>> muutujad = input("Muutujad (sisesta formaadis A,B,C,...): ").split(",")
Muutujad (sisesta formaadis A,B,C,...): A, B, C
>>> muutujad
['A', ' B', ' C']
str.split() is a much better option for what you are trying to do here.
What about list("Muutujad (sisesta formaadis A,B,C,...): ".replace(' ', ''))
Downvoter: I meant: this is how you do remove commas from string.
You do not convert your input from string to list and then remove your commas from the list, it's absurd.
you do: list(input('...').replace(' ', ''))
or you use split, as pointed out above.

Categories