Splitting a string and retaining the delimiter with the delimiter appearing contiguously - python

I have the following string:
bar = 'F9B2Z1F8B30Z4'
I have a function foo that splits the string on F, then adds back the F delimiter.
def foo(my_str):
res = ['F' + elem for elem in my_str.split('F') if elem != '']
return res
This works unless there are two "F"s back-to-back in the string. For example,
foo('FF9B2Z1F8B30Z4')
returns
['F9B2Z1', 'F8B30Z4']
(the double "F" at the start of the string is not processed)
I'd like the function to split on the first "F" and add it to the list, as follows:
['F', 'F9B2Z1', 'F8B30Z4']
If there is a double "F" in the middle of the string, then the desired behavior would be:
foo('F9B2Z1FF8B30Z4')
['F9B2Z1', 'F', 'F8B30Z4']
Any help would be greatly appreciated.

Instead of the filtering if, use slicing instead because an empty string is a problem only at the beginning:
def foo(my_str):
res = ['F' + elem for elem in my_str.split('F')]
return res[1:] if my_str and my_str[0]=='F' else res
Output:
>>> foo('FF9B2Z1F8B30Z4')
['F', 'F9B2Z1', 'F8B30Z4']
>>> foo('FF9B2Z1FF8B30Z4FF')
['F', 'F9B2Z1', 'F', 'F8B30Z4', 'F', 'F']
>>> foo('9B2Z1F8B30Z4')
['F9B2Z1', 'F8B30Z4']
>>> foo('')
['F']

Using regex it can be done with
import re
pattern = r'^[^F]+|(?<=F)[^F]*'
The ^[^F]+ captures all characters at the beginning of strings that do not start with F.
(?<=F)[^F]* captures anything following an F so long as it is not an F character including empty matches.
>>> print(['F' + x for x in re.findall(pattern, 'abcFFFAFF')])
['Fabc', 'F', 'F', 'FA', 'F', 'F']
>>> print(['F' + x for x in re.findall(pattern, 'FFabcFA')])
['F', 'Fabc', 'FA']
>>> print(['F' + x for x in re.findall(pattern, 'abc')])
['Fabc']
Note that this returns nothing for empty strings. If empty strings need to return ['F'] then pattern can be changed to pattern = r'^[^F]+|(?<=F)[^F]*|^$' adding ^$ to capture empty strings.

Related

Python: replace an exact matching substring with variable

I have a list of strings like 'cdbbdbda', 'fgfghjkbd', 'cdbbd' etc. I have also a variable fed from another list of strings. What I need is to replace a substring in the first list's strings, say b by z, only if it is preceeded by a substring from the variable list, all the other occurrences being intouched.
What I have:
a = ['cdbbdbda', 'fgfghjkbd', 'cdbbd']
c = ['d', 'f', 'l']
What I do:
for i in a:
for j in c:
if j+'b' in i:
i = re.sub('b', 'z', i)
What I need:
'cdzbdzda'
'fgfghjkbd'
'cdzbd'
What I get:
'cdzzdzda'
'fgfghjkbd'
'cdzzd'
all instances of 'b' are replaced.
I'm new in it, any help is very welcome. Looking for answer at Stackoverflow I have found many solutions with regex based on word boundaries or with re either with str.replace based on count, but I can't use it as the lenght of the string and number of occurrences of 'b' can vary.
I think if you include j in the find and replace, you'll get what you want.
>>> for i in a:
... for j in c:
... i = re.sub(j+'b', j+'z', i)
... print i
...
cdzbdzda
fgfghjkbd
cdzbd
>>>
I added print i because your loop doesn't make in-place changes, so without that output, it's not possible to see what replacements were made.
You should simply use regular expressions with a positive lookbehind assertion.
Like this:
import re
for i in a:
for j in c:
i = re.sub('(?<=' + j + ')b', 'z', i)
The base case is:
re.sub('(?<=d)b', 'z', 'cdbbdbda')
You can use a list comprehension:
import re
a = ['cdbbdbda', 'fgfghjkbd', 'cdbbd']
c = ['d', 'f', 'l']
new_a = [re.sub('|'.join('(?<={})b'.format(i) for i in c), 'z', b) for b in a]
Output:
['cdzbdzda', 'fgfghjkbd', 'cdzbd']

split string into list by regex

I need a regex, which split input string to list with next rules:
1) By dot;
2) Do not split expression if it is in quotes.
Examples:
'a.b.c' -> ['a', 'b', 'c'];
'a."b.c".d' -> ['a', 'b.c', 'd'];
'a.'b.c'.d' -> ['a', 'b.c', 'd'];
'a.'b c'.d' -> ['a', 'b c', 'd'];
You could leverage the newer regex module with the following expression:
(["']).*?\1(*SKIP)(*FAIL)|\.
This captures quotes, match them up to the next quote and let the matched part fail. The alternation is the dot.
In Python:
import regex as re
data = """
a.b.c
a."b.c".d
a.'b.c'.d
a.'b c'.d
"""
rx = re.compile(r"""(["']).*?\1(*SKIP)(*FAIL)|\.""")
for line in data.split("\n"):
if line:
parts = [part.strip("'").strip('"') for part in rx.split(line) if part]
print(parts)
Which yields
['a', 'b', 'c']
['a', 'b.c', 'd']
['a', 'b.c', 'd']
['a', 'b c', 'd']
See a demo on regex101.com.
If you want to stick with the re module, you could replace the dot in question before and split by the replacement afterwards.
import re
data = """
a.b.c
a."b.c".d
a.'b.c'.d
a.'b c'.d
"""
rx = re.compile(r"""(["']).*?\1|(?P<dot>\.)""")
needle = "SUPERMAN"
def replacer(match):
if match.group('dot') is not None:
return needle
else:
return match.group(0)
for line in data.split("\n"):
if line:
line = rx.sub(replacer, line)
parts = [part.strip("'").strip('"') for part in line.split(needle) if part]
print(parts)
This yields the exact same output as above. Please note that both approaches won't work for escaped quotes.
You can do it with some extra efforts here how can you do.
First split with '.' and then do some logically work on it.
string_data = 'a."b.c".d'
data = string_data.split('.')
list = []
value = None
for i in range(0,len(data)):
if value:
value = None
else:
if '"' in data[i]:
value = data[i]
value = value + '.' + data[i+1]
if value:
list.append(value)
else:
list.append(data[i])
print(list)
It will give you output same as in your qus.
As an alternative you could try using an or | with a positive lookbehind (?<= and a positive lookahead (?= for the single and double quotes
(?<=").*?(?=")|(?<=').*?(?=')|[a-z]+
regex = r"(?<=\").*?(?=\")|(?<=').*?(?=')|[a-z]+"
line = "a.\"b.t\".qq.d.d.'d'.'d.g.r'.d.d"
print(re.findall(regex, line))
['a', 'b.t', 'qq', 'd', 'd', 'd', '.', 'd.g.r', 'd', 'd']
Test output python
here is a regex for you:
\.?([^\"\'\.]+)|\"(.+)\"|\'(.+)\'\.?
implementation:
import re
regex = re.compile( r"""\.?([^\"\'\.]+)|\"(.+)\"|\'(.+)\'\.?""")
def str2list(string):
b = regex.findall(string)
l = []
for i in list(b):
for j in list(i):
if j:
l.append(j)
return l
str2list('a.b.c')
str2list('a."b.c".d')
str2list("a.'b.c'.d")
output:
['a', 'b', 'c']
['a', 'b.c', 'd']
['a', 'b.c', 'd']

Remove duplicates but retain sequence

I'm trying to reduce a string with duplicates however I do not want to create a set. For example
mystring = 'TTTTTPPPTPTTTTPPPPPPPPP'
The sequence of the letters is 'TPTPTP', so I need a resulting string of
newstring = 'TPTPTP'
I'm sure there is an easy one-liner but its evading me
You're looking for itertools.groupby.
>>> mystring = 'TTTTTPPPTPTTTTPPPPPPPPP'
>>> groups = [x for x, y in itertools.groupby(mystring)]
>>> groups
['T', 'P', 'T', 'P', 'T', 'P']
>>> ''.join(groups)
TPTPTP
Official documentation
zip each character with the one before and take those which are different:
>>> a
'TTTTTPPPTPTTTTPPPPPPPPP'
>>> ''.join(i for i, j in zip(a, '\0' + a) if i != j)
'TPTPTP'
You can also use regular expressions if you feel like it.
>>> import re
>>> mystring = 'TTTTTPPPTPTTTTPPPPPPPPP'
>>> ''.join(re.findall(r'(.)\1*', mystring))
'TPTPTP'
That looks for any character, followed by the same found character zero or more times.

Split a string in Python having parenthesis (multiple splitters)

I have a string, for example:
"ab(abcds)kadf(sd)k(afsd)(lbne)"
I want to split it to a list such that the list is stored like this:
a
b
abcds
k
a
d
f
sd
k
afsd
lbne
I need to get the elements outside the parenthesis in separate rows and the ones inside it in separate ones.
I am not able to think of any solution to this problem.
You can use iter to make an iterator and use itertools.takewhile to extract the strings between the parens:
it = iter(s)
from itertools import takewhile
print([ch if ch != "(" else "".join(takewhile(lambda x: x!= ")",it)) for ch in it])
['a', 'b', 'abcds', 'k', 'a', 'd', 'f', 'sd', 'k', 'afsd', 'lbne']
If ch is not equal to ( we just take the char else if ch is a ( we use takewhile which will keep taking chars until we hit a ) .
Or using re.findall get all strings starting and ending in () with \((.+?))` and all other characters with :
print([''.join(tup) for tup in re.findall(r'\((.+?)\)|(\w)', s)])
['a', 'b', 'abcds', 'k', 'a', 'd', 'f', 'sd', 'k', 'afsd', 'lbne']
You just need to use the magic of 're.split' and some logic.
import re
string = "ab(abcds)kadf(sd)k(afsd)(lbne)"
temp = []
x = re.split(r'[(]',string)
#x = ['ab', 'abcds)kadf', 'sd)k', 'afsd)', 'lbne)']
for i in x:
if ')' not in i:
temp.extend(list(i))
else:
t = re.split(r'[)]',i)
temp.append(t[0])
temp.extend(list(t[1]))
print temp
#temp = ['a', 'b', 'abcds', 'k', 'a', 'd', 'f', 'sd', 'k', 'afsd', 'lbne']
Have a look at difference in append and extend here.
I hope this helps.
You have two options. The really easy one is to just iterate over the string. For example:
in_parens=False
buffer=''
for char in my_string:
if char =='(':
in_parens=True
elif char==')':
in_parens = False
my_list.append(buffer)
buffer=''
elif in_parens:
buffer+=char
else:
my_list.append(char)
The other option is regex.
I would suggest regex. It is worth practicing.
Try: Python re. If you are new to re it may take a bit of time but you can do all kind of string manipulations once you get it.
import re
search_string = 'ab(abcds)kadf(sd)k(afsd)(lbne)'
re_pattern = re.compile('(\w)|\((\w*)\)') # Match single character or characters in parenthesis
print [x if x else y for x,y in re_pattern.findall(search_string)]

match the pattern at the end of a string?

Imagine I have the following strings:
['a','b','c_L1', 'c_L2', 'c_L3', 'd', 'e', 'e_L1', 'e_L2']
Where the "c" string has important sub-categories (L1, L2, L3). These indicate special data for our purposes that have been generated in a program based a pre-designated string "L". In other words, I know that the special entries should have the form:
name_Lnumber
Knowing that I'm looking for this pattern, and that I am using "L" or more specifically "_L" as my designation of these objects, how could I return a list of entries that meet this condition? In this case:
['c', 'e']
Use a simple filter:
>>> l = ['a','b','c_L1', 'c_L2', 'c_L3', 'd', 'e', 'e_L1', 'e_L2']
>>> filter(lambda x: "_L" in x, l)
['c_L1', 'c_L2', 'c_L3', 'e_L1', 'e_L2']
Alternatively, use a list comprehension
>>> [s for s in l if "_L" in s]
['c_L1', 'c_L2', 'c_L3', 'e_L1', 'e_L2']
Since you need the prefix only, you can just split it:
>>> set(s.split("_")[0] for s in l if "_L" in s)
set(['c', 'e'])
you can use the following list comprehension :
>>> set(i.split('_')[0] for i in l if '_L' in i)
set(['c', 'e'])
Or if you want to match the elements that ends with _L(digit) and not something like _Lm you can use regex :
>>> import re
>>> set(i.split('_')[0] for i in l if re.match(r'.*?_L\d$',i))
set(['c', 'e'])

Categories