split string python with changing indicator - python

lets say I have a string .
a = '!!!!!!a1######a2&&&&&&a3::::'
It naturally splits by: a1,a2 and a3 to
['!!!!!!','######','&&&&&&','::::']
I want to use the split function, something like:
>>> a.split('a*')
The * indicates that it doesn't matter what character comes after a. Is there an immediate way to do this?

s = '!!!!!!a1######a2&&&&&&a3::::'
import re
print(re.split(r'a[0-9]+', s))

By using regex, with module re, you can try like this:
import re
a = re.split(r'a\d','!!!!!!a1######a2&&&&&&a3::::')
If you want to be more specific in your split keys, try this:
a = re.split(r'a1|a2|a3','!!!!!!a1######a2&&&&&&a3::::')
and make your custom condition, as you wish.

While not as efficient as #Menglong's solution, you could technically do it with just list and string operations, without importing re:
>>> a = '!!!!!!a1######a2&&&&&&a3::::'
>>> s = a.split('a')
>>> s[:1] + [x[1:] for x in s[1:] if x]
['!!!!!!', '######', '&&&&&&', '::::']
This works because if you split on 'a', the first character of every segment after the first will be the * character you want to get rid of.
This solution is not preferable, just something I did as an exercise.

Related

Python - how to substitute a substring using regex with n occurrencies

I have a string with a lot of recurrencies of a single pattern like
a = 'eresQQQutnohnQQQjkhjhnmQQQlkj'
and I have another string like
b = 'rerTTTytu'
I want to substitute the entire second string having as a reference the 'QQQ' and the 'TTT', and I want to find in this case 3 different results:
'ererTTTytuohnQQQjkhjhnmQQQlkj'
'eresQQQutnrerTTTytujhnmQQQlkj'
'eresQQQutnohnQQQjkhjrerTTTytu'
I've tried using re.sub
re.sub('\w{3}QQQ\w{3}' ,b,a)
but I obtain only the first one, and I don't know how to get the other two solutions.
Edit: As you requested, the two characters surrounding 'QQQ' will be replaced as well now.
I don't know if this is the most elegant or simplest solution for the problem, but it works:
import re
# Find all occurences of ??QQQ?? in a - where ? is any character
matches = [x.start() for x in re.finditer('\S{2}QQQ\S{2}', a)]
# Replace each ??QQQ?? with b
results = [a[:idx] + re.sub('\S{2}QQQ\S{2}', b, a[idx:], 1) for idx in matches]
print(results)
Output
['errerTTTytunohnQQQjkhjhnmQQQlkj',
'eresQQQutnorerTTTytuhjhnmQQQlkj',
'eresQQQutnohnQQQjkhjhrerTTTytuj']
Since you didn't specify the output format, I just put it in a list.

How to split a string and keeping the pattern

This is how the string splitting works for me right now:
output = string.encode('UTF8').split('}/n}')[0]
output += '}\n}'
But I am wondering if there is a more pythonic way to do it.
The goal is to get everything before this '}/n}' including '}/n}'.
This might be a good use of str.partition.
string = '012za}/n}ddfsdfk'
parts = string.partition('}/n}')
# ('012za', '}/n}', 'ddfsdfk')
''.join(parts[:-1])
# 012za}/n}
Or, you can find it explicitly with str.index.
repl = '}/n}'
string[:string.index(repl) + len(repl)]
# 012za}/n}
This is probably better than using str.find since an exception will be raised if the substring isn't found, rather than producing nonsensical results.
It seems like anything "more elegant" would require regular expressions.
import re
re.search('(.*?}/n})', string).group(0)
# 012za}/n}
It can be done with with re.split() -- the key is putting parens around the split pattern to preserve what you split on:
import re
output = "".join(re.split(r'(}/n})', string.encode('UTF8'))[:2])
However, I doubt that this is either the most efficient nor most Pythonic way to achieve what you want. I.e. I don't think this is naturally a split sort of problem. For example:
tag = '}/n}'
encoded = string.encode('UTF8')
output = encoded[:encoded.index(tag)] + tag
or if you insist on a one-liner:
output = (lambda string, tag: string[:string.index(tag)] + tag)(string.encode('UTF8'), '}/n}')
or returning to regex:
output = re.match(r".*}/n}", string.encode('UTF8')).group(0)
>>> string_to_split = 'first item{\n{second item'
>>> sep = '{\n{'
>>> output = [item + sep for item in string_to_split.split(sep)]
NOTE: output = ['first item{\n{', 'second item{\n{']
then you can use the result:
for item_with_delimiter in output:
...
It might be useful to look up os.linesep if you're not sure what the line ending will be. os.linesep is whatever the line ending is under your current OS, so '\r\n' under Windows or '\n' under Linux or Mac. It depends where input data is from, and how flexible your code needs to be across environments.
Adapted from Slice a string after a certain phrase?, you can combine find and slice to get the first part of the string and retain }/n}.
str = "012za}/n}ddfsdfk"
str[:str.find("}/n}")+4]
Will result in 012za}/n}

Python: Dividing a string into substrings

I have a bunch of mathematical expressions stored as strings. Here's a short one:
stringy = "((2+2)-(3+5)-6)"
I want to break this string up into a list that contains ONLY the information in each "sub-parenthetical phrase" (I'm sure there's a better way to phrase that.) So my yield would be:
['2+2','3+5']
I have a couple of ideas about how to do this, but I keep running into a "okay, now what" issue.
For example:
for x in stringy:
substring = stringy[stringy.find('('+1 : stringy.find(')')+1]
stringlist.append(substring)
Works just peachy to return 2+2, but that's about as far as it goes, and I am completely blanking on how to move through the remainder...
One way using regex:
import re
stringy = "((2+2)-(3+5)-6)"
for exp in re.findall("\(([\s\d+*/-]+)\)", stringy):
print exp
Output
2+2
3+5
You could use regular expressions like the following:
import re
x = "((2+2)-(3+5)-6)"
re.findall(r"(?<=\()[0-9+/*-]+(?=\))", x)
Result:
['2+2', '3+5']

Python String stripping and splitting

I'm working with image metadata and able to extract a string that looks like this
Cube1[visible:true, mode:Normal]{r:Cube1.R, g:Cube1.G, b:Cube1.B, a:Cube1.A},
Ground[visible:true, mode:Normal]{r:Ground.R, g:Ground.G, b:Ground.B, a:Ground.A},
Cube3[visible:true, mode:Normal]{r:Cube3.R, g:Cube3.G, b:Cube3.B, a:Cube3.A},
Cube4[visible:true, mode:Normal]{r:Cube4.R, g:Cube4.G, b:Cube4.B, a:Cube4.A},
Sphere[visible:true, mode:Normal]{r:Sphere.R, g:Sphere.G, b:Sphere.B, a:Sphere.A},
OilTank[visible:true, mode:Normal]{r:OilTank.R, g:OilTank.G, b:OilTank.B, a:OilTank.A},
Cube2[visible:true, mode:Normal]{r:Cube2.R, g:Cube2.G, b:Cube2.B, a:Cube2.A}
I what convert that large mess to only the layer names. I also need for the order to stay the same. So, in this case it would be:
Cube1
Ground
Cube3
Cube4
Sphere
OilTank
Cube2
I've tried using "split" and "slice". I'm assuming there is a hierarchy here but I'm not sure where to go next.
If the data is indeed formated like that:
import re
i = [the listed string]
names = [j.strip('[') for j in re.findall("\w+\[\.*", i)]
Output:
['Cube1', 'Ground', 'Cube3', 'Cube4', 'Sphere', 'OilTank', 'Cube2']
If you just need the left-most portion, I would use:
name, _ = line.split("[", 1)
If you need something more complex, I'd look into using regular expressions with the re moduleā€¦ Let me know and I can suggest something.
>>> mess = 'Cube1[visible:true, mode:Normal]{r:Cube1.R, g:Cube1.G, b:Cube1.B, a:Cube1.A},\nGround[visible:true, mode:Normal]{r:Ground.R, g:Ground.G, b:Ground.B, a:Ground.A},\nCube3[visible:true, mode:Normal]{r:Cube3.R, g:Cube3.G, b:Cube3.B, a:Cube3.A},\nCube4[visible:true, mode:Normal]{r:Cube4.R, g:Cube4.G, b:Cube4.B, a:Cube4.A},\nSphere[visible:true, mode:Normal]{r:Sphere.R, g:Sphere.G, b:Sphere.B, a:Sphere.A},\nOilTank[visible:true, mode:Normal]{r:OilTank.R, g:OilTank.G, b:OilTank.B, a:OilTank.A},\nCube2[visible:true, mode:Normal]{r:Cube2.R, g:Cube2.G, b:Cube2.B, a:Cube2.A}'
>>> names = "\n".join(line.split("[", 1)[0] for line in mess.split("\n"))
>>> print names
Cube1
Ground
Cube3
Cube4
Sphere
OilTank
Cube2
I don't know a lot about python, but my thoughts in terms of logic would be this:
Split on the comma character
Loop on the resulting array and cut off everything after the first '[' using substring(indexOf) or similar python manipulation.
Then loop though the array again to concatenate the strings back together.
Sorry I don't know the specific commands for doing this. Hope it helps!
Regexes are unecessary, assuming that really is the exact format of your data.
[i.split('[', 1)[0] for i in lst]
With string split:
names = [ x.split('[')[0] for x in your_text.split('\n') ]
With regular expressions:
import re
names = re.findall(r'^\w+', your_text, re.MULTILINE)

Replacing leading text in Python

I use Python 2.6 and I want to replace each instance of certain leading characters (., _ and $ in my case) in a string with another character or string. Since in my case the replacement string is the same, I came up with this:
def replaceLeadingCharacters(string, old, new = ''):
t = string.lstrip(old)
return new * (len(string) - len(t)) + t
which seems to work fine:
>>> replaceLeadingCharacters('._.!$XXX$._', '._$', 'Y')
'YYY!$XXX$._'
Is there a better (simpler or more efficient) way to achieve the same effect in Python ?
Is there a way to achieve this effect with a string instead of characters? Something like str.replace() that stops once something different than the string-to-be-replaced comes up in the input string? Right now I've come up with this:
def replaceLeadingString(string, old, new = ''):
n = 0
o = 0
s = len(old)
while string.startswith(old, o):
n += 1
o += s
return new * n + string[o:]
I am hoping that there is a way to do this without an explicit loop
EDIT:
There are quite a few answers using the re module. I have a couple of questions/issues with it:
Isn't it significantly slower than the str methods when used as a replacement for them?
Is there an easy way to properly quote/escape strings that will be used in a regular expression? For example if I wanted to use re for replaceLeadingCharacters, how would I ensure that the contents of the old variable will not mess things up in ^[old]+ ? I'd rather have a "black-box" function that does not require its users to pay attention to the list of characters that they provide.
Your replaceLeadingCharacters() seems fine as is.
Here's replaceLeadingString() implementation that uses re module (without the while loop):
#!/usr/bin/env python
import re
def lreplace(s, old, new):
"""Return a copy of string `s` with leading occurrences of
substring `old` replaced by `new`.
>>> lreplace('abcabcdefabc', 'abc', 'X')
'XXdefabc'
>>> lreplace('_abc', 'abc', 'X')
'_abc'
"""
return re.sub(r'^(?:%s)+' % re.escape(old),
lambda m: new * (m.end() / len(old)),
s)
Isn't it significantly slower than the str methods when used as a replacement for them?
Don't guess. Measure it for expected input.
Is there an easy way to properly quote/escape strings that will be used in a regular expression?
re.escape()
re.sub(r'^[._$]+', lambda m: 'Y' * m.end(0), '._.!$XXX$._')
But IMHO your first solution is good enough.

Categories