Python: Dividing a string into substrings - python

I have a bunch of mathematical expressions stored as strings. Here's a short one:
stringy = "((2+2)-(3+5)-6)"
I want to break this string up into a list that contains ONLY the information in each "sub-parenthetical phrase" (I'm sure there's a better way to phrase that.) So my yield would be:
['2+2','3+5']
I have a couple of ideas about how to do this, but I keep running into a "okay, now what" issue.
For example:
for x in stringy:
substring = stringy[stringy.find('('+1 : stringy.find(')')+1]
stringlist.append(substring)
Works just peachy to return 2+2, but that's about as far as it goes, and I am completely blanking on how to move through the remainder...

One way using regex:
import re
stringy = "((2+2)-(3+5)-6)"
for exp in re.findall("\(([\s\d+*/-]+)\)", stringy):
print exp
Output
2+2
3+5

You could use regular expressions like the following:
import re
x = "((2+2)-(3+5)-6)"
re.findall(r"(?<=\()[0-9+/*-]+(?=\))", x)
Result:
['2+2', '3+5']

Related

Search for any number of unknown substrings in place of * in a list of string

First of all, sorry if the title isn't very explicit, it's hard for me to formulate it properly. That's also why I haven't found if the question has already been asked, if it has.
So, I have a list of string, and I want to perform a "procedural" search replacing every * in my target-substring by any possible substring.
Here is an example:
strList = ['obj_1_mesh',
'obj_2_mesh',
'obj_TMP',
'mesh_1_TMP',
'mesh_2_TMP',
'meshTMP']
searchFor('mesh_*')
# should return: ['mesh_1_TMP', 'mesh_2_TMP']
In this case where there is just one * I just split each string with * and use startswith() and/or endswith(), so that's ok.
But I don't know how to do the same thing if there are multiple * in the search string.
So my question is, how do I search for any number of unknown substrings in place of * in a list of string?
For example:
strList = ['obj_1_mesh',
'obj_2_mesh',
'obj_TMP',
'mesh_1_TMP',
'mesh_2_TMP',
'meshTMP']
searchFor('*_1_*')
# should return: ['obj_1_mesh', 'mesh_1_TMP']
Hope everything is clear enough. Thanks.
Consider using 'fnmatch' which provides Unix-like file pattern matching. More info here http://docs.python.org/2/library/fnmatch.html
from fnmatch import fnmatch
strList = ['obj_1_mesh',
'obj_2_mesh',
'obj_TMP',
'mesh_1_TMP',
'mesh_2_TMP',
'meshTMP']
searchFor = '*_1_*'
resultSubList = [ strList[i] for i,x in enumerate(strList) if fnmatch(x,searchFor) ]
This should do the trick
I would use the regular expression package for this if I were you. You'll have to learn a little bit of regex to make correct search queries, but it's not too bad. '.+' is pretty similar to '*' in this case.
import re
def search_strings(str_list, search_query):
regex = re.compile(search_query)
result = []
for string in str_list:
match = regex.match(string)
if match is not None:
result+=[match.group()]
return result
strList= ['obj_1_mesh',
'obj_2_mesh',
'obj_TMP',
'mesh_1_TMP',
'mesh_2_TMP',
'meshTMP']
print search_strings(strList, '.+_1_.+')
This should return ['obj_1_mesh', 'mesh_1_TMP']. I tried to replicate the '*_1_*' case. For 'mesh_*' you could make the search_query 'mesh_.+'. Here is the link to the python regex api: https://docs.python.org/2/library/re.html
The simplest way to do this is to use fnmatch, as shown in ma3oun's answer. But here's a way to do it using Regular Expressions, aka regex.
First we transform your searchFor pattern so it uses '.+?' as the "wildcard" instead of '*'. Then we compile the result into a regex pattern object so we can efficiently use it multiple tests.
For an explanation of regex syntax, please see the docs. But briefly, the dot means any character (on this line), the + means look for one or more of them, and the ? means do non-greedy matching, i.e., match the smallest string that conforms to the pattern rather than the longest, (which is what greedy matching does).
import re
strList = ['obj_1_mesh',
'obj_2_mesh',
'obj_TMP',
'mesh_1_TMP',
'mesh_2_TMP',
'meshTMP']
searchFor = '*_1_*'
pat = re.compile(searchFor.replace('*', '.+?'))
result = [s for s in strList if pat.match(s)]
print(result)
output
['obj_1_mesh', 'mesh_1_TMP']
If we use searchFor = 'mesh_*' the result is
['mesh_1_TMP', 'mesh_2_TMP']
Please note that this solution is not robust. If searchFor contains other characters that have special meaning in a regex they need to be escaped. Actually, rather than doing that searchFor.replace transformation, it would be cleaner to just write the pattern using regex syntax in the first place.
If the string you are looking for looks always like string you can just use the find function, you'll get something like:
for s in strList:
if s.find(searchFor) != -1:
do_something()
If you have more than one string to look for (like abc*123*test) you gonna need to look for the each string, find the second one in the same string starting at the index you found the first + it's len and so on.

How to split a string and keeping the pattern

This is how the string splitting works for me right now:
output = string.encode('UTF8').split('}/n}')[0]
output += '}\n}'
But I am wondering if there is a more pythonic way to do it.
The goal is to get everything before this '}/n}' including '}/n}'.
This might be a good use of str.partition.
string = '012za}/n}ddfsdfk'
parts = string.partition('}/n}')
# ('012za', '}/n}', 'ddfsdfk')
''.join(parts[:-1])
# 012za}/n}
Or, you can find it explicitly with str.index.
repl = '}/n}'
string[:string.index(repl) + len(repl)]
# 012za}/n}
This is probably better than using str.find since an exception will be raised if the substring isn't found, rather than producing nonsensical results.
It seems like anything "more elegant" would require regular expressions.
import re
re.search('(.*?}/n})', string).group(0)
# 012za}/n}
It can be done with with re.split() -- the key is putting parens around the split pattern to preserve what you split on:
import re
output = "".join(re.split(r'(}/n})', string.encode('UTF8'))[:2])
However, I doubt that this is either the most efficient nor most Pythonic way to achieve what you want. I.e. I don't think this is naturally a split sort of problem. For example:
tag = '}/n}'
encoded = string.encode('UTF8')
output = encoded[:encoded.index(tag)] + tag
or if you insist on a one-liner:
output = (lambda string, tag: string[:string.index(tag)] + tag)(string.encode('UTF8'), '}/n}')
or returning to regex:
output = re.match(r".*}/n}", string.encode('UTF8')).group(0)
>>> string_to_split = 'first item{\n{second item'
>>> sep = '{\n{'
>>> output = [item + sep for item in string_to_split.split(sep)]
NOTE: output = ['first item{\n{', 'second item{\n{']
then you can use the result:
for item_with_delimiter in output:
...
It might be useful to look up os.linesep if you're not sure what the line ending will be. os.linesep is whatever the line ending is under your current OS, so '\r\n' under Windows or '\n' under Linux or Mac. It depends where input data is from, and how flexible your code needs to be across environments.
Adapted from Slice a string after a certain phrase?, you can combine find and slice to get the first part of the string and retain }/n}.
str = "012za}/n}ddfsdfk"
str[:str.find("}/n}")+4]
Will result in 012za}/n}

How to find values in string, manipulate and replace them

How to find values in string, add specific value to each of them and replace output with fixed string.
import re
def _replace(content):
#x = float(content.group(4))+20
#y = float(content.group(6))+20
return content.group(6)
print re.sub('<g(\s)transform="matrix\((.*)(\s)(.*)(\s)(.*)\)\"', _replace, '<g transform="matrix(0.412445 -0.0982513 0.0982513 0.412445 -5.77618 67.0025)">')
First off, I should repeat the usual warning about not parsing XML with regexes. It's a bad idea, and it will never work for all cases. If you're actually trying to parse the full xml document, use an XML parser.
That having been said, I'm guilty of doing quick and dirty stuff like this all the time. If you really just need a one-off solution, a simple regex can often get the job done. Just be aware that it will come back to haunt you as soon as you run into something more complex!
Next, I confess to not being much of a regex wiz, but here's how I'd modify your code snippet:
import re
def _replace(content):
values = [float(val) for val in content.group(2).split()]
values[3] += 20
values[5] += 100
values = ['{0}'.format(val) for val in values]
return content.group(1) + ' '.join(values) + content.group(3)
test_string = '<g transform="matrix(0.412445 -0.0982513 0.0982513 0.412445 -5.77618 67.0025)">'
pattern = r'(transform=\"matrix\()(.*?)(\))'
print test_string
print re.sub(pattern, _replace, test_string)

Replacing leading text in Python

I use Python 2.6 and I want to replace each instance of certain leading characters (., _ and $ in my case) in a string with another character or string. Since in my case the replacement string is the same, I came up with this:
def replaceLeadingCharacters(string, old, new = ''):
t = string.lstrip(old)
return new * (len(string) - len(t)) + t
which seems to work fine:
>>> replaceLeadingCharacters('._.!$XXX$._', '._$', 'Y')
'YYY!$XXX$._'
Is there a better (simpler or more efficient) way to achieve the same effect in Python ?
Is there a way to achieve this effect with a string instead of characters? Something like str.replace() that stops once something different than the string-to-be-replaced comes up in the input string? Right now I've come up with this:
def replaceLeadingString(string, old, new = ''):
n = 0
o = 0
s = len(old)
while string.startswith(old, o):
n += 1
o += s
return new * n + string[o:]
I am hoping that there is a way to do this without an explicit loop
EDIT:
There are quite a few answers using the re module. I have a couple of questions/issues with it:
Isn't it significantly slower than the str methods when used as a replacement for them?
Is there an easy way to properly quote/escape strings that will be used in a regular expression? For example if I wanted to use re for replaceLeadingCharacters, how would I ensure that the contents of the old variable will not mess things up in ^[old]+ ? I'd rather have a "black-box" function that does not require its users to pay attention to the list of characters that they provide.
Your replaceLeadingCharacters() seems fine as is.
Here's replaceLeadingString() implementation that uses re module (without the while loop):
#!/usr/bin/env python
import re
def lreplace(s, old, new):
"""Return a copy of string `s` with leading occurrences of
substring `old` replaced by `new`.
>>> lreplace('abcabcdefabc', 'abc', 'X')
'XXdefabc'
>>> lreplace('_abc', 'abc', 'X')
'_abc'
"""
return re.sub(r'^(?:%s)+' % re.escape(old),
lambda m: new * (m.end() / len(old)),
s)
Isn't it significantly slower than the str methods when used as a replacement for them?
Don't guess. Measure it for expected input.
Is there an easy way to properly quote/escape strings that will be used in a regular expression?
re.escape()
re.sub(r'^[._$]+', lambda m: 'Y' * m.end(0), '._.!$XXX$._')
But IMHO your first solution is good enough.

Returning all characters before the first underscore

Using re in Python, I would like to return all of the characters in a string that precede the first appearance of an underscore. In addition, I would like the string that is being returned to be in all uppercase and without any non-alpanumeric characters.
For example:
AG.av08_binloop_v6 = AGAV08
TL.av1_binloopv2 = TLAV1
I am pretty sure I know how to return a string in all uppercase using string.upper() but I'm sure there are several ways to remove the . efficiently. Any help would be greatly appreciated. I am still learning regular expressions slowly but surely. Each tip gets added to my notes for future use.
To further clarify, my above examples aren't the actual strings. The actual string would look like:
AG.av08_binloop_v6
With my desired output looking like:
AGAV08
And the next example would be the same. String:
TL.av1_binloopv2
Desired output:
TLAV1
Again, thanks all for the help!
Even without re:
text.split('_', 1)[0].replace('.', '').upper()
Try this:
re.sub("[^A-Z\d]", "", re.search("^[^_]*", str).group(0).upper())
Since everyone is giving their favorite implementation, here's mine that doesn't use re:
>>> for s in ('AG.av08_binloop_v6', 'TL.av1_binloopv2'):
... print ''.join(c for c in s.split('_',1)[0] if c.isalnum()).upper()
...
AGAV08
TLAV1
I put .upper() on the outside of the generator so it is only called once.
You don't have to use re for this. Simple string operations would be enough based on your requirements:
tests = """
AG.av08_binloop_v6 = AGAV08
TL.av1_binloopv2 = TLAV1
"""
for t in tests.splitlines():
print t[:t.find('_')].replace('.', '').upper()
# Returns:
# AGAV08
# TLAV1
Or if you absolutely must use re:
import re
pat = r'([a-zA-Z0-9.]+)_.*'
pat_re = re.compile(pat)
for t in tests.splitlines():
print re.sub(r'\.', '', pat_re.findall(t)[0]).upper()
# Returns:
# AGAV08
# TLAV1
He, just for fun, another option to get text before the first underscore is:
before_underscore, sep, after_underscore = str.partition('_')
So all in one line could be:
re.sub("[^A-Z\d]", "", str.partition('_')[0].upper())
import re
re.sub("[^A-Z\d]", "", yourstr.split('_',1)[0].upper())

Categories