python regex: capture parts of multiple strings that contain spaces - python

I am trying to capture sub-strings from a string that looks similar to
'some string, another string, '
I want the result match group to be
('some string', 'another string')
my current solution
>>> from re import match
>>> match(2 * '(.*?), ', 'some string, another string, ').groups()
('some string', 'another string')
works, but is not practicable - what I am showing here of course is massively reduced in terms of complexity compared to what I'm doing in the real project; I want to use one 'straight' (non-computed) regex pattern only. Unfortunately, my attempts have failed so far:
This doesn't match (None as result), because {2} is applied to the space only, not to the whole string:
>>> match('.*?, {2}', 'some string, another string, ')
adding parentheses around the repeated string has the comma and space in the result
>>> match('(.*?, ){2}', 'some string, another string, ').groups()
('another string, ',)
adding another set of parantheses does fix that, but gets me too much:
>>> match('((.*?), ){2}', 'some string, another string, ').groups()
('another string, ', 'another string')
adding a non-capturing modifier improves the result, but still misses the first string
>>> match('(?:(.*?), ){2}', 'some string, another string, ').groups()
('another string',)
I feel like I'm close, but I can't really seem to find the proper way.
Can anyone help me ? Any other approaches I'm not seeing ?
Update after the first few responses:
First up, thank you very much everyone, your help is greatly appreciated! :-)
As I said in the original post, I have omitted a lot of complexity in my question for the sake of depicting the actual core problem. For starters, in the project I am working on, I am parsing large amounts of files (currently tens of thousands per day) in a number (currently 5, soon ~25, possibly in the hundreds later) of different line-based formats. There is also XML, JSON, binary and some other data file formats, but let's stay focussed.
In order to cope with the multitude of file formats and to exploit the fact that many of them are line-based, I have created a somewhat generic Python module that loads one file after the other, applies a regex to every line and returns a large data structure with the matches. This module is a prototype, the production version will require a C++ version for performance reason which will be connected over Boost::Python and will probably add the subject of regex dialects to the list of complexities.
Also, there are not 2 repetitions, but an amount varying between currently zero and 70 (or so), the comma is not always a comma and despite what I said originally, some parts of the regex pattern will have to be computed at runtime; let's just say I have reason to try and reduce the 'dynamic' amount and have as much 'fixed' pattern as possible.
So, in a word: I must use regular expressions.
Attempt to rephrase: I think the core of the problem boils down to: Is there a Python RegEx notation that e.g. involves curly braces repetitions and allows me to capture
'some string, another string, '
into
('some string', 'another string')
?
Hmmm, that probably narrows it down too far - but then, any way you do it is wrong :-D
Second attempt to rephrase: Why do I not see the first string ('some string') in the result ? Why does the regex produce a match (indicating there's gotta be 2 of something), but only returns 1 string (the second one) ?
The problem remains the same even if I use non-numeric repetition, i.e. using + instead of {2}:
>>> match('(?:(.*?), )+', 'some string, another string, ').groups()
('another string',)
Also, it's not the second string that's returned, it is the last one:
>>> match('(?:(.*?), )+', 'some string, another string, third string, ').groups()
('third string',)
Again, thanks for your help, never ceases to amaze me how helpful peer review is while trying to find out what I actually want to know...

Unless there's much more to this problem than you've explained, I don't see the point in using regexes. This is very simple to deal with using basic string methods:
[s.strip() for s in mys.split(',') if s.strip()]
Or if it has to be a tuple:
tuple(s.strip() for s in mys.split(',') if s.strip())
The code is more readable too. Please tell me if this fails to apply.
EDIT: Ok, there is indeed more to this problem than it initially seemed. Leaving this for historical purposes though. (Guess I'm not 'disciplined' :) )

As described, I think this regex works fine:
import re
thepattern = re.compile("(.+?)(?:,|$)") # lazy non-empty match
thepattern.findall("a, b, asdf, d") # until comma or end of line
# Result:
Out[19]: ['a', ' b', ' asdf', ' d']
The key here is to use findall rather than match. The phrasing of your question suggests you prefer match, but it isn't the right tool for the job here -- it is designed to return exactly one string for each corresponding group ( ) in the regex. Since your 'number of strings' is variable, the right approach is to use either findall or split.
If this isn't what you need, then please make the question more specific.
Edit: And if you must use tuples rather than lists:
tuple(Out[19])
# Result
Out[20]: ('a', ' b', ' asdf', ' d')

import re
regex = " *((?:[^, ]| +[^, ])+) *, *((?:[^, ]| +[^, ])+) *, *"
print re.match(regex, 'some string, another string, ').groups()
# ('some string', 'another string')
print re.match(regex, ' some string, another string, ').groups()
# ('some string', 'another string')
print re.match(regex, ' some string , another string, ').groups()
# ('some string', 'another string')

No offense, but you obviously have a lot to learn about regexes, and what you're going to learn, ultimately, is that regexes can't handle this job. I'm sure this particular task is doable with regexes, but then what? You say you have potentially hundreds of different file formats to parse! You even mentioned JSON and XML, which are fundamentally incompatible with regexes.
Do yourself a favor: forget about regexes and learn pyparsing instead. Or skip Python entirely and use a standalone parser generator like ANTLR. In either case, you'll probably find that grammars for most of your file formats have already been written.

I think the core of the problem boils
down to: Is there a Python RegEx
notation that e.g. involves curly
braces repetitions and allows me to
capture 'some string, another string,
' ?
I don't think there is such a notation.
But regexes are not a matter of only NOTATION , that is to say the RE string used to define a regex. It is also a matter of TOOLS, that is to say functions.
Unfortunately, I can't use findall as
the string from the initial question
is only a part of the problem, the
real string is a lot longer, so
findall only works if I do multiple
regex findalls / matches / searches.
You should give more information without delaying: we could understand more rapidly what are the constraints. Because in my opinion, to answer to your problem as it has been exposed, findall() is indeed OK:
import re
for line in ('string one, string two, ',
'some string, another string, third string, ',
# the following two lines are only one string
'Topaz, Turquoise, Moss Agate, Obsidian, '
'Tigers-Eye, Tourmaline, Lapis Lazuli, '):
print re.findall('(.+?), *',line)
Result
['string one', 'string two']
['some string', 'another string', 'third string']
['Topaz', 'Turquoise', 'Moss Agate', 'Obsidian', 'Tigers-Eye', 'Tourmaline', 'Lapis Lazuli']
Now, since you "have omitted a lot of complexity" in your question, findall() could incidentally be unsufficient to hold this complexity. Then finditer() will be used because it allows more flexibility in the selection of groups of a match
import re
for line in ('string one, string two, ',
'some string, another string, third string, ',
# the following two lines are only one string
'Topaz, Turquoise, Moss Agate, Obsidian, '
'Tigers-Eye, Tourmaline, Lapis Lazuli, '):
print [ mat.group(1) for mat in re.finditer('(.+?), *',line) ]
gives the same result and can be complexified by writing other expression in place of mat.group(1)

In order to sum this up, it seems I am already using the best solution by constructing the regex pattern in a 'dynamic' manner:
>>> from re import match
>>> match(2 * '(.*?), ', 'some string, another string, ').groups()
('some string', 'another string')
the
2 * '(.*?)
is what I mean by dynamic. The alternative approach
>>> match('(?:(.*?), ){2}', 'some string, another string, ').groups()
('another string',)
fails to return the desired result due to the fact that (as Glenn and Alan kindly explained)
with match, the captured content gets overwritten
with each repetition of the capturing
group
Thanks for your help everyone! :-)

Related

Why is there an empty string, and only one, created when I split a string by "?" [duplicate]

What is the point of '/segment/segment/'.split('/') returning ['', 'segment', 'segment', '']?
Notice the empty elements. If you're splitting on a delimiter that happens to be at position one and at the very end of a string, what extra value does it give you to have the empty string returned from each end?
str.split complements str.join, so
"/".join(['', 'segment', 'segment', ''])
gets you back the original string.
If the empty strings were not there, the first and last '/' would be missing after the join().
More generally, to remove empty strings returned in split() results, you may want to look at the filter function.
Example:
f = filter(None, '/segment/segment/'.split('/'))
s_all = list(f)
returns
['segment', 'segment']
There are two main points to consider here:
Expecting the result of '/segment/segment/'.split('/') to be equal to ['segment', 'segment'] is reasonable, but then this loses information. If split() worked the way you wanted, if I tell you that a.split('/') == ['segment', 'segment'], you can't tell me what a was.
What should be the result of 'a//b'.split() be? ['a', 'b']?, or ['a', '', 'b']? I.e., should split() merge adjacent delimiters? If it should, then it will be very hard to parse data that's delimited by a character, and some of the fields can be empty. I am fairly sure there are many people who do want the empty values in the result for the above case!
In the end, it boils down to two things:
Consistency: if I have n delimiters, in a, I get n+1 values back after the split().
It should be possible to do complex things, and easy to do simple things: if you want to ignore empty strings as a result of the split(), you can always do:
def mysplit(s, delim=None):
return [x for x in s.split(delim) if x]
but if one doesn't want to ignore the empty values, one should be able to.
The language has to pick one definition of split()—there are too many different use cases to satisfy everyone's requirement as a default. I think that Python's choice is a good one, and is the most logical. (As an aside, one of the reasons I don't like C's strtok() is because it merges adjacent delimiters, making it extremely hard to do serious parsing/tokenization with it.)
There is one exception: a.split() without an argument squeezes consecutive white-space, but one can argue that this is the right thing to do in that case. If you don't want the behavior, you can always to a.split(' ').
I'm not sure what kind of answer you're looking for? You get three matches because you have three delimiters. If you don't want that empty one, just use:
'/segment/segment/'.strip('/').split('/')
Having x.split(y) always return a list of 1 + x.count(y) items is a precious regularity -- as #gnibbler's already pointed out it makes split and join exact inverses of each other (as they obviously should be), it also precisely maps the semantics of all kinds of delimiter-joined records (such as csv file lines [[net of quoting issues]], lines from /etc/group in Unix, and so on), it allows (as #Roman's answer mentioned) easy checks for (e.g.) absolute vs relative paths (in file paths and URLs), and so forth.
Another way to look at it is that you shouldn't wantonly toss information out of the window for no gain. What would be gained in making x.split(y) equivalent to x.strip(y).split(y)? Nothing, of course -- it's easy to use the second form when that's what you mean, but if the first form was arbitrarily deemed to mean the second one, you'd have lot of work to do when you do want the first one (which is far from rare, as the previous paragraph points out).
But really, thinking in terms of mathematical regularity is the simplest and most general way you can teach yourself to design passable APIs. To take a different example, it's very important that for any valid x and y x == x[:y] + x[y:] -- which immediately indicates why one extreme of a slicing should be excluded. The simpler the invariant assertion you can formulate, the likelier it is that the resulting semantics are what you need in real life uses -- part of the mystical fact that maths is very useful in dealing with the universe.
Try formulating the invariant for a split dialect in which leading and trailing delimiters are special-cased... counter-example: string methods such as isspace are not maximally simple -- x.isspace() is equivalent to x and all(c in string.whitespace for c in x) -- that silly leading x and is why you so often find yourself coding not x or x.isspace(), to get back to the simplicity which should have been designed into the is... string methods (whereby an empty string "is" anything you want -- contrary to man-in-the-street horse-sense, maybe [[empty sets, like zero &c, have always confused most people;-)]], but fully conforming to obvious well-refined mathematical common-sense!-).
Well, it lets you know there was a delimiter there. So, seeing 4 results lets you know you had 3 delimiters. This gives you the power to do whatever you want with this information, rather than having Python drop the empty elements, and then making you manually check for starting or ending delimiters if you need to know it.
Simple example: Say you want to check for absolute vs. relative filenames. This way you can do it all with the split, without also having to check what the first character of your filename is.
Consider this minimal example:
>>> '/'.split('/')
['', '']
split must give you what's before and after the delimiter '/', but there are no other characters. So it has to give you the empty string, which technically precedes and follows the '/', because '' + '/' + '' == '/'.
If you don't want empty spaces to be returned by split use it without args.
>>> " this is a sentence ".split()
['this', 'is', 'a', 'sentence']
>>> " this is a sentence ".split(" ")
['', '', 'this', '', '', 'is', '', 'a', 'sentence', '']
always use strip function before split if want to ignore blank lines.
youroutput.strip().split('splitter')
Example:
yourstring =' \nhey\njohn\nhow\n\nare\nyou'
yourstring.strip().split('\n')

how to remove or translate multiple strings from strings?

I have a long string like this:
'[("He tended to be helpful, enthusiastic, and encouraging, even to studentsthat didn\'t have very much innate talent.\\n",), (\'Great instructor\\n\',), (\'He could always say something nice and was always helpful.\\n\',), (\'He knew what he was doing.\\n\',), (\'Likes art\\n\',), (\'He enjoys the classwork.\\n\',), (\'Good discussion of ideas\\n\',), (\'Open-minded\\n\',), (\'We learned stuff without having to take notes, we just applied it to what we were doing; made it an interesting and fun class.\\n\',), (\'Very kind, gave good insight on assignments\\n\',), (\' Really pushed me in what I can do; expanded how I thought about art, the materials used, and how it was visually.\\n\',)
and I want to remove all [, (, ", \, \n from this string at once. Somehow I can do it one by one, but always failed with '\n'. Is there any efficient way I can remove or translate all these characters or blank lines symbols?
Since my senectiecs are not long so I do not want to use dictionary methods like earlier questions.
Maybe you could use regex to find all the characters that you want to replace
s = s.strip()
r = re.compile("\[|\(|\)|\]|\\|\"|'|,")
s = re.sub(r, '', s)
print s.replace("\\n", "")
I have some problems with the "\n" but replacing after the regex is easy to remove too.
If string is correct python expression then you can use literal_eval from ast module to transform string to tuples and after that you can process every tuple.
from ast import literal_eval
' '.join(el[0].strip() for el in literal_eval(your_string))
If not then you can use this:
def get_part_string(your_string):
for part in re.findall(r'\((.+?)\)', your_string):
yield re.sub(r'[\"\'\\\\n]', '', part).strip(', ')
''.join(get_part_string(your_string))

A simple python confusion about format string

New to python and I am learning this tutorial:
http://learnpythonthehardway.org/book/ex8.html
I just cannot see why the line "But it didn't sing." got printed out with double-quote and all the others got printed with single quote.. Cannot see any difference from the code...
The quotes depends on the string: if there are no quotes, it will use simple quotes:
>>> """no quotes"""
'no quotes'
if there is a single quote, it will use double quotes:
>>> """single quote:'"""
"single quote:'"
if there is a double quote, it will use single quotes:
"""double quote:" """
'double quote:" '
if there are both, it will use single quotes, hence escaping the single one:
>>> """mix quotes:'" """
'mix quotes:\'" '
>>> """mix quotes:"' """
'mix quotes:"\' '
>>> '''mix quotes:"' '''
'mix quotes:"\' '
There won't be a difference though when you print the string:
>>> print '''mix quotes:"' '''
mix quotes:"'
the surroundings quotes are for the representation of the strings:
>>> print str('''mix quotes:"' ''')
mix quotes:"'
>>> print repr('''mix quotes:"' ''')
'mix quotes:"\' '
You might want to check the python tutorial on strings.
The representation of a value should be equivalent to the Python code required to generate it. Since the string "But it didn't sing." contains a single quote, using single quotes to delimit it would create invalid code. Therefore double quotes are used instead.
Python has several rules for outputting the repr of strings.
Normally, it uses ' to surround them, except if there are 's within it - then it uses " for removing the need of quoting.
If a string contains both ' and '"characters, it uses's and quotes the"`.
As there can be several valid and equivalent representations of a string, these rues might change from version to version.
BTW, in the site you linked to the answer is given as well:
Q: Why does %r sometimes print things with single-quotes when I wrote them with double-quotes?
A: Python is going to print the strings in the most efficient way it can, not replicate exactly the way you wrote them. This is perfectly fine since %r is used for debugging and inspection, so it's not necessary that it be pretty.

The last line of this python program uses both " and ' but I don't know why

Ok on this link it shows the last line of output that has ' around everything except the third sentence and I do not know why. This bothered me at the beginning and thought it was just a weird mistake but its on the "extra credit" so now I am even more curious.
This is because the %r formatter prints the argument in the form you may use in source code, which, for strings, means that it is quote-delimited and escaped. For boolean values, this is just True or False. To print the string as it is, use %s instead.
>>> print '%s' % '"Hello, you\'re"'
"Hello, you're"
>>> print '%r' % '"Hello, you\'re"'
'"Hello, you\'re"'
python's repr() function, which is invoked by interpolating the %r formatting directive, has the approximate effect of printing objects the way they would appear in source code.
There are several ways to format strings in python source, using single or double quotes, with backslash escapes or as raw strings, as simple, single line strings or multi line strings (in any combination). Python picks only two ways to format strings, as single or double quoted, single line strings with escapes instead of raw.
Python makes a crude attempt at picking a minimal format, with a slight bias in favor of the single quote version (since that would be one fewer keystrokes on most keyboards).
The rules are very simple. If a string contains a single quote, but no double quotes, python prints the string as it would appear in python source if it were double quoted, Otherwise it uses single quotes.
Some examples to illustrate. Note for simplicity all of the inputs use triple quotes to avoid backslash escapes.
>>> ''' Hello world '''
' Hello world '
>>> ''' "Hello world," he said. '''
' "Hello world," he said. '
>>> ''' You don't say? '''
" You don't say? "
>>> ''' "Can't we all just get along?" '''
' "Can\'t we all just get along?" '
>>>

Is there a way to make python str.partition ignore case?

I am trying to make Python's str.partition function ignore case during the search, so
>>>partition_tuple = 'Hello moon'.partition('hello')
('', 'Hello', ' moon')
and
>>>partition_tuple = 'hello moon'.partition('hello')
('', 'hello', ' moon')
return as shown above.
Should I be using regular expressions instead?
Thanks,
EDIT:
Pardons, I should have been more specific. I want to find a keyword in a string, change it (by adding stuff around it) then put it back in. My plan to do this was make partitions and then change the middle section then put it all back together.
Example:
'this is a contrived example'
with keyword 'contrived' would become:
'this is a <<contrived>> example'
and I need it to perform the <<>> even if 'contrived' was spelled with a capital 'C.'
Note that any letter in the word could be capitalized, not just the starting one.
The case needs to be preserved.
Another unique point to this problem is that there can be several keywords. In fact, there can even be a key phrase. That is to say, in the above example, the keywords could have been 'a contrived' and 'contrived' in which case the output would need to look like:
'this is <<a contrived>> example.'
How about
re.split('[Hh]ello', 'Hello moon')
This gives
['', ' moon']
Now you have the pieces and you can put them back together as you please.
And it preserves the case.
[Edit]
You can put multiple keywords in one regex (but read caution below)
re.split(r'[Hh]ello | moon', 'Hello moon')
Caution: re will use the FIRST one that matches and then ignore the rest.
So, putting in multiple keywords is only useful if there is a SINGLE keyword in each target.
How about
'Hello moon'.lower().partition('hello')
What is the actual problem you are trying using partition()?
No, partition() is case-sensitive and there is no way around it except by normalizing the primary string.
You can do this if you don't need to preserve the case:
>>> partition_tuple = 'Hello moon'.lower().partition('hello')
>>> partition_tuple
('', 'hello', ' moon')
>>>
However as you can see, this makes the resulting tuple lowercase as well. You cannot make partition case insensitive.
Perhaps more info on the task would help us give a better answer.
For example, is Bastien's answer sufficient, or does case need to be preserved?
If the string has the embedded space you could just use
the str.split(sep)
function.
But I am guessing you have a more complex task in mind.
Please describe it more.
You could also do this by writing your own case_insensitive_partition which could look something like this (barely tested but it did work at least in trivial cases):
def case_partition(text, sep):
ltext = text.lower()
lsep = sep.lower()
ind = ltext.find(lsep)
seplen = len(lsep)
return (text[:ind], text[ind:ind+seplen], text[ind+seplen:])

Categories