re.split() with special cases

re.split() with special cases - python

I am new to regular expression and have a problem with the re.split functionality.
In my case the split has to care "special escapes".
The text should be seperated at ;, except there is a leading ?.
Edit: In that case the two parts shouldn't be splitted and the ? has to be removed.
Here an example and the result I wish:
import re
txt = 'abc;vwx?;yz;123'
re.split(r'magical pattern', txt)
['abc', 'vwx;yz', '123']
I tried so far these attempt:
re.split(r'(?<!\?);', txt)
and got:
['abc', 'vwx?;yz', '123']
Sadly causes the not consumed ? trouble and the following list comprehension is to performance critical:
[part.replace('?;', ';') for part in re.split(r'(?<!\?);', txt)]
['abc', 'vwx;yz', '123']
Is there a "fast" way to reproduce that behavior with re?
Could the re.findall function be the solution to take?
For example a extended version of this code:
re.findall(r'[^;]+', txt)
I am using python 2.7.3.
Thanking you in anticipation!

Regex is not the tool for the job. Use the csv module instead:
>>> txt = 'abc;vwx?;yz;123'
>>> r = csv.reader([txt], delimiter=';', escapechar='?')
>>> next(r)
['abc', 'vwx;yz', '123']

You cannot do what you want with one regular expression. Unescaping ?; after splitting is a separate task altogether, not one that you can get the re module to do for you while splitting at the same time.
Just keep the task separate; you could use a generator to do the unescaping for you:
def unescape(iterable):
for item in iterable:
yield item.replace('?;', ';')
for elem in unescape(re.split(r'(?<!\?);', txt)):
print elem
but that won't be faster than your list comprehension.

I would do it like this:
re.sub('(?<!\?);',r'|', txt).replace('?;',';').split('|')

Try this :-)
def split( txt, sep, esc, escape_chars):
''' Split a string
txt - string to split
sep - separator, one character
esc - escape character
escape_chars - List of characters allowed to be escaped
'''
l = []
tmp = []
i = 0
while i < len(txt):
if len(txt) > i + 1 and txt[i] == esc and txt[i+1] in escape_chars:
i += 1
tmp.append(txt[i])
elif txt[i] == sep:
l.append("".join(tmp))
tmp = []
elif txt[i] == esc:
print('Escape Error')
else:
tmp.append(txt[i])
i += 1
l.append("".join(tmp))
return l
if __name__ == "__main__":
txt = 'abc;vwx?;yz;123'
print split(txt, ';', '?', [';','\\','?'])
Returns:
['abc', 'vwx;yz', '123']

Related

Is there a wildcard character in python

To be more specific, it's for an "if" condition
I have a list of strings which have 5 spaces then the last character
Is there a character that can replace the last character of every string
Like:
if string == " &":
do something
And the condition would be true if & == any type of character

You can access the last character by slicing, e.g. -1 is the last one:
lst = ['&', 'A', 'B', 'C']
s = 'some random string which ends on &'
if s[-1] in lst:
print('hurray!')
#hurray!
Alternatively you can also use .endswith() if its only a few entries:
s = 'some random string which ends on &'
if s.endswith('&') or s.endswith('A'):
print('hurray!')
#hurray!
Since you also asked how to replace the last character, this can be done like this:
s = s[:-1] + '!'
#Out[72]: 'some random string which ends on !'
As per you comment, here is a wildcard solution:
import re
s = r' &'
pattern = r' .{1}$'
if re.search(pattern, s):
print('hurray!')
#hurray!

Try this:
if string[-1] == 'A' or string[-1] == '1':
do something

You may use a regular expression along with re.search, for example:
vals = ["validA", "valid1", "invalid"]
for val in vals:
if re.search(r'[A1]$', val):
print(val + ": MATCH")
This prints:
validA: MATCH
valid1: MATCH

Perhaps you're looking for the .endswith() function? For example:
if "waffles".endswith("s"):
...

Reverse marked substrings in a string

I have a string in which every marked substring within < and >
has to be reversed (the brackets don't nest). For example,
"hello <wolfrevokcats>, how <t uoy era>oday?"
should become
"hello stackoverflow, how are you today?"
My current idea is to loop over the string and find pairs of indices
where < and > are. Then simply slice the string and put the slices
together again with everything that was in between the markers reversed.
Is this a correct approach? Is there an obvious/better solution?

It's pretty simple with regular expressions. re.sub takes a function as an argument to which the match object is passed.
>>> import re
>>> s = 'hello <wolfrevokcats>, how <t uoy era>oday?'
>>> re.sub('<(.*?)>', lambda m: m.group(1)[::-1], s)
'hello stackoverflow, how are you today?'
Explanation of the regex:
<(.*?)> will match everything between < and > in matching group 1. To ensure that the regex engine will stop at the first > symbol occurrence, the lazy quantifier *? is used.
The function lambda m: m.group(1)[::-1] that is passed to re.sub takes the match object, extracts group 1, and reverses the string. Finally re.sub inserts this return value.

Or, use re.sub() and a replacing function:
>>> import re
s = 'hello <wolfrevokcats>, how <t uoy era>oday?'
>>> re.sub(r"<(.*?)>", lambda match: match.group(1)[::-1], s)
'hello stackoverflow, how are you today?'
where .*? would match any characters any number of times in a non-greedy fashion. The parenthesis around it would help us to capture it in a group which we then refer to in the replacing function - match.group(1). [::-1] slice notation reverses a string.

I'm going to assume this is a coursework assignment and the use of regular expressions isn't allowed. So I'm going to offer a solution that doesn't use it.
content = "hello <wolfrevokcats>, how <t uoy era>oday?"
insert_pos = -1
result = []
placeholder_count = 0
for pos, ch in enumerate(content):
if ch == '<':
insert_pos = pos
elif ch == '>':
insert_pos = -1
placeholder_count += 1
elif insert_pos >= 0:
result.insert(insert_pos - (placeholder_count * 2), ch)
else:
result.append(ch)
print("".join(result))
The gist of the code is to have just a single pass at the string one character at a time. When outside the brackets, simply append the character at the end of the result string. When inside the brackets, insert the character at the position of the opening bracket (i.e. pre-pend the character).

I agree that regular expressions is the proper tool to solve this problem, and I like the gist of Dmitry B.'s answer. However, I used this question to practice about generators and functional programming, and I post my solution just for sharing it.
msg = "<,woN> hello <wolfrevokcats>, how <t uoy era>oday?"
def traverse(s, d=">"):
for c in s:
if c in "<>": d = c
else: yield c, d
def group(tt, dc=None):
for c, d in tt:
if d != dc:
if dc is not None:
yield dc, l
l = [c]
dc = d
else:
l.append(c)
else: yield dc, l
def direct(groups):
func = lambda d: list if d == ">" else reversed
fst = lambda t: t[0]
snd = lambda t: t[1]
for gr in groups:
yield func(fst(gr))(snd(gr))
def concat(groups):
return "".join("".join(gr) for gr in groups)
print(concat(direct(group(traverse(msg)))))
#Now, hello stackoverflow, how are you today?

Here's another one without using regular expressions:
def reverse_marked(str0):
separators = ['<', '>']
reverse = 0
str1 = ['', str0]
res = ''
while len(str1) == 2:
str1 = str1[1].split(separators[reverse], maxsplit=1)
res = ''.join((res, str1[0][::-1] if reverse else str1[0]))
reverse = 1 - reverse # toggle 0 - 1 - 0 ...
return res
print(reverse_marked('hello <wolfrevokcats>, how <t uoy era>oday?'))
Output:
hello stackoverflow, how are you today?

python string manipulation

I have a string s with nested brackets: s = "AX(p>q)&E((-p)Ur)"
I want to remove all characters between all pairs of brackets and store in a new string like this: new_string = AX&E
i tried doing this:
p = re.compile("\(.*?\)", re.DOTALL)
new_string = p.sub("", s)
It gives output: AX&EUr)
Is there any way to correct this, rather than iterating each element in the string?

Another simple option is removing the innermost parentheses at every stage, until there are no more parentheses:
p = re.compile("\([^()]*\)")
count = 1
while count:
s, count = p.subn("", s)
Working example: http://ideone.com/WicDK

You can just use string manipulation without regular expression
>>> s = "AX(p>q)&E(qUr)"
>>> [ i.split("(")[0] for i in s.split(")") ]
['AX', '&E', '']
I leave it to you to join the strings up.

>>> import re
>>> s = "AX(p>q)&E(qUr)"
>>> re.compile("""\([^\)]*\)""").sub('', s)
'AX&E'

Yeah, it should be:
>>> import re
>>> s = "AX(p>q)&E(qUr)"
>>> p = re.compile("\(.*?\)", re.DOTALL)
>>> new_string = p.sub("", s)
>>> new_string
'AX&E'

Nested brackets (or tags, ...) are something that are not possible to handle in a general way using regex. See http://www.amazon.de/Mastering-Regular-Expressions-Jeffrey-Friedl/dp/0596528124/ref=sr_1_1?ie=UTF8&s=gateway&qid=1304230523&sr=8-1-spell for details why. You would need a real parser.
It's possible to construct a regex which can handle two levels of nesting, but they are already ugly, three levels will already be quite long. And you don't want to think about four levels. ;-)

You can use PyParsing to parse the string:
from pyparsing import nestedExpr
import sys
s = "AX(p>q)&E((-p)Ur)"
expr = nestedExpr('(', ')')
result = expr.parseString('(' + s + ')').asList()[0]
s = ''.join(filter(lambda x: isinstance(x, str), result))
print(s)
Most code is from: How can a recursive regexp be implemented in python?

You could use re.subn():
import re
s = 'AX(p>q)&E((-p)Ur)'
while True:
s, n = re.subn(r'\([^)(]*\)', '', s)
if n == 0:
break
print(s)
Output
AX&E

this is just how you do it:
# strings
# double and single quotes use in Python
"hey there! welcome to CIP"
'hey there! welcome to CIP'
"you'll understand python"
'i said, "python is awesome!"'
'i can\'t live without python'
# use of 'r' before string
print(r"\new code", "\n")
first = "code in"
last = "python"
first + last #concatenation
# slicing of strings
user = "code in python!"
print(user)
print(user[5]) # print an element
print(user[-3]) # print an element from rear end
print(user[2:6]) # slicing the string
print(user[:6])
print(user[2:])
print(len(user)) # length of the string
print(user.upper()) # convert to uppercase
print(user.lstrip())
print(user.rstrip())
print(max(user)) # max alphabet from user string
print(min(user)) # min alphabet from user string
print(user.join([1,2,3,4]))
input()

What is the pythonic way to remove trailing spaces from a string?

The parameter to the function satisfy these rules:
It does not have any leading whitespace
It might have trailing whitespaces
There might be interleaved whitespaces in the string.
Goal: remove duplicate whitespaces that are interleaved & strip trailing whitespaces.
This is how I am doing it now:
# toks - a priori no leading space
def squeeze(toks):
import re
p = re.compile(r'\W+')
a = p.split( toks )
for i in range(0, len(a)):
if len(a[i]) == 0:
del a[i]
return ' '.join(a)
>>> toks( ' Mary Decker is hot ' )
Mary Decker is hot
Can this be improved ? Pythonic enough ?

This is how I would do it:
" ".join(toks.split())
PS. Is there a subliminal message in this question? ;-)

Can't you use rstrip()?
some_string.rstrip()
or strip() for stripping the string from both sides?
In addition: the strip() methods also support to pass in arbitrary strip characters:
string.strip = strip(s, chars=None)
strip(s [,chars]) -> string
Related: if you need to strip whitespaces in-between: split the string, strip the terms and re-join it.
Reading the API helps!

To answer your questions literally:
Yes, it could be improved. The first improvement would be to make it work.
>>> squeeze('x ! y')
'x y' # oops
Problem 1: You are using \W+ (non-word characters) when you should be using \s+ (whitespace characters)
>>> toks = 'x ! y z '
>>> re.split('\W+', toks)
['x', 'y', 'z', '']
>>> re.split('\s+', toks)
['x', '!', 'y', 'z', '']
Problem 2: The loop to delete empty strings works, but only by accident. If you wanted a general-purpose loop to delete empty strings in situ, you would need to work backwards, otherwise your subscript i would get out of whack with the number of elements remaining. It works here because re.split() without a capturing group can produce empty elements only at the start and end. You have defined away the start problem, and the end case doesn't cause a problem because there have been no prior deletions. So you are left with a very ugly loop which could be replaced by two lines:
if a and not a[-1]: # guard against empty list
del a[-1]
However unless your string is very long and you are worried about speed (in which case you probably shouldn't be using re), you'd probably want to allow for leading whitespace (assertions like "my data doesn't have leading whitespace" are ignored by convention) and just do it in a loop on the fly:
a = [x for x in p.split(toks) if x]
Next step is to avoid building the list a:
return ' '.join(x for x in p.split(toks) if x)
Now you did mention "Pythonic" ... so let's throw out all that re import and compile overhead stuff, and the genxp and just do this:
return ' '.join(toks.split())

Well, I tend not to use the re module if I can do the job reasonably with
the built-in functions and features. For example:
def toks(s):
return ' '.join([x for x in s.split(' ') if x])
... seems to accomplish the same goal with only built in split, join, and the list comprehension to filter our empty elements of the split string.
Is that more "Pythonic?" I think so. However my opinion is hardly authoritative.
This could be done as a lambda expression as well; and I think that would not be Pythonic.
Incidentally this assumes that you want to ONLY squeeze out duplicate spaces and trim leading and trailing spaces. If your intent is to munge all whitespace sequences into single spaces (and trim leading and trailing) then change s.split(' ') to s.split() -- passing no argument, or None, to the split() method is different than passing it a space.

To make your code more Pythonic, you must realize that in Python, a[i] being a string, instead of deleting a[i] if a[i]=='' , it is better keeping a[i] if a[i]!='' .
So, instead of
def squeeze(toks):
import re
p = re.compile(r'\W+')
a = p.split( toks )
for i in range(0, len(a)):
if len(a[i]) == 0:
del a[i]
return ' '.join(a)
write
def squeeze(toks):
import re
p = re.compile(r'\W+')
a = p.split( toks )
a = [x for x in a if x]
return ' '.join(a)
and then
def squeeze(toks):
import re
p = re.compile(r'\W+')
return ' '.join([x for x in p.split( toks ) if x])
Then, taking account that a function can receive a generator as well as a list:
def squeeze(toks):
import re
p = re.compile(r'\W+')
return ' '.join((x for x in p.split( toks ) if x))
and that doubling parentheses isn't obligatory:
def squeeze(toks):
import re
p = re.compile(r'\W+')
return ' '.join(x for x in p.split( toks ) if x)
.
.
Additionally, instead of obliging Python to verify if re is or isn't present in the namespace of the function squeeze() each time it is called (it is what it does), it would be better to pass re as an argument by defautlt :
import re
def squeeze(toks,re = re):
p = re.compile(r'\W+')
return ' '.join(x for x in p.split( toks ) if x)
and , even better:
import re
def squeeze(toks,p = re.compile(r'\W+')):
return ' '.join(x for x in p.split( toks ) if x)
.
.
Remark: the if x part in the expression is useful only to leave apart the heading '' and the ending '' occuring in the list p.split( toks ) when toks begins and ends with whitespaces.
But , instead of splitting, it is as much good to keep what is desired:
import re
def squeeze(toks,p = re.compile(r'\w+')):
return ' '.join(p.findall(toks))
.
.
All that said, the pattern r'\W+' in your question is wrong for your purpose, as John Machin pointed it out.
If you want to compress internal whitespaces and to remove trailing whitespaces, whitespace being taken in its pure sense designating the set of characters ' ' , '\f' , '\n' , '\r' , '\t' , '\v' ( see \s in re) , you must replace your spliting with this one:
import re
def squeeze(toks,p = re.compile(r'\s+')):
return ' '.join(x for x in p.split( toks ) if x)
or, keeping the right substrings:
import re
def squeeze(toks,p = re.compile(r'\S+')):
return ' '.join(p.findall(toks))
which is nothing else than the simpler and faster expression ' '.join(toks.split())
But if you want in fact just to compress internal and remove trailing characters ' ' and '\t' , keeping the newlines untouched, you will use
import re
def squeeze(toks,p = re.compile(r'[^ \t]+')):
return ' '.join(p.findall(toks))
and that can't be replaced by anything else.

I know this question is old. But why not use regex?
import re
result = ' Mary Decker is hot '
print(f"=={result}==")
result = re.sub('\s+$', '', result)
print(f"=={result}==")
result = re.sub('^\s+', '', result)
print(f"=={result}==")
result = re.sub('\s+', ' ', result)
print(f"=={result}==")
The output is
== Mary Decker is hot ==
== Mary Decker is hot==
==Mary Decker is hot==
==Mary Decker is hot==

Is there a generator version of `string.split()` in Python?

string.split() returns a list instance. Is there a version that returns a generator instead? Are there any reasons against having a generator version?

It is highly probable that re.finditer uses fairly minimal memory overhead.
def split_iter(string):
return (x.group(0) for x in re.finditer(r"[A-Za-z']+", string))
Demo:
>>> list( split_iter("A programmer's RegEx test.") )
['A', "programmer's", 'RegEx', 'test']
edit: I have just confirmed that this takes constant memory in python 3.2.1, assuming my testing methodology was correct. I created a string of very large size (1GB or so), then iterated through the iterable with a for loop (NOT a list comprehension, which would have generated extra memory). This did not result in a noticeable growth of memory (that is, if there was a growth in memory, it was far far less than the 1GB string).
More general version:
In reply to a comment "I fail to see the connection with str.split", here is a more general version:
def splitStr(string, sep="\s+"):
# warning: does not yet work if sep is a lookahead like `(?=b)`
if sep=='':
return (c for c in string)
else:
return (_.group(1) for _ in re.finditer(f'(?:^|{sep})((?:(?!{sep}).)*)', string))
# alternatively, more verbosely:
regex = f'(?:^|{sep})((?:(?!{sep}).)*)'
for match in re.finditer(regex, string):
fragment = match.group(1)
yield fragment
The idea is that ((?!pat).)* 'negates' a group by ensuring it greedily matches until the pattern would start to match (lookaheads do not consume the string in the regex finite-state-machine). In pseudocode: repeatedly consume (begin-of-string xor {sep}) + as much as possible until we would be able to begin again (or hit end of string)
Demo:
>>> splitStr('.......A...b...c....', sep='...')
<generator object splitStr.<locals>.<genexpr> at 0x7fe8530fb5e8>
>>> list(splitStr('A,b,c.', sep=','))
['A', 'b', 'c.']
>>> list(splitStr(',,A,b,c.,', sep=','))
['', '', 'A', 'b', 'c.', '']
>>> list(splitStr('.......A...b...c....', '\.\.\.'))
['', '', '.A', 'b', 'c', '.']
>>> list(splitStr(' A b c. '))
['', 'A', 'b', 'c.', '']
(One should note that str.split has an ugly behavior: it special-cases having sep=None as first doing str.strip to remove leading and trailing whitespace. The above purposefully does not do that; see the last example where sep="\s+".)
(I ran into various bugs (including an internal re.error) when trying to implement this... Negative lookbehind will restrict you to fixed-length delimiters so we don't use that. Almost anything besides the above regex seemed to result in errors with the beginning-of-string and end-of-string edge-cases (e.g. r'(.*?)($|,)' on ',,,a,,b,c' returns ['', '', '', 'a', '', 'b', 'c', ''] with an extraneous empty string at the end; one can look at the edit history for another seemingly-correct regex that actually has subtle bugs.)
(If you want to implement this yourself for higher performance (although they are heavweight, regexes most importantly run in C), you'd write some code (with ctypes? not sure how to get generators working with it?), with the following pseudocode for fixed-length delimiters: Hash your delimiter of length L. Keep a running hash of length L as you scan the string using a running hash algorithm, O(1) update time. Whenever the hash might equal your delimiter, manually check if the past few characters were the delimiter; if so, then yield substring since last yield. Special case for beginning and end of string. This would be a generator version of the textbook algorithm to do O(N) text search. Multiprocessing versions are also possible. They might seem overkill, but the question implies that one is working with really huge strings... At that point you might consider crazy things like caching byte offsets if few of them, or working from disk with some disk-backed bytestring view object, buying more RAM, etc. etc.)

The most efficient way I can think of it to write one using the offset parameter of the str.find() method. This avoids lots of memory use, and relying on the overhead of a regexp when it's not needed.
[edit 2016-8-2: updated this to optionally support regex separators]
def isplit(source, sep=None, regex=False):
"""
generator version of str.split()
:param source:
source string (unicode or bytes)
:param sep:
separator to split on.
:param regex:
if True, will treat sep as regular expression.
:returns:
generator yielding elements of string.
"""
if sep is None:
# mimic default python behavior
source = source.strip()
sep = "\\s+"
if isinstance(source, bytes):
sep = sep.encode("ascii")
regex = True
if regex:
# version using re.finditer()
if not hasattr(sep, "finditer"):
sep = re.compile(sep)
start = 0
for m in sep.finditer(source):
idx = m.start()
assert idx >= start
yield source[start:idx]
start = m.end()
yield source[start:]
else:
# version using str.find(), less overhead than re.finditer()
sepsize = len(sep)
start = 0
while True:
idx = source.find(sep, start)
if idx == -1:
yield source[start:]
return
yield source[start:idx]
start = idx + sepsize
This can be used like you want...
>>> print list(isplit("abcb","b"))
['a','c','']
While there is a little bit of cost seeking within the string each time find() or slicing is performed, this should be minimal since strings are represented as continguous arrays in memory.

Did some performance testing on the various methods proposed (I won't repeat them here). Some results:
str.split (default = 0.3461570239996945
manual search (by character) (one of Dave Webb's answer's) = 0.8260340550004912
re.finditer (ninjagecko's answer) = 0.698872097000276
str.find (one of Eli Collins's answers) = 0.7230395330007013
itertools.takewhile (Ignacio Vazquez-Abrams's answer) = 2.023023967998597
str.split(..., maxsplit=1) recursion = N/A†
†The recursion answers (string.split with maxsplit = 1) fail to complete in a reasonable time, given string.splits speed they may work better on shorter strings, but then I can't see the use-case for short strings where memory isn't an issue anyway.
Tested using timeit on:
the_text = "100 " * 9999 + "100"
def test_function( method ):
def fn( ):
total = 0
for x in method( the_text ):
total += int( x )
return total
return fn
This raises another question as to why string.split is so much faster despite its memory usage.

This is generator version of split() implemented via re.search() that does not have the problem of allocating too many substrings.
import re
def itersplit(s, sep=None):
exp = re.compile(r'\s+' if sep is None else re.escape(sep))
pos = 0
while True:
m = exp.search(s, pos)
if not m:
if pos < len(s) or sep is not None:
yield s[pos:]
break
if pos < m.start() or sep is not None:
yield s[pos:m.start()]
pos = m.end()
sample1 = "Good evening, world!"
sample2 = " Good evening, world! "
sample3 = "brackets][all][][over][here"
sample4 = "][brackets][all][][over][here]["
assert list(itersplit(sample1)) == sample1.split()
assert list(itersplit(sample2)) == sample2.split()
assert list(itersplit(sample3, '][')) == sample3.split('][')
assert list(itersplit(sample4, '][')) == sample4.split('][')
EDIT: Corrected handling of surrounding whitespace if no separator chars are given.

Here is my implementation, which is much, much faster and more complete than the other answers here. It has 4 separate subfunctions for different cases.
I'll just copy the docstring of the main str_split function:
str_split(s, *delims, empty=None)
Split the string s by the rest of the arguments, possibly omitting
empty parts (empty keyword argument is responsible for that).
This is a generator function.
When only one delimiter is supplied, the string is simply split by it.
empty is then True by default.
str_split('[]aaa[][]bb[c', '[]')
-> '', 'aaa', '', 'bb[c'
str_split('[]aaa[][]bb[c', '[]', empty=False)
-> 'aaa', 'bb[c'
When multiple delimiters are supplied, the string is split by longest
possible sequences of those delimiters by default, or, if empty is set to
True, empty strings between the delimiters are also included. Note that
the delimiters in this case may only be single characters.
str_split('aaa, bb : c;', ' ', ',', ':', ';')
-> 'aaa', 'bb', 'c'
str_split('aaa, bb : c;', *' ,:;', empty=True)
-> 'aaa', '', 'bb', '', '', 'c', ''
When no delimiters are supplied, string.whitespace is used, so the effect
is the same as str.split(), except this function is a generator.
str_split('aaa\\t bb c \\n')
-> 'aaa', 'bb', 'c'
import string
def _str_split_chars(s, delims):
"Split the string `s` by characters contained in `delims`, including the \
empty parts between two consecutive delimiters"
start = 0
for i, c in enumerate(s):
if c in delims:
yield s[start:i]
start = i+1
yield s[start:]
def _str_split_chars_ne(s, delims):
"Split the string `s` by longest possible sequences of characters \
contained in `delims`"
start = 0
in_s = False
for i, c in enumerate(s):
if c in delims:
if in_s:
yield s[start:i]
in_s = False
else:
if not in_s:
in_s = True
start = i
if in_s:
yield s[start:]
def _str_split_word(s, delim):
"Split the string `s` by the string `delim`"
dlen = len(delim)
start = 0
try:
while True:
i = s.index(delim, start)
yield s[start:i]
start = i+dlen
except ValueError:
pass
yield s[start:]
def _str_split_word_ne(s, delim):
"Split the string `s` by the string `delim`, not including empty parts \
between two consecutive delimiters"
dlen = len(delim)
start = 0
try:
while True:
i = s.index(delim, start)
if start!=i:
yield s[start:i]
start = i+dlen
except ValueError:
pass
if start<len(s):
yield s[start:]
def str_split(s, *delims, empty=None):
"""\
Split the string `s` by the rest of the arguments, possibly omitting
empty parts (`empty` keyword argument is responsible for that).
This is a generator function.
When only one delimiter is supplied, the string is simply split by it.
`empty` is then `True` by default.
str_split('[]aaa[][]bb[c', '[]')
-> '', 'aaa', '', 'bb[c'
str_split('[]aaa[][]bb[c', '[]', empty=False)
-> 'aaa', 'bb[c'
When multiple delimiters are supplied, the string is split by longest
possible sequences of those delimiters by default, or, if `empty` is set to
`True`, empty strings between the delimiters are also included. Note that
the delimiters in this case may only be single characters.
str_split('aaa, bb : c;', ' ', ',', ':', ';')
-> 'aaa', 'bb', 'c'
str_split('aaa, bb : c;', *' ,:;', empty=True)
-> 'aaa', '', 'bb', '', '', 'c', ''
When no delimiters are supplied, `string.whitespace` is used, so the effect
is the same as `str.split()`, except this function is a generator.
str_split('aaa\\t bb c \\n')
-> 'aaa', 'bb', 'c'
"""
if len(delims)==1:
f = _str_split_word if empty is None or empty else _str_split_word_ne
return f(s, delims[0])
if len(delims)==0:
delims = string.whitespace
delims = set(delims) if len(delims)>=4 else ''.join(delims)
if any(len(d)>1 for d in delims):
raise ValueError("Only 1-character multiple delimiters are supported")
f = _str_split_chars if empty else _str_split_chars_ne
return f(s, delims)
This function works in Python 3, and an easy, though quite ugly, fix can be applied to make it work in both 2 and 3 versions. The first lines of the function should be changed to:
def str_split(s, *delims, **kwargs):
"""...docstring..."""
empty = kwargs.get('empty')

No, but it should be easy enough to write one using itertools.takewhile().
EDIT:
Very simple, half-broken implementation:
import itertools
import string
def isplitwords(s):
i = iter(s)
while True:
r = []
for c in itertools.takewhile(lambda x: not x in string.whitespace, i):
r.append(c)
else:
if r:
yield ''.join(r)
continue
else:
raise StopIteration()

I don't see any obvious benefit to a generator version of split(). The generator object is going to have to contain the whole string to iterate over so you're not going to save any memory by having a generator.
If you wanted to write one it would be fairly easy though:
import string
def gsplit(s,sep=string.whitespace):
word = []
for c in s:
if c in sep:
if word:
yield "".join(word)
word = []
else:
word.append(c)
if word:
yield "".join(word)

I wrote a version of #ninjagecko's answer that behaves more like string.split (i.e. whitespace delimited by default and you can specify a delimiter).
def isplit(string, delimiter = None):
"""Like string.split but returns an iterator (lazy)
Multiple character delimters are not handled.
"""
if delimiter is None:
# Whitespace delimited by default
delim = r"\s"
elif len(delimiter) != 1:
raise ValueError("Can only handle single character delimiters",
delimiter)
else:
# Escape, incase it's "\", "*" etc.
delim = re.escape(delimiter)
return (x.group(0) for x in re.finditer(r"[^{}]+".format(delim), string))
Here are the tests I used (in both python 3 and python 2):
# Wrapper to make it a list
def helper(*args, **kwargs):
return list(isplit(*args, **kwargs))
# Normal delimiters
assert helper("1,2,3", ",") == ["1", "2", "3"]
assert helper("1;2;3,", ";") == ["1", "2", "3,"]
assert helper("1;2 ;3, ", ";") == ["1", "2 ", "3, "]
# Whitespace
assert helper("1 2 3") == ["1", "2", "3"]
assert helper("1\t2\t3") == ["1", "2", "3"]
assert helper("1\t2 \t3") == ["1", "2", "3"]
assert helper("1\n2\n3") == ["1", "2", "3"]
# Surrounding whitespace dropped
assert helper(" 1 2 3 ") == ["1", "2", "3"]
# Regex special characters
assert helper(r"1\2\3", "\\") == ["1", "2", "3"]
assert helper(r"1*2*3", "*") == ["1", "2", "3"]
# No multi-char delimiters allowed
try:
helper(r"1,.2,.3", ",.")
assert False
except ValueError:
pass
python's regex module says that it does "the right thing" for unicode whitespace, but I haven't actually tested it.
Also available as a gist.

If you would also like to be able to read an iterator (as well as return one) try this:
import itertools as it
def iter_split(string, sep=None):
sep = sep or ' '
groups = it.groupby(string, lambda s: s != sep)
return (''.join(g) for k, g in groups if k)
Usage
>>> list(iter_split(iter("Good evening, world!")))
['Good', 'evening,', 'world!']

more_itertools.split_at offers an analog to str.split for iterators.
>>> import more_itertools as mit
>>> list(mit.split_at("abcdcba", lambda x: x == "b"))
[['a'], ['c', 'd', 'c'], ['a']]
>>> "abcdcba".split("b")
['a', 'cdc', 'a']
more_itertools is a third-party package.

I wanted to show how to use the find_iter solution to return a generator for given delimiters and then use the pairwise recipe from itertools to build a previous next iteration which will get the actual words as in the original split method.
from more_itertools import pairwise
import re
string = "dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d"
delimiter = " "
# split according to the given delimiter including segments beginning at the beginning and ending at the end
for prev, curr in pairwise(re.finditer("^|[{0}]+|$".format(delimiter), string)):
print(string[prev.end(): curr.start()])
note:
I use prev & curr instead of prev & next because overriding next in python is a very bad idea
This is quite efficient

Dumbest method, without regex / itertools:
def isplit(text, split='\n'):
while text != '':
end = text.find(split)
if end == -1:
yield text
text = ''
else:
yield text[:end]
text = text[end + 1:]

Very old question, but here is my humble contribution with an efficient algorithm:
def str_split(text: str, separator: str) -> Iterable[str]:
i = 0
n = len(text)
while i <= n:
j = text.find(separator, i)
if j == -1:
j = n
yield text[i:j]
i = j + 1

def split_generator(f,s):
"""
f is a string, s is the substring we split on.
This produces a generator rather than a possibly
memory intensive list.
"""
i=0
j=0
while j<len(f):
if i>=len(f):
yield f[j:]
j=i
elif f[i] != s:
i=i+1
else:
yield [f[j:i]]
j=i+1
i=i+1

here is a simple response
def gen_str(some_string, sep):
j=0
guard = len(some_string)-1
for i,s in enumerate(some_string):
if s == sep:
yield some_string[j:i]
j=i+1
elif i!=guard:
continue
else:
yield some_string[j:]

def isplit(text, sep=None, maxsplit=-1):
if not isinstance(text, (str, bytes)):
raise TypeError(f"requires 'str' or 'bytes' but received a '{type(text).__name__}'")
if sep in ('', b''):
raise ValueError('empty separator')
if maxsplit == 0 or not text:
yield text
return
regex = (
re.escape(sep) if sep is not None
else [br'\s+', r'\s+'][isinstance(text, str)]
)
yield from re.split(regex, text, maxsplit=max(0, maxsplit))

Here is an answer that is based on split and maxsplit. This does not use recursion.
def gsplit(todo):
chunk= 100
while todo:
splits = todo.split(maxsplit=chunk)
if len(splits) == chunk:
todo = splits.pop()
else:
todo=None
for item in splits:
yield item

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

re.split() with special cases - python

Regex is not the tool for the job. Use the csv module instead: >>> txt = 'abc;vwx?;yz;123' >>> r = csv.reader([txt], delimiter=';', escapechar='?') >>> next(r) ['abc', 'vwx;yz', '123']

I would do it like this: re.sub('(?<!\?);',r'|', txt).replace('?;',';').split('|')

Related

Is there a wildcard character in python

Reverse marked substrings in a string

python string manipulation

What is the pythonic way to remove trailing spaces from a string?

Is there a generator version of `string.split()` in Python?

Categories

Resources