Is there a generator version of `string.split()` in Python? - python

string.split() returns a list instance. Is there a version that returns a generator instead? Are there any reasons against having a generator version?

It is highly probable that re.finditer uses fairly minimal memory overhead.
def split_iter(string):
return (x.group(0) for x in re.finditer(r"[A-Za-z']+", string))
Demo:
>>> list( split_iter("A programmer's RegEx test.") )
['A', "programmer's", 'RegEx', 'test']
edit: I have just confirmed that this takes constant memory in python 3.2.1, assuming my testing methodology was correct. I created a string of very large size (1GB or so), then iterated through the iterable with a for loop (NOT a list comprehension, which would have generated extra memory). This did not result in a noticeable growth of memory (that is, if there was a growth in memory, it was far far less than the 1GB string).
More general version:
In reply to a comment "I fail to see the connection with str.split", here is a more general version:
def splitStr(string, sep="\s+"):
# warning: does not yet work if sep is a lookahead like `(?=b)`
if sep=='':
return (c for c in string)
else:
return (_.group(1) for _ in re.finditer(f'(?:^|{sep})((?:(?!{sep}).)*)', string))
# alternatively, more verbosely:
regex = f'(?:^|{sep})((?:(?!{sep}).)*)'
for match in re.finditer(regex, string):
fragment = match.group(1)
yield fragment
The idea is that ((?!pat).)* 'negates' a group by ensuring it greedily matches until the pattern would start to match (lookaheads do not consume the string in the regex finite-state-machine). In pseudocode: repeatedly consume (begin-of-string xor {sep}) + as much as possible until we would be able to begin again (or hit end of string)
Demo:
>>> splitStr('.......A...b...c....', sep='...')
<generator object splitStr.<locals>.<genexpr> at 0x7fe8530fb5e8>
>>> list(splitStr('A,b,c.', sep=','))
['A', 'b', 'c.']
>>> list(splitStr(',,A,b,c.,', sep=','))
['', '', 'A', 'b', 'c.', '']
>>> list(splitStr('.......A...b...c....', '\.\.\.'))
['', '', '.A', 'b', 'c', '.']
>>> list(splitStr(' A b c. '))
['', 'A', 'b', 'c.', '']
(One should note that str.split has an ugly behavior: it special-cases having sep=None as first doing str.strip to remove leading and trailing whitespace. The above purposefully does not do that; see the last example where sep="\s+".)
(I ran into various bugs (including an internal re.error) when trying to implement this... Negative lookbehind will restrict you to fixed-length delimiters so we don't use that. Almost anything besides the above regex seemed to result in errors with the beginning-of-string and end-of-string edge-cases (e.g. r'(.*?)($|,)' on ',,,a,,b,c' returns ['', '', '', 'a', '', 'b', 'c', ''] with an extraneous empty string at the end; one can look at the edit history for another seemingly-correct regex that actually has subtle bugs.)
(If you want to implement this yourself for higher performance (although they are heavweight, regexes most importantly run in C), you'd write some code (with ctypes? not sure how to get generators working with it?), with the following pseudocode for fixed-length delimiters: Hash your delimiter of length L. Keep a running hash of length L as you scan the string using a running hash algorithm, O(1) update time. Whenever the hash might equal your delimiter, manually check if the past few characters were the delimiter; if so, then yield substring since last yield. Special case for beginning and end of string. This would be a generator version of the textbook algorithm to do O(N) text search. Multiprocessing versions are also possible. They might seem overkill, but the question implies that one is working with really huge strings... At that point you might consider crazy things like caching byte offsets if few of them, or working from disk with some disk-backed bytestring view object, buying more RAM, etc. etc.)

The most efficient way I can think of it to write one using the offset parameter of the str.find() method. This avoids lots of memory use, and relying on the overhead of a regexp when it's not needed.
[edit 2016-8-2: updated this to optionally support regex separators]
def isplit(source, sep=None, regex=False):
"""
generator version of str.split()
:param source:
source string (unicode or bytes)
:param sep:
separator to split on.
:param regex:
if True, will treat sep as regular expression.
:returns:
generator yielding elements of string.
"""
if sep is None:
# mimic default python behavior
source = source.strip()
sep = "\\s+"
if isinstance(source, bytes):
sep = sep.encode("ascii")
regex = True
if regex:
# version using re.finditer()
if not hasattr(sep, "finditer"):
sep = re.compile(sep)
start = 0
for m in sep.finditer(source):
idx = m.start()
assert idx >= start
yield source[start:idx]
start = m.end()
yield source[start:]
else:
# version using str.find(), less overhead than re.finditer()
sepsize = len(sep)
start = 0
while True:
idx = source.find(sep, start)
if idx == -1:
yield source[start:]
return
yield source[start:idx]
start = idx + sepsize
This can be used like you want...
>>> print list(isplit("abcb","b"))
['a','c','']
While there is a little bit of cost seeking within the string each time find() or slicing is performed, this should be minimal since strings are represented as continguous arrays in memory.

Did some performance testing on the various methods proposed (I won't repeat them here). Some results:
str.split (default = 0.3461570239996945
manual search (by character) (one of Dave Webb's answer's) = 0.8260340550004912
re.finditer (ninjagecko's answer) = 0.698872097000276
str.find (one of Eli Collins's answers) = 0.7230395330007013
itertools.takewhile (Ignacio Vazquez-Abrams's answer) = 2.023023967998597
str.split(..., maxsplit=1) recursion = N/A†
†The recursion answers (string.split with maxsplit = 1) fail to complete in a reasonable time, given string.splits speed they may work better on shorter strings, but then I can't see the use-case for short strings where memory isn't an issue anyway.
Tested using timeit on:
the_text = "100 " * 9999 + "100"
def test_function( method ):
def fn( ):
total = 0
for x in method( the_text ):
total += int( x )
return total
return fn
This raises another question as to why string.split is so much faster despite its memory usage.

This is generator version of split() implemented via re.search() that does not have the problem of allocating too many substrings.
import re
def itersplit(s, sep=None):
exp = re.compile(r'\s+' if sep is None else re.escape(sep))
pos = 0
while True:
m = exp.search(s, pos)
if not m:
if pos < len(s) or sep is not None:
yield s[pos:]
break
if pos < m.start() or sep is not None:
yield s[pos:m.start()]
pos = m.end()
sample1 = "Good evening, world!"
sample2 = " Good evening, world! "
sample3 = "brackets][all][][over][here"
sample4 = "][brackets][all][][over][here]["
assert list(itersplit(sample1)) == sample1.split()
assert list(itersplit(sample2)) == sample2.split()
assert list(itersplit(sample3, '][')) == sample3.split('][')
assert list(itersplit(sample4, '][')) == sample4.split('][')
EDIT: Corrected handling of surrounding whitespace if no separator chars are given.

Here is my implementation, which is much, much faster and more complete than the other answers here. It has 4 separate subfunctions for different cases.
I'll just copy the docstring of the main str_split function:
str_split(s, *delims, empty=None)
Split the string s by the rest of the arguments, possibly omitting
empty parts (empty keyword argument is responsible for that).
This is a generator function.
When only one delimiter is supplied, the string is simply split by it.
empty is then True by default.
str_split('[]aaa[][]bb[c', '[]')
-> '', 'aaa', '', 'bb[c'
str_split('[]aaa[][]bb[c', '[]', empty=False)
-> 'aaa', 'bb[c'
When multiple delimiters are supplied, the string is split by longest
possible sequences of those delimiters by default, or, if empty is set to
True, empty strings between the delimiters are also included. Note that
the delimiters in this case may only be single characters.
str_split('aaa, bb : c;', ' ', ',', ':', ';')
-> 'aaa', 'bb', 'c'
str_split('aaa, bb : c;', *' ,:;', empty=True)
-> 'aaa', '', 'bb', '', '', 'c', ''
When no delimiters are supplied, string.whitespace is used, so the effect
is the same as str.split(), except this function is a generator.
str_split('aaa\\t bb c \\n')
-> 'aaa', 'bb', 'c'
import string
def _str_split_chars(s, delims):
"Split the string `s` by characters contained in `delims`, including the \
empty parts between two consecutive delimiters"
start = 0
for i, c in enumerate(s):
if c in delims:
yield s[start:i]
start = i+1
yield s[start:]
def _str_split_chars_ne(s, delims):
"Split the string `s` by longest possible sequences of characters \
contained in `delims`"
start = 0
in_s = False
for i, c in enumerate(s):
if c in delims:
if in_s:
yield s[start:i]
in_s = False
else:
if not in_s:
in_s = True
start = i
if in_s:
yield s[start:]
def _str_split_word(s, delim):
"Split the string `s` by the string `delim`"
dlen = len(delim)
start = 0
try:
while True:
i = s.index(delim, start)
yield s[start:i]
start = i+dlen
except ValueError:
pass
yield s[start:]
def _str_split_word_ne(s, delim):
"Split the string `s` by the string `delim`, not including empty parts \
between two consecutive delimiters"
dlen = len(delim)
start = 0
try:
while True:
i = s.index(delim, start)
if start!=i:
yield s[start:i]
start = i+dlen
except ValueError:
pass
if start<len(s):
yield s[start:]
def str_split(s, *delims, empty=None):
"""\
Split the string `s` by the rest of the arguments, possibly omitting
empty parts (`empty` keyword argument is responsible for that).
This is a generator function.
When only one delimiter is supplied, the string is simply split by it.
`empty` is then `True` by default.
str_split('[]aaa[][]bb[c', '[]')
-> '', 'aaa', '', 'bb[c'
str_split('[]aaa[][]bb[c', '[]', empty=False)
-> 'aaa', 'bb[c'
When multiple delimiters are supplied, the string is split by longest
possible sequences of those delimiters by default, or, if `empty` is set to
`True`, empty strings between the delimiters are also included. Note that
the delimiters in this case may only be single characters.
str_split('aaa, bb : c;', ' ', ',', ':', ';')
-> 'aaa', 'bb', 'c'
str_split('aaa, bb : c;', *' ,:;', empty=True)
-> 'aaa', '', 'bb', '', '', 'c', ''
When no delimiters are supplied, `string.whitespace` is used, so the effect
is the same as `str.split()`, except this function is a generator.
str_split('aaa\\t bb c \\n')
-> 'aaa', 'bb', 'c'
"""
if len(delims)==1:
f = _str_split_word if empty is None or empty else _str_split_word_ne
return f(s, delims[0])
if len(delims)==0:
delims = string.whitespace
delims = set(delims) if len(delims)>=4 else ''.join(delims)
if any(len(d)>1 for d in delims):
raise ValueError("Only 1-character multiple delimiters are supported")
f = _str_split_chars if empty else _str_split_chars_ne
return f(s, delims)
This function works in Python 3, and an easy, though quite ugly, fix can be applied to make it work in both 2 and 3 versions. The first lines of the function should be changed to:
def str_split(s, *delims, **kwargs):
"""...docstring..."""
empty = kwargs.get('empty')

No, but it should be easy enough to write one using itertools.takewhile().
EDIT:
Very simple, half-broken implementation:
import itertools
import string
def isplitwords(s):
i = iter(s)
while True:
r = []
for c in itertools.takewhile(lambda x: not x in string.whitespace, i):
r.append(c)
else:
if r:
yield ''.join(r)
continue
else:
raise StopIteration()

I don't see any obvious benefit to a generator version of split(). The generator object is going to have to contain the whole string to iterate over so you're not going to save any memory by having a generator.
If you wanted to write one it would be fairly easy though:
import string
def gsplit(s,sep=string.whitespace):
word = []
for c in s:
if c in sep:
if word:
yield "".join(word)
word = []
else:
word.append(c)
if word:
yield "".join(word)

I wrote a version of #ninjagecko's answer that behaves more like string.split (i.e. whitespace delimited by default and you can specify a delimiter).
def isplit(string, delimiter = None):
"""Like string.split but returns an iterator (lazy)
Multiple character delimters are not handled.
"""
if delimiter is None:
# Whitespace delimited by default
delim = r"\s"
elif len(delimiter) != 1:
raise ValueError("Can only handle single character delimiters",
delimiter)
else:
# Escape, incase it's "\", "*" etc.
delim = re.escape(delimiter)
return (x.group(0) for x in re.finditer(r"[^{}]+".format(delim), string))
Here are the tests I used (in both python 3 and python 2):
# Wrapper to make it a list
def helper(*args, **kwargs):
return list(isplit(*args, **kwargs))
# Normal delimiters
assert helper("1,2,3", ",") == ["1", "2", "3"]
assert helper("1;2;3,", ";") == ["1", "2", "3,"]
assert helper("1;2 ;3, ", ";") == ["1", "2 ", "3, "]
# Whitespace
assert helper("1 2 3") == ["1", "2", "3"]
assert helper("1\t2\t3") == ["1", "2", "3"]
assert helper("1\t2 \t3") == ["1", "2", "3"]
assert helper("1\n2\n3") == ["1", "2", "3"]
# Surrounding whitespace dropped
assert helper(" 1 2 3 ") == ["1", "2", "3"]
# Regex special characters
assert helper(r"1\2\3", "\\") == ["1", "2", "3"]
assert helper(r"1*2*3", "*") == ["1", "2", "3"]
# No multi-char delimiters allowed
try:
helper(r"1,.2,.3", ",.")
assert False
except ValueError:
pass
python's regex module says that it does "the right thing" for unicode whitespace, but I haven't actually tested it.
Also available as a gist.

If you would also like to be able to read an iterator (as well as return one) try this:
import itertools as it
def iter_split(string, sep=None):
sep = sep or ' '
groups = it.groupby(string, lambda s: s != sep)
return (''.join(g) for k, g in groups if k)
Usage
>>> list(iter_split(iter("Good evening, world!")))
['Good', 'evening,', 'world!']

more_itertools.split_at offers an analog to str.split for iterators.
>>> import more_itertools as mit
>>> list(mit.split_at("abcdcba", lambda x: x == "b"))
[['a'], ['c', 'd', 'c'], ['a']]
>>> "abcdcba".split("b")
['a', 'cdc', 'a']
more_itertools is a third-party package.

I wanted to show how to use the find_iter solution to return a generator for given delimiters and then use the pairwise recipe from itertools to build a previous next iteration which will get the actual words as in the original split method.
from more_itertools import pairwise
import re
string = "dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d"
delimiter = " "
# split according to the given delimiter including segments beginning at the beginning and ending at the end
for prev, curr in pairwise(re.finditer("^|[{0}]+|$".format(delimiter), string)):
print(string[prev.end(): curr.start()])
note:
I use prev & curr instead of prev & next because overriding next in python is a very bad idea
This is quite efficient

Dumbest method, without regex / itertools:
def isplit(text, split='\n'):
while text != '':
end = text.find(split)
if end == -1:
yield text
text = ''
else:
yield text[:end]
text = text[end + 1:]

Very old question, but here is my humble contribution with an efficient algorithm:
def str_split(text: str, separator: str) -> Iterable[str]:
i = 0
n = len(text)
while i <= n:
j = text.find(separator, i)
if j == -1:
j = n
yield text[i:j]
i = j + 1

def split_generator(f,s):
"""
f is a string, s is the substring we split on.
This produces a generator rather than a possibly
memory intensive list.
"""
i=0
j=0
while j<len(f):
if i>=len(f):
yield f[j:]
j=i
elif f[i] != s:
i=i+1
else:
yield [f[j:i]]
j=i+1
i=i+1

here is a simple response
def gen_str(some_string, sep):
j=0
guard = len(some_string)-1
for i,s in enumerate(some_string):
if s == sep:
yield some_string[j:i]
j=i+1
elif i!=guard:
continue
else:
yield some_string[j:]

def isplit(text, sep=None, maxsplit=-1):
if not isinstance(text, (str, bytes)):
raise TypeError(f"requires 'str' or 'bytes' but received a '{type(text).__name__}'")
if sep in ('', b''):
raise ValueError('empty separator')
if maxsplit == 0 or not text:
yield text
return
regex = (
re.escape(sep) if sep is not None
else [br'\s+', r'\s+'][isinstance(text, str)]
)
yield from re.split(regex, text, maxsplit=max(0, maxsplit))

Here is an answer that is based on split and maxsplit. This does not use recursion.
def gsplit(todo):
chunk= 100
while todo:
splits = todo.split(maxsplit=chunk)
if len(splits) == chunk:
todo = splits.pop()
else:
todo=None
for item in splits:
yield item

Related

Why is this dictionary lookup method of processing a string into words slower than .split()?

When I wrote my first func to split a string up into words based on a list of separator characters, I thought I was being sloppy/lazy, since this would seemingly iterate over the whole string again for each character in the separator list, not to mention joining the list back into a string:
def splitText(text, list):
for str in list:
text = " ".join(text.split(str))
arr = text.split(" ")
cleaned = []
for word in arr:
if word != "":
cleaned.append(word)
return cleaned
So I wrote another version using a dictionary lookup as it iterated through the string a single time, epecting far better performance since a single O(1) lookup would be performed for each character in the string rather than the many splits and joins going on in the above func:
def splitTextFast(text, dict):
arr = []
progressIndex = 0
for i, char in enumerate(text):
if char in dict:
arr.append(text[progressIndex:i])
progressIndex = i + 1
return arr
But to my surprise, splitTextFast was significantly slower. Why? Or, what mistake can I fix to create to most optimal version of this function? I just started using Python today, so maybe I'm missing something obvious.
usage:
splitCharsArr = [",", ".", ";", "'", " "]
splitCharsDict = {",": True, ".": True, ";": True, "'": True, " ": True }
mockData = "one. two; three four.five'six ;seven,eight,nine"
wordList1 = splitText(mockData, splitCharsArr)
wordList2 = splitTextFast(mockData, splitCharsDict)

How do I reverse words in a string with Python

I am trying to reverse words of a string, but having difficulty, any assistance will be appreciated:
S = " what is my name"
def reversStr(S):
for x in range(len(S)):
return S[::-1]
break
What I get now is: eman ym si tahw
However, I am trying to get: tahw is ym eman (individual words reversed)
def reverseStr(s):
return ' '.join([x[::-1] for x in s.split(' ')])
orig = "what is my name"
reverse = ""
for word in orig.split():
reverse = "{} {}".format(reverse, word[::-1])
print(reverse)
Since everyone else's covered the case where the punctuation moves, I'll cover the one where you don't want the punctuation to move.
import re
def reverse_words(sentence):
return re.sub(r'[a-zA-Z]+', lambda x : x.group()[::-1], sentence)
Breaking this down.
re is python's regex module, and re.sub is the function in that module that handles substitutions. It has three required parameters.
The first is the regex you're matching by. In this case, I'm using r'\w+'. The r denotes a raw string, [a-zA-Z] matches all letters, and + means "at least one".
The second is either a string to substitute in, or a function that takes in a re.MatchObject and outputs a string. I'm using a lambda (or nameless) function that simply outputs the matched string, reversed.
The third is the string you want to do a find in a replace in.
So "What is my name?" -> "tahW si ym eman?"
Addendum:
I considered a regex of r'\w+' initially, because better unicode support (if the right flags are given), but \w also includes numbers and underscores. Matching - might also be desired behavior: the regexes would be r'[a-zA-Z-]+' (note trailing hyphen) and r'[\w-]+' but then you'd probably want to not match double-dashes (ie --) so more regex modifications might be needed.
The built-in reversed outputs a reversed object, which you have to cast back to string, so I generally prefer the [::-1] option.
inplace refers to modifying the object without creating a copy. Yes, like many of us has already pointed out that python strings are immutable. So technically we cannot reverse a python string datatype object inplace. However, if you use a mutable datatype, say bytearray for storing the string characters, you can actually reverse it inplace
#slicing creates copy; implies not-inplace reversing
def rev(x):
return x[-1::-1]
# inplace reversing, if input is bytearray datatype
def rev_inplace(x: bytearray):
i = 0; j = len(x)-1
while i<j:
t = x[i]
x[i] = x[j]
x[j] = t
i += 1; j -= 1
return x
Input:
x = bytearray(b'some string to reverse')
rev_inplace(x)
Output:
bytearray(b'esrever ot gnirts emose')
Try splitting each word in the string into a list (see: https://docs.python.org/2/library/stdtypes.html#str.split).
Example:
>>string = "This will be split up"
>>string_list = string.split(" ")
>>string_list
>>['This', 'will', 'be', 'split', 'up']
Then iterate through the list and reverse each constituent list item (i.e. word) which you have working already.
def reverse_in_place(phrase):
res = []
phrase = phrase.split(" ")
for word in phrase:
word = word[::-1]
res.append(word)
res = " ".join(res)
return res
[thread has been closed, but IMO, not well answered]
the python string.lib doesn't include an in place str.reverse() method.
So use the built in reversed() function call to accomplish the same thing.
>>> S = " what is my name"
>>> ("").join(reversed(S))
'eman ym si tahw'
There is no obvious way of reversing a string "truly" in-place with Python. However, you can do something like:
def reverse_string_inplace(string):
w = len(string)-1
p = w
while True:
q = string[p]
string = ' ' + string + q
w -= 1
if w < 0:
break
return string[(p+1)*2:]
Hope this makes sense.
In Python, strings are immutable. This means you cannot change the string once you have created it. So in-place reverse is not possible.
There are many ways to reverse the string in python, but memory allocation is required for that reversed string.
print(' '.join(word[::-1] for word in string))
s1 = input("Enter a string with multiple words:")
print(f'Original:{s1}')
print(f'Reverse is:{s1[::-1]}')
each_word_new_list = []
s1_split = s1.split()
for i in range(0,len(s1_split)):
each_word_new_list.append(s1_split[i][::-1])
print(f'New Reverse as List:{each_word_new_list}')
each_word_new_string=' '.join(each_word_new_list)
print(f'New Reverse as String:{each_word_new_string}')
If the sentence contains multiple spaces then usage of split() function will cause trouble because you won't know then how many spaces you need to rejoin after you reverse each word in the sentence. Below snippet might help:
# Sentence having multiple spaces
given_str = "I know this country runs by mafia "
tmp = ""
tmp_list = []
for i in given_str:
if i != ' ':
tmp = tmp + i
else:
if tmp == "":
tmp_list.append(i)
else:
tmp_list.append(tmp)
tmp_list.append(i)
tmp = ""
print(tmp_list)
rev_list = []
for x in tmp_list:
rev = x[::-1]
rev_list.append(rev)
print(rev_list)
print(''.join(rev_list))
output:
def rev(a):
if a == "":
return ""
else:
z = rev(a[1:]) + a[0]
return z
Reverse string --> gnirts esreveR
def rev(k):
y = rev(k).split()
for i in range(len(y)-1,-1,-1):
print y[i],
-->esreveR gnirts

Removing non numeric characters from a string

I have been given the task to remove all non numeric characters including spaces from a either text file or string and then print the new result next to the old characters for example:
Before:
sd67637 8
After:
676378
As i am a beginner i do not know where to start with this task. Please Help
The easiest way is with a regexp
import re
a = 'lkdfhisoe78347834 (())&/&745 '
result = re.sub('[^0-9]','', a)
print result
>>> '78347834745'
Loop over your string, char by char and only include digits:
new_string = ''.join(ch for ch in your_string if ch.isdigit())
Or use a regex on your string (if at some point you wanted to treat non-contiguous groups separately)...
import re
s = 'sd67637 8'
new_string = ''.join(re.findall(r'\d+', s))
# 676378
Then just print them out:
print(old_string, '=', new_string)
There is a builtin for this.
string.translate(s, table[, deletechars])
Delete all characters from s
that are in deletechars (if present), and then translate the
characters using table, which must be a 256-character string giving
the translation for each character value, indexed by its ordinal. If
table is None, then only the character deletion step is performed.
>>> import string
>>> non_numeric_chars = ''.join(set(string.printable) - set(string.digits))
>>> non_numeric_chars = string.printable[10:] # more effective method. (choose one)
'sd67637 8'.translate(None, non_numeric_chars)
'676378'
Or you could do it with no imports (but there is no reason for this):
>>> chars = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~ \t\n\r\x0b\x0c'
>>> 'sd67637 8'.translate(None, chars)
'676378'
I would not use RegEx for this. It is a lot slower!
Instead let's just use a simple for loop.
TLDR;
This function will get the job done fast...
def filter_non_digits(string: str) -> str:
result = ''
for char in string:
if char in '1234567890':
result += char
return result
The Explanation
Let's create a very basic benchmark to test a few different methods that have been proposed. I will test three methods...
For loop method (my idea).
List Comprehension method from Jon Clements' answer.
RegEx method from Moradnejad's answer.
# filters.py
import re
# For loop method
def filter_non_digits_for(string: str) -> str:
result = ''
for char in string:
if char in '1234567890':
result += char
return result
# Comprehension method
def filter_non_digits_comp(s: str) -> str:
return ''.join(ch for ch in s if ch.isdigit())
# RegEx method
def filter_non_digits_re(string: str) -> str:
return re.sub('[^\d]','', string)
Now that we have an implementation of each way of removing digits, let's benchmark each one.
Here is some very basic and rudimentary benchmark code. However, it will do the trick and give us a good comparison of how each method performs.
# tests.py
import time, platform
from filters import filter_non_digits_re,
filter_non_digits_comp,
filter_non_digits_for
def benchmark_func(func):
start = time.time()
# the "_" in the number just makes it more readable
for i in range(100_000):
func('afes098u98sfe')
end = time.time()
return (end-start)/100_000
def bench_all():
print(f'# System ({platform.system()} {platform.machine()})')
print(f'# Python {platform.python_version()}\n')
tests = [
filter_non_digits_re,
filter_non_digits_comp,
filter_non_digits_for,
]
for t in tests:
duration = benchmark_func(t)
ns = round(duration * 1_000_000_000)
print(f'{t.__name__.ljust(30)} {str(ns).rjust(6)} ns/op')
if __name__ == "__main__":
bench_all()
Here is the output from the benchmark code.
# System (Windows AMD64)
# Python 3.9.8
filter_non_digits_re 2920 ns/op
filter_non_digits_comp 1280 ns/op
filter_non_digits_for 660 ns/op
As you can see the filter_non_digits_for() funciton is more than four times faster than using RegEx, and about twice as fast as the comprehension method. Sometimes simple is best.
You can use string.ascii_letters to identify your non-digits:
from string import *
a = 'sd67637 8'
a = a.replace(' ', '')
for i in ascii_letters:
a = a.replace(i, '')
In case you want to replace a colon, use quotes " instead of colons '.
To extract Integers
Example: sd67637 8 ==> 676378
import re
def extract_int(x):
return re.sub('[^\d]','', x)
To extract a single float/int number (possible decimal separator)
Example: sd7512.sd23 ==> 7512.23
import re
def extract_single_float(x):
return re.sub('[^\d|\.]','', x)
To extract multiple float/float numbers
Example: 123.2 xs12.28 4 ==> [123.2, 12.28, 4]
import re
def extract_floats(x):
return re.findall("\d+\.\d+", x)
Adding into #MoradneJad . You can use the following code to extract integer values, floats and even signed values.
a = re.findall(r"[-+]?\d*\.\d+|\d+", "Over th44e same pe14.1riod of time, p-0.8rices also rose by 82.8p")
And then you can convert the list items to numeric data type effectively using map.
print(list(map(float, a)))
[44.0, 14.1, -0.8, 82.8]
import re
result = re.sub('\D','','sd67637 8')
result >>> '676378'
Convert all numeric strings with or without unit abbreviations. You must indicate that the source string is a decimal comma notation by parameter dec=',' Converting to floats as well as integer is possible. Default conversion is float, but set the parameter toInt=True and the result is an integer. Automatic recognition of unit abbreviations that can be edited in the md dictionary. The key is the unit abbreviation and the value is the multiplier. In this way, the applications of this function are endless. The result is always a number you can calculate with. This all in one function is not the fastest method, but you don't have to worry anymore and it always returns a reliable result.
import re
'''
units: gr=grams, K=thousands, M=millions, B=billions, ms=mili-seconds, mt= metric-tonnes
'''
md = {'gr': 0.001, '%': 0.01, 'K': 1000, 'M': 1000000, 'B': 1000000000, 'ms': 0.001, 'mt': 1000}
seps = {'.': True, ',': False}
kl = list(md.keys())
def to_Float_or_Int(strVal, toInt=None, dec=None):
toInt = False if toInt is None else toInt
dec = '.' if dec is None else dec
def chck_char_in_string(strVal):
rs = None
for el in kl:
if el in strVal:
rs = el
break
return rs
if dec in seps.keys():
dcp = seps[dec]
strVal = strVal.strip()
mpk = chck_char_in_string(strVal)
mp = 1 if mpk is None else md[mpk]
strVal = re.sub(r'[^\de.,-]+', '', strVal)
if dcp:
strVal = strVal.replace(',', '')
else:
strVal = strVal.replace('.', '')
strVal = strVal.replace(',', '.')
dcnm = float(strVal)
dcnm = dcnm * mp
dcnm = int(round(dcnm)) if toInt else dcnm
else:
print('wrong decimal separator')
dcnm = None
return dcnm
Call the function as follows:
pvals = ['-123,456', '-45,145.01 K', '753,159.456', '1,000,000', '985 ms' , '888 745.23', '1.753 e-04']
cvals = ['-123,456', '1,354852M', '+10.000,12 gr', '-87,24%', '10,2K', '985 ms', '(mt) 0,475', ' ,159']
print('decimal point strings')
for val in pvals:
result = to_Float_or_Int(val)
print(result)
print()
print('decimal comma strings')
for val in cvals:
result = to_Float_or_Int(val, dec=',')
print(result)
exit()
The output results:
decimal point strings
-123456.0
-45145010.0
753159.456
1000000.0
0.985
888745.23
0.0001753
decimal comma strings
-123.456
1354852.0
10.00012
-0.8724
10200.0
0.985
475.0
0.159

re.split() with special cases

I am new to regular expression and have a problem with the re.split functionality.
In my case the split has to care "special escapes".
The text should be seperated at ;, except there is a leading ?.
Edit: In that case the two parts shouldn't be splitted and the ? has to be removed.
Here an example and the result I wish:
import re
txt = 'abc;vwx?;yz;123'
re.split(r'magical pattern', txt)
['abc', 'vwx;yz', '123']
I tried so far these attempt:
re.split(r'(?<!\?);', txt)
and got:
['abc', 'vwx?;yz', '123']
Sadly causes the not consumed ? trouble and the following list comprehension is to performance critical:
[part.replace('?;', ';') for part in re.split(r'(?<!\?);', txt)]
['abc', 'vwx;yz', '123']
Is there a "fast" way to reproduce that behavior with re?
Could the re.findall function be the solution to take?
For example a extended version of this code:
re.findall(r'[^;]+', txt)
I am using python 2.7.3.
Thanking you in anticipation!
Regex is not the tool for the job. Use the csv module instead:
>>> txt = 'abc;vwx?;yz;123'
>>> r = csv.reader([txt], delimiter=';', escapechar='?')
>>> next(r)
['abc', 'vwx;yz', '123']
You cannot do what you want with one regular expression. Unescaping ?; after splitting is a separate task altogether, not one that you can get the re module to do for you while splitting at the same time.
Just keep the task separate; you could use a generator to do the unescaping for you:
def unescape(iterable):
for item in iterable:
yield item.replace('?;', ';')
for elem in unescape(re.split(r'(?<!\?);', txt)):
print elem
but that won't be faster than your list comprehension.
I would do it like this:
re.sub('(?<!\?);',r'|', txt).replace('?;',';').split('|')
Try this :-)
def split( txt, sep, esc, escape_chars):
''' Split a string
txt - string to split
sep - separator, one character
esc - escape character
escape_chars - List of characters allowed to be escaped
'''
l = []
tmp = []
i = 0
while i < len(txt):
if len(txt) > i + 1 and txt[i] == esc and txt[i+1] in escape_chars:
i += 1
tmp.append(txt[i])
elif txt[i] == sep:
l.append("".join(tmp))
tmp = []
elif txt[i] == esc:
print('Escape Error')
else:
tmp.append(txt[i])
i += 1
l.append("".join(tmp))
return l
if __name__ == "__main__":
txt = 'abc;vwx?;yz;123'
print split(txt, ';', '?', [';','\\','?'])
Returns:
['abc', 'vwx;yz', '123']

How can I simplify this conversion from underscore to camelcase in Python?

I have written the function below that converts underscore to camelcase with first word in lowercase, i.e. "get_this_value" -> "getThisValue". Also I have requirement to preserve leading and trailing underscores and also double (triple etc.) underscores, if any, i.e.
"_get__this_value_" -> "_get_ThisValue_".
The code:
def underscore_to_camelcase(value):
output = ""
first_word_passed = False
for word in value.split("_"):
if not word:
output += "_"
continue
if first_word_passed:
output += word.capitalize()
else:
output += word.lower()
first_word_passed = True
return output
I am feeling the code above as written in non-Pythonic style, though it works as expected, so looking how to simplify the code and write it using list comprehensions etc.
This one works except for leaving the first word as lowercase.
def convert(word):
return ''.join(x.capitalize() or '_' for x in word.split('_'))
(I know this isn't exactly what you asked for, and this thread is quite old, but since it's quite prominent when searching for such conversions on Google I thought I'd add my solution in case it helps anyone else).
Your code is fine. The problem I think you're trying to solve is that if first_word_passed looks a little bit ugly.
One option for fixing this is a generator. We can easily make this return one thing for first entry and another for all subsequent entries. As Python has first-class functions we can get the generator to return the function we want to use to process each word.
We then just need to use the conditional operator so we can handle the blank entries returned by double underscores within a list comprehension.
So if we have a word we call the generator to get the function to use to set the case, and if we don't we just use _ leaving the generator untouched.
def underscore_to_camelcase(value):
def camelcase():
yield str.lower
while True:
yield str.capitalize
c = camelcase()
return "".join(c.next()(x) if x else '_' for x in value.split("_"))
I prefer a regular expression, personally. Here's one that is doing the trick for me:
import re
def to_camelcase(s):
return re.sub(r'(?!^)_([a-zA-Z])', lambda m: m.group(1).upper(), s)
Using unutbu's tests:
tests = [('get__this_value', 'get_ThisValue'),
('_get__this_value', '_get_ThisValue'),
('_get__this_value_', '_get_ThisValue_'),
('get_this_value', 'getThisValue'),
('get__this__value', 'get_This_Value')]
for test, expected in tests:
assert to_camelcase(test) == expected
Here's a simpler one. Might not be perfect for all situations, but it meets my requirements, since I'm just converting python variables, which have a specific format, to camel-case. This does capitalize all but the first word.
def underscore_to_camelcase(text):
"""
Converts underscore_delimited_text to camelCase.
Useful for JSON output
"""
return ''.join(word.title() if i else word for i, word in enumerate(text.split('_')))
I think the code is fine. You've got a fairly complex specification, so if you insist on squashing it into the Procrustean bed of a list comprehension, then you're likely to harm the clarity of the code.
The only changes I'd make would be:
To use the join method to build the result in O(n) space and time, rather than repeated applications of += which is O(n²).
To add a docstring.
Like this:
def underscore_to_camelcase(s):
"""Take the underscore-separated string s and return a camelCase
equivalent. Initial and final underscores are preserved, and medial
pairs of underscores are turned into a single underscore."""
def camelcase_words(words):
first_word_passed = False
for word in words:
if not word:
yield "_"
continue
if first_word_passed:
yield word.capitalize()
else:
yield word.lower()
first_word_passed = True
return ''.join(camelcase_words(s.split('_')))
Depending on the application, another change I would consider making would be to memoize the function. I presume you're automatically translating source code in some way, and you expect the same names to occur many times. So you might as well store the conversion instead of re-computing it each time. An easy way to do that would be to use the #memoized decorator from the Python decorator library.
This algorithm performs well with digit:
import re
PATTERN = re.compile(r'''
(?<!\A) # not at the start of the string
_
(?=[a-zA-Z]) # followed by a letter
''', re.X)
def camelize(value):
tokens = PATTERN.split(value)
response = tokens.pop(0).lower()
for remain in tokens:
response += remain.capitalize()
return response
Examples:
>>> camelize('Foo')
'foo'
>>> camelize('_Foo')
'_foo'
>>> camelize('Foo_')
'foo_'
>>> camelize('Foo_Bar')
'fooBar'
>>> camelize('Foo__Bar')
'foo_Bar'
>>> camelize('9')
'9'
>>> camelize('9_foo')
'9Foo'
>>> camelize('foo_9')
'foo_9'
>>> camelize('foo_9_bar')
'foo_9Bar'
>>> camelize('foo__9__bar')
'foo__9_Bar'
Here's mine, relying mainly on list comprehension, split, and join. Plus optional parameter to use different delimiter:
def underscore_to_camel(in_str, delim="_"):
chunks = in_str.split(delim)
chunks[1:] = [_.title() for _ in chunks[1:]]
return "".join(chunks)
Also, for sake of completeness, including what was referenced earlier as solution from another question as the reverse (NOT my own code, just repeating for easy reference):
first_cap_re = re.compile('(.)([A-Z][a-z]+)')
all_cap_re = re.compile('([a-z0-9])([A-Z])')
def camel_to_underscore(in_str):
s1 = first_cap_re.sub(r'\1_\2', name)
return all_cap_re.sub(r'\1_\2', s1).lower()
I agree with Gareth that the code is ok. However, if you really want a shorter, yet readable approach you could try something like this:
def underscore_to_camelcase(value):
# Make a list of capitalized words and underscores to be preserved
capitalized_words = [w.capitalize() if w else '_' for w in value.split('_')]
# Convert the first word to lowercase
for i, word in enumerate(capitalized_words):
if word != '_':
capitalized_words[i] = word.lower()
break
# Join all words to a single string and return it
return "".join(capitalized_words)
The problem calls for a function that returns a lowercase word the first time, but capitalized words afterwards. You can do that with an if clause, but then the if clause has to be evaluated for every word. An appealing alternative is to use a generator. It can return one thing on the first call, and something else on successive calls, and it does not require as many ifs.
def lower_camelcase(seq):
it=iter(seq)
for word in it:
yield word.lower()
if word.isalnum(): break
for word in it:
yield word.capitalize()
def underscore_to_camelcase(text):
return ''.join(lower_camelcase(word if word else '_' for word in text.split('_')))
Here is some test code to show that it works:
tests=[('get__this_value','get_ThisValue'),
('_get__this_value','_get_ThisValue'),
('_get__this_value_','_get_ThisValue_'),
('get_this_value','getThisValue'),
('get__this__value','get_This_Value'),
]
for test,answer in tests:
result=underscore_to_camelcase(test)
try:
assert result==answer
except AssertionError:
print('{r!r} != {a!r}'.format(r=result,a=answer))
Here is a list comprehension style generator expression.
from itertools import count
def underscore_to_camelcase(value):
words = value.split('_')
counter = count()
return ''.join('_' if w == '' else w.capitalize() if counter.next() else w for w in words )
def convert(word):
if not isinstance(word, str):
return word
if word.startswith("_"):
word = word[1:]
words = word.split("_")
_words = []
for idx, _word in enumerate(words):
if idx == 0:
_words.append(_word)
continue
_words.append(_word.capitalize())
return ''.join(_words)
This is the most compact way to do it:
def underscore_to_camelcase(value):
words = [word.capitalize() for word in value.split('_')]
words[0]=words[0].lower()
return "".join(words)
Another regexp solution:
import re
def conv(s):
"""Convert underscore-separated strings to camelCase equivalents.
>>> conv('get')
'get'
>>> conv('_get')
'_get'
>>> conv('get_this_value')
'getThisValue'
>>> conv('__get__this_value_')
'_get_ThisValue_'
>>> conv('_get__this_value__')
'_get_ThisValue_'
>>> conv('___get_this_value')
'_getThisValue'
"""
# convert case:
s = re.sub(r'(_*[A-Z])', lambda m: m.group(1).lower(), s.title(), count=1)
# remove/normalize underscores:
s = re.sub(r'__+|^_+|_+$', '|', s).replace('_', '').replace('|', '_')
return s
if __name__ == "__main__":
import doctest
doctest.testmod()
It works for your examples, but it might fail for names containting digits - it depends how you would capitalize them.
For regexp sake !
import re
def underscore_to_camelcase(value):
def rep(m):
if m.group(1) != None:
return m.group(2) + m.group(3).lower() + '_'
else:
return m.group(3).capitalize()
ret, nb_repl = re.subn(r'(^)?(_*)([a-zA-Z]+)', rep, value)
return ret if (nb_repl > 1) else ret[:-1]
A slightly modified version:
import re
def underscore_to_camelcase(value):
first = True
res = []
for u,w in re.findall('([_]*)([^_]*)',value):
if first:
res.append(u+w)
first = False
elif len(w)==0: # trailing underscores
res.append(u)
else: # trim an underscore and capitalize
res.append(u[:-1] + w.title())
return ''.join(res)
I know this has already been answered, but I came up with some syntactic sugar that handles a special case that the selected answer does not (words with dunders in them i.e. "my_word__is_____ugly" to "myWordIsUgly"). Obviously this can be broken up into multiple lines but I liked the challenge of getting it on one. I added line breaks for clarity.
def underscore_to_camel(in_string):
return "".join(
list(
map(
lambda index_word:
index_word[1].lower() if index_word[0] == 0
else index_word[1][0].upper() + (index_word[1][1:] if len(index_word[1]) > 0 else ""),
list(enumerate(re.split(re.compile(r"_+"), in_string)
)
)
)
)
)
Maybe, pydash works for this purpose (https://pydash.readthedocs.io/en/latest/)
>>> from pydash.strings import snake_case
>>>> snake_case('needToBeSnakeCased')
'get__this_value'
>>> from pydash.strings import camel_case
>>>camel_case('_get__this_value_')
'getThisValue'

Categories