python string substitute letter by its occurrence number - python

Suppose I have a string Hello, world, and I'd like to replace all instances of o by its occurrence number, i.e. get a sting: Hell1, w2rld?
I have found how to reference a numbered group, \g<1>, but it does require a group number.
Is there are way to do what I want in python?
Update: Sorry for not mentioning that I was indeed looking for a regexp solution, not just a string. I have marked the solution I liked the best, but thank for all contributions, they were cool!

For a regular expression solution:
import re
class Replacer:
def __init__(self):
self.counter = 0
def __call__(self, mo):
self.counter += 1
return str(self.counter)
s = 'Hello, World!'
print(re.sub('o', Replacer(), s))

Split the string on the letter "o" and reassemble it by adding the index of the part in front of each part (except the first one):
string = "Hello World"
result = "".join(f"{i or ''}"+part for i,part in enumerate(string.split("o")))
output:
print(result)
# Hell1 W2rld

Using itertools.count
Ex:
import re
from itertools import count
c = count(1)
s = "Hello, world"
print(re.sub(r"o", lambda x: "{}".format(next(c)), s))
#or
print(re.sub(r"o", lambda x: f"{next(c)}", s))
# --> Hell1, w2rld

You don't need regex for that, a simple loop will suffice:
sample = "Hello, world"
pattern = 'o'
output = ''
count = 1
for char in sample:
if char == pattern:
output += str(count)
count += 1
else:
output += char
print(output)
>>> Hell1, w2rld

Related

Remove all types of brackets in a string in python (with content) [duplicate]

I have a very long string of text with () and [] in it. I'm trying to remove the characters between the parentheses and brackets but I cannot figure out how.
The list is similar to this:
x = "This is a sentence. (once a day) [twice a day]"
This list isn't what I'm working with but is very similar and a lot shorter.
You can use re.sub function.
>>> import re
>>> x = "This is a sentence. (once a day) [twice a day]"
>>> re.sub("([\(\[]).*?([\)\]])", "\g<1>\g<2>", x)
'This is a sentence. () []'
If you want to remove the [] and the () you can use this code:
>>> import re
>>> x = "This is a sentence. (once a day) [twice a day]"
>>> re.sub("[\(\[].*?[\)\]]", "", x)
'This is a sentence. '
Important: This code will not work with nested symbols
Explanation
The first regex groups ( or [ into group 1 (by surrounding it with parentheses) and ) or ] into group 2, matching these groups and all characters that come in between them. After matching, the matched portion is substituted with groups 1 and 2, leaving the final string with nothing inside the brackets. The second regex is self explanatory from this -> match everything and substitute with the empty string.
-- modified from comment by Ajay Thomas
Run this script, it works even with nested brackets.
Uses basic logical tests.
def a(test_str):
ret = ''
skip1c = 0
skip2c = 0
for i in test_str:
if i == '[':
skip1c += 1
elif i == '(':
skip2c += 1
elif i == ']' and skip1c > 0:
skip1c -= 1
elif i == ')'and skip2c > 0:
skip2c -= 1
elif skip1c == 0 and skip2c == 0:
ret += i
return ret
x = "ewq[a [(b] ([c))]] This is a sentence. (once a day) [twice a day]"
x = a(x)
print x
print repr(x)
Just incase you don't run it,
Here's the output:
>>>
ewq This is a sentence.
'ewq This is a sentence. '
Here's a solution similar to #pradyunsg's answer (it works with arbitrary nested brackets):
def remove_text_inside_brackets(text, brackets="()[]"):
count = [0] * (len(brackets) // 2) # count open/close brackets
saved_chars = []
for character in text:
for i, b in enumerate(brackets):
if character == b: # found bracket
kind, is_close = divmod(i, 2)
count[kind] += (-1)**is_close # `+1`: open, `-1`: close
if count[kind] < 0: # unbalanced bracket
count[kind] = 0 # keep it
else: # found bracket to remove
break
else: # character is not a [balanced] bracket
if not any(count): # outside brackets
saved_chars.append(character)
return ''.join(saved_chars)
print(repr(remove_text_inside_brackets(
"This is a sentence. (once a day) [twice a day]")))
# -> 'This is a sentence. '
This should work for parentheses. Regular expressions will "consume" the text it has matched so it won't work for nested parentheses.
import re
regex = re.compile(".*?\((.*?)\)")
result = re.findall(regex, mystring)
or this would find one set of parentheses, simply loop to find more:
start = mystring.find("(")
end = mystring.find(")")
if start != -1 and end != -1:
result = mystring[start+1:end]
You can split, filter, and join the string again. If your brackets are well defined the following code should do.
import re
x = "".join(re.split("\(|\)|\[|\]", x)[::2])
You can try this. Can remove the bracket and the content exist inside it.
import re
x = "This is a sentence. (once a day) [twice a day]"
x = re.sub("\(.*?\)|\[.*?\]","",x)
print(x)
Expected ouput :
This is a sentence.
For anyone who appreciates the simplicity of the accepted answer by jvallver, and is looking for more readability from their code:
>>> import re
>>> x = 'This is a sentence. (once a day) [twice a day]'
>>> opening_braces = '\(\['
>>> closing_braces = '\)\]'
>>> non_greedy_wildcard = '.*?'
>>> re.sub(f'[{opening_braces}]{non_greedy_wildcard}[{closing_braces}]', '', x)
'This is a sentence. '
Most of the explanation for why this regex works is included in the code. Your future self will thank you for the 3 additional lines.
(Replace the f-string with the equivalent string concatenation for Python2 compatibility)
The RegEx \(.*?\)|\[.*?\] removes bracket content by finding pairs, first it remove paranthesis and then square brackets. I also works fine for the nested brackets as it acts in sequence. Ofcourse, it would break in case of bad brackets scenario.
_brackets = re.compile("\(.*?\)|\[.*?\]")
_spaces = re.compile("\s+")
_b = _brackets.sub(" ", "microRNAs (miR) play a role in cancer ([1], [2])")
_s = _spaces.sub(" ", _b.strip())
print(_s)
# OUTPUT: microRNAs play a role in cancer

Regex replace in Spyder with case conversion [duplicate]

This question's answers are a community effort. Edit existing answers to improve this post. It is not currently accepting new answers or interactions.
Example:
>>> convert('CamelCase')
'camel_case'
Camel case to snake case
import re
name = 'CamelCaseName'
name = re.sub(r'(?<!^)(?=[A-Z])', '_', name).lower()
print(name) # camel_case_name
If you do this many times and the above is slow, compile the regex beforehand:
pattern = re.compile(r'(?<!^)(?=[A-Z])')
name = pattern.sub('_', name).lower()
To handle more advanced cases specially (this is not reversible anymore):
def camel_to_snake(name):
name = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', name)
return re.sub('([a-z0-9])([A-Z])', r'\1_\2', name).lower()
print(camel_to_snake('camel2_camel2_case')) # camel2_camel2_case
print(camel_to_snake('getHTTPResponseCode')) # get_http_response_code
print(camel_to_snake('HTTPResponseCodeXYZ')) # http_response_code_xyz
To add also cases with two underscores or more:
def to_snake_case(name):
name = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', name)
name = re.sub('__([A-Z])', r'_\1', name)
name = re.sub('([a-z0-9])([A-Z])', r'\1_\2', name)
return name.lower()
Snake case to pascal case
name = 'snake_case_name'
name = ''.join(word.title() for word in name.split('_'))
print(name) # SnakeCaseName
There's an inflection library in the package index that can handle these things for you. In this case, you'd be looking for inflection.underscore():
>>> inflection.underscore('CamelCase')
'camel_case'
I don't know why these are all so complicating.
for most cases, the simple expression ([A-Z]+) will do the trick
>>> re.sub('([A-Z]+)', r'_\1','CamelCase').lower()
'_camel_case'
>>> re.sub('([A-Z]+)', r'_\1','camelCase').lower()
'camel_case'
>>> re.sub('([A-Z]+)', r'_\1','camel2Case2').lower()
'camel2_case2'
>>> re.sub('([A-Z]+)', r'_\1','camelCamelCase').lower()
'camel_camel_case'
>>> re.sub('([A-Z]+)', r'_\1','getHTTPResponseCode').lower()
'get_httpresponse_code'
To ignore the first character simply add look behind (?!^)
>>> re.sub('(?!^)([A-Z]+)', r'_\1','CamelCase').lower()
'camel_case'
>>> re.sub('(?!^)([A-Z]+)', r'_\1','CamelCamelCase').lower()
'camel_camel_case'
>>> re.sub('(?!^)([A-Z]+)', r'_\1','Camel2Camel2Case').lower()
'camel2_camel2_case'
>>> re.sub('(?!^)([A-Z]+)', r'_\1','getHTTPResponseCode').lower()
'get_httpresponse_code'
If you want to separate ALLCaps to all_caps and expect numbers in your string you still don't need to do two separate runs just use | This expression ((?<=[a-z0-9])[A-Z]|(?!^)[A-Z](?=[a-z])) can handle just about every scenario in the book
>>> a = re.compile('((?<=[a-z0-9])[A-Z]|(?!^)[A-Z](?=[a-z]))')
>>> a.sub(r'_\1', 'getHTTPResponseCode').lower()
'get_http_response_code'
>>> a.sub(r'_\1', 'get2HTTPResponseCode').lower()
'get2_http_response_code'
>>> a.sub(r'_\1', 'get2HTTPResponse123Code').lower()
'get2_http_response123_code'
>>> a.sub(r'_\1', 'HTTPResponseCode').lower()
'http_response_code'
>>> a.sub(r'_\1', 'HTTPResponseCodeXYZ').lower()
'http_response_code_xyz'
It all depends on what you want so use the solution that best suits your needs as it should not be overly complicated.
nJoy!
Avoiding libraries and regular expressions:
def camel_to_snake(s):
return ''.join(['_'+c.lower() if c.isupper() else c for c in s]).lstrip('_')
>>> camel_to_snake('ThisIsMyString')
'this_is_my_string'
stringcase is my go-to library for this; e.g.:
>>> from stringcase import pascalcase, snakecase
>>> snakecase('FooBarBaz')
'foo_bar_baz'
>>> pascalcase('foo_bar_baz')
'FooBarBaz'
I think this solution is more straightforward than previous answers:
import re
def convert (camel_input):
words = re.findall(r'[A-Z]?[a-z]+|[A-Z]{2,}(?=[A-Z][a-z]|\d|\W|$)|\d+', camel_input)
return '_'.join(map(str.lower, words))
# Let's test it
test_strings = [
'CamelCase',
'camelCamelCase',
'Camel2Camel2Case',
'getHTTPResponseCode',
'get200HTTPResponseCode',
'getHTTP200ResponseCode',
'HTTPResponseCode',
'ResponseHTTP',
'ResponseHTTP2',
'Fun?!awesome',
'Fun?!Awesome',
'10CoolDudes',
'20coolDudes'
]
for test_string in test_strings:
print(convert(test_string))
Which outputs:
camel_case
camel_camel_case
camel_2_camel_2_case
get_http_response_code
get_200_http_response_code
get_http_200_response_code
http_response_code
response_http
response_http_2
fun_awesome
fun_awesome
10_cool_dudes
20_cool_dudes
The regular expression matches three patterns:
[A-Z]?[a-z]+: Consecutive lower-case letters that optionally start with an upper-case letter.
[A-Z]{2,}(?=[A-Z][a-z]|\d|\W|$): Two or more consecutive upper-case letters. It uses a lookahead to exclude the last upper-case letter if it is followed by a lower-case letter.
\d+: Consecutive numbers.
By using re.findall we get a list of individual "words" that can be converted to lower-case and joined with underscores.
Personally I am not sure how anything using regular expressions in python can be described as elegant. Most answers here are just doing "code golf" type RE tricks. Elegant coding is supposed to be easily understood.
def to_snake_case(not_snake_case):
final = ''
for i in xrange(len(not_snake_case)):
item = not_snake_case[i]
if i < len(not_snake_case) - 1:
next_char_will_be_underscored = (
not_snake_case[i+1] == "_" or
not_snake_case[i+1] == " " or
not_snake_case[i+1].isupper()
)
if (item == " " or item == "_") and next_char_will_be_underscored:
continue
elif (item == " " or item == "_"):
final += "_"
elif item.isupper():
final += "_"+item.lower()
else:
final += item
if final[0] == "_":
final = final[1:]
return final
>>> to_snake_case("RegularExpressionsAreFunky")
'regular_expressions_are_funky'
>>> to_snake_case("RegularExpressionsAre Funky")
'regular_expressions_are_funky'
>>> to_snake_case("RegularExpressionsAre_Funky")
'regular_expressions_are_funky'
''.join('_'+c.lower() if c.isupper() else c for c in "DeathToCamelCase").strip('_')
re.sub("(.)([A-Z])", r'\1_\2', 'DeathToCamelCase').lower()
Here's my solution:
def un_camel(text):
""" Converts a CamelCase name into an under_score name.
>>> un_camel('CamelCase')
'camel_case'
>>> un_camel('getHTTPResponseCode')
'get_http_response_code'
"""
result = []
pos = 0
while pos < len(text):
if text[pos].isupper():
if pos-1 > 0 and text[pos-1].islower() or pos-1 > 0 and \
pos+1 < len(text) and text[pos+1].islower():
result.append("_%s" % text[pos].lower())
else:
result.append(text[pos].lower())
else:
result.append(text[pos])
pos += 1
return "".join(result)
It supports those corner cases discussed in the comments. For instance, it'll convert getHTTPResponseCode to get_http_response_code like it should.
I don't get idea why using both .sub() calls? :) I'm not regex guru, but I simplified function to this one, which is suitable for my certain needs, I just needed a solution to convert camelCasedVars from POST request to vars_with_underscore:
def myFunc(...):
return re.sub('(.)([A-Z]{1})', r'\1_\2', "iTriedToWriteNicely").lower()
It does not work with such names like getHTTPResponse, cause I heard it is bad naming convention (should be like getHttpResponse, it's obviously, that it's much easier memorize this form).
For the fun of it:
>>> def un_camel(input):
... output = [input[0].lower()]
... for c in input[1:]:
... if c in ('ABCDEFGHIJKLMNOPQRSTUVWXYZ'):
... output.append('_')
... output.append(c.lower())
... else:
... output.append(c)
... return str.join('', output)
...
>>> un_camel("camel_case")
'camel_case'
>>> un_camel("CamelCase")
'camel_case'
Or, more for the fun of it:
>>> un_camel = lambda i: i[0].lower() + str.join('', ("_" + c.lower() if c in "ABCDEFGHIJKLMNOPQRSTUVWXYZ" else c for c in i[1:]))
>>> un_camel("camel_case")
'camel_case'
>>> un_camel("CamelCase")
'camel_case'
Using regexes may be the shortest, but this solution is way more readable:
def to_snake_case(s):
snake = "".join(["_"+c.lower() if c.isupper() else c for c in s])
return snake[1:] if snake.startswith("_") else snake
This is not a elegant method, is a very 'low level' implementation of a simple state machine (bitfield state machine), possibly the most anti pythonic mode to resolve this, however re module also implements a too complex state machine to resolve this simple task, so i think this is a good solution.
def splitSymbol(s):
si, ci, state = 0, 0, 0 # start_index, current_index
'''
state bits:
0: no yields
1: lower yields
2: lower yields - 1
4: upper yields
8: digit yields
16: other yields
32 : upper sequence mark
'''
for c in s:
if c.islower():
if state & 1:
yield s[si:ci]
si = ci
elif state & 2:
yield s[si:ci - 1]
si = ci - 1
state = 4 | 8 | 16
ci += 1
elif c.isupper():
if state & 4:
yield s[si:ci]
si = ci
if state & 32:
state = 2 | 8 | 16 | 32
else:
state = 8 | 16 | 32
ci += 1
elif c.isdigit():
if state & 8:
yield s[si:ci]
si = ci
state = 1 | 4 | 16
ci += 1
else:
if state & 16:
yield s[si:ci]
state = 0
ci += 1 # eat ci
si = ci
print(' : ', c, bin(state))
if state:
yield s[si:ci]
def camelcaseToUnderscore(s):
return '_'.join(splitSymbol(s))
splitsymbol can parses all case types: UpperSEQUENCEInterleaved, under_score, BIG_SYMBOLS and cammelCasedMethods
I hope it is useful
Take a look at the excellent Schematics lib
https://github.com/schematics/schematics
It allows you to created typed data structures that can serialize/deserialize from python to Javascript flavour, eg:
class MapPrice(Model):
price_before_vat = DecimalType(serialized_name='priceBeforeVat')
vat_rate = DecimalType(serialized_name='vatRate')
vat = DecimalType()
total_price = DecimalType(serialized_name='totalPrice')
So many complicated methods...
Just find all "Titled" group and join its lower cased variant with underscore.
>>> import re
>>> def camel_to_snake(string):
... groups = re.findall('([A-z0-9][a-z]*)', string)
... return '_'.join([i.lower() for i in groups])
...
>>> camel_to_snake('ABCPingPongByTheWay2KWhereIsOurBorderlands3???')
'a_b_c_ping_pong_by_the_way_2_k_where_is_our_borderlands_3'
If you don't want make numbers like first character of group or separate group - you can use ([A-z][a-z0-9]*) mask.
A horrendous example using regular expressions (you could easily clean this up :) ):
def f(s):
return s.group(1).lower() + "_" + s.group(2).lower()
p = re.compile("([A-Z]+[a-z]+)([A-Z]?)")
print p.sub(f, "CamelCase")
print p.sub(f, "getHTTPResponseCode")
Works for getHTTPResponseCode though!
Alternatively, using lambda:
p = re.compile("([A-Z]+[a-z]+)([A-Z]?)")
print p.sub(lambda x: x.group(1).lower() + "_" + x.group(2).lower(), "CamelCase")
print p.sub(lambda x: x.group(1).lower() + "_" + x.group(2).lower(), "getHTTPResponseCode")
EDIT: It should also be pretty easy to see that there's room for improvement for cases like "Test", because the underscore is unconditionally inserted.
Lightely adapted from https://stackoverflow.com/users/267781/matth
who use generators.
def uncamelize(s):
buff, l = '', []
for ltr in s:
if ltr.isupper():
if buff:
l.append(buff)
buff = ''
buff += ltr
l.append(buff)
return '_'.join(l).lower()
This simple method should do the job:
import re
def convert(name):
return re.sub(r'([A-Z]*)([A-Z][a-z]+)', lambda x: (x.group(1) + '_' if x.group(1) else '') + x.group(2) + '_', name).rstrip('_').lower()
We look for capital letters that are precedeed by any number of (or zero) capital letters, and followed by any number of lowercase characters.
An underscore is placed just before the occurence of the last capital letter found in the group, and one can be placed before that capital letter in case it is preceded by other capital letters.
If there are trailing underscores, remove those.
Finally, the whole result string is changed to lower case.
(taken from here, see working example online)
Here's something I did to change the headers on a tab-delimited file. I'm omitting the part where I only edited the first line of the file. You could adapt it to Python pretty easily with the re library. This also includes separating out numbers (but keeps the digits together). I did it in two steps because that was easier than telling it not to put an underscore at the start of a line or tab.
Step One...find uppercase letters or integers preceded by lowercase letters, and precede them with an underscore:
Search:
([a-z]+)([A-Z]|[0-9]+)
Replacement:
\1_\l\2/
Step Two...take the above and run it again to convert all caps to lowercase:
Search:
([A-Z])
Replacement (that's backslash, lowercase L, backslash, one):
\l\1
I was looking for a solution to the same problem, except that I needed a chain; e.g.
"CamelCamelCamelCase" -> "Camel-camel-camel-case"
Starting from the nice two-word solutions here, I came up with the following:
"-".join(x.group(1).lower() if x.group(2) is None else x.group(1) \
for x in re.finditer("((^.[^A-Z]+)|([A-Z][^A-Z]+))", "stringToSplit"))
Most of the complicated logic is to avoid lowercasing the first word. Here's a simpler version if you don't mind altering the first word:
"-".join(x.group(1).lower() for x in re.finditer("(^[^A-Z]+|[A-Z][^A-Z]+)", "stringToSplit"))
Of course, you can pre-compile the regular expressions or join with underscore instead of hyphen, as discussed in the other solutions.
Concise without regular expressions, but HTTPResponseCode=> httpresponse_code:
def from_camel(name):
"""
ThisIsCamelCase ==> this_is_camel_case
"""
name = name.replace("_", "")
_cas = lambda _x : [_i.isupper() for _i in _x]
seq = zip(_cas(name[1:-1]), _cas(name[2:]))
ss = [_x + 1 for _x, (_i, _j) in enumerate(seq) if (_i, _j) == (False, True)]
return "".join([ch + "_" if _x in ss else ch for _x, ch in numerate(name.lower())])
Without any library :
def camelify(out):
return (''.join(["_"+x.lower() if i<len(out)-1 and x.isupper() and out[i+1].islower()
else x.lower()+"_" if i<len(out)-1 and x.islower() and out[i+1].isupper()
else x.lower() for i,x in enumerate(list(out))])).lstrip('_').replace('__','_')
A bit heavy, but
CamelCamelCamelCase -> camel_camel_camel_case
HTTPRequest -> http_request
GetHTTPRequest -> get_http_request
getHTTPRequest -> get_http_request
Very nice RegEx proposed on this site:
(?<!^)(?=[A-Z])
If python have a String Split method, it should work...
In Java:
String s = "loremIpsum";
words = s.split("(?<!^)(?=[A-Z])");
Just in case someone needs to transform a complete source file, here is a script that will do it.
# Copy and paste your camel case code in the string below
camelCaseCode ="""
cv2.Matx33d ComputeZoomMatrix(const cv2.Point2d & zoomCenter, double zoomRatio)
{
auto mat = cv2.Matx33d::eye();
mat(0, 0) = zoomRatio;
mat(1, 1) = zoomRatio;
mat(0, 2) = zoomCenter.x * (1. - zoomRatio);
mat(1, 2) = zoomCenter.y * (1. - zoomRatio);
return mat;
}
"""
import re
def snake_case(name):
s1 = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', name)
return re.sub('([a-z0-9])([A-Z])', r'\1_\2', s1).lower()
def lines(str):
return str.split("\n")
def unlines(lst):
return "\n".join(lst)
def words(str):
return str.split(" ")
def unwords(lst):
return " ".join(lst)
def map_partial(function):
return lambda values : [ function(v) for v in values]
import functools
def compose(*functions):
return functools.reduce(lambda f, g: lambda x: f(g(x)), functions, lambda x: x)
snake_case_code = compose(
unlines ,
map_partial(unwords),
map_partial(map_partial(snake_case)),
map_partial(words),
lines
)
print(snake_case_code(camelCaseCode))
Wow I just stole this from django snippets. ref http://djangosnippets.org/snippets/585/
Pretty elegant
camelcase_to_underscore = lambda str: re.sub(r'(?<=[a-z])[A-Z]|[A-Z](?=[^A-Z])', r'_\g<0>', str).lower().strip('_')
Example:
camelcase_to_underscore('ThisUser')
Returns:
'this_user'
REGEX DEMO
def convert(name):
return reduce(
lambda x, y: x + ('_' if y.isupper() else '') + y,
name
).lower()
And if we need to cover a case with already-un-cameled input:
def convert(name):
return reduce(
lambda x, y: x + ('_' if y.isupper() and not x.endswith('_') else '') + y,
name
).lower()
Not in the standard library, but I found this module that appears to contain the functionality you need.
If you use Google's (nearly) deterministic Camel case algorithm, then one does not need to handle things like HTMLDocument since it should be HtmlDocument, then this regex based approach is simple. It replace all capitals or numbers with an underscore. Note does not handle multi digit numbers.
import re
def to_snake_case(camel_str):
return re.sub('([A-Z0-9])', r'_\1', camel_str).lower().lstrip('_')
def convert(camel_str):
temp_list = []
for letter in camel_str:
if letter.islower():
temp_list.append(letter)
else:
temp_list.append('_')
temp_list.append(letter)
result = "".join(temp_list)
return result.lower()
Use: str.capitalize() to convert first letter of the string (contained in variable str) to a capital letter and returns the entire string.
Example:
Command: "hello".capitalize()
Output: Hello

Python isspace function

I'm having difficulty with the isspace function. Any idea why my code is wrong and how to fix it?
Here is the problem:
Implement the get_num_of_non_WS_characters() function. get_num_of_non_WS_characters() has a string parameter and returns the number of characters in the string, excluding all whitespace.
Here is my code:
def get_num_of_non_WS_characters(s):
count = 0
for char in s:
if char.isspace():
count = count + 1
return count
You want non whitespace, so you should use not
def get_num_of_non_WS_characters(s):
count = 0
for char in s:
if not char.isspace():
count += 1
return count
>>> get_num_of_non_WS_characters('hello')
5
>>> get_num_of_non_WS_characters('hello ')
5
For completeness, this could be done more succinctly using a generator expression
def get_num_of_non_WS_characters(s):
return sum(1 for char in s if not char.isspace())
A shorter version of #CoryKramer answer:
def get_num_of_non_WS_characters(s):
return sum(not c.isspace() for c in s)
As an alternative you could also simple do:
def get_num_of_non_WS_characters(s):
return len(''.join(s.split()))
Then
s = 'i am a string'
get_num_of_non_WS_characters(s)
will return 10
This will also remove tabs and new line characters:
s = 'i am a string\nwith line break'
''.join(s.split())
will give
'iamastringwithlinebreak'
I would just use n=s.replace(" " , "") and then len(n).
Otherwise I think you should increase the count after the if statement and put a continue inside it.

Reverse marked substrings in a string

I have a string in which every marked substring within < and >
has to be reversed (the brackets don't nest). For example,
"hello <wolfrevokcats>, how <t uoy era>oday?"
should become
"hello stackoverflow, how are you today?"
My current idea is to loop over the string and find pairs of indices
where < and > are. Then simply slice the string and put the slices
together again with everything that was in between the markers reversed.
Is this a correct approach? Is there an obvious/better solution?
It's pretty simple with regular expressions. re.sub takes a function as an argument to which the match object is passed.
>>> import re
>>> s = 'hello <wolfrevokcats>, how <t uoy era>oday?'
>>> re.sub('<(.*?)>', lambda m: m.group(1)[::-1], s)
'hello stackoverflow, how are you today?'
Explanation of the regex:
<(.*?)> will match everything between < and > in matching group 1. To ensure that the regex engine will stop at the first > symbol occurrence, the lazy quantifier *? is used.
The function lambda m: m.group(1)[::-1] that is passed to re.sub takes the match object, extracts group 1, and reverses the string. Finally re.sub inserts this return value.
Or, use re.sub() and a replacing function:
>>> import re
s = 'hello <wolfrevokcats>, how <t uoy era>oday?'
>>> re.sub(r"<(.*?)>", lambda match: match.group(1)[::-1], s)
'hello stackoverflow, how are you today?'
where .*? would match any characters any number of times in a non-greedy fashion. The parenthesis around it would help us to capture it in a group which we then refer to in the replacing function - match.group(1). [::-1] slice notation reverses a string.
I'm going to assume this is a coursework assignment and the use of regular expressions isn't allowed. So I'm going to offer a solution that doesn't use it.
content = "hello <wolfrevokcats>, how <t uoy era>oday?"
insert_pos = -1
result = []
placeholder_count = 0
for pos, ch in enumerate(content):
if ch == '<':
insert_pos = pos
elif ch == '>':
insert_pos = -1
placeholder_count += 1
elif insert_pos >= 0:
result.insert(insert_pos - (placeholder_count * 2), ch)
else:
result.append(ch)
print("".join(result))
The gist of the code is to have just a single pass at the string one character at a time. When outside the brackets, simply append the character at the end of the result string. When inside the brackets, insert the character at the position of the opening bracket (i.e. pre-pend the character).
I agree that regular expressions is the proper tool to solve this problem, and I like the gist of Dmitry B.'s answer. However, I used this question to practice about generators and functional programming, and I post my solution just for sharing it.
msg = "<,woN> hello <wolfrevokcats>, how <t uoy era>oday?"
def traverse(s, d=">"):
for c in s:
if c in "<>": d = c
else: yield c, d
def group(tt, dc=None):
for c, d in tt:
if d != dc:
if dc is not None:
yield dc, l
l = [c]
dc = d
else:
l.append(c)
else: yield dc, l
def direct(groups):
func = lambda d: list if d == ">" else reversed
fst = lambda t: t[0]
snd = lambda t: t[1]
for gr in groups:
yield func(fst(gr))(snd(gr))
def concat(groups):
return "".join("".join(gr) for gr in groups)
print(concat(direct(group(traverse(msg)))))
#Now, hello stackoverflow, how are you today?
Here's another one without using regular expressions:
def reverse_marked(str0):
separators = ['<', '>']
reverse = 0
str1 = ['', str0]
res = ''
while len(str1) == 2:
str1 = str1[1].split(separators[reverse], maxsplit=1)
res = ''.join((res, str1[0][::-1] if reverse else str1[0]))
reverse = 1 - reverse # toggle 0 - 1 - 0 ...
return res
print(reverse_marked('hello <wolfrevokcats>, how <t uoy era>oday?'))
Output:
hello stackoverflow, how are you today?

Remove text between () and []

I have a very long string of text with () and [] in it. I'm trying to remove the characters between the parentheses and brackets but I cannot figure out how.
The list is similar to this:
x = "This is a sentence. (once a day) [twice a day]"
This list isn't what I'm working with but is very similar and a lot shorter.
You can use re.sub function.
>>> import re
>>> x = "This is a sentence. (once a day) [twice a day]"
>>> re.sub("([\(\[]).*?([\)\]])", "\g<1>\g<2>", x)
'This is a sentence. () []'
If you want to remove the [] and the () you can use this code:
>>> import re
>>> x = "This is a sentence. (once a day) [twice a day]"
>>> re.sub("[\(\[].*?[\)\]]", "", x)
'This is a sentence. '
Important: This code will not work with nested symbols
Explanation
The first regex groups ( or [ into group 1 (by surrounding it with parentheses) and ) or ] into group 2, matching these groups and all characters that come in between them. After matching, the matched portion is substituted with groups 1 and 2, leaving the final string with nothing inside the brackets. The second regex is self explanatory from this -> match everything and substitute with the empty string.
-- modified from comment by Ajay Thomas
Run this script, it works even with nested brackets.
Uses basic logical tests.
def a(test_str):
ret = ''
skip1c = 0
skip2c = 0
for i in test_str:
if i == '[':
skip1c += 1
elif i == '(':
skip2c += 1
elif i == ']' and skip1c > 0:
skip1c -= 1
elif i == ')'and skip2c > 0:
skip2c -= 1
elif skip1c == 0 and skip2c == 0:
ret += i
return ret
x = "ewq[a [(b] ([c))]] This is a sentence. (once a day) [twice a day]"
x = a(x)
print x
print repr(x)
Just incase you don't run it,
Here's the output:
>>>
ewq This is a sentence.
'ewq This is a sentence. '
Here's a solution similar to #pradyunsg's answer (it works with arbitrary nested brackets):
def remove_text_inside_brackets(text, brackets="()[]"):
count = [0] * (len(brackets) // 2) # count open/close brackets
saved_chars = []
for character in text:
for i, b in enumerate(brackets):
if character == b: # found bracket
kind, is_close = divmod(i, 2)
count[kind] += (-1)**is_close # `+1`: open, `-1`: close
if count[kind] < 0: # unbalanced bracket
count[kind] = 0 # keep it
else: # found bracket to remove
break
else: # character is not a [balanced] bracket
if not any(count): # outside brackets
saved_chars.append(character)
return ''.join(saved_chars)
print(repr(remove_text_inside_brackets(
"This is a sentence. (once a day) [twice a day]")))
# -> 'This is a sentence. '
This should work for parentheses. Regular expressions will "consume" the text it has matched so it won't work for nested parentheses.
import re
regex = re.compile(".*?\((.*?)\)")
result = re.findall(regex, mystring)
or this would find one set of parentheses, simply loop to find more:
start = mystring.find("(")
end = mystring.find(")")
if start != -1 and end != -1:
result = mystring[start+1:end]
You can split, filter, and join the string again. If your brackets are well defined the following code should do.
import re
x = "".join(re.split("\(|\)|\[|\]", x)[::2])
You can try this. Can remove the bracket and the content exist inside it.
import re
x = "This is a sentence. (once a day) [twice a day]"
x = re.sub("\(.*?\)|\[.*?\]","",x)
print(x)
Expected ouput :
This is a sentence.
For anyone who appreciates the simplicity of the accepted answer by jvallver, and is looking for more readability from their code:
>>> import re
>>> x = 'This is a sentence. (once a day) [twice a day]'
>>> opening_braces = '\(\['
>>> closing_braces = '\)\]'
>>> non_greedy_wildcard = '.*?'
>>> re.sub(f'[{opening_braces}]{non_greedy_wildcard}[{closing_braces}]', '', x)
'This is a sentence. '
Most of the explanation for why this regex works is included in the code. Your future self will thank you for the 3 additional lines.
(Replace the f-string with the equivalent string concatenation for Python2 compatibility)
The RegEx \(.*?\)|\[.*?\] removes bracket content by finding pairs, first it remove paranthesis and then square brackets. I also works fine for the nested brackets as it acts in sequence. Ofcourse, it would break in case of bad brackets scenario.
_brackets = re.compile("\(.*?\)|\[.*?\]")
_spaces = re.compile("\s+")
_b = _brackets.sub(" ", "microRNAs (miR) play a role in cancer ([1], [2])")
_s = _spaces.sub(" ", _b.strip())
print(_s)
# OUTPUT: microRNAs play a role in cancer

Categories