Regex subbing in Python leads to ASCII characters appearing - python

I am trying to use regex to replace some issues in some text.
Strings look like this:
a = "Here is a shortString with various issuesWith spacing"
My regex looks like this right now:
new_string = re.sub("[a-z][A-Z]", "\1 \2", a).
This takes those places with missing spaces (there is always a capital letter after a lowercase letter), and adds a space.
Unfortunately, the output looks like this:
Here is a shor\x01 \x02tring with various issue\x01 \x02ith spacing
I want it to look like this:
b = "Here is a short String with various issues With spacing"
It seems that the regex is properly matching the correct instances of things I want to change, but there is something wrong with my substitution. I thought \1 \2 meant replace with the first part of the regex, add a space, and then add the second matched item. But for some reason I get something else?

>>> a = "Here is a shortString with various issuesWith spacing"
>>> re.sub("([a-z])([A-Z])", r"\1 \2", a)
'Here is a short String with various issues With spacing'
capturing group and backslash escaping was missing.
you can go even further:
>>> a = "Here is a shortString with various issuesWith spacing"
>>> re.sub('([a-z])([A-Z])', r'\1 \2', a).lower().capitalize()
'Here is a short string with various issues with spacing'

You need to define capturing groups, and use raw string literals:
import re
a = "Here is a shortString with various issuesWith spacing"
new_string = re.sub(r"([a-z])([A-Z])", r"\1 \2", a)
print(new_string)
See the Python demo.
Note that without the r'' prefix Python interpreted the \1 and \2 as characters rather than as backreferences since the \ was parsed as part of an escape sequence. In raw string literals, \ is parsed as a literal backslash.

You can have a try like this:
>>>> import re
>>>> a = "Here is a shortString with various issuesWith spacing"
>>>> re.sub(r"(?<=[a-z])(?=[A-Z])", " ", a)
>>>> Here is a short String with various issues With spacing

Related

Regex to Select All Math Operator not Hyphen

I am trying to implement a regex which splits the string on all math operators but no hyphen in the string:
dummy_string= "I Dont_Know The-Meaning_2018-You Know_Meaning_2017+You Know_Meaning_2017"
string_list = re.split("[0-9][+/*\-][A-Za-z]", dummy_string)
print(string_list)
>>['I Dont_Know The-Meaning_201', 'ou Know_Meaning_201', 'ou Know_Meaning_2017']
Expected Output:
>>['I Dont_Know The-Meaning_2018', 'You Know_Meaning_201', 'You Know_Meaning_2017']
I am using regex (re) package for this.
You may use (?<=[0-9]) and (?=[A-Za-z]) lookarounds instead of consuming patterns:
import re
dummy_string= "I Dont_Know The-Meaning_2018-You Know_Meaning_2017+You Know_Meaning_2017"
string_list = re.split("(?<=[0-9])[+/*-](?=[A-Za-z])", dummy_string)
print(string_list)
# => ['I Dont_Know The-Meaning_2018', 'You Know_Meaning_2017', 'You Know_Meaning_2017']
See the Python demo
When you use [0-9][+/*\-][A-Za-z] to split a string, the digit before a non-word delimiter and a letter after it are consumed, i.e. added to the match value, and re.split removes this text from the resulting output. When using lookarounds, the matched texts remain "unconsumed", they are not added to the match value and thus remain in the re.split output.
Note that you do not have to escape - when it is at the end of the character class, [+/*-] = [+/*\-]. If you plan to add more chars into the class, you may keep - escaped to avoid further issues.

How to remove substrings marked with special characters from a string?

I have a string in Python:
Tt = "This is a <\"string\">string, It should be <\"changed\">changed to <\"a\">a nummber."
print Tt
'This is a <"string">string, It should be <"changed">changed to <"a">a nummber.'
You see the some words repeat in this part <\" \">.
My question is, how to delete those repeated parts (delimited with the named characters)?
The result should be like:
'This is a string, It should be changed to a nummber.'
Use regular expressions:
import re
Tt = re.sub('<\".*?\">', '', Tt)
Note the ? after *. It makes the expression non-greedy,
so it tries to match so few symbols between <\" and \"> as possible.
The Solution of James will work only in cases when the delimiting substrings
consist only from one character (< and >). In this case it is possible to use negations like [^>]. If you want to remove a substring delimited with character sequences (e.g. with begin and end), you should use non-greedy regular expressions (i.e. .*?).
I'd use a quick regular expression:
import re
Tt = "This is a <\"string\">string, It should be <\"changed\">changed to <\"a\">a number."
print re.sub("<[^<]+>","",Tt)
#Out: This is a string, It should be changed to a nummber.
Ah - similar to Igor's post, he beat my by a bit. Rather than making the expression non-greedy, I don't match an expression if it contains another start tag "<" in it, so it will only match a start tag that's followed by an end tag ">".

Split stacked entities using regex re.split in python

I am having trouble splitting continuous strings into more reasonable parts:
E.g. 'MarieMüller' should become 'Marie Müller'
So far I've used this, which works if no special characters occur:
' '.join([a for a in re.split(ur'([A-Z][a-z]+)', ''.join(entity)) if a])
This outputs for e.g. 'TinaTurner' -> 'Tina Turner', but doesn't work
for 'MarieMüller', which outputs: 'MarieMüller' -> 'Marie M \utf8 ller'
Now I came accros using regex \p{L}:
' '.join([a for a in re.split(ur'([\p{Lu}][\p{Ll}]+)', ''.join(entity)) if a])
But this produces weird things like:
'JenniferLawrence' -> 'Jennifer L awrence'
Could anyone give me a hand?
If you work with Unicode and need to use Unicode categories, you should consider using PyPi regex module. There, you have support for all the Unicode categories:
>>> import regex
>>> p = regex.compile(ur'(?<=\p{Ll})(?=\p{Lu})')
>>> test_str = u"Tina Turner\nMarieM\u00FCller\nJacek\u0104cki"
>>> result = p.sub(u" ", test_str)
>>> result
u'Tina Turner\nMarie M\xfcller\nJacek \u0104cki'
^ ^ ^
Here, the (?<=\p{Ll})(?=\p{Lu}) regex finds all locations between the lower- (\p{Ll}) and uppercase (\p{Lu}) letters, and then the regex.sub inserts a space there. Note that regex module automatically compiles the regex with regex.UNICODE flag if the pattern is a Unicode string (u-prefixed).
It won't work for extended character
You can use re.sub() for this. It will be much simpler
(?=(?!^)[A-Z])
For handling spaces
print re.sub(r'(?<=[^\s])(?=(?!^)[A-Z])', ' ', ' Tina Turner'.strip())
For handling cases of consecutive capital letters
print re.sub(r'(?<=[a-z])(?=[A-Z])', ' ', ' TinaTXYurner'.strip())
Ideone Demo
Regex Breakdown
(?= #Lookahead to find all the position of capital letters
(?!^) #Ignore the first capital letter for substitution
[A-Z]
)
Using a function constructed of Python's string operations instead of regular expressions, this should work:
def split_combined_words(combined):
separated = [combined[1]]
for letter in combined[1:]:
print letter
if (letter.islower() or (letter.isupper() and separated[-1].isupper())):
separated.append(letter)
else:
separated.extend((" ", letter))
return "".join(separated)

Python unescaping string in regex replacements

The output of the code below:
rpl = 'This is a nicely escaped newline \\n'
my_string = 'I hope this apple is replaced with a nicely escaped string'
reg = re.compile('apple')
reg.sub( rpl, my_string )
..is:
'I hope this This is a nicely escaped newline \n is replaced with a nicely escaped string'
..so when printed:
I hope this This is a nicely escaped newline
is replaced with a nicely escaped string
So python is unescaping the string when it replaces 'apple' in the other string? For now I've just done
reg.sub( rpl.replace('\\','\\\\'), my_string )
Is this safe? Is there a way to stop Python from doing that?
From help(re.sub) [emphasis mine]:
sub(pattern, repl, string, count=0, flags=0)
Return the string obtained by replacing the leftmost
non-overlapping occurrences of the pattern in string by the
replacement repl. repl can be either a string or a callable;
if a string, backslash escapes in it are processed. If it is
a callable, it's passed the match object and must return
a replacement string to be used.
One way to get around this is to pass a lambda:
>>> reg.sub(rpl, my_string )
'I hope this This is a nicely escaped newline \n is replaced with a nicely escaped string'
>>> reg.sub(lambda x: rpl, my_string )
'I hope this This is a nicely escaped newline \\n is replaced with a nicely escaped string'
All regex patterns used for Python's re module are unescaped, including both search and replacement patterns. This is why the r modifier is generally used with regex patterns in Python, as it reduces the amount of "backwhacking" necessary to write usable patterns.
The r modifier appears before a string constant and basically makes all \ characters (except those before string delimiters) verbatim. So, r'\\' == '\\\\', and r'\n' == '\\n'.
Writing your example as
rpl = r'This is a nicely escaped newline \\n'
my_string = 'I hope this apple is replaced with a nicely escaped string'
reg = re.compile(r'apple')
reg.sub( rpl, my_string )
works as expected.

Match single quotes from python re

How to match the following i want all the names with in the single quotes
This hasn't been much that much of a twist and turn's to 'Tom','Harry' and u know who..yes its 'rock'
How to extract the name within the single quotes only
name = re.compile(r'^\'+\w+\'')
The following regex finds all single words enclosed in quotes:
In [6]: re.findall(r"'(\w+)'", s)
Out[6]: ['Tom', 'Harry', 'rock']
Here:
the ' matches a single quote;
the \w+ matches one or more word characters;
the ' matches a single quote;
the parentheses form a capture group: they define the part of the match that gets returned by findall().
If you only wish to find words that start with a capital letter, the regex can be modified like so:
In [7]: re.findall(r"'([A-Z]\w*)'", s)
Out[7]: ['Tom', 'Harry']
I'd suggest
r = re.compile(r"\B'\w+'\B")
apos = r.findall("This hasn't been much that much of a twist and turn's to 'Tom','Harry' and u know who..yes its 'rock'")
Result:
>>> apos
["'Tom'", "'Harry'", "'rock'"]
The "negative word boundaries" (\B) prevent matches like the 'n' in words like Rock'n'Roll.
Explanation:
\B # make sure that we're not at a word boundary
' # match a quote
\w+ # match one or more alphanumeric characters
' # match a quote
\B # make sure that we're not at a word boundary
^ ('hat' or 'caret', among other names) in regex means "start of the string" (or, given particular options, "start of a line"), which you don't care about. Omitting it makes your regex work fine:
>>> re.findall(r'\'+\w+\'', s)
["'Tom'", "'Harry'", "'rock'"]
The regexes others have suggested might be better for what you're trying to achieve, this is the minimal change to fix your problem.
Your regex can only match a pattern following the start of the string. Try something like: r"'([^']*)'"

Categories